Would anybody explain me what is the takeaway from this approach?
When I run the code I see something like:
38/100 = 42.0% - 38.0% accuracy, 90.47619% accuracy of rated data.
90% is nice, but it looks to me like a tradeoff between precision and recall, where we just label those samples with the highest probability of pos/neg...
The primary take away was meant to be that a clustered Bayesian classifier (each trained on random samples of training data) offers significant improvements over traditional single Bayesian classifiers.
Nothing revolutionary, just wanted to share a practical example that out performs most Bayesian implementations I have came across. I also wanted to share the results of model pruning and that tokenization matters.
I originally coded this without much intent on sharing, so the post write-up could probably use a lot of work. :)
This idea seems similar to local linear regression [1] in a way: approximate a non-linear model with multiple linear models that are fit to different data selections.
Neat idea wrt. using a Naive Bayes classifier here!
This idea originally came to me after explaining Random Forests to someone as a way to improve their Decision Tree. On a whim, I built out a prototype and it offered improved results. :)
My best runs using our proprietary tokenizers and much more training data (tens of thousands of tweets, reviews, and open source training data), yield results of:
87-93% accuracy while rating 50-70% of the data. I.e. If the classifier isn't "confident" then it doesn't rate the data.
I'll run a quick test to demonstrate it's absolute accuracy. Though it definitely performs more than well enough to use in production, and is relatively little effort. It especially performs well in aggregate stats (>99% accuracy)
My favorite finding was the clustering of Bayesian classifiers all trained on random samples of data resulted in significant improvement. (Think Decision Tree -> Random Forest)
Perhaps the most important thing I took away from this write-up is the motivation: even though state of art NNs can perform better, they are impossible to debug for humans. Bayesian methods like the decision tree allow humans to interpret the underlying model, and so are often more useful in practice.
I never know exactly what they're referring to when they talk about "accuracy" in these kinds of papers: what was the precision and recall across the various categories (sentiments?) into which we are classifying the data?
That is a good question. Typically accuracy for something like this is "what percentage of the test data does the model accurately predict". In this example, the model is not punished for "not rating" text, and that is captured in the "percent rated" column. This example only demonstrates two classes of sentiment, positive & negative.
The main purpose is to demonstrate techniques to gain improvements over traditional/common Bayesian methods.
Also, I would like to see a better metric than simple accuracy. Maybe harmonic mean of precision and recall?
Otherwise, great to see more stuff like that in more languages (Kotlin, here).
Edit: Oh, hang on, I missed this bit:
>> more developed tokenizers that understand language
"understand language"? Really? What about "can handle"?