Recently via Twitter I came across “Gibbs Sampling for the Uninitiated” by Philip Resnik and Eric Hardisty, a tutorial that shows how to use Gibbs sampling of a Naive Bayes model to estimate the labels on a set of documents. This paper goes through the algebra in great detail and concludes with pseudocode. Resnik and Hardisty do such a good job of making it look easy that I decided to write my own Gibbs sampler. It was, in fact, pretty easy. I wrote a Numeric Python script and put it in a github project called Naive Bayes Gibbs Sampler. Currently it just generates a small random corpus and runs over it for a few iterations, but the
__main__ block makes it clear how this script could be used to do real work.
Further improvements to be done in my copious spare time:
- Currently this only works on an unlabeled corpus, but it would be easy to support partially labeled corpora.
- There are probably ways to improve performance by being more sophisticated in my use of Numeric Python array operations. At various points I do matrix operations by enumerating the rows with Python
forloops, and my experience with similar frameworks like MATLAB and R is that when you’re writing a loop you’re doing it wrong.
- Since this is an unsupervised learner, even if it gets the category labels exactly right, the actual symbols used to represent them likely won’t match the true ones. For example, if the true labels are [1,1,2,3,1,2], an unsupervised estimated label set of [2,2,3,1,2,3] is 100% correct because it is just a a rotation of the labels 1→2, 2→3, 3→1. It would be nice to add an evaluation metric that can handle this. I’m not sure what is standard in unsupervised learning applications, but my first choice would be the B-Cubed metric, which is typically used for scoring coreference resolution but is more generally a way of quantifying the overlap between two equivalence sets.
- I need test cases. Some small data set for which I can be reasonably confident what the accuracy would be. I wonder if there is a way to do this short of running it on a standard data set like Newsgroups and comparing to published results. That is to say, how do you do unit testing in addition to integration testing?
That said, I hope this can be a useful addition to the tutorial. Have at it.