“Gibbs Sampling for the Uninitiated” for the Uninitiated

Recently via Twitter I came across “Gibbs Sampling for the Uninitiated” by Philip Resnik and Eric Hardisty, a tutorial that shows how to use Gibbs sampling of a Naive Bayes model to estimate the labels on a set of documents. This paper goes through the algebra in great detail and concludes with pseudocode. Resnik and Hardisty do such a good job of making it look easy that I decided to write my own Gibbs sampler. It was, in fact, pretty easy. I wrote a Numeric Python script and put it in a github project called Naive Bayes Gibbs Sampler. Currently it just generates a small random corpus and runs over it for a few iterations, but the __main__ block makes it clear how this script could be used to do real work.

Further improvements to be done in my copious spare time:

  • Currently this only works on an unlabeled corpus, but it would be easy to support partially labeled corpora.
  • There are probably ways to improve performance by being more sophisticated in my use of Numeric Python array operations. At various points I do matrix operations by enumerating the rows with Python for loops, and my experience with similar frameworks like MATLAB and R is that when you’re writing a loop you’re doing it wrong.
  • Since this is an unsupervised learner, even if it gets the category labels exactly right, the actual symbols used to represent them likely won’t match the true ones. For example, if the true labels are [1,1,2,3,1,2], an unsupervised estimated label set of [2,2,3,1,2,3] is 100% correct because it is just a a rotation of the labels 1→2, 2→3, 3→1. It would be nice to add an evaluation metric that can handle this. I’m not sure what is standard in unsupervised learning applications, but my first choice would be the B-Cubed metric, which is typically used for scoring coreference resolution but is more generally a way of quantifying the overlap between two equivalence sets.
  • I need test cases. Some small data set for which I can be reasonably confident what the accuracy would be. I wonder if there is a way to do this short of running it on a standard data set like Newsgroups and comparing to published results. That is to say, how do you do unit testing in addition to integration testing?

That said, I hope this can be a useful addition to the tutorial. Have at it.

This entry was posted in Those that have just broken the flower vase. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s