⌘-F the World

For the most part information technology enables communication across large gulfs of space and time, but it has also had unintended side effects that have changed the way we interact with our ten-foot or so radius of personal space. By virtue of being portable cellphones are also losable, but but virtue of being callable they are findable. So the refrain of the early 21st century has been, “I can’t find my phone–can you call it?” This has become such second nature that I feel an occasional flicker of confusion when I remember that the same strategy doesn’t work for keys, wallets, shoes, and glasses. I shouldn’t though, because unlike those other household items, being findable is a core aspect of a cell phone’s functionality.

I also find myself wanting to be able to locate material objects–an item in a grocery store, say–by hitting ⌘-F. Mostly I realize that this is just an unwarranted extension of a computing metaphor into the physical world, but some reflex arc in me has sincerely come to associate knowing the name of a thing with being able to make it appear. Occasionally the modality of stuff offers an advantage over machine-managed information (say you’re trying to find the first mention in a novel of a minor character whose name you cannot recall; you’re out of luck on a Kindle, but with a paperback you’ll likely have some sense memory of approximately where in the physical book they appeared), but for the most part computers win the indexing challenge hands down, because computers process language, and an index of the physical world is what language is.

Posted in Mermaids, Those that have just broken the flower vase | Leave a comment

Spoiler Alert

The girl in The Crying Game is actually a robot.

Norman Bates’ mother isn’t a prisoner–she’s a ghost!

The whole time in The Sixth Sense, Bruce Willis was Suzanne Pleshette.

The aliens in “To Serve Man” are tennis pros.

Tom Hanks and Meg Ryan end up together.

It was all a dream.

Posted in Those that at a distance resemble flies | Leave a comment

Preternity

Some people believe that after you die your soul will move on to an afterlife which if you were good will be an eternity of joy but if you were bad will be a period of endless torment. They’re almost right. You do spend an eternity experiencing either pleasure or pain depending on the moral quality of your life, but this happens before you are born. The kindly grandmother who volunteers at the local soup kitchen has already been rewarded by a infinite period of blissful communion with her creator. As for the pillar-of-his-community child molester, uncaptured serial killer, and the concentration camp guard who dies peacefully in his bed at age eighty, take heart–their skin was indeed flayed by demons in a lake of fire. True, this agony ended at the moment of their births, but it’s not like these monsters are getting away with anything because it had no beginning. Only a mere mortal would quibble over ordering since the books end up balanced just the same.

–So what happens after you die?
–Gee, I don’t know. I never really thought about it.

Posted in Innumerable ones | Leave a comment

Wittgenstein Went to New Orleans and all he got was this Lousy Croissant

When you can speak multiple languages or dialects and switch back and forth between them in order to make a point, that’s called code-switching. When you speak a single language or dialect and employ the forms of a different one for no clear reason, that’s called affectation. As always there are boundary cases.

Croissant

The French pronounce this [ˈkwɔsɔn]. Most Americas say [kɹ̩ˈsɑnt]. Croissant is a loan word, but we’ve been eating croissants over here in North America for long enough that I think we can call this a fully-fledged member of the English lexicon, recognizably French spelling notwithstanding. However, it’s not uncommon for me to hear (presumably monolingual) English speakers say [ˈkwɔsɔn] to other (presumably monolingual) English speakers in an otherwise English-only context. This sounds affected to my ears, but whenever I’ve heard it the usage has been so unselfconscious–so devoid of wince-inducing social jockeying–that I wonder if there is an actual lexical change afoot.

New Orleans, Long Island, Shanghai

Here’s something that I do so I don’t think it’s affected, but you be the judge. When saying the names of places I’ll lean towards a well-known native pronunciation instead of what would be standard for my dialect. So given a choice between [nju ɔɹˈlinz], [nju ˈɔɹlɪnz], or [ˈnɔlɪnz] (N’awlins for you non-linguists) I’ll opt for the last one, even though the total amount of time I’ve spent in New Orleans is about two days when I was five. Likewise I say [lɔnˈgaɪln̩d] instead of [lɔŋ aɪlæn̩d], hitting that medial /g/ hard like a loanshark does a deadbeat. More subtly for an American, I pronounce the name of the China’s largest city [ʃɒŋhaɪ] instead of [ʃæŋhaɪ] because that’s how I’ve heard Chinese people say it. (Though I say [ʃæŋhaɪ] for the verb that means to kidnap someone and force them to serve in your navy, because that’s a different lexical item. I also refer to the 1986 Madonna/Sean Penn vehicle as [ʃæŋhaɪ supraɪz], but perhaps it is best if we don’t speak of this at all.)

There’s a limit of course. I’m not going to say “Last summer I took a lovely vacation in [mɛxiko]” because I don’t want to sound like a schmuck. I suggest the native pronunciation without trying to sound convincingly native myself. What this translates to into phonetic terms is borrowing segments and broad stress patterns, but drawing the subtler aspects of pronunciation (e.g. vowel quality) from my own dialect.

Wittgenstein

All these issues come together in a crucial aspect of Wittgenstein scholarship: how should an English speaker say his name? As I see it, there are four options.

  1. [wɪtgɪnstaɪn]…This is reasonable because it is consistently anglicized in a here-in-America-we-speak-American-buddy kind of way.
  2.  [vɪtgɪnstaɪn]…This is how I say it. It’s pretty standard and sounds reasonable, but when you think about it, just changing the initial consonant while leaving everything else the same is kind of sloppy. This pronunciation belongs to a dialect that I call War Movie German, which is identical to English except for the addition of the lexical items “achtung” and “jawohl” and the single phonetic rule #w → v.
  3. [vɪtgɪnʃtaɪn]…Logically this seems better than (2). If you’re going to try and sound German you might as well go all the way, but to my ears something is off. I think because the segment change v→w is so much more famous as a German stereotype than s→ʃ that the latter perversely comes off affected by comparison.
  4. [wɪtgɪnʃtaɪn]…See sounding like a schmuck above.
Posted in Mermaids | 2 Comments

The Prospector and the Snowstorm

There was a gold prospector who got caught in a blizzard in the Alaskan wilderness. He hiked for three days in blinding whiteness, lost, starving, frozen, and alone. The prospector prayed to God to save him, but all that came was more snow. On the fourth day the prospector’s strength failed him, and he lay down to die. Just at that moment an Eskimo hunting party wandered past. They strapped the prospector to their sled and dragged him back into town.

That night as he lay safe and warm in his bed, the prospector prayed to God again. “Lord,” he prayed, “when I asked for your help and it didn’t come, I thought you had forsaken me. But then as I was about to die, you sent those Eskimos. It is a great comfort for me to know that the moment things seem the most hopeless is the moment when you will intervene.”

All at once the prospector felt a strange and powerful presence, and God’s voice sounded in his head. “No,” God said, “you are mistaken. I was the one who sent the blizzard. I had nothing to do with those Eskimos. It was dumb luck that they came by when they did. If you want to take comfort in something, take comfort in knowing that if you should freeze to death in the wilderness, that would be part of My plan, and My plan is the right one because it’s Mine.”

Posted in Those that at a distance resemble flies | Leave a comment

Gay Animal Hoarders

As mental illness reality TV goes, Animal Hoarders lies somewhere in the middle of the voyeurism spectrum–not on a par with the legitimate public service that is Intervention but also nowhere near the rank exploitiveness of Celebrity Rehab with Dr. Drew. In keeping with its genre, each installment is as formulaic and interchangeable as a sea chanty or episode of CSI. There is a brief prelude in which a happy couple expresses love for their pets followed by a reveal of just how many pets we’re talking about here–thirty dogs, a hundred cats, two hundred screeching budgies colonizing every square inch of ceiling space. The hoarder half of the couple insists to the camera that everything is fine, the depth of their denial underscored by shots of domestic chaos and animal filth. The non-hoarder half of the couple tries to talk tough about the need for change, but it quickly becomes apparent that they are enablers of this behavior. At the midpoint the producers intervene, negotiating the removal of at least some of the animals. The episode concludes with a follow-up some time later in which the animal level is generally still high but below where it was, and the couple expresses hope for the future.

About ninety percent of these couples are married people of the opposite sex. About ten percent are same-sex couples. This has absolutely no relevance for either the hoarding behavior or the clockwork manner in which it is depicted. The soft-butch spaniel enthusiast is precisely as crazy as the married cat lady, and in precisely the same way. Even if the producers believed the hoarders’ sexual orientation to be relevant, they would probably end up ignoring it anyway because these episodes have to hit a very specific sequence of marks in only twenty minutes. The result is that a TV show with no political agenda whatsoever ends up normalizing homosexuality in a particularly insidious way, by permitting us to gawk at unattractive queer couples who are messed up for reasons that have nothing to do with sex.

There are only so many hours in the day and so many words that can be spoken in an hour, and the narrowness of a communication channel naturally tilts it towards conservatism. It is easier to buttress a consensus than challenge it because the pat phrases are already out there, the groundwork has been laid. In the thirty-second news slot the spokesman for received wisdom always has the home field advantage. But contrary to the whingeing of pols and conspiracy theorists alike who feel like they can’t get a fair shake from the media, this is a purely structural phenomenon. It is ideologically neutral, and sometimes concision helps to undermine a consensus, or at least hustle a fading one out the door a little faster.

Twenty years ago a show like Animal Hoarders wouldn’t have been able to let the sexual orientation of its profilees pass without comment. It would have had to have been acknowledged and somehow minimized either for fear of offending homophobic viewers, or providing more ammunition for their prejudices, or both. The easiest thing would have been to quietly adopt a heterosexuals-only policy for the show, but from the producers’ standpoint that would have been a hassle because there probably aren’t that many non-camera-shy animal hoarders out there, and the less picky you can be the better. So the moment this delicacy is no longer absolutely required, the invisible hand of the reality TV marketplace pushes it to the curb. An outmoded sexual taboo is abandoned literally because no one has time for it.

Posted in Those drawn with a very fine camel’s-hair brush | Leave a comment

Importing scikit-learn Models into Java

Currently scikit-learn is the best general purpose machine learning package. It is part of the Scientific Python family of tools, built on top of the Numeric Python matrix processing engine. The code is readable, documentation extensive, and the package is popular, so there’s plenty of help available on Stack Overflow when you need it. But perhaps scikit-learn’s best selling point is that it’s written in Python, a language well suited for the ad hoc exploratory working style typical of machine learning. Java machine learning toolkits like Weka and Mallet are mathematically solid, but running mathematical algorithms is only part of the job of data science. There’s also inevitably lots of format munging, directory groveling, glue code, and trying things that don’t work. You want the basics to be as easy as possible. The Python command line achieves a level of transparency that Java–with its boilerplate, IDEs, compilers, complex build systems, and lack of a REPL–cannot match.

Illustrations of machine learning classification

Still, the JVM is a popular platform, and it would be nice to be able to train a model in scikit-learn and then deploy it in Java. There is currently no support for this. The right thing would be to have scikit-learn export its model files to some common format like PMML, but that feature does not currently exist.1 scikit-learn’s only serialization is Python’s native pickle format, which works great, but only for other Python programs. In theory, writing your own serialization should be easy. A model is just a set of numbers, but it only works if the test time code exactly reproduces the training code’s processing of its input. Any deviation and your finely tuned vector of coefficients becomes nothing more than a numeric jumble.

Let’s take a look at a fairly simple but still non-trivial machine learning model and see what is involved in exporting its semantics in a cross-language way. Say I want to do text classification. I have a corpus of short documents drawn from two genres: cookbooks and descriptions of farm life. I have tab-delimited text files that look like this.

0   The horse and the cow lived on the farm
1   Boil two eggs for five minutes
0   The hayloft of the barn was full
1   Drain the pasta

The first column is an integer class label and the second is a document. I want the computer to learn how to hypothesize a 0 or 1 for any string input it is given. A standard approach would be to treat the documents as bags of words and build a Naive Bayes model over them. To make things more sophisticated, let’s train on bi-grams in addition to individual words, and work with Tf-Idf values instead of raw counts. scikit-learn makes this easy. Here is the bulk of the code needed to train such a model.

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

def train_model(corpus):
    labels = []
    data = []
    for line in corpus:
        label, vector = line.decode('utf-8').split("\t")
        labels.append(int(label))
        data.append(vector)
    model = Pipeline([('vect', CountVectorizer(ngram_range=[1, 2])),
                      ('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB())])
    model.fit(data, labels)
    return model

The model returned is a scikit-learn object. If we want to export it to another language, we have to extract its meaningful parts and serialize them in a general-purpose way. These meaningful parts are sets of numbers. Specially, for each dimension in vector space there is an Idf score and an array of coefficients for each class. Additionally there is a scalar bias for each class. So what we have is a vector of numbers plus a mapping for strings to vectors of numbers.

Bias 0 Bias 1
-0.693 -0.587
Term Idf Coefficient 0 Coefficient 1
garlic 4.673 -8.327 -6.825
peel garlic 3.522 -12.805 -10.505

You have to do some detective work to figure where inside the scikit-learn objects these numbers actually reside, but once you have them you can serialize them in a language-agnostic way by writing them out as JSON. Sure the file will be huge, and the representation of floating point numbers as strings is wildly inefficient, but we can always gzip the thing.

Now the Java decoder needs to 1) load this file 2) turn the input into n-gram terms 3) build a vector of term Tf-Idf scores 4) linearly transform that vector using the model’s coefficients and biases. None of this is particularly difficult,but you have to make sure that the Java decoder performs each of these steps in exactly the same way as the Python encoder, so that the numbers passed between them retain their meaning.

The Linear N-gram Model project contains a Python training script and a Java decoder that does this. Train a model in Python on a corpus like the one pictured above, run it in Java on unlabeled text and it will produce class predictions and log likelihoods like so.

0   -47.8674 -47.1280   The harvest was finished early this year
0   -47.0950 -42.8352   We fed the horses and the pigs
1   -45.3605 -46.8341   Place the garlic in a pan

This project can serve as starter example code for machine learning researchers faced with a similar cross-language serialization task.

1But check out Py2PMML, which looks like it gets you part of the way there. (Hat tip darknightelf.)

Posted in Innumerable ones, Those that have just broken the flower vase | Leave a comment