A big thing in machine learning-driven artificial intelligence for language processing right now is word embedding vectors. The word cat is represented to the computer not as a string of letters c-a-t, nor as an item in a vocabulary (which would boil down to an index, i.e. an integer) but rather as a point in a high-dimensional continuous vector space. Not just any point though. One trains a language model on a large corpus of English text and in the course of doing so produces these “word embeddings” as a side effect. They are an intermediate data structure employed on the way towards the standard language modeling task of predicting which word is likely to appear near which other words. It turns out that treating these vectors as the proxy for the meaning of the words they correspond to is helpful in many other natural language tasks, so instead of throwing them out you set them aside for future use, the way a chef would set aside chicken bones to make a broth.
In the 300-dimensional GloVe embedding of English, for example, the word cat is represented by the numbers -0.15067, -0.024468, -0.23368, -0.23378, -0.18382 and so on 295 more times. This is in some crude way what cat “means”. The numbers corresponding to any given word are completely uninterpretable, but taken as a whole the system makes a certain degree of sense. For example, the words with points in the vector space closest to cat–kitten, dog, kitty, pet, feline–refer to things obviously similar to cats.
We can even do basic semantic “arithmetic”. If we subtract the vector for man from the vector for king we get a point in the space relatively close to the result of subtracting woman from queen. Meaning is captured not by the individual pieces, but in the structure of the whole.
What’s more, this structure is essentially differential. The only thing distinctive about the vector embedding of cat is that it is some sequence of numbers and not another. Crucially it is different from the sequences for dog, queen, man, and so forth. Different, yet intermingled with. By virtue of being mapped into the same space, every word bears a relationship to every other word, and this relationship is itself no accident. It is, as I said above, the result of having a computer churn through an enormous body of English text, the product of millions upon millions of calculations. Imagine shaking a vast multi-dimensional numeric matrix like a snow globe until its individual elements arrange themselves in a shape that allows the statistical patterns peculiar to the English language to pass through with relative ease. You could never work backwards from the vector embeddings to the training text, but nevertheless you know that each word’s vector is where it is because of its interplay with the other words in actual living language. Each point in the space bears the invisible trace of all the others.
Have you ever opened a dictionary to look up an unfamiliar word and thought, why this definition is just composed of other words, all of which also appear in this dictionary? And if you looked up their definitions, they would just be other words, and so on. This can produce a vertiginous feeling, the realization that language has no beginning and no outside vantage point. Word vector space is like that, except even more vertiginous because it is a continuous space. Your mind naturally turns to imagining a topology of this space. There could be surfaces and manifolds. About each word’s point we cannot help but imagine a little hypersphere, its semantic penumbra. There will be an infinite number of points within that hypersphere that do not correspond to any English word, but nevertheless could correspond to a word, if only a word were to happen to have appeared in such-and-such a set of contexts. If we were to go back an insert our novel term into our training texts, would it make sense? Would it express a novel concept, but one nevertheless similar to the concepts near it in the embedding space? Perhaps the move from the discrete space of words to the continuous space of embeddings reverses language’s discretizing nature, its ability to chop the smooth flow of experience into discrete atoms of meaning. Continuity, after all, is just infinity standing on its head. So this is not just an interplay of signifiers, but an endless interplay.
What is the result of all this work? Well, we can create computer programs that solve a number of practical problems. They can summarize documents, group similar news articles, help people find the information they need, and transcribe speech. All useful and impressive, and all seemingly impossible just a few decades ago before we had invented these particular mathematical tools. But still there’s something unsatisfying about the whole business, because at the end of the day it’s all just bits on a machine. Your computer programs process reams of text in order to produce…more text. And that can’t be all there is. We know from our experience that language isn’t just some complicated interplay of tokens. At some point it has to touch on the outside world. It has to be about something. And yet all the computer can model for us is a closed system. Disappointingly, there’s nothing outside the text.
It’s all very strange, very counterintuitive. Honestly it makes your head hurt just to think about it. Such an odd conceptualization must be an artifact of the software engineering process, the awkward attempt to force the fundamentally human phenomenon of language onto a computer. No one would be so perverse as to conceive of language in this manner unless practical engineering necessity forced them to.