Written by: Stephen Hsu
Primary Source: Information Processing
This NYTimes Magazine article describes the implementation of a new deep neural net version of Google Translate. The previous version used statistical methods that had reached a plateau in effectiveness, due to limitations of short-range correlations in conditional probabilities. I’ve found the new version to be much better than the old one (this is quantified a bit in the article).
More deep learning.
NYTimes: … There was, however, another option: just design, mass-produce and install in dispersed data centers a new kind of chip to make everything faster. These chips would be called T.P.U.s, or “tensor processing units,” … “Normally,” Dean said, “special-purpose hardware is a bad idea. It usually works to speed up one thing. But because of the generality of neural networks, you can leverage this special-purpose hardware for a lot of other things.” [ Nvidia currently has the lead in GPUs used in neural network applications, but perhaps TPUs will become a sideline business for Google if their TensorFlow software becomes widely used … ]
Just as the chip-design process was nearly complete, Le and two colleagues finally demonstrated that neural networks might be configured to handle the structure of language. He drew upon an idea, called “word embeddings,” that had been around for more than 10 years. When you summarize images, you can divine a picture of what each stage of the summary looks like — an edge, a circle, etc. When you summarize language in a similar way, you essentially produce multidimensional maps of the distances, based on common usage, between one word and every single other word in the language. The machine is not “analyzing” the data the way that we might, with linguistic rules that identify some of them as nouns and others as verbs. Instead, it is shifting and twisting and warping the words around in the map. In two dimensions, you cannot make this map useful. You want, for example, “cat” to be in the rough vicinity of “dog,” but you also want “cat” to be near “tail” and near “supercilious” and near “meme,” because you want to try to capture all of the different relationships — both strong and weak — that the word “cat” has to other words. It can be related to all these other words simultaneously only if it is related to each of them in a different dimension. You can’t easily make a 160,000-dimensional map, but it turns out you can represent a language pretty well in a mere thousand or so dimensions — in other words, a universe in which each word is designated by a list of a thousand numbers. Le gave me a good-natured hard time for my continual requests for a mental picture of these maps. “Gideon,” he would say, with the blunt regular demurral of Bartleby, “I do not generally like trying to visualize thousand-dimensional vectors in three-dimensional space.”
Still, certain dimensions in the space, it turned out, did seem to represent legible human categories, like gender or relative size. If you took the thousand numbers that meant “king” and literally just subtracted the thousand numbers that meant “queen,” you got the same numerical result as if you subtracted the numbers for “woman” from the numbers for “man.” And if you took the entire space of the English language and the entire space of French, you could, at least in theory, train a network to learn how to take a sentence in one space and propose an equivalent in the other. You just had to give it millions and millions of English sentences as inputs on one side and their desired French outputs on the other, and over time it would recognize the relevant patterns in words the way that an image classifier recognized the relevant patterns in pixels. You could then give it a sentence in English and ask it to predict the best French analogue.
That the conceptual vocabulary of human language (and hence, of the human mind) has dimensionality of order 1000 is kind of obvious*** if you are familiar with Chinese ideograms. (Ideogram = a written character symbolizing an idea or concept.) One can read the newspaper with mastery of roughly 2-3k characters. Of course, some minds operate in higher dimensions than others ;-)
The major difference between words and pixels, however, is that all of the pixels in an image are there at once, whereas words appear in a progression over time. You needed a way for the network to “hold in mind” the progression of a chronological sequence — the complete pathway from the first word to the last. In a period of about a week, in September 2014, three papers came out — one by Le and two others by academics in Canada and Germany — that at last provided all the theoretical tools necessary to do this sort of thing. That research allowed for open-ended projects like Brain’s Magenta, an investigation into how machines might generate art and music. It also cleared the way toward an instrumental task like machine translation. Hinton told me he thought at the time that this follow-up work would take at least five more years.
The entire article is worth reading (there’s even a bit near the end which addresses Searle’s Chinese Room confusion). However, the author underestimates the importance of machine translation. The “thought vector” structure of human language encodes the key primitives used in human intelligence. Efficient methods for working with these structures (e.g., for reading and learning from vast quantities of existing text) will greatly accelerate AGI.
*** Some further explanation, from the comments:
The average person has a vocabulary of perhaps 10-20k words. But if you eliminate redundancy (synonyms + see below) you are probably only left with a few thousand words. With these words one could express most concepts (e.g., those required for newspaper articles). Some ideas might require concatenations of multiple words: “cougar” = “big mountain cat” , etc.
But the ~1k figure gives you some idea of how many distinct “primitives” (= “big”, “mountain”, “cat”) are found in human thinking. It’s not the number of distinct concepts, but rather the rough number of primitives out of which we build everything else.
Of course, truly deep areas of science discover / invent new concepts which are almost new primitives (fundamental, but didn’t exist before!), such as “entropy”, “quantum field”, “gauge boson”, “black hole”, “natural selection”, “convex optimization”, “spontaneous symmetry breaking”, “phase transition” etc.
If we trained a deep net to translate sentences about Physics from Martian to English, we could (roughly) estimate the “conceptual depth” of the subject. We could even compare two different subjects, such as Physics versus Art History.
Latest posts by Stephen Hsu (see all)
- Blade Runner 2049: Demis Hassabis (Deep Mind) interviews director Villeneuve - October 9, 2017
- Information Theory of Deep Neural Nets: “Information Bottleneck” - October 7, 2017
- A Gentle Introduction to Neural Networks - October 3, 2017