Written by: Stephen Hsu
Primary Source: Information Processing
Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Nature 521, 436–444 (28 May 2015) doi:10.1038/nature14539
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
The article seems to give a somewhat, er, compressed, version of the history of the field. See these comments by Schmidhuber:
Machine learning is the science of credit assignment. The machine learning community itself profits from proper credit assignment to its members. The inventor of an important method should get credit for inventing it. She may not always be the one who popularizes it. Then the popularizer should get credit for popularizing it (but not for inventing it). Relatively young research areas such as machine learning should adopt the honor code of mature fields such as mathematics: if you have a new theorem, but use a proof technique similar to somebody else’s, you must make this very clear. If you “re-invent” something that was already known, and only later become aware of this, you must at least make it clear later.
As a case in point, let me now comment on a recent article in Nature (2015) about “deep learning” in artificial neural networks (NNs), by LeCun & Bengio & Hinton (LBH for short), three CIFAR-funded collaborators who call themselves the “deep learning conspiracy” (e.g., LeCun, 2015). They heavily cite each other. Unfortunately, however, they fail to credit the pioneers of the field, which originated half a century ago. All references below are taken from the recent deep learning overview (Schmidhuber, 2015), except for a few papers listed beneath this critique focusing on nine items.
1. LBH’s survey does not even mention the father of deep learning, Alexey Grigorevich Ivakhnenko, who published the first general, working learning algorithms for deep networks (e.g., Ivakhnenko and Lapa, 1965). A paper from 1971 already described a deep learning net with 8 layers (Ivakhnenko, 1971), trained by a highly cited method still popular in the new millennium. Given a training set of input vectors with corresponding target output vectors, layers of additive and multiplicative neuron-like nodes are incrementally grown and trained by regression analysis, then pruned with the help of a separate validation set, where regularisation is used to weed out superfluous nodes. The numbers of layers and nodes per layer can be learned in problem-dependent fashion.
2. LBH discuss the importance and problems of gradient descent-based learning through backpropagation (BP), and cite their own papers on BP, plus a few others, but fail to mention BP’s inventors. BP’s continuous form was derived in the early 1960s (Bryson, 1961; Kelley, 1960; Bryson and Ho, 1969). Dreyfus (1962) published the elegant derivation of BP based on the chain rule only. BP’s modern efficient version for discrete sparse networks (including FORTRAN code) was published by Linnainmaa (1970). Dreyfus (1973) used BP to change weights of controllers in proportion to such gradients. By 1980, automatic differentiation could derive BP for any differentiable graph (Speelpenning, 1980). Werbos (1982) published the first application of BP to NNs, extending thoughts in his 1974 thesis (cited by LBH), which did not have Linnainmaa’s (1970) modern, efficient form of BP. BP for NNs on computers 10,000 times faster per Dollar than those of the 1960s can yield useful internal representations, as shown by Rumelhart et al. (1986), who also did not cite BP’s inventors. [ THERE ARE 9 POINTS IN THIS CRITIQUE ]
… LBH may be backed by the best PR machines of the Western world (Google hired Hinton; Facebook hired LeCun). In the long run, however, historic scientific facts (as evident from the published record) will be stronger than any PR. There is a long tradition of insights into deep learning, and the community as a whole will benefit from appreciating the historical foundations.
One very striking aspect of the history of deep neural nets, which is acknowledged both by Schmidhuber and LeCun et al., is that the subject was marginal to “mainstream” AI and CS research for a long time, and that new technologies (i.e., GPUs) were crucial to its current flourishing in terms of practical results. The theoretical results, such as they are, appeared decades ago! It is clear that there are many unanswered questions concerning guarantees of optimal solutions, the relative merits of alternative architectures, use of memory networks, etc.
Some additional points:
1. Prevalence of saddle points over local minima in high dimensional geometries: apparently early researchers were concerned about incomplete optimization of DNNs due to local minima in parameter space. But saddle points are much more common in high dimensional spaces and local minima have turned out not to be a big problem.
2. Optimized neural networks are similar in important ways to biological (e.g., monkey) brains! When monkeys and ConvNet are shown the same pictures, the activation of high-level units in the ConvNet explains half of the variance of random sets of 160 neurons in the monkey’s inferotemporal cortex.
Some comments on the relevance of all this to the quest for human-level AI from an earlier post:
.. evolution has encoded the results of a huge environment-dependent optimization in the structure of our brains (and genes), a process that AI would have to somehow replicate. A very crude estimate of the amount of computational power used by nature in this process leads to a pessimistic prognosis for AI even if one is willing to extrapolate Moore’s Law well into the future. [ Moore’s Law (Dennard scaling) may be toast for the next decade or so! ] Most naive analyses of AI and computational power only ask what is required to simulate a human brain, but do not ask what is required to evolve one. I would guess that our best hope is to cheat by using what nature has already given us — emulating the human brain as much as possible.
If indeed there are good (deep) generalized learning architectures to be discovered, that will take time. Even with such a learning architecture at hand, training it will require interaction with a rich exterior world — either the real world (via sensors and appendages capable of manipulation) or a computationally expensive virtual world. Either way, I feel confident in my bet that a strong version of the Turing test (allowing, e.g., me to communicate with the counterpart over weeks or months; to try to teach it things like physics and watch its progress; eventually for it to teach me) won’t be passed until at least 2050 and probably well beyond.
Relevant remarks from Schmidhuber:
[Link] …Ancient algorithms running on modern hardware can already achieve superhuman results in limited domains, and this trend will accelerate. But current commercial AI algorithms are still missing something fundamental. They are no self-referential general purpose learning algorithms. They improve some system’s performance in a given limited domain, but they are unable to inspect and improve their own learning algorithm. They do not learn the way they learn, and the way they learn the way they learn, and so on (limited only by the fundamental limits of computability). As I wrote in the earlier reply: “I have been dreaming about and working on this all-encompassing stuff since my 1987 diploma thesis on this topic.” However, additional algorithmic breakthroughs may be necessary to make this a practical reality.
[Link] The world of RNNs is such a big world because RNNs (the deepest of all NNs) are general computers, and because efficient computing hardware in general is becoming more and more RNN-like, as dictated by physics: lots of processors connected through many short and few long wires. It does not take a genius to predict that in the near future, both supervised learning RNNs and reinforcement learning RNNs will be greatly scaled up. Current large, supervised LSTM RNNs have on the order of a billion connections; soon that will be a trillion, at the same price. (Human brains have maybe a thousand trillion, much slower, connections – to match this economically may require another decade of hardware development or so). In the supervised learning department, many tasks in natural language processing, speech recognition, automatic video analysis and combinations of all three will perhaps soon become trivial through large RNNs (the vision part augmented by CNN front-ends). The commercially less advanced but more general reinforcement learning department will see significant progress in RNN-driven adaptive robots in partially observable environments. Perhaps much of this won’t really mean breakthroughs in the scientific sense, because many of the basic methods already exist. However, much of this will SEEM like a big thing for those who focus on applications. (It also seemed like a big thing when in 2011 our team achieved the first superhuman visual classification performance in a controlled contest, although none of the basic algorithms was younger than two decades: http://people.idsia.ch/~juergen/superhumanpatternrecognition.html)
So what will be the real big thing? I like to believe that it will be self-referential general purpose learning algorithms that improve not only some system’s performance in a given domain, but also the way they learn, and the way they learn the way they learn, etc., limited only by the fundamental limits of computability. I have been dreaming about and working on this all-encompassing stuff since my 1987 diploma thesis on this topic, but now I can see how it is starting to become a practical reality. Previous work on this is collected here: http://people.idsia.ch/~juergen/metalearner.html
See also Solomonoff universal induction. I don’t believe that completely general purpose learning algorithms have to become practical before we achieve human-level AI. Humans are quite limited, after all! When was the last time you introspected to learn about the way you learn you learn …? Perhaps it is happening “under the hood” to some extent, but not in maximum generality; we have hardwired limits.
Do we really need Solomonoff? Did Nature make use of his Universal Prior in producing us? It seems like cheaper tricks can produce “intelligence” ;-)