A Very Short History of Artificial Neural Networks

by James V Stone

“If you want to understand a really complicated device, like a brain, you should build one.”
G Hinton, 2018.

The development of modern artificial neural networks occurred in a series of step-wise dramatic improvements, interspersed with periods of stasis (often called neural network winters). As early as 1943, before commercial computers existed, McCulloch and Pitts began to explore how small networks of artificial neurons could mimic brain-like processes. The cross-disciplinary nature of this early research is evident from the fact that McCulloch and Pitts also contributed to the classic neurophysiological paper What the frog’s eye tells the frog’s brain (Lettvin et al, 1959); remarkably, this paper was not published in a biology journal, but in the Proceedings of the Institute of Radio Engineers.

Figure 1. A taxonomy of artificial neural networks.

In subsequent years, the increasing availability of computers allowed neural networks to be tested on simple pattern recognition tasks. But progress was slow, partly because most research funding was allocated to more conventional approaches that did not attempt to mimic the neural architecture of the brain, and partly because early artificial neural network learning algorithms were limited. A taxonomy of artificial neural networks is shown in Figure 1.

The Perceptron

A major landmark in the history of neural networks was Frank Rosenblatt’s perceptron in 1958, which was explicitly modelled on the neuronal structures in the brain. The perceptron learning algorithm allowed a perceptron to learn to associate inputs with outputs in a manner apparently similar to learning in humans. Specifically, the perceptron could learn an association between a simple input ‘image’ and a desired output, where this output indicated whether or not the object was present in the image. A simple perceptron is shown in Figure 2.

Figure 2. The simplest perceptron, with two input units and one output unit. The connections from input to output units have weights w1 and w2.

Neural Networks and Human Memory

Perceptrons, and neural networks in general, share three key properties with human memory. First, unlike conventional computer memory, neural network memory is content addressable, which means that recall is triggered by an image or a sound, as shown in Figure 3. In contrast, computer memory can be accessed only if the specific location (address) of the required information is known.

Figure 3. Content addressable memory. Even a simple neural network can recover a previously learned image (right) from a corrupted version of that image (left). The image on the left was used as input, and the image on the right is the output. From Artificial Intelligence Engines, 2019.

Second, a common side-effect of content addressable memory is that, given a learned association between an input image and an output, recall can be triggered by an input image that is similar (but not identical) to the original input of a particular input/output association. This ability to generalise beyond learned associations is a critical human-like property of artificial neural networks.

Third, if a single weight or unit is destroyed, this does not eliminate any particular association; instead, it usually degrades all associations to some extent. This graceful degradation is thought to resemble human memory.

The First Neural Network Winter

Despite these human-like qualities, the perceptron was dealt a severe blow in 1969, when Minsky and Papert famously proved that it could not learn associations unless they were of a particularly simple kind (i.e. linearly separable, as described in Chapter 2 of Artificial Intelligence Engines). This marked the beginning of the first neural network winter, during which neural network research was undertaken by only a handful of scientists. During this winter, the capabilities of linear networks were explored in the forms of holophones by Longuet-Higgins (1968), correlographs by Longuet-Higgins, Willshaw, and Buneman (1970), and correlation matrix memories by Kohonen (1972).

Hopfield Networks

“The ability of large collections of neurons to perform “computational” tasks may in part be a spontaneous collective consequence of having a large number of interacting simple neurons”.
JJ Hopfield, 1982.

The modern era of neural networks began in 1982 with the Hopfield net, shown in Figure 4. Although Hopfield nets are not practically very useful, Hopfield introduced a theoretical framework based on statistical mechanics, which laid the foundations for Ackley, Hinton, and Sejnowki’s Boltzmann machine in 1985.

Figure 4. A Hopfield net with seven fully interconnected binary units.

The Boltzmann Machine

“By studying a simple and idealized machine that is in the same general class of computational device as the brain, we can gain insight into the principles that underlie biological computation.”
G Hinton, and T Sejnowski, and D Ackley, 1984.

Unlike a Hopfield net, in which the states of all units are specified by the associations being learned, a Boltzmann machine has a reservoir of hidden units, which can be used to learn complex associations, as shown in Figure 5. The Boltzmann machine is important because it facilitated a conceptual shift away from the idea of a neural network as a passive associative machine towards the view of a neural network as a generative model.

Figure 5. A Boltzmann machine with four input units, two hidden units, and four output units. This configuration is known as a 4–2–4 autoencoder.

The only problem is that Boltzmann machines learn at a rate that is best described as glacial (see Figure 6). But on a practical level, the Boltzmann machine demonstrated that neural networks could learn to solve complex toy (i.e. small-scale) problems, which suggested that they could learn to solve almost any problem (at least in principle, and at least eventually).

Figure 6. Training a Boltzmann machine consists of an outer loop and an inner loop using simulated annealing.

Backpropagation Neural Networks

“Until recently, learning in multilayered networks was an unsolved problem and considered by some impossible.”
T Sejnowski and C Rosenberg, 1986.

The impetus supplied by the Boltzmann machine gave rise to a more tractable method devised in 1986 by Rumelhart, Hinton, and Williams, the backpropagation learning algorithm. A backpropagation network consists of three layers of units: an input layer, which is connected with connection weights to a hidden layer, which in turn is connected to an output layer, as shown in Figure 7.

Figure 7. A backprop neural network with two input units and one output unit. For each learned association, the error at the output is propagated back to hidden units These errors are then used to learn correct weights between units.

The backpropagation algorithm is important because it demonstrated the potential of neural networks to learn sophisticated tasks in a human-like manner. Crucially, for the first time, a backpropagation neural network called NETtalk learned to ‘speak’, inasmuch as it translated text to phonemes (the basic elements of speech), which a voice synthesizer then used to produce speech. During the learning process, it was claimed that NETtalk produced outputs which were akin to babbling in human infants. This type of anthropomorphic description attracted much attention from the popular press at the time.

Reinforcement Learning

In parallel with the evolution of neural networks, reinforcement learning was developed throughout the 1980s and 1990s, principally by Sutton and Barto (2018). Reinforcement learning is an inspired fusion of game playing by computers, as developed by Shannon (1950) and Samuel (1959), optimal control theory, and stimulus– response experiments in psychology. More recently, deep learning networks (see below) have been used to augment traditional reinforcement learning algorithms.

Figure 8. Reinforcement learning can be used to balance a pole on a cart by nudging it left or right. Reproduced with permission from https://github.com/david78k/pendulum.

Early results showed that hard, albeit small-scale, problems (such as balancing a pole, Figure 8) can be solved using feedback in the form of simple reward signals (Michie and Chambers, 1968; Barto, Sutton, and Anderson, 1983). More recently, reinforcement learning has been combined with deep learning to produce impressive skill acquisition, such as in the case of a glider that learns to gain height on thermals (see Figure 9).

Figure 9. Learning to soar using reinforcement learning. (a) Glide used for learning. (b) Before learning, flight is disorganised, and glider descends. c) After learning, glider ascends to 0.6 km. Note the different scales on the vertical axes of b and c. Reproduced with permission from Guilliard et al. (2018); (b,c) from Reddy et al. (2016).

Additionally, the list of successful game-playing applications of reinforcement learning is impressive. These game-playing applications follow in the footsteps of an early success in backgammon, known as TD-Gammon (Tesauro, 1995), which was to produce the best backgammon player in the world (Sutton, 2018); the letters TD stand for temporal difference, which is a component of reinforcement learning. An intriguing aspect of TD-Gammon is that it developed a style of playing that was novel, and which was subsequently widely adopted by grandmasters of the game.

In 2016, AlphaGo beat Lee Sedol, an 18-time world champion at the game of Go (Silver et al., 2016). A year later, AlphaGo beat a team of the world’s top five players. Whereas AlphaGo initially learned from observing 160,000 human games, AlphaGo Zero learned through sheer trial and error before beating AlphaGo 100 games to none (Silver et al., 2017). Both AlphaGo and AlphaGo Zero relied on a combination of reinforcement learning and deep learning.

Like TD-Gammon, AlphaGo Zero generated novel moves that surprised and (initially) mystified human observers, but which led to successful outcomes. Just as TD-Gammon altered the strategies used by humans to play backgammon, so AlphaGo Zero is changing the strategies that humans use to play Go. So, in a sense, humans are starting to learn from machines that learn.

As far back as 1951, Alan Turing anticipated these achievements :

“Once the machine thinking method has started, it would not take long to outstrip our feeble powers.”
A Turing, 1951.

Given Turing’s prescient words, perhaps we should not be surprised at the accomplishments of neural networks. Even so, the importance of AlphaGo Zero in beating the machine (AlphaGo) that beat the human world champion cannot be overstated. We can try to rationalise the achievements of AlphaGo Zero by pointing out that it played many more games than a human could possibly play in a lifetime. But the fact remains that a computer program has learned to play a game so well that it can beat every one of the 7.7 billion people on the planet. Statistically speaking, that places AlphaGo Zero above the 99.9999999th percentile in terms of performance.

From Backprop to Deep Learning

In theory, a backprop network with just two hidden layers of units can associate pretty much any inputs with any outputs, which means that it should be able to perform most tasks. However, getting a backprop network to learn the tasks that it should be able to perform in theory is problematic. A plausible solution is to add more units to each layer and more layers of hidden units, because this should improve learning (at least in theory). In practice, it was found that conventional backprop networks struggled to learn if they had deep learning architectures like that shown in Figure 10.

Figure 10. A deep network with three hidden layers.

With the benefit of hindsight, the field of artificial intelligence stems from the research originally done on Hopfield nets, Boltzmann machines, the backprop algorithm, and reinforcement learning. However, the evolution of backprop networks into deep learning networks had to wait for three related developments: 1) much faster computers, 2) massively bigger training data sets, and, 3) incremental improvements in learning algorithms

Precisely how much each of these developments contributed to the rapid rate of progress in recent years is hard to gauge (Sejnowski, 2018). Whatever the exact cause, given the enormous resources currently being allocated to neural network research by governments and industry, it seems likely that the final neural network winter is now over.

Note: This is an edited extract from Artificial Intelligence Engines by James V Stone. Inevitably, not all contributors to the field of artificial neural networks can be included in a brief summary, such as this. For a more detailed historical overview, please see Further Readings, below.

Also by James V Stone The Emperor’s New AI?

This is an extract from the book Artificial Intelligence Engines: A Tutorial Introduction to the Mathematics of Deep Learning by James V Stone (2019).

https://jim-stone.staff.shef.ac.uk/AIEngines/ http://jim-stone.staff.shef.ac.uk/AIGuide


Barto, AG, Sutton, RS, and Anderson, CW. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans. Sys. Man Cyb., 13(5):834–846, 1983.

Hinton, GE, Sejnowski, TJ, and Ackley, DH. Boltzmann machines: Constraint satisfaction networks that learn. Technical report, Department of Computer Science, Carnegie-Mellon University, 1984.

Hopfield, JJ. Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad. Sci. USA, 79(8):2554–2558, 1982.

Kohonen, T. Correlation matrix memories. IEEE Trans. Computers, 100(4):353–359, 1972.

Lettvin, JV, Maturana, HR, McCulloch, WS, and Pitts, WH. What the frog’s eye tells the frog’s brain. Proceedings of the Institute of Radio Engineers, pages 1940–1951, 1959.

Longuet-Higgins, HC. The non-local storage of temporal information. Proc. R. Soc. Lond. B, 171(1024):327–334, 1968.

Longuet-Higgins, HC, Willshaw, DJ, and Buneman, OP. Theories of associative recall. Quarterly Reviews of Biophysics, 3(2):223–244, 1970.

McCulloch, WS and Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophysics, 5:115–133, 1943.

Michie, D and Chambers, RA. BOXES: An experiment in adaptive control. Machine Intelligence, 2(2):137–152, 1968.

Minsky, M and Papert, S. Perceptrons: An Introduction to Computational Geometry. MIT Press, 1969.

Rosenblatt, F. The Perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386– 408, 1958.

Rumelhart, DE, Hinton, GE, and Williams, RJ. Learning representations by back-propagating errors. Nature, 323:533–536, 1986.

Samuel, AL. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3):210–229, 1959.

Sejnowski, TJ and Rosenberg, CR. NETtalk. Complex Systems, 1(1), 1987.

Sejnowski, TJ. The Deep Learning Revolution. MIT Press, 2018.

Shannon, CE. Programming a computer for playing chess. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 41(314):256–275, 1950.

Silver, D et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529:484–503, 2016.

Silver, D et al. Mastering the game of Go without human knowledge. Nature, 550(7676):354–359, 2017.

Sutton, RS and Barto, AG. Reinforcement Learning: An Introduction. MIT Press, 2018.

Tesauro, G. Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3):58–68, 1995.

Further Readings

Sejnowski, TJ. The Deep Learning Revolution. MIT Press, 2018. This book gives a personal history, from a scientist who has played a pivotal role in the development of neural network algorithms.

James V Stone is an Honorary Associate Professor at the University of Sheffield, England. Published books: jim-stone.staff.shef.ac.uk/books.html

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store