Deep Learning

Singularity, Deep Learning, and AI

DRAFT: January, 2019

CREDIT: The Deep Learning Revolution, by Terrence J. Sejnowski



Sadly, I was giving up on speech recognition just as it was emerging. I gave up, after 30 years of waiting, around 1995. Bad idea.

Speech recognition stopped being just cute, and exploded onto the world scene during the late 1990’s. It has taken almost two decades to commercialize, but the technologies birthed in the late 1990’s have now yielded commercial grade results. 

Why then? Why the late 1990’s?  

In reading “The Deep Learning Revolution”, by Terrence J. Senjnowski, I learned why: an underlying technology called “deep learning” had come of age. 

“Deep Learning” was birthed in the late 1990’s, but the research leading up to the term goes back to the 1980”s (and the foundations of all this goes back to 1965).

Many trace the current revolution in deep learning to October 2012. Researchers proved successful in a large-scale ImageNet competition. Their approach won the ICPR contest on analysis of large medical images for cancer detection.

In 2013 and 2014, the error rate on the ImageNet task using deep learning was further reduced, following a similar trend in large-scale speech recognition. The Wolfram Image Identification project publicized these improvements.

Image classification was then extended to the more challenging task of generating descriptions (captions) for images, often as a combination of CNNs and LSTMs.

IN 2015, spectacular practical application began to burst on the scene in 2015. Speech recognition was one. Facial recognition was a second. Pattern recognition can identify cats, dogs and dog breeds, and applications that allow medical diagnosticians to improve their diagnoses. 

Today, applications address computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases superior to human experts.

It turns out that new approaches to Deep Learning have broad applicability. But one of those applications that has broken into mass commercialization is …. speech recognition. These breakthroughs trace back to breakthroughs in “speaker recognition” – results that were achieved at SRI.

To understand the massive improvements, consider this: In 2015, Google Voice Search experienced a dramatic performance jump of 49%.

Or consider this: All major commercial speech recognition systems (e.g., Microsoft Cortana, Xbox, Skype Translator, Amazon Alexa, Google Now, Apple Siri, Baidu and iFlyTek voice search, and a range of Nuance speech products, etc.) are based on deep learning.

More on speaker recognition: The recent history traces back to breakthroughs at SRI in the late 1990’s. The research arms of NSA and DARPA needed answers. To get the answers, they turned to SRI international. SRI made the biggest breakthroughs. They cracked “speaker recognition” at that time. They failed, however, to crack “speech recognition”. That came later, around 2003.

Specifically, important papers were published in the late 1990’s describing how deep learning could solve the nagging issues of speaker and speech recognition. 

The deep learning method used was called long short-term memory (LSTM). (Hochreiter and Schmidhuber, 1997.)

Deep learning for speech recognition came later, in the early 21st century. In 2003, LSTM started to become competitive with traditional speech recognizers on certain tasks.. Later it was combined with connectionist temporal classification (CTC) in stacks of LSTM RNNs.

Google Voice Search drew upon “CTC-trained LSTM” – in other words, the LSTM technologies birthed in the late 1990’s had by 2015 yielded commercial-grade results.

Today, lay people understand the power of speech recognition by using “Siri” – or by using the voice transcription technologies on their iPhones. Everyone has noted the vast improvements in the last several years. All of these improvements are due to Deep Learning. 

Let me step back at this point and trace the breakthroughs by researchers. I begin with a glossary:

AI – Artificial Intelligence

ANN – Artificial Neural Networks

DNN – Deep Learning Networks – a variant of artificial intelligence in which software “learns to recognize patterns in distinct layers

RNN – Recurrent Neural Networks

“Deep” – The “deep” in “deep learning” refers to the number of layers through which the data is transformed.

“Layers” – each layer represents a level of abstraction that allows the machine to group like data from unlike data (the machine classifies). Each successive layer uses the output from the previous layer as input. The “deep” in “deep learning” refers to the number of layers through which the data is transformed. Deep learning helps to disentangle these abstractions and pick out which features improve performance.[1]

Pattern Recognition

Image Recognition (In 2011, deep learning-based image recognition has become “superhuman”, producing more accurate results than human contestants.)

Speech Recognition (and ASR – Automatic Speech recognition)

Speaker Recognition (In 1998, deep learning-based speaker recognition was proven to be effective)

Visual Recognition – recognizing object, faces handwritten zip codes etc

Facial Recognition – for example, Facebook’s AI lab performs tasks such as automatically tagging uploaded pictures with the names of the people in them.

Object recognition – (in 1992, a method of extracting 3D objects from a cluttered scene

Medical Imaging – where each neural-network layer operates both independently and in concert, separating aspects such as color, size and shape before integrating the outcomes” of medical imaging

Deep Learning Techniques

Supervised – uses classifications

Unsupervised.  – uses pattern recognition (without human assistance)

Backpropogation (Backprop) – passing information in the reverse direction and adjusting the network to reflect that information.

LSTM – long short-term memory 

CTC – connectionist temporal classification

CAP – the chain of transformations from input to output. CAPs describe potentially causal connections between input and output.

More precisely, deep learning systems have a substantial credit assignment path (CAP) depth. The CAP is the chain of transformations from input to output. CAPs describe potentially causal connections between input and output.


TAMER – in 2008, proposed new methods for robots or computer programs to learn how to perform tasks by interacting with a human instructor.[

TAMER (Deep TAMER) – in 2018, is a new algorithm using deep learning to provide a robot the ability to learn new tasks through observation. (robots learn a task with a human trainer, watching video streams or observing a human perform a task in-person). The robot practices the task with the help of some coaching from the trainer, who provided feedback such as “good job” and “bad job.”

CRESCEPTRON, in 1991, a method for performing 3-D object recognition in cluttered scenes. 


GPU – in 2009, Nvidia graphics processing units (GPUs) were used by Google Brain to create capable DNNs. This increased the speed of deep-learning systems by about 100 times.

Training Sets

TIMIT (Automatic speech recognition trainer)

MNIST (image classification trainer)

The MNIST database is composed of handwritten digits. it includes 60,000 training examples and 10,000 test examples. As with TIMIT, its small size lets users test multiple configurations.

With this glossary, a few simple statements can pinpoint why the current revolution is exploding:

Hardware has advanced, thanks to GPU commercialized in 2009.

Software has advanced, thanks to GPU-based successes in cancer image identification in 2012. 

Pattern recognition has advanced, with speech recognition leading the way. The TIMIT training set has allowed exponential progress, especially in 2015, leading the way. 

Robotics have advanced, thanks to deep TAMER breakthroughs in 2018.