Deep Learning and Voice Recognition

Singularity, Deep Learning, and AI

DRAFT: January, 2019

CREDIT: The Deep Learning Revolution, by Terrence J. Sejnowski

CREDIT: https://en.wikipedia.org/wiki/Deep_learning

====================

Sadly, I was giving up on voice recognition just as it was emerging. I gave up, after 30 years of waiting, around 1995. Bad idea.

Voice recognition stopped being just cute, and exploded onto the world scene during the late 1990’s. It has taken almost two decades to commercialize, but the technologies birthed in the late 1990’s have now yielded commercial grade results. 

Why then? Why the late 1990’s?  Because an underlying technology called “deep learning” had come of age. Because Thanks the research arms of NSA and DARPA needed answers. Because, to get the answers, they turned to SRI international. SRI made the biggest breakthroughs. They cracked “speaker recognition” at that time. They failed, however, to crack “speech recognition”. That came later.

Specifically, important papers were published in the late 1990’s describing how deep learning could solve the nagging issues facing speaker and voice recognition. The deep learning method used was called long short-term memory (LSTM). (Hochreiter and Schmidhuber, 1997.)

Deep learning for speech recognition came later. In 2003, LSTM started to become competitive with traditional speech recognizers on certain tasks.. Later it was combined with connectionist temporal classification (CTC) in stacks of LSTM RNNs.

Voice recognition today is widely commercialized. In 2015, Google Voice Search experienced a dramatic performance jump of 49%. (They drew upon “CTC-trained LSTM” – in other words, the LSTM technologies birthed in the late 1990’s had by 2015 yielded commercial-grade results).

All major commercial speech recognition systems (e.g., Microsoft Cortana, Xbox, Skype Translator, Amazon Alexa, Google Now, Apple Siri, Baidu and iFlyTek voice search, and a range of Nuance speech products, etc.) are based on deep learning.

“Deep Learning” was birthed in the late 1990’s, but the research leading up to the term goes back to the 1980”s.