Voice Recognition Explodes

CREDIT: The Deep Learning Revolution, by Terrence J. Sejnowski

CREDIT: https://en.wikipedia.org/wiki/Deep_learning

====================

Sadly, I was giving up on voice recognition just as it was emerging. I gave up, after 30 years of waiting, around 1995. Bad idea.

Voice recognition stopped being just cute, and exploded onto the world scene during the late 1990’s. It has taken almost two decades to commercialize, but the technologies birthed in the late 1990’s have now yielded commercial grade results. 

Why then? Why the late 1990’s?  

In reading “The Deep Learning Revolution”, by Terrence J. Senjnowski, I learned why: an underlying technology called “deep learning” had come of age. 

“Deep Learning” was birthed in the late 1990’s, but the research leading up to the term goes back to the 1980”s.

It turns out that new approaches to Deep Learning have broad applicability. But one of those applications that has broken into mass commercialization is …. voice recognition. 

To understand the massive improvements, consider this: In 2015, Google Voice Search experienced a dramatic performance jump of 49%.

Or consider this: All major commercial speech recognition systems (e.g., Microsoft Cortana, Xbox, Skype Translator, Amazon Alexa, Google Now, Apple Siri, Baidu and iFlyTek voice search, and a range of Nuance speech products, etc.) are based on deep learning.

The recent history traces back to breakthroughs at SRI in the late 1990’s. The research arms of NSA and DARPA needed answers. To get the answers, they turned to SRI international. SRI made the biggest breakthroughs. They cracked “speaker recognition” at that time. They failed, however, to crack “speech recognition”. That came later.

Specifically, important papers were published in the late 1990’s describing how deep learning could solve the nagging issues of speaker and voice recognition. The deep learning method used was called long short-term memory (LSTM). (Hochreiter and Schmidhuber, 1997.)

Deep learning for speech recognition came later, in the early 21st century. In 2003, LSTM started to become competitive with traditional speech recognizers on certain tasks.. Later it was combined with connectionist temporal classification (CTC) in stacks of LSTM RNNs.

Google Voice Search drew upon “CTC-trained LSTM” – in other words, the LSTM technologies birthed in the late 1990’s had by 2015 yielded commercial-grade results.

Today, lay people understand the power of speech recognition by using “Siri” – or by using the voice transcription technologies on their iPhones. Everyone has noted the vast improvements in the last several years. All of these improvements are due to Deep Learning.