Big Data shakes up the Speech Industry

I spent a few hours at the Mobile Voice conference and left with an appreciation of Google’s impact on the speech industry. Google’s speech offerings loomed over the few sessions I attended. Some of that was probably due to Michael Cohen’s keynote1 describing Google’s philosophy and approach, but clearly Google has the attention of all the speech vendors. Tim’s recent blog post on the emerging Internet Operating System captured the growing importance of networked applications that rely on massive amounts of data, and it was interesting to observe in person its impact on an industry. (Google’s speech and language technologies were among the examples Tim cited.)

Google thinks of seamless voice-driven interfaces as having two key features: (1) ubiquitous availability so users can access speech interfaces from any app and on any device, and (2) high-performance so speech technologies lead to frictionless user interactions. In order to produce and deliver ubiquitous, high-performance speech interfaces, Michael Cohen emphasized Google’s big data systems as key to how they develop all their services.

Having speech technologies in the cloud lets Google quickly iterate and push enhanced speech engines on a regular basis. More importantly, their speech engines learn and get trained using real data from their many interconnected services. Speech engines typically rely on both language and acoustic models. Language models are statistical models of word sequences and patterns. Cohen pointed out that their language models use data collected from web searches, giving them access to an ever growing corpus that few can match (230 billion words collected, refined to a vocabulary of the million most common words). Cohen disclosed that some of the more recent acoustic models they’re evaluating are built using unsupervised machine-learning algorithms. (These are speech algorithms trained on recorded speech that haven’t been transcribed by hand.) While he coyly avoided explaining how an accurate system can be built from unsupervised techniques, it’s likely they use data from their 411 service (something Tim predicted 3 years ago). [Update (4/27): Readers point to Youtube and Google's Voice Search on smartphones as likely sources of data for tuning speech engines.]

Of course having access to relevant real-world, user-generated data is pointless if one can’t operate at a large-scale. Fortunately Google pioneered many of the recently popular big data management and parallel computing technologies, so they’re probably the best company equipped to use large-scale data. Big data technologies are essential pieces of infrastructure that Google engineers tap into. In fact their speech algorithms wade through massive amounts of data on a regular basis, resulting in a virtuous cycle of refinements.

There are situations when embedded speech engines make sense (e.g., speech enabled navigation systems should still work in the “middle of nowhere”). Google’s access to relevant data and their big data skills make them a formidable general purpose, cloud-based2 speech engine. Hybrid systems that use cloud services when available, and otherwise default to embedded speech engines, were mentioned frequently at the conference. This is great news for players like Nuance that have both embedded and cloud engines. But as network connections become more reliable and ubiquitous, Google’s cloud-based (and big data driven) speech engines are going to get harder to beat. In recent years many speech companies have amassed lots of data, but in Google they face a competitor that leverages web-scale data.

Microsoft with its search engine and call center data, speech products and research group, is also a major player. It just isn’t clear if they are using data from their interconnected services to benefit their speech products, as efficiently as Google does.

(1) Michael Cohen was one of the founders of Nuance and is currently the Manager of Speech Technology at Google.

(2) Some of Google’s speech engines are easily accessible (at least on Android) through simple API’s.

tags: , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • Troy McConaghy

    Google’s 411 service isn’t their only source of data for building their automatic speech recognition (ASR) technology.

    YouTube has a feature called auto-timing, where if you upload a video plus a text transcript, it will automatically figure out the time codes (i.e. when to show each bit of transcript text in the captions).

    Auto-timing a cool feature for everyday YouTube users, but it’s also a great source of audio file + transcript pairs: very useful for training and testing speech recognition systems. I don’t know if they’re *actually* using that data to improve their ASR systems, but they certainly could.

  • Jose Marinez

    Don’t forget the “Voice Search” feature on Android and iPhone as a source of data. I bet this is the bulk of it as well as the voicemail from Google Voice.

  • Bill Meisel

    As a co-organizer of the conference, I complement you on getting to the gist of the matter so accurately. This area will heat up over the next year to boiling, as exemplifed by the Apple acquisition of Siri, a voice-enabled “concierge,” after the conference. (Siri had a presentation at the conference.) In essence, I foresee a fight over who will provide us with our “personal assistant” that we ask what we want in flexible language (extending the basic search paradigm), either by speaking or typing, whichever is most convenient.

  • Pearson Cummings

    I work on the Microsoft Tellme team and I thought I would weigh in on the last point in your post. For background, the Microsoft Tellme team includes all of the company’s speech recognition assets.

    Microsoft Tellme processes over 11 billion (yep, that’s a “b”) voice requests each year in the cloud and uses those to improve the Microsoft Tellme speech platform. Even before they were part of Microsoft, Tellme made a big bet on the cloud, so the infrastructure is mature and extremely effective.

    Additionally, because we created a unified network that includes Tellme’s speech self-service, the voice search in Bing for mobile, and the voice recognition in the Ford Sync (just to name a few), we’ve been able to build the largest cloud-based speech platform in the industry, one that learns from the 11 billion voice requests it processes each year.

    Just thought I would answer the question at the end of your post. Hopefully I added a little bit extra to an interesting and informative story.