Voice in Google Mobile App: A Tipping Point for the Web?

As I wrote in Daddy, Where’s Your Phone?, it’s time to start thinking of the phone as a first class device for accessing web services, not as a way of repurposing content or applications originally designed to be accessed on a keyboard and big screen. The release of speech recognition in Google Mobile App for iPhone continues the process begun with the iPhone itself, of building a new, phone-native way of delivering computing services. Here are two of the key elements:

  1. Sensor-based interfaces. Apple wowed us with iPhone touch screen, but the inclusion of the accelerometer was almost as important, and now Google has shown us how it can be used as a key component of an application user interface. Put the phone to your ear, and the application starts listening, triggered by the natural gesture rather than by an artificial tap or click. Yes, the accelerometer has been used in games like tilt, parlor amusements like the iPint, but Google has pushed things further by integrating it into a kind of workflow with the phone’s main sensor, the microphone.

    This is the future of mobile: to invent interfaces that throw away the assumptions of the previous generation. Point and click was a breakthrough for PCs, but it’s a trap for mobile interface design. Right now, the iPhone (and other similar smartphones) have an array of sensors: the microphone, the camera, the touchscreen, the accelerometer, the location sensor (GPS or cell triangulation), and yes, on many, the keyboard and pointing device. Future applications will surprise us by using them in new ways, and in new combinations; future devices will provide richer and richer arrays of senses (yes, senses, not just sensors) for paying attention to what we want.

    Could a phone recognize the gesture of raising the camera up and then holding it steady to launch the camera application? Could we talk to the phone to adjust camera settings? (There’s a constrained language around lighting and speed and focus that should be easy to recognize.) Could a phone recognize the motion of a car and switch automatically to voice dialing? And of course, there are all the Wii-like interactions with other devices that are possible when we think of the phone as a controller. Sensor based workflows are the future of UI design.

  2. Cloud integration. It’s easy to forget that the speech recognition isn’t happening on your phone. It’s happening on Google’s servers. It’s Google’s vast database of speech data that makes the speech recognition work so well. It would be hard to pack all that into a local device.

    And that of course is the future of mobile as well. A mobile phone is inherently a connected device with local memory and processing. But it’s time we realized that the local compute power is a fraction of what’s available in the cloud. Web applications take this for granted — for example, when we request a map tile for our phone — but it’s surprising how many native applications settle themselves comfortably in their silos. (Consider my long-ago complaint that the phone address book cries out to be a connected application powered by my phone company’s call-history database, annotated by data harvested from my online social networking applications as well as other online sources.)

Put these two trends together, and we can imagine the future of mobile: a sensor-rich device with applications that use those sensors both to feed and interact with cloud services. The location sensor knows you’re here so you don’t need to tell the map server where to start; the microphone knows the sound of your voice, so it unlocks your private data in the cloud; the camera images an object or a person, sends it to a remote application that recognizes it, and retrieves relevant data. All of these things already exist in scattered applications, but eventually, they will be the new normal.

This is an incredibly exciting time in mobile application design. There are breakthroughs waiting to happen. Voice and gesture recognition in the Google Mobile App is just the beginning.

  • Tim,

    I completely agree – much of the advance with the iPhone has been based on the Hardware – now software is really exploiting the possibilities.

    The rein of the GUI – mouse and finger may be comming to an end

    Hints of the Sensory Operating System (SOS) have been around for a while.

    Ray Kurzweil made some good comments back in the spring about this.

    I scribbled some notes on this here too

  • bowerbird

    sounds great. as long as it’s affordable to the masses.


  • Agreed, for too long interactions with phones have been based upon our past understanding of human computer interaction. It’s analogous to how television was first used to radio plays, where we could watch the people standing around the microphones. Voice certainly has huge potential, and combine that with the ability for phones to have a lot of contextual information about your location and status etc. they can become far more useful entry points to the cloud than they currently are. I actually think the tipping point will come when we have perpetually connected mobile devices, that is when the real innovation and the next wave will come. When you have a billion devices with a perpetual connection, ala the “Evernet” there will be enormous innovation in how we interact with everything.

  • This was a really good and thought-provoking post.

    >This is an incredibly exciting time in mobile application design.

    Sure is.

    This bit:

    >It’s easy to forget that the speech recognition isn’t happening on your phone. It’s happening on Google’s servers. It’s Google’s vast database of speech data that makes the speech recognition work so well.

    combined with this from Karl Long (comment above):

    >Voice certainly has huge potential

    led me to think:

    What if Google were to open up their speech recognition capability and that speech database (that makes it work well), to developers via APIs? It could potentially lead to a huge surge in innovative speech-recognition related applications. Who knows, they may have plans in this area already …

  • Bradley Mazurek

    Another application that uses sensors in novel ways is HappyWakeUp (http://www.happywakeup.com) for Nokia mobile phones.

    Twenty minutes before your alarm time, the phone begins to listen. If you begin to roll around or stir, it triggers the alarm. If you don’t move before the scheduled wake up time, it triggers the alarm anyway.

    Worst case, you’re no worse off than with a normal alarm.

    Usually, though, the phone will trigger the alarm when you’re in a more wakeful state…a state in which you’re more amenable to being woken up.

    Not a networked sensor, but a sensor-enhanced experience.

  • george

    It’s a little nitpicky, but I think it’s the proximity sensor that Google is accessing to turn on the voice feature, not the accelerometer. This doesn’t really discount anything you’ve said here, though.

  • Vic Gundotra

    We are using both the proximity and accelerometer as signals.

    VP, Engineering Google

  • Tim,

    If Mobiles can start using multiple senses for input and output can have amazing effects for inclusion of those with disability.

    Also – Ray Kurzweil earlier described using mobiles for real-time speech translation – although the mobile has such limited processing ability it can use the power of the cloud to carry out the translations.

  • Sensors (and especially combinations of sensors) are changing not only mobile phones, but also environments and more traditional appliances and consumer electronics as well. And hey, this is what my new O’Reilly Book Designing Gestural Interfaces is all about.


    For an interesting take on a device changing based on how it’s held, check out the Bar of Soap project at MIT:



  • I wonder if there is something deeper than technology going on here?

    It’s easy to focus on just one sensor or technology. GPS, for example, seems to be regarded as a separate beast. But,as Tim describes, real magic happens when this stuff works together. People, collectively, make technologies work together for individual purposes. OK, that’s obvious.

    There’s a less-obvious social change afoot. It seems to me that there is a new ethic rising out of collective efforts — Open Source and Google’s practice of publishing their innovations are examples. This un-named ethic values and rewards contribution to common good. Sometimes the rewards are greater than those for traditional hoard-the-IP models.

    I wonder if a shift to a new ethic is the tipping point rather than anything to do with technology?

  • @David Sonnen: As noted, Google does the speech parsing at its servers. You and I don’t have access to that. The “common good” would be Apple making speech recognition (and parsing for various knowledge domains) part of the iPhone OS so that it’s available to all apps, not just one.

    I’m not complaining about Google providing this service for “free” but let’s recognize the fact that you too can be beneficent with your IP if it takes a gazillion servers to make any meaningful use of it.

  • With these kind of things integration of cloud computing with voice search and mush more to come we can imagine the future to be easy and sophisticated for our children.

  • @Kontra –

    But with local speech on the phone, it’s likely that you will get inferior results. You won’t have the characteristic that Google has, of being a massively distributed dynamic learning engine.

    That’s one of the key drivers of Web 2.0. Some things are simply better when everyone is connected.

  • Richard N.

    > Apple wowed us with iPhone touch screen

    Are you serious? You were blown away by the touchscreen? My cell phone is 6+ years old, and it has been a full screen touch screen for a long long time. Yes, it’s running Windows Mobile (PocketPC), but touchscreens are nothing new. Apple just put the touchscreen in a prettier box, but where have you been? Under a rock?

    I’ve got this neat device called a web browser that you might wow you.

  • Todd

    “…Put these two trends together, and we can imagine the future of mobile: a sensor-rich device with applications that use those sensors both to feed and interact with cloud services.”

    Home run! Finally…I was beginning to think I was the only one who came to this conclusion, like I am that crazy guy on the street corner with the “REPENT” sign.

  • @Tim O’Reilly: “But with local speech on the phone, it’s likely that you will get inferior results.”

    That may or may not be technically accurate; I’m no position to say.

    What’s undeniable, though, is that some services like Google’s spelling corrections/suggestions are absolutely better/possible only if your dataset is huge. Perhaps Vic can illuminate if that’s the case for speech recognition here.

    However, Google releasing open this or similar IPs is somewhat irrelevant in that without huge datasets and servers the successful execution of the service is all but unreachable for those not named Google.

  • Richard N.

    If your six year old touch screen had the kind of gestural interface that apple popularized with the iPhone, I have indeed been living under a rock. Either that, or you’ve never tried an iPhone, and don’t understand how it’s different from previous touch-based devices.

  • Richard N.

    If your six year old touch screen had the kind of gestural interface that apple popularized with the iPhone, I have indeed been living under a rock. Either that, or you’ve never tried an iPhone, and don’t understand how it’s different from previous touch-based devices.

  • I totally agree that the sensors being used in new ways is a turning point. Once I discovered the App Store, I was most interested in what people could invent around those sensors. There are lots of clever uses of location, microphone, web, etc, although I didn’t see anything so elegant as detecting the phone to ear gesture.

    But… didn’t Apple already do that with the phone itself? It dims when you put it to your ear (could be just a timeout), but it definitely reactivates the screen when you move it away to look at it again.

  • Yes, awesome… I’m a strong believer on the mobile handset as the personal gateway to the Internet, other our friends/family and the things around us. At the center of this are the interactions and their meanings, taking advantage of the user’s mobile context, all of this possible via the handset’s sensors and the connectivity to the cloud… I’ve been writing about this for some time now, and more stuff is coming.


  • VR has to be nearly 100% accurate for mobile users to keep with it. Many voice-to-X services on cellular have come and gone. (Far cooler cloud-recognition is Shazam music-recognition app, now linked to iTunes – still my fave mobile app!)

    Gesture-UI and haptics are interesting, but not habit-changing for mobile. App store more seismic. It has changed our habits – downloading mobile apps now commonplace, easy, no-brainer experience. This is fertile ground for innovation, user experimentation = tipping point potentiality!

    @CEO has it right – context is key to mobile experience. Sensors + proximity = the next mobile paradigm. It has already landed with RFID and QR codes – just waiting for a tipping point. Expect that everywhere we go we shall be waving our mobiles over sensors. Still untapped = still big commercial oppy.

    As for mobile as web access device. Yes, definitely, but mobile needs a slightly different architectural emphasis, which is messaging-centricity.

    But who knows. The mobile future is still unfolding.

  • I’m considering getting a G1/Android Phone due to an App that uses sensory input in a way that totally amazed me. Biggu’s (http://www.biggu.com/) App, ShopSavvy, is something that I could convince my wife to buy a G1 for. It lets you use your camera on your phone as a bar code scanner. And then it comparison shops for you online.

    I’m very interested in the continued development of Apps that make use of sensory input devices available on the G1 or iPhone. I see it very much like a high-tech, software Swiss Army Knife.

  • Uwe Trenkner

    Hi Tim,

    after reading some news on a new(?) Android app that is pitched as a Siri-challenger, I went back to this old post of yours to re-read what you predicted exactly 3 years ago.

    Already in 2007, you had speculated that Google was offering 411 services to acquire a huge amount of voice data to be used for their future services.

    Now my question to you: Why has Google not yet cashed-in on this original lead? Why was it Apple and not Google that came up with what you predicted here in this post? Has Google lost their focus? Or was their focus to narrowly put on integrating voice and search? Or have they been working on something even “bigger” than Siri?

    Do you have any ideas on this?