Fri

Jun 15
2007

Tim O'Reilly

Tim O'Reilly

Collective Intelligence in Word 2007 Spell Checker

Gregor Hochmuth wrote in from Berlin to point out how Microsoft is employing collective intelligence to automatically solicit information from its users to improve Word 2007's spell checker:

I thought you might enjoy this: When I was closing Word 2007 today, I was surprised to see the attached dialog pop-up. Microsoft's new spell checker asked me whether it could transmit certain unknown words and phrases that I used in the last several weeks.

Among them are choice examples like Wikipedia, Gladwell, shortcode and others -- words that were certainly not in the original distributable. I assume Microsoft will re-distribute the most frequently submitted words in an upcoming spell checker update. Brilliant! And it reminds of the way in which Google first introduced its "Did you mean...?" feature-- by tracking how users corrected their own spelling mistakes before re-trying a search.

Frankly, any company today that isn't doing this kind of thing is nuts. Learning from your users in real time and automatically improving your software based on what you learn is one of the things that distinguishes Web 2.0 ("Live Software") applications. Frankly, I'm a bit surprised that Microsoft asks permission. So many companies today (e.g. Google, Amazon) are collecting data from their users as a matter of course, and applying it for the benefit of other users.

I'm curious about changing mores with regard to this kind of data collection, when its use is limited to improving the software. Do you think it should be "opt in" or "opt out"?


tags: web 2.0  | comments: 14   | Sphere It
submit:

 
Previous  |  Next

0 TrackBacks

TrackBack URL for this entry: http://blogs.oreilly.com/cgi-bin/mt/mt-t.cgi/5607

Comments: 14

  Chris Vail [06.15.07 11:16 AM]

The question is, to what purpose is the software being improved? If the purpose is reducing the privacy of the user, for example, then this needs the user's informed consent. "Improving the user experience" is in the eye (ear?) of the beholder; probably users should be informed that the application is using their inputs to improve itself, and they should be able to opt out of that process (life is complicated, people are paranoid).

Open source software has a good balance; you can see what the code is doing, and you can voluntarily submit improvements yourself. So maybe an open source application that automatically improves itself (and allows users to update/correct/remove the relevant improvements) is the way to go.

  Duh [06.15.07 11:23 AM]

In this case, MicroSoft got it right. If you go to Google and punch in your words then you're already giving Google information. Word is different. People write private, confidential, proprietary and classified documents with Word. Sending back unknown words or phrases without express permission could constitute actual crimes and put MS at legal risk.

  Chris Smith [06.15.07 11:29 AM]

I would normally assume that anything that happens on a company's servers could be used for improvements, unless you opt out. On the other hand, anything that happens locally on the user's computer should only be sent/used if they explicitly opt in.

I can't justify why I feel this to myself very well (other than it seems to be the norm for most things), but I did come up with this analogy while thinking about it: if I was browsing a catalogue in a shop, I wouldn't be annoyed or even concerned if I found they were somehow observing my actions (e.g. which sections of the catalogue I looked at); if I took one home and found out they were somehow tracking what I was doing, I'd be outraged.

So maybe it's the feeling that my local computer is private that would make me annoyed if an application phoned home without me knowing. If I'm using a web application/site, then I *know* it's on their terms.

  The Dawn Treader [06.15.07 11:31 AM]

The big difference is that when you type something into a web application then it is in the provider's database and they can do what they will without your knowledge. When I use a Windows application the data resides on my box, and I can track if the application tries to send it my data off to the mothership. Can you imagine the screaming that would happen if someone with a network monitor noticed that Word automatically uploaded your private dictionary every night?

  DinoHorse [06.15.07 11:55 AM]

I do like the idea... collective collaboration is key... just go to Wikipedia and see how instant the updates are thanks to the community.

  Tobi [06.15.07 01:53 PM]

I do think that MS is right to ask for information given the distrust of many people against MS (just remember the discussion about the data being transmitted while activating Windows)

  John P [06.15.07 02:30 PM]

Anytime desktop software connects to the internet to transmit info, it must ask permission.

If you phone home without asking, people will notice the activity, and wonder what private data is being sent.

This is being a good citizen for desktop software. Web software doesn't need to ask. Nothing to notice in this case.

  Scott Carpenter [06.15.07 02:43 PM]

Why do I have the suspicion that MS will try to use copyright to restrict use of its collaboratively generated spelling lists? (Or some other kind of restriction, if copyrighting correct spelling is even too ridiculous for the current state of the art in IP maximilism.) :-)

  Coleman [06.15.07 02:57 PM]

It should be opt-out, because it's too much of a hassle otherwise. I trust Google, Microsoft, etc. with my data.

  Joe Wikert [06.15.07 06:25 PM]

The privacy component is a non-issue as far as I'm concerned. I don't care if Microsoft (or any other company) does something like this with my data. On the other hand, what I *do* worry about is the volume of new typos this would introduce. There are countless acronyms, abbreviations and other phrases that are unique to industries and organizations around the globe. Those would get added to the spelling database over time and many are likely to not only be completely useless to me, but they might be some of my more common typos (e.g., teh instead of the). If those are no longer flagged as questionable my docs will be laced with typos!

  Dave [06.19.07 07:10 PM]

Everything old is new again.

Back in 1978 the university I was at had a punch-card driven word processor for thesis typesetting. It understood all the thesis formatting requirements, and could do math and chemistry typesetting on an IBM line printer (with special print chain). Of course, the escape codes you need to punch were wicked....

Anyway, it had a hyphenation dictionary that worked on user contributions. The system hyphenation dictionary was always consulted, and you could could add your own deck of hyphenations as part of your input deck. When you were done, you gave your cards to the system admin, who added your words to the global hyphenation dictionary after review.

  Colin [06.20.07 10:45 AM]

How will they manage the possibility of contaminating the system with user-generated incorrect spellings, ie: 'follow suite'

  Tim O'Reilly [06.20.07 10:51 AM]

Colin -- language is dynamic and evolving. Ultimately, the correct spelling becomes the one used by the most people, regardless of what the dictionary says. For emerging usage today, people simply can google for two spellings and see the count. I remember when there was a race between e-mail and email, and eventually email became the accepted form. So if people submit enough misspellings, they will eventually be correct...

  MEL [07.12.08 10:07 AM]

This is dreadful. This "service" uploaded nearly the entire contents of proprietary technical proposals that I've been running through my spell checker. Not just words, but sentences, company names, phrases, technical data from tables that went through the spell check. Can you scream SECURITY RISK?

Post A Comment:

 (please be patient, comments may take awhile to post)






Type the characters you see in the picture above.

RECOMMENDED FOR YOU

RECENT COMMENTS