Thy self thou gav’st, thy own worth then not knowing
(This post is the fourth in a series called
“Being online: identity, anonymity, and all things in between.”)
Voracious data foraging leads advertisers along two paths. One of
their aims is to differentiate you from other people. If vendors know
what condiments you put in your lunch or what material you like your
boots made from, they can pinpoint their ads and promotions more
precisely at you. That’s why they love it when you volunteer that
information on your blog or social network, just as do the college
development staff we examined before.
The companies’ second aim is to insert you into a group of people for
which they can design a unified marketing campaign. That is, in
addition to differentiation, they want demographics.
The first aim, differentiation, is fairly easy to understand. Imagine
you are browsing web sites about colic. An observer (and I’ll discuss
in a moment how observations take place) can file away the reasonable
deduction that there is a baby in your life, and can load your browser
window with ads for diapers and formula. This is called behavioral
Since behavioral advertising is normally a pretty smooth operator, you
may find it fun to try a little experiment that could lift the curtain
on it bit. Hand your computer over for a few hours to a friend or
family member who differs from you a great deal in interests, age,
gender, or other traits. (Choose somebody you trust, of course.) Let
him or her browse the web and carry on his or her normal business.
When you return and resume your own regular activities, check the ads
in your browser windows, which will probably take on a slant you never
saw before. Of course, the marketers reading this article will be
annoyed that I asked you to pollute their data this way.
Experiences like this might arouse you to be conscious of every online
twitch and scratch, just as you may feel in real life in the presence
of a security guard whose suspicion you’ve aroused, or when on stage,
or just being a normal teenager. Online, paranoia is level-headedness.
Someone indeed is collecting everything they can about you: the amount
of time you spend on one page before moving on to the next, the links
you click on, the search terms you enter. But it’s all being collected
by a computer, and no human eyes are ever likely to gaze upon it.
Your identity in the computerized eyes of the advertiser is a strange
pastiche of events from your past. As mentioned at the beginning of
the article, Google’s Dashboard lets you see what Google knows about
you, and even remove items–an impressive concession for a company
that has mastered better than any other how to collect information on
casual Web users and build a business on it. Of course, you have to
establish an identity with them before you can check what they know
about your identity. This is not the last irony we’ll encounter when
But advertisers do more than direct targeting, and I actually find
the other path their tracking takes–demographic analysis–more
problematic. Let’s return to the colicky baby example. Advertisers add
you to their collection of known (or assumed) baby caretakers and tag
your record with related information to help them understand the
general category of “baby care.” Anything they know about your age,
income, and other traits helps them understand modern parenting.
wrote over a decade ago,
this kind of data mining typecasts us and encourages us to head down
well-worn paths. Unlike differentiation, demographics affect you
whether or not you play the game. Even if you don’t go online, the
activities of other people like you determine how companies judge your
The latest stage in the evolution of demographic data mining is
sentiment analysis, which trawls through social networking messages to
measure the pulse of the public on some issue chosen by the
researcher. A crude application of sentiment analysis is to search for
“love” or “hate” followed by a product trademark, but the natural
language processing can become amazingly subtle. Once the data is
parsed, companies can track, for instance, the immediate reaction to a
product release, and then how that reaction changed after a review or
ad was widely disseminated. Results affect not only advertising but
Once again, my reaction to sentiment analysis mixes respect for its
technical sophistication with worries about what it does to our
independence. If you add your voice to the Twittersphere, it may be
used by people you’ll never know to draw far-reaching conclusions. On
the other hand, if you refuse to participate, your opinion will be
Google’s Dashboard tells you only what they preserve on you
personally, not the aggregated statistics they calculate that
presumably include anonymous browsing. But you can peek at those as
well, and carry on some rough sentiment analysis of your own, through
Considering all this demographic analysis (behavioral, sentiment, and
other) catapults me into a bit of a 21st-century-style existential
crisis. If a marketer is able to combine facts about my age, income,
place of birth, and purchases to accurately predict that I’ll want a
particular song or piece of clothing, how can I flaunt my identity as
an autonomous individual?
Perhaps we should resolve to face the brave new world stoically and
help the companies pursue their goals. Social networking sites are
developing APIs and standards that allow you to copy information
easily between them. For instance, there are sites that let you
simultaneously post the same message instantly to both Twitter and
Facebook. I think we should all step up and use these services. After
all, if your off-the-cuff Tweet about your skis from the lounge of a
ski resort goes into planning a multimillion dollar campaign, wouldn’t
it be irresponsible to send the advertiser mixed messages?
My call to action sounds silly, of course, because the data gathering
and analysis will obviously not be swayed by a single Tweet. In fact,
sophisticated forms of data mining depend on the recent upsurge of new
members onto the forums where the information is collected. The volume
of status messages has to be so high that idiosyncrasies get ironed
out. And companies must also trust that the margin of error caused by
malicious competitors or other actors will be negligible.
We saw in an earlier section that your online presence is signaled by
a slim swath of information. At the low end, marketers know only your
approximate location through your IP address. At the other extreme
they can feast on the data provided by someone who not only logs into
a site–creating a persistent identity–but fills out a form with
demographic information (which the vendor hopes is truthful).
As another example of modern data-driven advertising, Facebook
delivers ads to you based on the information you enter there, such as age
and marital status. A tech journal reported that
the Google Droid phone combines contacts from many sources,
but I haven’t experienced this on my Droid and I don’t see
technically how it could be done.
Most browsing takes place in an identity zone lying between the IP
address and the filled-out profile. We saw this zone in my earlier
example from the coffee shop. The visitor does not identify himself,
but lets the browser accept a cookie by default from each site.
Each cookie–so long as you don’t take action to remove one, as I did
in my experiment–is returned to the server that left it on your
browser. If you use a different browser, the server doesn’t know
you’re the same person, and if a family member uses your browser to
visit the same server, it doesn’t know you’re different people.
Because the browser returns the cookie only to servers from the same
domain–say, yahoo.com–that sent the cookie, your identity
is automatically segmented. Whatever yahoo.com knows about
you, oreilly.com and google.com do not. Servers can
also subdivide domains, so that mail.yahoo.com can use the
cookie to keep track of your preferred mail settings while
weather.yahoo.com serves meteorological information
appropriate for your location.
This wall between cookies would seem to protect your browsing and
purchasing habits from being dumped into a large vat and served up to
advertisers. But for every technical measure protecting privacy, there
is another technical trick that clever companies can use to breach
privacy. In the case of cookies, the trick exploits the ability of a
web to can display content from multiple domains simultaneously. Such
flexibility in serving domains is normally used (aside from tweaks to
improve performance) to embed images from one domain in a web page
sent by another, and in particular to embed advertising images.
Now, if advertisers all contract with a single ad agency, such as
(the biggest of the online ad companies), all the ads from different
vendors are served under the doubleclick.com domain and can
retrieve the same cookie. You don’t have to click on an ad for the
cookie to be returned. Furthermore, each ad knows the page on which it
Therefore, if you visit web pages about colic, skis, and Internet
privacy at various times, and if DoubleClick shows an ad on each page,
it can tell that the same person viewed those disparate topics and use
that information to choose ads for future pages you visit. In the
United States, unlike other countries, no laws prohibit DoubleClick
from sharing that information with anyone it wants. Furthermore, each
advertiser knows whether you click on their ad and what activity you
carry on subsequently at their site, including any purchases you make
and any personal information you fill out in a form.
Put it all together, and you are probably far from anonymous on the
Internet. In addition, a more recent form of persistent data,
controlled by the popular Flash environment through a technology
called local shared objects, makes promiscuous sharing easy and
removing the information much harder.
The purchase of DoubleClick in 2007 by Google, which already had more
information on individuals than anybody else, spurred a great protest
from the privacy community, and the FTC took a hard look before
approving the merger. A similar controversy may surround Google’s
recently announced purchase of
which provides a service similar to DoubleClick for advertisers on
So far I’ve just covered everyday corporate treatment of web browsing
and e-commerce. The frontiers of data mining extend far into
the rich veins of user content.
Deep packet inspection allows your Internet provider to snoop on your
traffic. Normally, the ISP is supposed to look only at the IP address
on each packet, but some ISPs check inside the packet’s content for
various reasons that could redound to your benefit (if it squelches a
computer virus) or detriment (if it truncates a file-sharing session).
I haven’t heard of any ISPs using this kind of inspection for
marketing, but many predictions have been aired that we’ll cross that
Governments have been snooping at the hubs that route Internet traffic
for years. China simply blocks references to domains, IP addresses, or
topics it finds dangerous, and monitors individuals for other
suspected behavior. The Bush administration and American telephone
companies got into hot water for collecting large gobs of traffic
without a court order. But for years before that, the Echelon project
was filtering all international traffic that entered or left the US
and several of its allies.
One alternative to being tossed on the waves of marketing is to join
the experiments in Vendor Relationship Management (VRM), which I
covered in a recent blog.
Although not really implemented anywhere yet, this movement holds out
the promise that we can put out bids for what we want and get back
proposals for products and services. Maybe VRM will make us devote
more conscious thinking to how we present ourselves online–and how
many selves we want to present. These are the subjects of the next section.
- Being online: Your identity in real life–what people know
- Your identity online: getting down to basics
- Your identity to advertisers: it’s not all about you (this post)
Group identities and social network identities