• Print

Is Your Survey Data Lying to You?

As the book industry continues to change, we are inundated
with statistics about user behavior:

  • 49% of e-book readers are bought as gifts
    [Bowker]
  • 28% of US adults are avid (5+ hours/week)
    readers [Verso]
    - 64MM avid readers
  • The heart of the U.S. romance novel readership
    is women aged 31-49 who are currently in a romantic relationship. [Romance
    Writers of America
    ]

These statistical nuggets are great because in
isolation they give us a glimpse into why people do what they do, and how we
can adjust our business to match market needs. But how often do we blindly
accept data because it comes with pretty graphs and sound bites that seem to make
sense? Probably more often than we’d like to admit.

The best way to ensure that we are not led astray,
is to look at what biases have been introduced into a study before using its
data to make a decision.  Bias is
systematic favoritism in the data collection process which causes misleading
results.  Two types of bias are hazards in
studies: selection bias and measurement bias.

  • Selection Bias can occur when the group that is
    surveyed does not accurately reflect the target of the study, or is simply too
    small to matter. For example, if a study claims to describe the behavior of all
    readers in the U.S. but only surveys 30 stay-at-home moms in Indiana, it is
    hardly representative of every reader in the country.
  • Measurement Bias occurs when the questions asked
    favor a specific outcome. A survey question like “Do you agree that e-books are
    replacing print books as the preferred medium?” will deliver very different
    results than one that asks readers to choose their preferred medium from among
    e-books, purchased p-books or books checked out from the library.

As you read a study, ask yourself the following
questions to determine if the authors tried to mitigate bias. Remember: the
target population is the group that you want to generalize about, and the
sample is the group that you actually survey in order to make those
generalizations.

  1. Is the
    target population (sometimes called the sampling frame) well-defined?
    If it
    isn’t, the study may contain people outside the target, or it may exclude
    people who are relevant. In researching e-book reader purchase behavior, a
    well-defined population could be American consumers who purchased an e-book
    reader either online or in a physical store over the last 2 years. But if a
    study only looked at online shoppers at Christmas, the results could be skewed
    towards gift givers, and they could not be generalized to consumers who bought
    e-book readers in stores.
  2. Is the
    sample randomly selected from the target population?
      In a truly random sample, every member of the
    target population has the same chance of being included in the study. When
    asking this question be wary of surveys that are conducted exclusively on the
    web, but draw generalizations about all people. These types of studies have
    participants that are not randomly selected, as they only capture a
    slice of the traffic to a given domain, and at best can only ever speak to the
    habits of the users of the particular site conducting research.
  3. Does the
    sample represent the target population?
    Here it is important to look at all
    of the characteristics of the target population to see if they are mirrored in
    the sample. If you are looking to figure out the book purchase habits of
    Americans, make sure the sample has the same diversity of ethnicity, geographic
    distribution and age as is reported in the latest U.S. census. 
  4. Is the
    sample large enough?
    The larger the sample the more accurate the results. A
    quick way to estimate if a sample is large enough to produce a reasonably small
    margin of error is to divide 1 by the square root of the sample size (Margin of
    Error=√Sample Size). So a 1,500 person survey would produce a margin of error
    of 2.58%. It is also important that the sample size in this calculation be the
    number of people who responded to survey, not the number of survey requests
    that were sent out.
  5. What is
    the response rate for the survey?
     The response rate is defined as the number of people in a target population who actually responded to a given survey.  If the response rate is too low, a study may only reflect people who have a strong opinion about the topic, making the results biased toward their opinions and not the larger and less vociferous target population. A “good” response rate is dependent upon the margin of error that a study is looking to achieve (or that it claims), and the size of the target population being studied. There are two factors to consider here. The first one is a no-brainer: The higher the response rate, the more accurate the study. The second is a little more subtle. The larger the target poplulation being examined, the lower the response rate required for the same level of accuracy. The linked figure helps explain the correlation graphically. (According to the chart, for a study that is looking to achieve a margin of error of +/- 5%, and is studying a population of 2000 people, the response rate needs to approach 20% to achieve the desired result.) At the end of the day, know the response rate and make sure it closely matches the stated margin of error that study purports to achieve.
  6. Do the
    questions appear to be leading the respondents into a particular answer?
    If
    they do, run the other way! This means that the researchers’ agenda is adding a
    measurement bias and the results aren’t worth the paper they are printed on. Also be wary of any study that doesn’t
    share its sampling method, sample characteristics and survey questions.

In the end, the goal of a
survey is to accurately describe a larger population. This can only be done if
great care is taken to 1) ensure that the results wouldn’t change much if
another sample was taken under the same conditions and to 2) reduce biases that
can be introduced into the system.

Jeevan Padiyar.jpg

About the Author
Jeevan Padiyar
is a technology entrepreneur and product strategist with ten years experience in e-commerce and product development. He is passionate about using data to validate growth strategies for new market penetration.

A pioneer in the book rental industry, Jeevan is CEO/CFO of BookSwim. Jeevan helped shape podcast monetization as chairman and CFO of RawVoice, Inc., making ad deals with GoDaddy, Citrix and HBO. Prior to that, he studied medicine at Albert Einstein College of Medicine as a Howard Hughes Medical Institute Fellow. Before coming to New York, Jeevan founded arena blimp manufacturer Simply Blimps. He led it to $30M in sales in five years, with clients like NHL, NBA, Yum Brands and Subway.

Jeevan holds degrees in chemistry and biochemistry from Kansas State, graduating Phi Beta Kappa

  • http://www.microsoft.com Jon

    Great article.

    I struggle with these very issues when looking at studies. With so much data flying around it is hard to sometimes tell which end is up. Thanks for stating things so simply.

  • Matt

    “In a truly random sample, every member of the target population has the same chance of being included in the study.”

    I know that happens on the Pick 4 lottery drawing because I can see the ping-pong balls on TV. But in a market research survey, what if it doesn’t say in the report? Or if it just says “participants were randomly selected”?

  • YoungBrud

    Thanks for writing this. It is good to have a healthy level of suspicion when you hear survey results and it should drive you to investigate further. There are far too many beliefs being formed and decisions being made based on unreliable survey data. Bad surveys can harm as much as good surveys can help. I hope you write more, in fact 93% of people I surveyed want you to write more.

  • http://www.bookswim.com Jeevan Padiyar

    Matt,

    Great question. If a study says doesn’t say its participants were randomly selected or if it only uses a one line descriptor “based on random sampling,” I would be wary of the results without doing further investigation. First I’d look for other studies that describe your subject matter. If you can’t find other data, I would reach out the author of the study and ask very directly how if they used a random popuplation, and their methodology for generating random entrants to their sample popuplation.

    There are three major types of probability sampling methods used:

    One is purely random, ie – assigning each person in your sampling frame a number and then using a random number generator to construct your sample population.

    Systematic sampling is another method. After the required sample size has been calculated, every Nth record is selected from a list of population members. As long as the list does not contain any hidden order, this sampling method is as good as the random sampling method. Its only advantage over the random sampling technique is simplicity.

    The third is stratified sampling.A stratum is a subset of the population that share at least one common characteristic. Examples of stratums might be males and females, or managers and non-managers. The researcher first identifies the relevant stratums and their actual representation in the population. Random sampling is then used to select a sufficient number of subjects from each stratum. “Sufficient” refers to a sample size large enough for us to be reasonably confident that the stratum represents the population. Stratified sampling is often used when one or more of the stratums in the population have a low incidence relative to the other stratums.

    Remember that a study is only as good as the sample population it examines. If this is not defined or care is not taken to insure its randomness, the data can’t be generally applied to anyone but the folks who were studied.

  • Eric Eltman

    It is unfortunate that so many marketing studies don’t know how to conduct good data analysis. It seems like everyone is pushing their own agenda. I wish there was a peer review process like there is in science to vet what is good and what stinks.

    Conferences could do this, but it seems like speakers who present data care less about education and more about pushing their wares.

    The world could user fewer pushy consultants who don’t do anything but try to sell their overpriced, under-performing services.

    Bravo to you for helping to educate us all.

  • Dave

    I hope you write more on this topic.I’d like to know about how studies that predict behavior measure up against reality.

    Also I looked at bookswim. Sounds like a great service.

  • Arvind

    All reporters should learn statistics or know sense to write smartly.

  • Jeff R

    I’m reminded me of a quote from James Mills:

    “If you torture your data long enough, they will tell you whatever you want to hear.”

    Although he meant it more in terms of inappropriate statistical methods, it’s not inapplicable here. Too often we take these types of studies as fact without thinking about the biases, etc, that are implicit in all survey samples. Worse, of course, is when the data supports our preferences – we are even less likely to question its foundation then.

    It’s too bad, as other comments have said, that more attention isn’t paid to this in the media.

  • http://www.carolynjewel.com Carolyn Jewel

    Thank you for this article. I, too, would love to hear more from you on the subject.

  • http://abaditya.com Aditya Banerjee

    Very relevant points to consider for a survey. In fact there’s a nice book by Darrell Huff called “How to lie with statistics” – http://www.amazon.com/How-Lie-Statistics-Darrell-Huff/dp/0393310728 – which touches upon similar points.

  • Julien

    There’s another trap out there, but possibly more widespread in media than in sciences: the confusion between self-reporting and observation.
    As in “how many books do you read every year?”
    as opposed to observing it (more difficult, of course).

    and even better than that, the confusion between opinions and truth.
    “90% of people think that life is possible on mars”
    What does that tell us? That a lot of uninformed nobodies think something. No problem with that, apart that very often it is treated rethorically as if it were an expert observation.