Trying to Track Swine Flu Across Cities in Realtime

John Geraci is a guest blogger and heads up the DIY City movement. He will be speaking about DIY City at Where 2.0 in San Jose on 5/20.

Since early last friday, when I got a tip about swine flu in Mexico City from a health researcher, the team that does SickCity has been working to make the system something that can (or could) detect swine flu outbreaks in cities around the world.

It hasn’t been easy.

SickCity is the “realtime disease detection for your city”, created by people at DIYcity. The service, launched last month, works by monitoring Twitter for local mentions of various terms that mean “I’m getting sick” and plotting those to location. Up until Friday, SickCity seemed to work reasonably well for the very rough beta tool that it is. It showed incidences of people reporting they had flu, or chicken pox, or other illnesses, broken down by city. You could look at a graph of the past 30 days for your city and see days when mentions of certain diseases and symptoms were higher or when they were lower. You could sometimes see trends. No one claimed that SickCity was ready for prime time, but those working on it felt that there was a very worthwhile idea in it that with a bit of refinement would be of huge value to communities.

On Friday, all of that got turned upside down.

Going to SickCity’s Mexico City page early in the day, I saw a sudden, several-hundred percent increase in mentions of flu. The problem was, not a single one of them was about actually having the flu – all were about the gigantic swine flu media event that was just beginning. Our disease detection tool had turned into a media event detection tool overnight.

Since then, we’ve been in a constant struggle to filter out the media effect from the data. The problem is, as the story grows and changes, the terms we have to filter for keep growing and changing. On Saturday we made a series of changes to the filters and search terms, and thought we were fine. By Sunday, those had become totally insufficient in the face of the growing Twitter storm surrounding swine flu (70 more results in the time it took me to write that sentence). We made more changes Sunday. Today, those additional filters seemed puny and insufficient. People are now calling swine flu “piggy flu”, “pork flu”, “bacon flu”, “wine flu”. They’re talking about Obama having flu. They’re talking about bird flu. The list of tweeting topics grows at an exponential rate. The topic of swine flu is incredibly viral.

So how do you get down below this huge cloud of noise, to the relatively tiny (but very important) signal down beneath? There are probably several thousand tweets happening right now about the idea of flu for every one that is about actually having the flu. The number of people actually coming down with flu right now in fact seems very low (let’s hope it stays that way).

Tracking other terms related to flu seems more promising – the term “fever” seems like a good one to look for, and once you get rid of the tweets mentioning spring fever, cabin fever and Doctor Johnny Fever, you’ve got a pretty good data set to use. But how representative of the flu population is that term?

Maybe tracking actual flu tweets in this situation isn’t really possible?

Still, the payoff in terms of value to communities and health organizations is huge if the developers can get something that can be demonstrated to work. As a public health researcher following SickCity told me, realtime outbreak detection is currently terrible at best. To improve on what’s there, you just have to give people a reliable signal that *something* is happening in a city. You don’t need to have exact numbers. You don’t even need to know whether what’s happening is actually flu, or food poisoning, or plague, really – the health officials can figure that out for themselves pretty quickly with all of the other tools at their disposal, once they know to be on the lookout. You just need to be able to reliably say “there is a sickness event happening right now in this city”, and that’s enough. You just need a canary in the coal mine.

So the developers behind SickCity, volunteers from DIYcity (mainly Paul Watson and Dan Greenblatt at this time, plus a few others) keep working on making it that. And right now they’re working round the clock. (It’s a public project – if you want to pitch in, by all means do so – you can get more info here.).

Even if SickCity fails to detect swine flu in cities around the world, it will have become a much more robust tool in the process of failing. If it doesn’t succeed in catching this pandemic, maybe it will be better prepared to catch the next one?

tags: , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • This is an unsurprising outcome. The sickcity effort appears to be uncontrolled, socially irresponsible, human subject experimentation on an involuntary population over a matter of significant public interest. It’s bad engineering and bad science and a misleading presentation on that web site.

    This [ crossed my desk today and seems applicable.

    It may be wishful thinking on my part but I suspect, stepping back from the heat of the moment on this particular issue, that efforts to optimize data-mining of twitter are premature in that I see no indication of (and indications against) the longevity of such a service. I think twitter’s a flash in the pan, but we’ll see.

    Meanwhile, there is just no conceivable scientific framework in which something of the form and function of “sickcity” works out to be a positive contribution to the public welfare. I can’t fathom the mindset that led to its actual construction and deployment.


  • David Sonnen

    You may have found a useful answer, just not one you expected.

    Tracking the accurate locations of something like flu involves a lot of different information. Examples: DNA verification of the virus, patient’s travel and personal contacts, patient’s symptoms, and a host of other data. Unless public health people freely publish their findings, that detailed information simply cannot exist in ‘Net-based streams.

    Even location-specific news aggregators have a problem with accurate location. Example:

    However, you may have found a way to measure and map public awareness or concern. As you’ve found, signal to noise is a big problem for determining location of actual cases, but location of concerned people might be easier to detect.

    It seems like location of concerned people would be useful to public health and emergency prepardness folks, as well as the public.

    You’ve tackled a hard problem. Good luck.

  • Hi David,

    > Unless public health people freely publish their findings,
    > that detailed information simply cannot exist in ‘Net-based streams.

    Not true! An employee of the NYC Department of Health whom I met with last week regarding SickCity told me that the way these ‘public health people’ assemble their data in the first place is by tuning into public data streams available and making sense of it all. Ex: A person goes into an emergency room and writes down their symptoms (self-diagnosis) on a paper, and 24 hours later that paper is in the hands of someone at the DoH, who tries to make sense of it all, in aggregate. Is self-reporting on Twitter so different?

    I think not, and I think this kind of self-reporting, taken in aggregate, will prove extremely useful once all of the kinks are worked out, as a primary data point. People talking publicly about their symptoms in social streams is just another resource for health officials to consider.

    Thanks for the comments!

  • yes, self-reporting on Twitter is entirely different from being assessed by a health care professional. Reporting supposed (in this case bogus and panic-mongering) aggregation results is also entirely different when conducted by sickcity so as to bump up the number of “uniques” and “page views” vs. when reported in a network of responsible public health types who think very hard about appropriate communication. It (sickcity) is a dreadfully irresponsible and unscientific experiment.

    Yes, officials do mine public sources like that but not primarily and not trying to create a real-time panic-prone bogus feedback loop.


  • Now hold on there, Thomas – “dreadfully irresponsible” is a term generally reserved for experiments that could result in genetic mutations to wildlife and such. What this is is group of enthusiastic people trying to wrangle a new data type that has only existed for a year or two and harness it for good use. I don’t call that an irresponsible experiment, I call that a really exciting experiment. If you disagree, you’re free to not follow along, but I don’t think name calling is warranted.

    For those who do find this kind of thing interesting, the SickCity team just dropped the noise in the system by an order of magnitude. They’re crowdsourcing the blacklisting of terms now – if you want to join in, you can add any stop words you see in noisy tweets using this form:

    They’ll keep working on it tonight – hopefully it will be better in the morning.

  • I think what SickCity is doing is pretty interesting for its use of a public data set, the fact that it doesn’t seem to be working yet is more due to the fact that twitter is so new as a medium that no one really understands how it can be used to successfully run experiments of this nature.

    It seems that we’re going to see more and more public data sets especially as parts of the government opens themselves up more and the more people start thinking about constructive uses of this data the better it will be for everyone involved.

    I expect the sickcity team to tweak the app and make it more useful as time goes on, and it is obviously not about pageviews since it is a .org after all not some crazy VC funded nightmare!

  • John,

    I think it is offensive that the sickcity web site does not include prominent display on every page that it is not a scientifically proved warning system and that it is based on wild guesses and that it is easily gamed by malicious actors and that it is easily fooled. I think it is offensive that the site does not strongly advise that nobody take it seriously. I am an engineer, at least in theory, and I regard a hallmark of my profession to be social responsibility – it is what distinguishes us from evil, mad scientists. This exhibit – that web site – fails badly at displaying social responsibility.

    Just today a very nice essay on the topic came to my attention: How False Rumors Can Cost Lives. It can give some perspective on the kinds of involuntary human experimentation and false representation sickcity exemplifies.


  • this may go without saying, but what about a hashtag for people who have/have seen flu symptoms (e.g. #flusymptoms)? if the topic of swine flu is so viral, perhaps the hashtag would be RT’d enough to make it useful. more broadly, you can keep refining your search terms, but is there a way to let the people help you?

  • i should say in full disclosure that i’ve helped in the development of sickcity.

    @stephanie – having people help is a great idea, and we’re definitely planning on crowdsourcing some of the refinement of the ‘sick tweet’ data set. one feature in the works is letting people flag or delete tweets that they deem as ‘not sick’ — i.e. false positives.

    re: hashtags, yes – they’re very useful if people use them, but one of the potentially very powerful things about sickcity is that it doesn’t require people to use hashtags. that being said, perhaps defining a hashtag people could use to unequivocally state their sickness could be a good idea.

    i know one should never expect to alter changes in behaviors when it comes to this sort of thing, but one change i think would be awesomely powerful would be to see people start explicitly letting others know they’re sick to contribute to this kind of effort, and not just as a ‘sick tweet’ as side-effect of talking about all the other things they’re doing that day…

    so we’re putting a tool in place that works well without this behavior, but if/when it emerges (and i don’t expect it will :), then something like this becomes really reliable and powerful.

  • The problems described here are subject of a burgenoing research area, see:

    Eysenbach G
    Infodemiology and Infoveillance: Framework for an Emerging Set of Public Health Informatics Methods to Analyze Search, Communication and Publication Behavior on the Internet
    J Med Internet Res 2009;11(1):e11

    The infovigil system, described briefly in this article, which also analyzes twitter streams, is actually working with public health agencies and epidemiologists, and is using more advanced natural language processing approaches, and will – in the near future – link to questionnaires and include sentinel households. I admire what SickCity is doing, but it is essentially a simplified version of infovigil. My concern is that it distracts the public and public health professionals from more rigorous, research-based systems. Research funding in this area is sparse, and I am not sure how helpful it is to set up competing systems.

  • Gunther,

    Don’t worry, you can keep your research dollars – SickCity is not a research project, and we’re not looking for research funding. Though researchers are welcome to listen in if they want, even give us pointers and advice as some are doing.

    I can’t imagine why you would be in favor of one, single project inquiring into using publicly available data to tell people interesting things about their communities – isn’t a diversity of teams, with a diversity of approaches a positive thing when trying to find value in new resources? I come from the world of internet startups, and that’s what I’ve always been taught. I think the scientific community is the same, no? Seems crazy to me at this early stage of the game to say “this is something being looked into by ‘authorities’ – please leave it alone”.

    Thanks for the link to your paper, and thanks for kindly referencing SickCity in it.

  • (I am the Paul Watson referred to in the article as having worked on SickCity.)

    Thomas, thanks for your comment. Looking past your strong language I do see merit in what you are saying and will talk to the others about putting up a notice. We are excited about the project and do think it has genuine, defensible value.

  • @stephanie, regarding your suggestion about ‘letting the people help you’ … we’ve added the feature which lets people remove tweets which they identify as ‘false positives’ — i.e. the tweet is not from a sick person talking about being sick, but is instead a snarky comment, link to an article, expression of concern, etc. It’s live now on — check it out! Thanks all for the feedback / suggestions.

  • @Dan, thanks for responding. delighted to hear about the new and upcoming features. also, you wrote:

    “i know one should never expect to alter changes in behaviors when it comes to this sort of thing, but one change i think would be awesomely powerful would be to see people start explicitly letting others know they’re sick to contribute to this kind of effort, and not just as a ‘sick tweet’ as side-effect of talking about all the other things they’re doing that day…”

    superduper agreed. it’s likewise awesomely powerful to entrust citizens with this responsibility – which inspires us to be that much more invested, and that much less prey to simply swimming (and perpetuating) uninformative hype.

    anyways. dankeshein again and good luck!

  • Rekha Murthy

    Are any of you familiar with I’ve wondered what would motivate someone to publicly announce their illness, but it seems that’s what some of the commenters are suggesting here as well. The next question would be – how do you extrapolate the self-announcements of the extremely small group of active Web contributors into any generalizable observations?

    I can understand the qualitative value of viewing individuals’ expression of symptoms, as John Geraci notes in the comments. But Sickcity seems to present this information as hard data, far more quantitative than it actually is. I’m not sure I see the credibility in that. And I certainly don’t see the originality, given Google Flu Trends. How is this different?