Predictive data analytics is saving lives and taxpayer dollars in New York City

City governments, faced with decreased resources after the Great Recession and rising citizen demand for services with increased urbanization, must be able to make better decisions that are informed by data. To put it another way, in 2012, mayors need to start playing Moneyball in government with evidence-based analysis.

From public health to education to energy policy, if governments can shift resources to where they’re needed more quickly and accurately, there’s substantial positive outcomes for citizens from the application of data for public good.

Predictive data analytics — like any data analysis — can only be as effective as the data that they’re based upon. Data quality is a long-term concern for any policy maker that wishes to make data-driven decisions, from foreign policy to energy to transportation. If it’s bad data, policymakers are going to have a problem, even with superior methodology and algorithms.

In that context, the approach that Mike Flowers and his data analytics team in New York City government have taken to detecting financial fraud and other crimes or problems is interesting — but the outcomes from it are notable.

According to Flowers, applying predictive data analytics toward “preemptive government” in New York City has resulted in:

A five-fold return on the time of building inspectors looking for illegal apartments.
An increase in the rate of detection for dangerous buildings that are highly likely to result in firefighter injury or death.
More than doubling the hit rate for discovering stores selling bootlegged cigarettes.
A five-fold increase in the detection of business licenses being flipped.
Fighting the prescription drug epidemic through detection of the 21 pharmacies (out of an estimated total of 2,150 in NYC) that accounted for more than 60% of total Medicaid reimbursements for Oxycodone in the city.

A timely introduction by former Maryland chief innovation officer , Bryan Sivak — now the chief technology officer of the U.S. Department of Health and Human Services — led me to Flowers, who now serves the director of analytics for the Office of Policy and Strategic Planning in the Office of the Mayor of New York City. Flowers, who began his career as an assistant district attorney in the New York country district attorney’s office, previously served as a counsel to the U.S. Senate Permanent Subcommittee on Investigations for the 110th and 111th Congress and was deputy director of the U.S. Department of Justice’s Regime Crimes Liaison’s Office in Baghdad, Iraq, where he supported investigations and trials of Saddam Hussein and other high-ranking members of the former regime.

Today, Flowers is a pioneer in the field of urban predictive data analytics, an emerging practice that applies data science to discover and act upon patterns in databases. This application of data analytics has come to national attention through algorithmic crimefighting in San Jose, Calif. and predictive policing in Memphis, Tenn. In Washington, agencies and auditors have quietly been applying data analytics towards fraud detection and anti-terrorism.

In the following interview, which was lightly edited for content and clarity, Flowers explains more about how his team achieved these results, what’s necessary to go beyond performance measurement, and what’s next for the application of predictive data analytics in New York City.

What do you do on a day-to-day or a week-to-week basis?

Flowers: We’re trying to do risk-assessment and predictive resource allocation for all 60-some city agencies, ranging from fire inspections, building inspections, audits by revenue collecting agencies or business licenses, all the way to entrepreneurial assistance, in the form of identifying locations where there’s a specific combination of businesses, to having a suppressive impact on certain catastrophic outcomes, like crime or fire or water main breaks or things like that. That’s the short version.

What are the tools you bring to bear to all of those challenges?

Flowers: Human capital and technology. From a human capital standpoint, I have a staff of five statisticians — or data scientists, as I guess they would be called in your world. What I look for are people with economics degrees. They’re very young, fresh out of school, for the most part. A couple of them have a couple of years’ experience but also have a creative bent to them somewhere.

I hired my chief analyst not just because he has a degree in mathematical economics, but because he’s also a huge fantasy baseball guy. Another team member was a music major in high school.

From a technological standpoint, we use a variety of tools, ranging all the way from Excel to the most robust versions of SAS and skills in coding with Python and SQL. It’s a mishmash of things, whatever we have available to us.

When you combine human capital with technological tools, what’s the outcome downstream? What are you able to do and what results have you achieved?

Flowers: We never wanted to be a solution in search of a problem. By way of example, the city receives roughly 20,000 to 25,000 complaints for something called an ‘illegal conversion’ every year. An illegal conversion is a situation where you have an apartment or a house that’s zoned for six people to live in safely and a landlord’s chopping them up and putting 60 people in there. They represent significant public safety hazards — and not just from fire, but from crime and from epidemiological issues. To throw at those 20,000 to 25,000 complaints, we have roughly 200 inspectors to the Department of Buildings.

What we’ve done is come up with a way to prioritize those [complaints] which represent the greatest catastrophic risk, as a structural fire. In doing that, we built a basic flat file of all 900,000 structures in the city of New York and populated them with data from about 19 agencies, ranging from whether or not an owner was in arrears on property taxes, if a property was in foreclosure, the age of the structure, et cetera. Then, we cross-tabulated that with about five years of historical fire data of all of the properties that had structural fires in the city, ranging in severity.

After we had some findings and saw certain things pop as being highly correlative to a fire, we went back to the inspectors at the individual agencies, the Department of Buildings, and the fire department, and just asked their people on the ground, “Are these the kinds of conditions that you see when you go in post-hoc, after this catastrophic event? Is this the kind of place that has a high number of rat complaints? Is the property in serious disrepair before you go in?”

And the answer was “yes.” That told us we were going down the right road.

What we’ve done now is run every new complaint that comes in against that flat file. We find those [complaints] which represent the top five percent for historic fire risk and then send that top five percent back out to the inspectors to follow up on with urgency.

Historically speaking, when the Department of Buildings went out to inspect a property because of a complaint, they were finding seriously high-risk conditions 13 percent of the time, as reflected in a “vacate order.” An order to empty the building in whole or in part is an extreme outcome but that’s what you want to find to remediate.

Using our system, they’re finding these risky conditions at a sustained level of about 70 to 80 percent of the time in the complaints that we send them out to [investigate]. From a Department of Buildings standpoint, they’re very happy, because that’s a fivefold return on inspection man hours. From the fire department standpoint, it also turns out that these buildings we send them to are 15 to 17 times more likely to result in a fireman being injured or killed in the response to the fire, so they love it. It’s been going on for a year. We do it on a weekly basis and it’s worked out spectacularly well.

Are you applying this kind of predictive modeling to other areas beyond building inspections?

Flowers: Absolutely. Because it worked out so well, my first reaction was one of being elated and shocked, frankly, that it had that level of impact right away. The second reaction was that there’s so much more we can do, because anything that we allocate resources towards, if we just take an outcome-based view of it, as opposed to an individual agency.

So another area we do [predictive data analytics] with is cigarette tax inspections. We have a big problem in New York City with cigarette tax evasion. It costs about $12 a pack here. And that’s because of taxes, right? And there’s a reason for that. It’s a social policy decision that we don’t want people smoking because it has a lot of impact on our public services.

If you go to Virginia, however, you can pay $5 a pack. It’s very easy to load up a van with 50 to 70 cartons that you buy in Virginia and then sell in the city. What we did was we took the exact same approach.

In this instance, the Department of Finance, in the Office of the Sheriff, goes out and does cigarette tax inspections around the city. Historically, their hit rate, meaning when they actually found stores that were selling unstamped cigarettes or bootleg cigarettes, was about just shy of 30 percent. We applied the same approach and the same methodology. Using our system, they’re now somewhere close to 82 percent.

For the sheriff, this is a wonderful tool because these inspections cost money. You send a guy out and you have to do the job and you want it to be fruitful, from an enforcement standpoint. To have that nearly threefold increase in efficiency by the sheriff is great. Plus, we’re getting more cigarettes off the street for less money.

Any other use cases?

Flowers: A lot of projects are extremely complementary. The data that we use for the cigarette tax inspection also includes business licensing information from another agency called the Department of Consumer Affairs (DCA). They license roughly 57 categories of businesses in New York City — about 150,000 different businesses in the city. What we did was we took violations issued by their inspectors and used them as a predictor for whether or not somebody would be violating the cigarette tax or being compliant with the cigarette tax regulations. During the course of that [analysis], we also found that we could use them in reverse. We could use the things that we were getting from other agencies to assist in DCA inspections.

We now have something where we’re able to identify people who are unlawfully flipping their license. If you get a certain number of violations from DCA– or any other agencies for that matter — the Department of Consumer Affairs can yank your license. Or they can initiate a suspension proceeding.

A very fascinating tell is to look holistically at all agency activity (not just DCA) around a specific business and see whether it’s consistent over a period of time but that the [DCA] license has changed hands. If it’s changed hands — but the violation activities remain consistent — it is highly likely that there was a “flip.” A flip is a situation where, if I’m an owner of a small store or a big store; I get a number of violations. I’m putting my business license at risk, so I simply go to you and say, “I’ll give you $5,000 to apply for a license for this location,” just as kind of a mule. And so it goes: the license is now in your name. You have a clean slate, vis-à-vis DCA data, and you just continue as before, but in reality, I’m the one running the show. I’ve just paid you an upfront fee of $5,000 to use your name. That happens with some frequency because the size of DCA’s remit is so large and they only have a certain number of inspectors, about 65.

What we’ve been able to do is the same thing back for them. We now can show them that they need to send somebody out to this location because it is quite likely there is a flip going on, and you need to scrutinize their records. The success rate for that has been fivefold.

Another examples involves prescription drugs. There are about 2,150 pharmacies operating in New York City. They’re licensed, to a certain degree, by another agency. Those pharmacies receive Medicaid reimbursement for certain prescription drugs, specifically Oxycontin. [Oyxcontin] is a big problem, not just for us, but for everywhere in the country right now. It’s really kind of taking off as ‘legal crack.’

Out of those 2,150, we needed to figure out the small number that were overrepresented in Oxycodone distribution. So, we did a simple screen for whether a pharmacy distributes a high amount of Oxycodone relative to its square footage and the size and the total amount of revenue it generates. We were able to isolate 21 out of those 2,150 pharmacies that were real outliers. That’s less than one percent of the pharmacy universe but accounting for north of 60 percent of total Oxycodone Medicaid reimbursement. Then we cross-tabulated that with the law enforcement activities on the locations. 20 of the 21 had had fraud events at those locations.

That told us that the metrics that we came up with were very good predictors. What we’ve done now is, just on a forward-looking basis, that’s how we analyze Medicaid reimbursements. It’s a way for us to capture problem pharmacies.

Another example is waste disposal data. The Department of Sanitation will only pick up your trash if you’re a resident. If you’re not a resident, then you have to contract with a private waste hauler. A city agency regulates private waste hauling in the city because historically we had mafia issues. Frankly, there are organized crime issues surrounding waste hauling in New York City. There’s no surprise there.

If there’s data from another agency, like the Department of Consumer Affairs or the Department of Health or the Department of Finance, indicating that [a company] is a growing concern — but there’s no existing license activity from the Business Integrity Commission, the private waste hauling industry regulator — that is a good tell for us that they’re not registered with the Business Integrity Commission, which smells bad, right? (No pun intended.)

That means that you’re not doing what you need to do [with the regulator] which is a good tell for something being afoot. There’s also a public health hazard there, because you’re talking about medical waste. For example, the Department of Sanitation won’t pick that up, including certain food waste generated by restaurants. So if somebody’s just illegally dumping, this is a way to capture them.

We’re doing that for the Business Integrity Commission. There’s about 15 to 20 other exercises just like that where the methodology is the same.

You’ve mentioned methodology a number of times now. What is the methodology by which you’re finding these kinds of patterns? Where did you start? And where are you going?

Flowers:: The City of New York knows so much about persons, places, and businesses through its regulatory activity.

The example I like to give is the coffee shop that you go into. If you walk into any coffee shop, usually, you are going to see six or seven stickers on the door. The lion’s share of those stickers are going to be generated by a municipality, whether it’s the Department of Health or if they serve food; the fire department if there’s people publically congregating; the Department of Buildings which is concerned about architecture and structural integrity, et cetera, et cetera, et cetera. It’s just never been married up before.

People are able to arbitrage these different agencies, like ‘we’ll give one agency one piece of information, another agency another piece of information.’ So by putting it all in one place, what we’ve been able to do is identify places that should have data but don’t, right? It’s very “Sherlock Holmes,” to a certain degree, and ‘the dog that didn’t bark.’ If there’s no activity where other pieces of data indicate there should be activity, that’s something we pay attention to.

You can take out of 900,000 buildings or nine million permanent residents or 65,000 miles of roads or whatever and zero-in on the one to five percent that really pose a problem for the city from a regulatory — or even a law enforcement –standpoint and allocate our limited resources towards remediating, bringing them to bear where the real problems are. We take all of the information we know, as a city, about persons and locations and businesses, consistent with our statutory privacy obligations, and then cross-tab that with whatever outcome a specific agency is tasked with addressing. That’s the basic methodology. And we do share that.

The issue that I always think about with these kinds of systems is data quality. If you put garbage in, you get garbage out. If people rely upon analysts to help them understand the world, and they getting bad data, they’re going to make bad decisions. How do you deal with data quality issues? Do you post the data online after you it’s cleaned up?

Flowers:: You’re absolutely right. The data ranges in quality. Some agencies like NYPD — which is a real leader, frankly, in using their data to drive their policing decisions — is really, really tight. The geocoding is down to the latitude and longitude. The names are generally quite tight, because there are multiple points to cooperate.

Some of the entities with which they deal, all the way down to certain agencies, just the error rate [of their data] is north of like 20 percent, which is horrifying. By having a breadth and diversity of sources and sorting just by location, or business entity name, it enables us to eliminate, to a certain degree, some of those errors.

So, say we have a higher rate like a 20 percent higher rate in waste hauling data. If we have a much higher rate of confidence in other data, from other sources that also touch on that location, it’ll give us a sense of actually how to fix those errors individually in those datasets. That’s one way.

Another way is that we just do basic cleanup. I’d say about 30 percent of the time of my analysts is spent cleaning the data sources that they get for the specific purpose for which it’s being brought to bear.

By way of example, I have one person here who discovered that there are little tools within SAS that have a “sounds-like” component. Here in New York, with a high non-native population, names that are not spelled out in the Roman alphabet manifest themselves in the Roman alphabet in many different ways. You could have 19 different versions of Mohammed, 12 versions of Jong, things like that. There are little tools that allow us to do a sounds-like component which gets us around two problems.

First, is the data entry problem. Somebody’s typing it in and spelling it in many different ways, as well as getting around just the fact that there may not be any garbage component: it’s just that people use different ways of spelling their name on different forms, for cultural reasons.

Second, what’s the product? How are we bringing the insight to bear? It’s one thing to say, ‘Okay. I’ve got a list of 30 targets that you need to go out and audit now.’ Right? An audit costs $65,000 to initiate. That’s a lot of money. We treat the agencies as clients. If I send one of my clients out on a wild goose chase where only half the time they’re going to have success, then they’ve just wasted a lot of time and money on something that didn’t bear any fruit. I can’t have that. We try and get it as tight as possible.

What we do is we complement different jobs. Say you have the Department of Consumer Affairs concerned about somebody flipping a license. That person is also a ripe target for an audit, but we’ll do it in sequence. We’ll send the DCA people out first. It’s easier for them, because all they do is just walk in and do an eyes-on on a location. If that eyes-on at a location indicates that it’s actually a going business or that something’s afoot, then they just write the citations and do their enforcement thing. If not, if there’s nothing wrong, then that’s their day-to-day job. It’s what they have to do, so it hasn’t been really much of a waste. But if it is a positive, that gives us increased confidence to send that very target over to the Department of Finance to audit. So we have multiple layers of dealing with bad data, ranging from simple tech solutions to diversity of corroboration all the way to how you ultimately sequence the operationalization. Does that make any sense?

It does. New York City has been a strong proponent of open data in recent years, given statements about improving its digital ‘data mine’ and sharing it online. There’s vast amounts of NYC regulatory data about people, places and things. Is that being posted into this same data store that NYC is using that the public can then see?

Flowers:: We’re getting there. I actually work really closely with DoITT [the New York City Department of Information Technology and Telecommunications] on these issues.

There’s a real problem with posting “garbage” publicly, too. Say you post all of the 311 calls that we received in 2011. We get 65,000 of them a day, so it’s a massive amount of information. Your error rate on those is so high that it’s actually pretty useless for somebody to do something with. Through our cleaning exercises, what we’ve done is feed the clean data that we can feed (we get agency permission to do that) back to DoITT to post. That’s one way.

How do we put data out that is governmentally available but not publicly available.? Is it fair, for example, to post that a restaurant got a complaint about something when we don’t know if that complaint was founded? In other words, there’s a lot of reasons people call 311. It could be you’re actually pissed off. It could be that you’re trying to leverage something for business purposes. People do that here, for real estate reasons and other reasons. It’s a ferociously competitive environment here in New York City. We really need to be careful that the data we put out isn’t going to unfairly compromise a person or a business, unless there’s real merit to the complaint. What we’re trying to do is work with them and put out founded complaints: things that were actually true. Screening those out from the ones that aren’t is really the biggest list, in terms of pushing this data out to the public.

Government entities, at all levels, can lose productivity loss because of lack of data sharing. One classic example is ‘stovepiping’ in the intelligence community. How are you handling that internally, so that when other agencies want collaborate internally, then can, even if you’re not publishing it externally for the reasons you just listed?

Flowers:: We’re about to roll out an analytics warehouse that would have feeds from all of the agencies that we have gotten data from, that would be made available back out to the governmental community here in New York City.

Through that data store, the Department of Buildings will have the ability to directly access information from the fire department or the Department of Finance or whatever in a way that they can then import and use internally.

In other words, we’re in the process of doing just that, breaking down the data stovepipe. What my office ends up doing most of the time is breaking down the “operationalization stovepipe,” embarking on joint exercises. The reason that we’ve had so much success, frankly, is because the agency sees that it’s beneficial to them and so they come to us. In the beginning, it was I would go to them and say, “Give me a wishlist. What’s your biggest headache? And let’s see what we can do about it.”

Now, we’re just drowning in requests from agencies who get it and are really onboard. Getting the data from my part is easy, if they give it to me. It’s a conditional precedent for me to do work for you: I need backend access to your system. Hopefully, through the warehouse, they won’t need to go through me anymore. That’s the plan, that it makes it more institutional and more sustainable at that point.

The fact of the matter is that there’s a lot of mistrust and a lot of turf [disputes] among agencies that’s hard to get around. We’ve managed to get around it by showing them that they have skin in the game and that it’s ultimately to their benefit.

Parked on top of that platform, by the way, are a number of analytics tools, including SAS, Palantir, and a couple of others, that will permit the agencies to do the kind of work that my group’s been doing on their own. We’ll just kind of sit back and consult for them on an as-needed basis.

By the very success of the projects that my group does, it shows that this is a great way to go for each individual agency out there. We’re also building a long-term sustainable platform for them to start doing for themselves. That also has a robust training component to it that my office is conducting.

Will the rest of what goes into that data warehouse be subject to Freedom of Information law requests? How does all of this data relate to public accountability? You can imagine open government watchdogs, good government advocates and other groups that are interested in not just city performance but also how agencies are performing. As they create their own lists of restaurants or services, that regulatory data could certainly be quite relevant to that work. Do you anticipate regulatory data will be affirmatively published?

Flowers:: It is complicated. First off, the Mayor just signed a local law passed by city council to push out almost all of our transactional data out to the public by 2017. My thing, the analytics warehouse, is going to be a big component of that, because it’s going to be clean. Or relatively clean, I should say.

I would also note that about 95 percent of what my group works with is already publicly available. But when you say publicly available, what do you mean? Do you go onto a website and there’s a dropdown box and I can individually search for 332 East 19th Street in Manhattan as opposed to getting the entire batch file?

The latter is definitely not true right now. The former is true. You could go on and find out a lot by what at least 15 or 16 agencies know about a location, just through that dropdown search.

But to do the real fun work, frankly, the holistic big data work, you need the full set, not just a one-off. Layered into that, we do have stringent privacy regulations adhering to certain data streams that we work with and we have to respect those.

Enterprises like this fail because of a failure to respect the privacy obligations that the city has. They’re significant on some data sets. Tax data, for example, is extremely sensitive. Property tax is public, but personal income tax or even business income tax is protected information. We could never share that data.

Certain human services data is extremely sensitive, from a HIPAA standpoint. Or FERPA, which is the student records issue. Those things, we can’t negotiate with, nor would we want to.

Whatever we put out is going to be complaint with those regulations as well. There’s also state and local privacy regulations that we have to adhere to. Balancing all of those equities, I feel like we’ll be able to put out the vast majority of what, at the end of the day, is citizen data. We’re a government of the people, by the people, for the people. We’re here to serve them. It’s their data. So it’ll get out there, but in a way that’s consistent with our privacy concerns and our privacy obligations under federal, state, and local law.

Why is it important for government to pursue more data-driven policy? Has New York City been able to create of a culture more driven by measurements and data, as opposed to, say, the gut-instincts of politicians or their political allegiances?

Flowers: Necessity. We are facing tremendous fiscal challenges, just like every other city in the country. Our pension costs, our legacy costs. There are so many other cities out there — the vast majority, frankly — that are in the same position as GM was. We have to make the decision how much are we going to cut our capacity to service the citizens of today to pay for what we did yesterday. And so frankly, we have declining resources. There’s been successive hiring freezes citywide that the administration.

I’m not trying to flog for the Mayor, although, frankly, this is all possible because of him and his outlook. The fact is that we need to continue to drive crime down. We need to continue to drive fires down. We need to continue to make sure that our public health systems are working. And we need to do that in an era of extreme belt tightening that is still with us. New York City has weathered the recession better than anyone else — but we’re still facing a recession. And the fact of the matter is if we are to have any hope of continuing to service the citizenry at not just the levels we do now but higher levels — because our population continues to grow as well, urbanization marches forward — we have to take this approach.

It’s not a matter of, “Oh, let’s do this because it’s cool.” We are facing serious fiscal constraints in how to continue to keep this city vibrant and going as a global leader. The administration has taken this approach, a data-driven outcome-driven approach, as one of the tools to get us there. That’s what’s driving it for us. It’s quite fascinating to look at things this way, but the fact of the matter is it wouldn’t be getting as much traction as it’s getting if we didn’t need to do it.

Strata Conference + Hadoop World — The O’Reilly Strata Conference, being held Oct. 23-25 in New York City, explores the changes brought to technology and business by big data, data science, and pervasive computing. This year, Strata has joined forces with Hadoop World.

Save 20% on registration with the code RADAR20

Closing analysis

New York City Mayor Michael Bloomberg has been saying for years that “In God we trust; everyone else, show me the data.” A series of interviews over the past few years has suggested that his outlook on using data analytics in government has been filtering down into city departments in the Big Apple.

“I’ve long said that my time in business and government has taught me that if you can’t measure it, you can’t manage it,” said Mayor Bloomberg in a statement last year. “Our Administration has made usage of data a hallmark of our problem-solving strategies. Like us, other cities across the country are also working to come up with innovative ways to use data — particularly in these times of fiscal discipline — and we should all be learning from each other’s experiences.”

As I observed last October in a feature on data and open government in NYC, Bloomberg’s approach is no surprise to anyone familiar with the mission statement of his eponymous financial data company:

“Bloomberg started out with one core belief: that bringing transparency to capital markets through access to information could increase capital flows, produce economic growth and jobs, and significantly reduce the cost of doing business.”

Under the Bloomberg administration, New York City has been a national leader in evangelizing open data, making the people’s data available to the people with the goals of improved transparency, accountability, insight into performance, and citizensourcing better ideas or processes along the way.

A longstanding open question for this observer, however, has been how well the Big Apple was consuming its own data, particularly with respect to how the city allocates resources and responds to public safety issues. This interview with Mike Flowers answered that question and suggests many more to ask in the future.