In an age of technology-fueled transparency, corporations are subject to the same powerful disruption as governments. In that context, data journalism has profound importance for society. If a researcher needs data for business journalism, OpenCorporates is a bonafide resource.
Today, OpenCorporates is making a new open database of corporate officers and directors available to the world.
“It’s pretty cool, and useful for journalists, to be able to search not just all the companies with directors for a given name in a given state, but across multiple states,” said Chris Taggart, founder of Open Corporates, in an email interview. “Not surprisingly, loads of people, from journalists to corruption investigators, are very interested in this.”
OpenCorporates is the largest open database of companies and corporate data in the world. The service now contains public data from around the world, from health and safety violations in the United Kingdom to official public notices in Spain to a register of federal contractors. The database has been built by the open data community, under a bounty scheme in conjunction with ScraperWiki. The site also has a useful Google Refine reconciliation function that matches legal entities to company names. Taggart’s presentation on OpenCorporates from the 2012 NICAR conference, which provides an overview, is embedded below:
The OpenCorporates open application programming interface can be used with or without a key, although an API key does increase usage limits. The open data site’s business model comes with an interesting hook: while OpenCorporates makes its data both free and open under a Share-Alike Attribution Open Database License, users who wish import the data into a proprietary database or use it without attribution must pay to do so.
“The critical thing about our Directors import, and *all* the other data in OpenCorporates, is that we give the provenance, both where and when we got the information,” said Taggart. “This is in contrast to the proprietary databases who never give this, because they don’t want you to go straight to the source, which also means it’s problematic in tracing the source of errors. We’ve had several instances of the data being wrong at the source, like U.K. health and safety violations.”
Taggart offered more perspective on the source of OpenCorporates director data, corporate data availability and the landscape around a universal business ID in the rest of our interview:
Where does the officer and director data come from? How is it validated and cleaned?
It’s all from the official company registers. Most are scraped (we’ve scraped millions of pages), a couple (e.g. Vermont) are from downloads that the registries provide. We just need to make sure we’re scraping and importing properly. We do some cleaning up (e.g. removing some of the ‘**NO DIRECTOR**’ entries, but to a degree this has to be done post import, as you often don’t know these till they’re imported (which is why there are still a few in there).
By the way, in case you were wondering, the reason there are so many more directors than in the filters to the right is that there are about 3 million and counting Florida directors.
Was this data available anywhere before? If no, why not?
As far as I’m aware, only in proprietary databases. Proprietary databases have dominated company data. The result is massive duplication of effort, databases that have opaque errors in them, because they don’t have many eyes on them, and lack of access to the public, small businesses, and as you will have heard from NICAR, journalists. I’m tempted to offer a bottle of champagne to the first journalist who finds a story in the directors data.
Who else is working on the universal business ID issue? I heard Beth Noveck propose something along these lines, for instance.
Several organizations have been working on this, mostly from a semi-proprietary point of view, or at least trying to generate a monopoly ID. In other words, it might be open, but in order to get anything on the company, you have to use their site as a lookup table.
OpenCorporates is different in that if you know the URI you know the jurisdiction and identity issued by the company register and vice versa. This means you don’t need to ask OpenCorporates what the company ID is, as it’s there in the ID. It also works with the EU/W3C’s Business Vocabulary, which has just been published.
ISO has been working on one, but it’s got exactly this problem. Also, their database won’t contain the company number, meaning it doesn’t link to the legal entity. Bloomberg have been working on one, as have Thomson Reuters, as they need an alternative to the DUNS number, but from the conversations I had in D.C., nobody’s terribly interested in this.
I don’t really know the status of Beth’s project. They were intending to create a new ID too. From speaking to Jim Hendler, it didn’t seem to be connected to the legal entity but instead to represent a search of the name (actually a hash of a SPARQL query). You can see a demo site at http://tw.rpi.edu/orgpedia/companies. I have severe doubts regarding this.
Finally, there’s the Financial Stability Board’s (part of the G20) work on a global legal entity identifier — we’re on the advisory board for this. This also would be a new number, and be voluntary, but on the other hand will be openly licensed.
I don’t think it’s a solution to the problem, as it won’t be complete and for other reasons, but it may surface more information. We’d definitely provide an entity resolution service to it.