There is a huge debate right now about making academic literature freely accessible and moving toward open access. But what would be possible if people stopped talking about it and just dug in and got on with it?
NASA’s Astrophysics Data System (ADS), hosted by the Smithsonian Astrophysical Observatory (SAO), has quietly been working away since the mid-’90s. Without much, if any, fanfare amongst the other disciplines, it has moved astronomers into a world where access to the literature is just a given. It’s something they don’t have to think about all that much.
The ADS service provides access to abstracts for virtually all of the astronomical literature. But it also provides access to the full text of more than half a million papers, going right back to the start of peer-reviewed journals in the 1800s. The service has links to online data archives, along with reference and citation information for each of the papers, and it’s all searchable and downloadable.
It also makes astronomy almost uniquely well placed for interesting data mining experiments, experiments that hint at what the rest of academia could do if they followed astronomy’s lead. The fact that the discipline’s literature has been scanned, archived, indexed and catalogued, and placed behind a RESTful API makes it a treasure trove, both for hypothesis generation and sociological research.
For example, the .Astronomy series of conferences is a small workshop that brings together the best and the brightest of the technical community: researchers, developers, educators and communicators. Billed as “20% time for astronomers,” it gives these people space to think about how the new technologies affect both how research and communicating research to their peers and to the public is done.
[Disclosure: I'm a member of the advisory board to the .Astronomy conference, and I previously served as a member of the programme organising committee for the conference series.]
It should perhaps come as little surprise that one of the more interesting projects to come out of a hack day held as part of this year’s .Astronomy meeting in Heidelberg was work by Robert Simpson, Karen Masters and Sarah Kendrew that focused on data mining the astronomical literature.
The team grabbed and processed the titles and abstracts of all the papers from the Astrophysical Journal (ApJ), Astronomy & Astrophysics (A&A), and the Monthly Notices of the Royal Astronomical Society (MNRAS) since each of those journals started publication — and that’s 1827 in the case of MNRAS.
By the end of the day, they’d found some interesting results showing how various terms have trended over time. The results were similar to what’s found in Google Books’ Ngram Viewer.
After the meeting, however, Robert has taken his initial results and explored the astronomical literature and his new corpus of data on the literature. He’s explored various visualisations of the data, including word matrixes for related terms and for various astro-chemistry.
He’s also taken a look at authorship in astronomy and is starting to find some interesting trends.
You can see that single-author papers dominated for most of the 20th century. Around 1960, we see the decline begin, as two- and three-author papers begin to become a significant chunk of the whole. In 1978, author papers become more prevalent than single-author papers.
Here we see that people begin to outpace papers in the 1960s. This may reflect the fact that as we get more technical as a field, and more specialised, it takes more people to write the same number of papers, which is a sort of interesting result all by itself.
Interview with Robert Simpson: Behind the project and what lies ahead
What made you think about data mining the ADS?
Robert Simpson: At the .Astronomy 4 Hack Day in July, Sarah Kendrew had the idea to try to do an astronomy version of BrainSCANr, a project that generates new hypotheses in the neuroscience literature. I’ve had a go at mining ADS and arXiv before, so it seemed like a great excuse to dive back in.
Do you think there might be actual science that could be done here?
Robert Simpson: Yes, in the form of finding questions that were unexpected. With such large volumes of peer-reviewed papers being produced daily in astronomy, there is a lot being said. Most researchers can only try to keep up with it all — my daily RSS feed from arXiv is next to useless, it’s so bloated. In amongst all that text, there must be connections and relationships that are being missed by the community at large, hidden in the chatter. Maybe we can develop simple techniques to highlight potential missed links, i.e. generate new hypotheses from the mass of words and data.
Are the results coming out of the work useful for auditing academics?
Robert Simpson: Well, perhaps, but that would be tricky territory in my opinion. I’ve only just begun to explore the data around authorship in astronomy. One thing that is clear is that we can see a big trend toward collaborative work. In 2012, only 6% of papers were single-author efforts, compared with 70+% in the 1950s.
We can measure how large groups are becoming, and who is part of which groups. In that sense, we can audit research groups, and maybe individual people. The big issue is keeping track of people through variations in their names and affiliations. Identifying authors is probably a solved problem if we look at ORCID.
What about citations? Can you draw any comparisons with h-index data?
Robert Simpson: I haven’t looked at h-index stuff specifically, at least not yet, but citations are fun. I looked at the trends surrounding the term “dark matter” and saw something interesting. Mentions of dark matter rise steadily after it first appears in the late ’70s.
In the data, astronomy becomes more and more obsessed with dark matter — the term appears in 1% of all papers by the end of the ’80s and 6% today.
Looking at citations changes the picture. The community is writing papers about dark matter more and more each year, but they are getting fewer citations than they used to (the peak for this was in the late ’90s). These trends are normalised, so the only regency effect I can think of is that dark matter papers take more than 10 years to become citable. Either that or dark matter studies are currently in a trough for impact.
Can you see where work is dropped by parts of the community and picked up again?
Robert Simpson: Not yet, but I see what you mean. I need to build a better picture of the community and its components.
Can you build a social graph of astronomers out of this data? What about (academic) family trees?
Robert Simpson: Identifying unique authors is my next step, followed by creating fingerprints of individuals at a given point in time. When do people create their first-author papers, when do they have the most impact in their careers, stuff like that.
What tools did you use? In hindsight, would you do it differently?
Robert Simpson: My aim is still to attempt real hypothesis generation. I’ve begun the process by investigating correlations between terms in the literature, but I think the power will be in being able to compare all terms with all terms and looking for the unexpected. Terms may correlate indirectly (via a third term, for example), so the entire corpus needs to be processed and optimised to make it work comprehensively.
Science between the cracks
I’m really looking forward to seeing more results coming out of Robert’s work. This sort of analysis hasn’t really been possible before. It’s showing a lot of promise both from a sociological angle, with the ability to do research into how science is done and how that has changed, but also ultimately as a hypothesis engine — something that can generate new science in and of itself. This is just a hack day experiment. Imagine what could be done if the literature were more open and this sort of analysis could be done across fields?
Right now, a lot of the most interesting science is being done in the cracks between disciplines, but the hardest part of that sort of work is often trying to understand the literature of the discipline that isn’t your own. Robert’s project offers a lot of hope that this may soon become easier.