Harvard University announced this week that it would make more than 12 million catalog records from its 73 libraries publicly available. These records contain bibliographic information about books, manuscripts, maps, videos, and audio recordings. The Harvard Library is making these records available under a Creative Commons 0 license, in accordance with its Open Metadata Policy.
The records will be available for download from Harvard and via an API from the Digital Public Library of America (DPLA), an initiative that’s aiming to build an online national public library. The records released from Harvard are in the MARC21 format and include information that describes the various works — author, title, publisher, data, subject headings.
“This is Big Data for books,” David Weinberger, co-director of Harvard’s Library Lab told The New York Times’ Quentin Hardy. “There might be 100 different attributes for a single object.”
The hope is that by making the metadata openly available, other libraries will follow suit and developers will be able to build new applications. “By instituting a policy of open metadata, the Harvard Library has expressed its appreciation for the great potential that library metadata has for innovative uses,” said Stuart Shieber, Library Board Member and and Professor of Computer Science at Harvard in the press release.
Cloudera has released the latest beta version of its Hadoop distribution: CDH4. It offers upgrades to Flume, Sqoop, Hue, Oozie and Whirr, and support for new versions of Red Hat, Centos, SUSE, Ubuntu and Debian.
Cloudera says CDH4 has a great many enhancements over CDH3, including better availability, utilization, extensibility and security. The new version also contains a “significantly redesigned MapReduce.” However, Cloudera says it plans to support both generations of MapReduce for the life of CDH4.
The “operational intelligence” company Splunk had its IPO this past week. As Forbes writer Josh Bersin noted, the initial offering was hot, coming in with “a valuation at 28X revenue ($3.2 billion). This valuation trumps the hot companies in social networking: Jive trades at 20X revenue, Google trades at 5X revenue, and Facebook, well we’ll see.” Bersin argues that “big data” is “big news” and “big business,” and he points to several things that the IPO and the market’s response point to for HR and talent management, including the observation that “most businesses today have plenty of data with which to make decisions.”
Feel free to email me.
Photo: Harvard College Library bookplate with withdrawal stamp by kladcat, on Flickr
Related:
]]>
Screenshot from the Wikidata Data Model page.
The Wikimedia Foundation — the good folks behind Wikipedia — recently proposed a Wikidata initiative. It’s a new project that would build out a free secondary database to collect structured data that could provide support in turn for Wikipedia and other Wikimedia projects. According to the proposal:
“Many Wikipedia articles contain facts and connections to other articles that are not easily understood by a computer, like the population of a country or the place of birth of an actor. In Wikidata, you will be able to enter that information in a way that makes it processable by the computer. This means that the machine can provide it in different languages, use it to create overviews of such data, like lists or charts, or answer questions that can hardly be answered automatically today.”
But in The Atlantic this week, Mark Graham, a research fellow at the Oxford Research Institute, takes a look at the proposal, calling these “changes that have worrying connotations for the diversity of knowledge in the world’s sixth most popular website.” Graham points to the different language editions of Wikipedia, noting that the encyclopedic knowledge contained therein is always highly diverse. “Not only does each language edition include different sets of topics, but when several editions do cover the same topic, they often put their own, unique spin on the topic. In particular, the ability of each language edition to exist independently has allowed each language community to contextualize knowledge for its audience.”
Graham fears that emphasizing a standardized, machine-readable, semantic-oriented Wikipedia will lose this local flavor:
“The reason that Wikidata marks such a significant moment in Wikipedia’s history is the fact that it eliminates some of the scope for culturally contingent representations of places, processes, people, and events. However, even more concerning is that fact that this sort of congealed and structured knowledge is unlikely to reflect the opinions and beliefs of traditionally marginalized groups.”
His arguments raise questions about the perceived universality of data, when in fact what we might find instead is terribly nuanced and localized, particularly when that data is contributed by humans who are distributed globally.
Netflix’s recommendation engine is often cited as a premier example of how user data can be mined and analyzed to build a better service. This week, Netflix’s Xavier Amatriain and Justin Basilico penned a blog post offering insights into the challenges that the company — and thanks to the Netflix Prize, the data mining and machine learning communities — have faced in improving the accuracy of movie recommendation engines.
The Netflix post raises some interesting questions about how the means of content delivery have changed recommendations. In other words, when Netflix refocused on its streaming product, viewing interests changed (and not just because the selection changed). The same holds true for the multitude of ways in which we can now watch movies via Netflix (there are hundreds of different device options for accessing and viewing content from the service).
Amatriain and Basilico write:
“Now it is clear that the Netflix Prize objective, accurate prediction of a movie’s rating, is just one of the many components of an effective recommendation system that optimizes our members’ enjoyment. We also need to take into account factors such as context, title popularity, interest, evidence, novelty, diversity, and freshness. Supporting all the different contexts in which we want to make recommendations requires a range of algorithms that are tuned to the needs of those contexts.”
Feel free to email me.
Related:
]]>The National Archives released the 1940 U.S. Census records on Monday, after a mandatory 72-year waiting period. The release marks the single largest collection of digital information ever made available online by the agency.
Screenshot from the digital version of the 1940 Census.
The 1940 Census, conducted as a door-to-door survey, included questions about age, race, occupation, employment status, income, and participation in New Deal programs — all important (and intriguing) following the previous decade’s Great Depression. One data point: in 1940, there were 5.1 million farmers. According to the 2010 American Community Survey (not the census, mind you), there were just 613,000.
The ability to glean these sorts of insights proved to be far more compelling than the National Archives anticipated, and the website hosting the data, Archives.com, was temporarily brought down by the traffic load. The site is now up, so anyone can investigate the records of approximately 132 million Americans. The records are searchable by map — or rather, “the appropriate enumeration district” — but not by name.
The Obama administration unveiled its “Big Data Research and Development Initiative” late last week, with more than $200 million in financial commitments. Among the White House’s goals: to “advance state-of-the-art core technologies needed to to collect, store, preserve, manage, analyze, and share huge quantities of data.”
The new big data initiative was announced with a number of departments and agencies already on board with specific plans, including grant opportunities from the Department of Defense and National Science Foundation, new spending on an XDATA program by DARPA to build new computational tools as well as open data initiatives, such as the the 1000 Genomes Project.
“In the same way that past Federal investments in information-technology R&D led to dramatic advances in supercomputing and the creation of the Internet, the initiative we are launching today promises to transform our ability to use big data for scientific discovery, environmental and biomedical research, education, and national security,” said Dr. John P. Holdren, assistant to the President and director of the White House Office of Science and Technology Policy in the official press release (PDF).
When the Girls Around Me app was released, using data from Foursquare and Facebook to notify users when there were females nearby, many commentators called it creepy. “Girls Around Me is the perfect complement to any pick-up strategy,” the app’s website once touted. “And with millions of chicks checking in daily, there’s never been a better time to be on the hunt.”
“Hunt” is an interesting choice of words here, and the Cult of Mac, among other blogs, asked if the app was encouraging stalking. Outcry about the app prompted Foursquare to yank the app’s API access, and the app’s developers later pulled the app voluntarily from the App Store.
Many of the responses to the app raised issues about privacy and user data, and questioned whether women in particular should be extra cautious about sharing their information with social networks. But as Amit Runchal writes in TechCrunch, this response blames the victims:
“You may argue, the women signed up to be a part of this when they signed up to be on Facebook. No. What they signed up for was to be on Facebook. Our identities change depending on our context, no matter what permissions we have given to the Big Blue Eye. Denying us the right to this creates victims who then get blamed for it. ‘Well,’ they say, ‘you shouldn’t have been on Facebook if you didn’t want to …’ No. Please recognize them as a person. Please recognize what that means.”
Writing here at Radar, Mike Loukides expands on some of these issues, noting that the questions are always about data and social context:
“It’s useful to imagine the same software with a slightly different configuration. Girls Around Me has undeniably crossed a line. But what if, instead of finding women, the app was Hackers Around Me? That might be borderline creepy, but most people could live with it, and it might even lead to some wonderful impromptu hackathons. EMTs Around Me could save lives. I doubt that you’d need to change a single line of code to implement either of these apps, just some search strings. The problem isn’t the software itself, nor is it the victims, but what happens when you move data from one context into another. Moving data about EMTs into context where EMTs are needed is socially acceptable; moving data into a context that facilitates stalking isn’t acceptable, and shouldn’t be.”
Feel free to email me.
Related:
]]>Ars Technica’s James Grimmelmann examines the recent history of the Principality of Sealand, a World War II anti-aircraft platform located six miles off the coast of England. Some reports claim Wikileaks is looking to relocate its servers there, ostensibly out of reach of legal threats and government interference. Why Sealand? It claims it’s an independent nation, and as such it “sounds perfect for WikiLeaks: a friendly, legally unassailable host with an anything-goes attitude,” writes Grimmelmann.
But as Grimmelmann notes, Sealand’s history isn’t exactly the “cryptographers’ paradise” one might expect. In the early 2000s another company called HavenCo set up shop there with a “no-questions-asked colocation” facility. Dandy in theory, but not in practice. The endeavor was never remotely successful, and the company spiraled downward, eventually becoming “nationalized” by Sealand. “HavenCo no longer had real technical experts or the competitive advantage of being willing to host legally risky content,” Grimmelmann writes. “What it did have was an absurdly inefficient cost structure. Every single piece of equipment, drop of fuel, and scrap of food had to be brought in by boat or helicopter. By 2006, ‘Sealand’ hosting was in a London data center. By 2008, even the HavenCo website was offline.”
It’s a fascinating story about the promises of data havens and the long-arm of the law. It’s also a cautionary tale for Wikileaks, suggests Grimmelmann. “Sealand isn’t going to save WikiLeaks any more than putting the site’s servers in a former nuclear bunker would. The legal system figured out a long time ago that throwing the account owner in jail works just as well as seizing the server.”
ThinkUp, one of the flagship products from the non-profit Expert Labs, will get a reboot as a for-profit company, write founders Gina Trapani and Anil Dash. The ThinkUp app is an open source tool that allows users to store, search and analyze all their social media activity (posts to Facebook, Twitter, Google+, etc.).
It’s a simple tool, says Dash:
“But what ThinkUp represents is a lot of important concepts: Owning your actions and words on the web. Encouraging more positive and fruitful conversations on social networks. Gaining insights into ourselves and our friends based on what we say and share. And the possibility of discovering important information or different perspectives if we can return the web back to its natural state of not being beholden to any one company or proprietary network.”
ThinkUp will remain open source but it will evolve to include an “easy-to-use product with mainstream appeal,” says Trapani. Expert Labs will be winding down, but the new company that has grown out of it will share many parts of the organization’s original mission.
With the headline “Just the Facts, Yes All of Them,” The New York Times profiles Gil Elbaz, the founder of the data startup Factual. “The world is one big data problem,” Elbaz tells journalist Quentin Hardy.
“Data has always been seen as just a side effect in computing, something you look up while you are doing work,” Elbaz says in the Times piece. “We see it as a whole separate layer that everyone is going to have to tap into, data you want to solve a problem, but that you might not have yourself, and completely reliable.”
Principality of Sealand coat of arms via Wikimedia Commons.
Feel free to email me.
Related:
]]>
The “Data Science Debate” panel at Strata California 2012. Watch the debate.
The Oxford-style debate at Strata continues to be one of the most-talked-about events from the conference. This week, it’s O’Reilly’s Mike Loukides who weighs in with his thoughts on the debate, which had the motion “In data science, domain expertise is more important than machine learning skill.” (For those that weren’t there, the machine learning side “won.” See Mike Driscoll’s summary and full video from the debate.)
Loukides moves from the unreasonable effectiveness of data to examine the “unreasonable necessity of subject experts.” He writes that:
“Whether you hire subject experts, grow your own, or outsource the problem through the application, data only becomes ‘unreasonably effective’ through the conversation that takes place after the numbers have been crunched … We can only take our inexplicable results at face value if we’re just going to use them and put them away. Nobody uses data that way. To push through to the next, even more interesting result, we need to understand what our results mean; our second- and third-order results will only be useful when we understand the foundations on which they’re based. And that’s the real value of a subject matter expert: not just asking the right questions, but understanding the results and finding the story that the data wants to tell. Results are good, but we can’t forget that data is ultimately about insight, and insight is inextricably tied to the stories we build from the data. And those stories are going to be ever more essential as we use data to build increasingly complex systems.”
Microsoft has hired Raghu Ramakrishnan as a technical fellow for its Server and Tools Business (STB), reports ZDNet’s Mary Jo Foley. According to his new company bio, Ramakrishnan’s work will involve “big data and integration between STB’s cloud offerings and the Online Services Division’s platform assets.”
Ramakrishnan comes to Microsoft from Yahoo, where he’s been the chief scientist for three divisions — Audience, Cloud Platforms and Search. As Foley notes, Ramakrishnan’s move is another indication that Microsoft is serious about “playing up its big data assets.” Strata chair Edd Dumbill examined Microsoft’s big data strategy earlier this year, noting in particular its work on a Hadoop distribution for Windows server and Azure.
How much is your data worth? The Atlantic’s Alexis Madrigal does a little napkin math based on figures from the Internet Advertising Bureau to come up with a broad and ambiguous range between half a cent and $1,200 — depending on how you decide to make the calculation, of course.
In an effort to make those measurements easier and more useful, Google unveiled some additional reports as part of its Analytics product this week. It’s a move Google says will help marketers:
“… identify the full value of traffic coming from social sites and measure how they lead to direct conversions or assist in future conversions; understand social activities happening both on and off of your site to help you optimize user engagement and increase social key performance indicators (KPIs); and make better, more efficient data-driven decisions in your social media marketing programs.”
Engagement and conversion metrics for each social network will now be trackable through Google Analytics. Partners for this new Social Data Hub, include Disqus, Echo, Reddit, Diigo, and Digg, among others.
Feel free to email me.
Related:
]]>The visualization site Visual.ly launched a new tool this week that helps users create their own infographics. Aptly called Visual.ly Create, the new feature lets people take publicly available datasets (such as information from a Twitter hashtag), select a template, and publish their own infographics.
Segment from a Visual.ly Create infographic of the #stratconf hashtag.
As GigaOm’s Derrick Harris observes, it’s fairly easy to spot the limitations with this service — in the data you can use, in the templates that are available, and in the visualizations that are created. But after talking to Visual.ly’s co-founder and Chief Content Officer Lee Sherman about some “serious customization options” that are in the works, Harris wonders if a tool like this could be something to spawn interest in data science:
“The problem is that we need more people with math skills to meet growing employer demand for data scientists and data analysts. But how do you get started caring about data in the first place when the barriers are so high? Really working with data requires a deep understanding of both math and statistics, and Excel isn’t exactly a barrel of monkeys (nor are the charts it creates).”
Could Visual.ly be an on-ramp for more folks to start caring about and playing with data?
Late last week, San Francisco Mayor Ed Lee unveiled the new data.SFgov.org, a cloud-based open data website that will replace DataSF.org, one of the earliest examples of civic open data initiatives.
“By making City data more accessible to the public secures San Francisco’s future as the world’s first 2.0 City,” said Lee in an announcement. “It’s only natural that we move our Open Data platform to the cloud and adopt modern open interface to facilitate that flow and access to information and develop better tools to enhance City services.”
The city’s Chief Innovation Officer Jay Nath told TechCrunch that the update to the website expands access to information while saving the city money.
The new site contains some 175 datasets, including map-based crime data, active business listings, and various financial datasets. It’s powered by the Seattle-based data startup Socrata.
“One day I’m sure everyone will routinely collect all sorts of data about themselves,” writes Mathematica and Wolfram Alpha creator Stephen Wolfram. “But because I’ve been interested in data for a very long time, I started doing this long ago. I actually assumed lots of other people were doing it too, but apparently they were not. And so now I have what is probably one of the world’s largest collections of personal data.”
And what a fascinating collection of data it is, including emails received and sent, phone calls made, calendar events planned, keystrokes made, and steps taken. Through this, you can see Wolfram’s sleep, social, and work patterns, and even how various chapters of his book and Mathematica projects took shape.
“The overall pattern is fairly clear,” Wolfram writes. “It’s meetings and collaborative work during the day, a dinnertime break, more meetings and collaborative work, and then in the later evening more work on my own. I have to say that looking at all this data, I am struck by how shockingly regular many aspects of it are. But in general, I am happy to see it. For my consistent experience has been that the more routine I can make the basic practical aspects of my life, the more I am able to be energetic — and spontaneous — about intellectual and other things.”
Feel free to email me.
Related:
]]>Over the past week, O’Reilly’s Alex Howard has profiled a number of practicing data journalists, following up on the National Institute for Computer-Assisted Reporting‘s (NICAR) 2012 conference. Howard argues that data journalism has enormous importance, but “given the reality that those practicing data journalism remain a tiny percentage of the world’s media, there’s clearly still a need for its foremost practitioners to show why it matters, in terms of impact.”
Howard’s profiles include:
Edd Dumbill takes a look at data marketplaces, the online platforms that host data from various publishers and offer it for sale to consumers. Dumbill compares four of the most mature data marketplaces — Infochimps, Factual, Windows Azure Data Marketplace, and DataMarket — and examines their different approaches and offerings.
Dumbill says marketplaces like these are useful in three ways:
“First, they provide a point of discoverability and comparison for data, along with indicators of quality and scope. Second, they handle the cleaning and formatting of the data, so it is ready for use (often 80% of the work in any data integration). Finally, marketplaces provide an economic model for broad access to data that would otherwise prove difficult to either publish or consume.”
The Atlantic’s Dashiell Bennett examines the MIT Sloan Sports Analytics Conference, a “festival of sports statistics” that has grown over the past six years from 175 attendees to more than 2,200.
Bennett writes:
“For a sports conference, the event is noticeably athlete-free. While a couple of token pros do occasionally appear as panel guests, this is about the people behind the scenes — those who are trying to figure out how to pick those athletes for their team, how to use them on the field, and how much to pay them without looking like a fool. General managers and team owners are the stars of this show … The difference between them and the CEOs of most companies is that the sports guys have better data about their employees … and a lot of their customers have it memorized.”
Feel free to email me.
Related:
]]>Datasift, one of the two companies that has official access to the Twitter firehose (the other being Gnip) announced its new Historics service this week, giving customers access to up to two years’ worth of historical Tweets. (By comparison, Gnip offers 30 days of Twitter data, and other developers and users have access to roughly a week’s worth of Tweets.)
GigaOm’s Barb Darrow responded to those who might be skeptical about the relevance of this sort of historic Twitter data in a service that emphasizes real-time. Darrow noted that DataSift CEO Rob Bailey said companies planning new products, promotions or price changes would do well to study the impact of their past actions before proceeding and that Twitter is the perfect venue for that.
Another indication of the desirability of this new Twitter data: the waiting list for Historics already includes a number of Fortune 500 companies. The service will get its official launch in April.
Although there are plenty of ways to receive formal training in math, statistics and engineering, there aren’t a lot of options when it comes to an education specifically in data science.
To that end, the Open Knowledge Foundation and Peer to Peer University (P2PU) have proposed a School of Data, arguing that:
“It will be years before data specialist degree paths become broadly available and accepted, and even then, time-intensive degree courses may not be the right option for journalists, activists, or computer programmers who just need to add data skills to their existing expertise. What is needed are flexible, on-demand, shorter learning options for people who are actively working in areas that benefit from data skills, particularly those who may have already left formal education programmes.”
The organizations are seeking volunteers to help develop the project, whether that’s in the form of educational materials, learning challenges, mentorship, or a potential student body.
The Strata Conference wraps up today in Santa Clara, Calif. If you missed Strata this year and weren’t able to catch the livestream of the conference, look for excerpts and videos posted here on Radar and through the O’Reilly YouTube channel in the coming weeks.
And be sure to make plans for Strata New York, being held October 23-25. That event will mark the merger with Hadoop World. The call for speaker proposals for Strata NY is now open.
Feel free to email me.
Related:
]]>The big data marketplace Infochimps announced this week that it will begin offering the platform that it’s built for itself to other companies — as both a platform-as-a-service and an on-premise solution. “The technical needs for Infochimps are pretty substantial,” says CEO Joe Kelly, and the company now plans to help others get up-to-speed with implementing a big data infrastructure.
Infochimps has offered datasets for download or via API for a number of years (see my May 2011 interview with the company here), but the startup is now making the transition to offer its infrastructure to others. Likening its big data marketplace to an “iTunes for data,” Infochimps says it’s clear that we still need a lot more “iPods” in production before most companies are able to handle the big data deluge.
Infochimps will now offer its in-house expertise to others. That includes a number of tools that one might expect: AWS, Hadoop, and Pig. But it also includes Ironfan, Infochimps’ management tool built on top of Chef.
Infochimps isn’t abandoning the big data marketplace piece of its business. However, its move to support companies with their big data efforts is indication there’s still quite a bit of work to do before everyone’s quite ready to “do stuff” with the big data we’re accumulating.
A fascinating piece of research is set to to appear at IEEE S&P on the subject of Internet-scale authorship identification based on “stylometry,” which is an analysis of writing style. The paper was co-authored by Arvind Narayanan, Hristo Paskov, Neil Gong, John Bethencourt, Emil Stefanov, Richard Shin and Dawn Song. They’ve been able to correctly identify writers 20% of the time based on looking at what they’ve published online before. It’s a finding with serious implications for online anonymity and free speech, the team notes.
“The good news for authors who would like to protect themselves against de-anonymization is it appears that manually changing one’s style is enough to throw off these attacks,” says Narayanan.
O’Reilly Media has just published a report on “Data for the Public Good.” In the report, Alex Howard makes the argument for a systemic approach to thinking about open data and the public sector, examining the case for a “public good” around public data as well as around governmental, journalistic, healthcare, and crisis situations (to name but a few scenarios and applications).
Howard notes that the success of recent open data initiatives “won’t depend on any single chief information officer, chief executive or brilliant developer. Data for the public good will be driven by a distributed community of media, nonprofits, academics and civic advocates focused on better outcomes, more informed communities and the new news, in whatever form it is delivered.” Although many municipalities have made the case for open data initiatives, there’s more to the puzzle, Howard argues, including recognizing the importance of personal data and making the case for a “hybridized public-private data.”
The “Data for the Public Good" report is available for free as a PDF, ePUB, or MOBI download.
Feel free to email me.
Related:
]]>Yahoo offered a peak behind the scenes of its front page with the release of the Yahoo C.O.R.E. Data Visualization. The visualization provides a way to view some of the demographic details behind what Yahoo visitors are clicking on.
The C.O.R.E. (Content Optimization and Relevance Engine) technology was created by Yahoo Labs. The tech is used by Yahoo News and its Today module to personalize results for its visitors — resulting in some 13,000,000 unique story combinations per day. According to Yahoo:
“C.O.R.E. determines how stories should be ordered, dependent on each user. Similarly, C.O.R.E. figures out which story categories (i.e. technology, health, finance, or entertainment) should be displayed prominently on the page to help deepen engagement for each viewer.”
Screenshot from Yahoo’s CORE data visualization. See the full visualization here.
Over on the High Scalability blog, Todd Huff examines how the blogging site Tumblr was able to scale its infrastructure, something that Huff describes as more challenging than the scaling that was necessary at Twitter.
To put give some idea of the scope of the problem, Hoff cites these figures:
“Growing at over 30% a month has not been without challenges. Some reliability problems among them. It helps to realize that Tumblr operates at surprisingly huge scales: 500 million page views a day, a peak rate of ~40k requests per second, ~3TB of new data to store a day, all running on 1000+ servers.”
Hoff interviews Blake Matheny, distributed systems engineer at Tumblr, for a look at the architecture of both “old” and “new” Tumblr. When the startup began, it was hosted on Rackspace where “it gave each custom domain blog an A record. When they outgrew Rackspace there were too many users to migrate.”
The article also describes the Tumblr firehose, noting again its differences from Twitter’s. “A challenge is to distribute so much data in real-time,” Huff writes. “[Tumblr} wanted something that would scale internally and that an application ecosystem could reliably grow around. A central point of distribution was needed.” Although Tumblr initially used Scribe/Hadoop, “this model stopped scaling almost immediately, especially at peak where people are creating 1000s of posts a second.”
Data scientist Pete Warden offers his own lessons learned about building visualizations this week in a story here on Radar. His first tip: “Play with your data” — that is, before you decide what problem you want to solve or visualization you want to create, take the time to know the data you’re working with.
Warden writes:
“The more time you spend manipulating and examining the raw information, the more you understand it at a deep level. Knowing your data is the essential starting point for any visualization.”
Warden explains how he was able to create a visualization for his new travel startup, Jetpac, that showed where American Facebook users go on vacation. Warden’s tips aren’t simply about the tools he used; he also walks through the conceptualization of the project as well as the crunching of the data.
Feel free to email me.
Related:
]]>The computational knowledge engine Wolfram|Alpha unveiled a pro version this week. For $4.99 per month ($2.99 for students), Wolfram|Alpha Pro offers access to more of the computational power “under the hood” of the site, in part by allowing users to upload their own datasets, which Wolfram|Alpha will in turn analyze.
This includes:
Wolfram|Alpha Pro subscribers can upload and analyze their own datasets.
There’s also a new extended keyboard that contains the Greek alphabet and other special characters for manually entering data. Data and analysis from these entries and any queries can also be downloaded.
“In a sense,” writes Wolfram’s founder Stephen Wolfram, “the concept is to imagine what a good data scientist would do if confronted with your data, then just immediately and automatically do that — and show you the results.”
Ushahidi‘s Patrick Meier takes a look at the recently released Data Protection Manual issued by the International Organization for Migration (IOM). According to the IOM, the manual is meant to serve as a guide to help:
” … protect the personal data of the migrants in its care. It follows concerns about the general increase in data theft and loss and the recognition that hackers are finding ever more sophisticated ways of breaking into personal files. The IOM Data Protection Manual aims to protect the integrity and confidentiality of personal data and to prevent inappropriate disclosure.”
Meier describes the manual as “required reading” but notes that there is no mention of social media in the 150-page document. “This is perfectly understandable given IOM’s work,” he writes, “but there is no denying that disaster-affected communities are becoming more digitally-enabled — and thus, increasingly the source of important, user-generated information.”
Meier moves through the Data Protection Manual’s principles, highlighting the ones that may be challenged when it comes to user-generated, crowdsourced data and raising important questions about consent, privacy, and security.
Many online dating websites claim that their algorithms are able to help match singles with their perfect mate. But a forthcoming article in “Psychological Science in the Public Interest,” a journal of the Association for Psychological Science, casts some doubt on the data science of dating.
According to the article’s lead author Eli Finkel, associate professor of social psychology at Northwestern University, “there is no compelling evidence that any online dating matching algorithm actually works.” Finkel argues that dating sites’ algorithms do not “adhere to the standards of science,” and adds that “it is unlikely that their algorithms can work, even in principle, given the limitations of the sorts of matching procedures that these sites use.”
It’s “relationship science” versus the in-take questions that most dating sites ask in order to help users create their profiles and suggest matches. Finkel and his coauthors note that some of the strongest predictors for good relationships — such as how couples interact under pressure — aren’t assessed by dating sites.
The paper calls for the creation of a panel to grade the scientific credibility of each online dating site.
Feel free to email me.
Related:
]]>When the file-storage and sharing site Megaupload had its domain name seized, assets frozen and website shut down in mid-January, the U.S. Justice Department contended that the owners were operating a site dedicated to copyright infringement. But that posed a huge problem for those who were using Megaupload for the legitimate and legal storage of their files. As the EFF noted, these users weren’t given any notice of the seizure, nor were they given an opportunity to retrieve their data.
Moreover, it seemed this week that those users would have all their data deleted, as Megaupload would no longer be able to pay its server fees.
While it appears that users have won a two-week reprieve before any deletion actually occurs, the incident does raise a number of questions about users’ data rights and control in the cloud. Specifically: What happens to user data when a file hosting / cloud provider goes under? And how much time and notice should users have to reclaim their data?
This is what you see when you visit Megaupload.com.
The financial news and information company Bloomberg opened its market data distribution interface this week. The BLPAPI is available under a free-use license at open.bloomberg.com. According to the press release, some 100,000 people already use the BLPAPI, but with this week’s announcement, the interface will be more broadly available.
The company introduced its Bloomberg Open Symbology back in 2009, a move to provide an alternative to some of the proprietary systems for identifying securities (particularly those services offered by Bloomberg’s competitor Thomson Reuters). This week’s opening of the BLPAPI is a similar gesture, one that the company says is part of its “Open Market Data Initiative, an ongoing effort to embrace and promote open solutions for the financial services industry.”
The BLPAPI works with a range of programming languages, including Java, C, C++, .NET, COM and Perl. But while the interface itself is free to use, the content is not.
Pentaho’s extract-transform-load technology Pentaho Kettle is being moved to the Apache License, Version 2.0. Kettle was previously available under the GNU Lesser General Public License (LGPL).
By moving to the Apache license, Pentaho says it will be more in line with the licensing of Hadoop, Hbase, and a number of NoSQL projects.
Kettle downloads and documentation are available at the Pentaho Big Data Community Home.
Andy Baio took a look at some of the data surrounding piracy and the Oscar screening process. There has long been concern that the review copies of movies distributed to members of the Academy of Motion Arts and Sciences were making their way online. Baio observed that while a record number of films have been nominated for Oscars this year (37), just eight of the “screeners” have been leaked online, “a record low that continues the downward trend from last year.”
However, while the number of screeners available online has diminished, almost all of the nominated films (34) had already been leaked online. “If the goal of blocking leaks is to keep the films off the Internet, then the MPAA [Motion Picture Association of America] still has a long way to go,” Baio wrote.
Baio has a number of additional observations about these leaks (and he also made the full data dump available for others to examine). But as the MPAA and others are making arguments (and helping pen related legislation) to crack down on Internet privacy, a good look at piracy trends seems particularly important.
Feel free to email me.
Related:
]]>GigaOm’s Derrick Harris explores some of the big data obstacles and opportunities surrounding genome research. He notes that:
When the Human Genome Project successfully concluded in 2003, it had taken 13 years to complete its goal of fully sequencing the human genome. Earlier this month, two firms — Life Technologies and Illumina — announced instruments that can do the same thing in a day, one for only $1,000. That’s likely going to mean a lot of data.
But as Harris observes, the promise of quick and cheap genomics is leading to other problems, particularly as the data reaches a heady scale. A fully sequenced human genome is about 100GB of raw data. But citing DNAnexus founder Andreas Sundquist, Harris says that:
… volume increases to about 1TB by the time the genome has been analyzed. He [Sundquist] also says we’re on pace to have 1 million genomes sequenced within the next two years. If that holds true, there will be approximately 1 million terabytes (or 1,000 petabytes, or 1 exabyte) of genome data floating around by 2014.
That makes the promise of a $1,000 genome sequencing service challenging when it comes to storing and processing petabytes of data. Harris posits that it will be cloud computing to the rescue here, providing the necessary infrastructure to handle all that data.
Literary critic and New York Times opinionator Stanley Fish has been on a bit of a rampage in recent weeks, taking on the growing field of the “digital humanities.” Prior to the annual Modern Language Association meeting, Fish cautioned that alongside the traditional panels and papers on Ezra Pound and William Shakespeare and the like, there were going to be a flood of sessions devoted to:
…’the digital humanities,’ an umbrella term for new and fast-moving developments across a range of topics: the organization and administration of libraries, the rethinking of peer review, the study of social networks, the expansion of digital archives, the refining of search engines, the production of scholarly editions, the restructuring of undergraduate instruction, the transformation of scholarly publishing, the re-conception of the doctoral dissertation, the teaching of foreign languages, the proliferation of online journals, the redefinition of what it means to be a text, the changing face of tenure — in short, everything.
That “everything” was narrowed down substantially in Fish’s editorial this week, in which he blasted the digital humanities for what he sees as its fixation “with matters of statistical frequency and pattern.” In other words: data and computational analysis.
According to Fish, the problem with digital humanities is that this new scholarship relies heavily on the machine — and not the literary critic — for interpretation. Fish contends that digital humanities scholars are all teams of statisticians and positivists, busily digitizing texts so they can data-mine them and systematically and programmatically uncover something of interest — something worthy of interpretation.
University of Illinois, Urbana-Champaign English professor Ted Underwood argues that Fish not only mischaracterizes what digital humanities scholars do, but he misrepresents how his own interpretive tradition works:
… by pretending that the act of interpretation is wholly contained in a single encounter with evidence. On his account, we normally begin with a hypothesis (which seems to have sprung, like Sin, fully-formed from our head), and test it against a single sentence.
One of the most interesting responses to Fish’s recent rants about the humanities’ digital turn comes from University of North Carolina English professor Daniel Anderson, who demonstrates in the following video a far fuller picture of what “digital” “data” — creation and interpretation — looks like:
Two of the big data events announced they’ll be merging this week: Hadoop World will now be part of the Strata Conference in New York this fall.
[Disclosure: The Strata events are run by O’Reilly Media.]
Cloudera first started Hadoop World back in 2009, and as Hadoop itself has seen increasing adoption, Hadoop World, too, has become more popular. Strata is a newer event — its first conference was held in Santa Clara, Calif., in February 2011, and it expanded to New York in September 2011.
With the merger, Hadoop World will be a featured program at Strata New York 2012 (Oct. 23-25).
In other Hadoop-related news this week, Strata chair Edd Dumbill took a close look at Microsoft’s Hadoop strategy. Although it might be surprising that Microsoft has opted to adopt an open source technology as the core of its big data plans, Dumbill argues that:
Hadoop, by its sheer popularity, has become the de facto standard for distributed data crunching. By embracing Hadoop, Microsoft allows its customers to access the rapidly-growing Hadoop ecosystem and take advantage of a growing talent pool of Hadoop-savvy developers.
Also, Cloudera data scientist Josh Willis takes a closer look at one aspect of that ecosystem: the work of scientists whose research falls outside of statistics and machine learning. His blog post specifically addresses one use case for Hadoop — seismology, for which there is now Seismic Hadoop — but the post also provides a broad look at what constitutes the practice of data science.
Feel free to email me.
Photo: Bootstrap DNA by Charles Jencks, 2003 by mira66, on Flickr
Related:
]]>