"strata week" entries

Strata Week: Big data’s big future

Big data in 2013, and beyond; the Sunlight Foundation's new data mining app; and the growth of our planet's central nervous system.

Here are a few stories from the data space that caught my attention this week.

Big data will continue to be a big deal

“Big data” became something of a buzz phrase in 2012, with its role in the US Presidential election, and businesses large and small starting to realize the benefits and challenges of mountains upon zettabytes of data — so much so that NPR’s linguist contributor Geoff Nunberg thinks it should have been the phrase of the year.

Nunberg says that though “it didn’t get the wide public exposure given to items like ‘frankenstorm,’ ‘fiscal cliff‘ and YOLO,” and might not have been “as familiar to many people as ‘Etch A Sketch’ and ’47 percent'” were during the election, big data has become a phenomenon affecting our lives: “It’s responsible for a lot of our anxieties about intrusions on our privacy, whether from the government’s anti-terrorist data sweeps or the ads that track us as we wander around the Web.” He also notes that big data has transformed statistics into “a sexy major” and predicts the term will long outlast “Gangnam Style.” (You can read Nunberg’s full case for big data at NPR.)

Read more…

Strata Week: When will big data outgrow our current metric system?

What's bigger than a yottabyte, the role big data will play in health care, and the potential impact of vehicle data.

Here are a few stories from the data space that caught my attention this week.

Bigger and bigger … and bigger … big data

MIT Technology Review’s business editor Jessica Leber reports this week on a conference presentation by MIT’s Andrew McAfee, wherein McAfee predicts data volumes will soon surpass the current upper bounds of metric measurement — the yottabyte. McAfee discussed in his presentation (and on his blog) how we’ve moved through the data measurement eras — terrabyte, petabyte, and soon the zettabyte … leaving us only with the yottabyte for the future. The yottabyte, Leber notes, was the largest scale of measurement scientists could imagine at the 1991 General Conference on Weights and Measures where it was established.

Leber reports that as we head into the zettabyte era, a threshold that Cisco predicts we’ll surpass by the end of 2016, McAfee predicts the General Conference on Weights and Measures will convene before the end of the decade to contemplate yottabyte’s successor. McAfee’s favorite contender prefix, Leder notes, is the “hella.”

Stacey Higginbotham at GigaOm recently covered this issue as well (I reported on it here). She reports that during a recent presentation, Intel’s Shantanu Gupta predicted the next prefixes: brontobytes and gegobytes. Higginbotham notes that the brontobyte is “apparently recognized by some people in the measurement community.”

Read more…

Strata Week: The Open Data Institute aims to mine the gold in open government data

The ODI's official launch, MIT's Kinect Kinetics project, and legal ways authorities are tracking us.

Here are a few stories from the data space that caught my attention this week.

Open government data gets a startup incubator

The Open Data Institute (ODI), founded by Tim Berners-Lee and artificial intelligence pioneer Nigel Shadbolt, officially launched this week in the U.K. As Berners-Lee and Shadbolt noted in “There’s gold to be mined from all our data (PDF),” the institute was initially funded and commissioned by the U.K. government to “help the public sector to use its own data more effectively” and that by “[w]orking with private companies and universities, it will also develop the capability of U.K. businesses to exploit open data, fostering a generation of open data entrepreneurs.” The institute’s mission is outlined on its website:

“The Open Data Institute will catalyse the evolution of an open data culture to create economic, environmental, and social value. It will unlock supply, generate demand, create and disseminate knowledge to address local and global issues. We will convene world-class experts to collaborate, incubate, nurture and mentor new ideas, and promote innovation. We will enable anyone to learn and engage with open data, and empower our teams to help others through professional coaching and mentoring.”

Jamillah Knowles reports at The Next Web that the institute is already hosting its first startups, including agile big data specialists Mastodon C; corporate information aggregator OpenCorporates; location-based data startup Placr; and Locatable, a startup aiming to help people find their perfect place to live.

Coinciding with the launch, the institute received an investment boost. As Ingrid Lunden reports at TechCrunch, the U.K. government has committed £10 million over the next five years (about $16 million); this week, investment firm Omidyar Network, co-founded by eBay founder Pierre Omidyar and his wife Pam, invested an additional $750,000 in the ODI. Lunden notes that though the ODI is focused on the U.K., having an international investment company on board “gives the effort a potential profile beyond these borders.”

In related news, O’Reilly Radar’s Alex Howard talked with open government developer Eric Mill, who together with GovTrack.us founder Josh Tauberer and New York Times developer Derek Willis published data and scrapers for legislation in Congress from THOMAS.gov in the public domain at github.com/unitedstates. Mill told Howard he’s hoping this work will serve as an example for government to publish the information themselves in the future:

“It would be fantastic if the relevant bodies published this data themselves and made these datasets and scrapers unnecessary. It would increase the information’s accuracy and timeliness, and probably its breadth. It would certainly save us a lot of work! Until that time, I hope that our approach to this data, based on the joint experience of developers who have each worked with it for years, can model to government what developers who aim to serve the public are actually looking for online.”

You can read Howard’s full interview with Mills about building the scraper and the accompanying dataset, using GitHub as a platform, and how the data is being used here.

Read more…

Strata Week: Big data gets warehouse services

AWS Redshift and BitYota launch, big data's problems could shift to real time, and NYPD may be crossing a line with cellphone records.

Here are a few stories from the data space that caught my attention this week.

Amazon, BitYota launch data warehousing services

Amazon announced the beta launch of its Amazon Web Services data warehouse service Amazon Redshift this week. Paul Sawers at The Next Web reports that Amazon hopes to democratize data warehousing services, offering affordable options to make such services viable for small businesses while enticing large companies with cheaper alternatives. Depending on the service plan, customers can launch Redshift clusters scaling to more than a petabyte for less than $1,000 per terabyte per year.

So far, the service has drawn in some big players — Sawers notes that the initial private beta has more than 20 customers, including NASA/JPL, Netflix, and Flipboard.

Brian Proffitt at ReadWrite took an in-depth look at the service, noting its potential speed capabilities and the importance of its architecture. Proffitt writes that Redshift’s massively parallel processing (MPP) architecture “means that unlike Hadoop, where data just sits cheaply waiting to be batch processed, data stored in Redshift can be worked on fast — fast enough for even transactional work.”

Proffitt also notes that Redshift isn’t without its red flags, pointing out that a public cloud service not only raises issues of data security, but of the cost of data access — the bandwidth costs of transferring data back and forth. He also raises concerns that this service may play into Amazon’s typical business model of luring customers into its ecosystem bits at a time. Proffitt writes:

“If you have been keeping your data and applications local, shifting to Redshift could also mean shifting your applications to some other part of the AWS ecosystem as well, just to keep the latency times and bandwidth costs reasonable. In some ways, Redshift may be the AWS equivalent of putting the milk in the back of the grocery store.”

In related news, startup BitYota also launched a data warehousing service this week. Larry Dignan reports at ZDNet that BitYota is built on a cloud infrastructure and uses SQL technology, and that service plans will start at $1,500 per month for 500GB of data. As to competition with AWS Redshift, BitYota co-founder and CEO Dev Patel told Dignan that it’s a non-issue: “[Redshift is] not a competitor to us. Amazon is taking the traditional data warehouse and making it available. We focus on a SaaS approach where the hardware layer is abstracted away,” he said.

Read more…

Strata Week: Big data’s daily influence

Big data's broad effect on our world, myriad uses for traffic data, and Obama's big data practice vs. policy.

Here are a few stories from the data space that caught my attention this week.

How big data is transforming just about everything

Professor John Naughton took a look this week at how big data is transforming various industries that affect our daily lives.

He highlights finance, of course, which he says has been “pathologically mathematised;” marketing, for which there is more data about human behavior than we’ve ever had; and the very broad category of science. Naughton notes that researchers used to conjure up theories and look to data to support or refute; now, researchers turn to data to find patterns and connections that might inspire new theories. Naughton also looks at medicine, which is just on the brink of delving into the big data realm. He writes:

“Last week’s news about how Cambridge researchers stopped an MRSA outbreak affecting 12 babies in the Rosie Hospital by rapidly sequencing the genome of the bacteria illustrates how medicine has become a data-intensive field. Even a few years ago, the resources required to achieve this would have involved a roomful of computers and upwards of a week.”

Naughton addresses the use of big data in sports as well, speculating that baseball has been the sport most transformed by data. He’ll likely find agreement there. Barry Eggers goes into depth on the dramatic effect big data is having on baseball over at TechCrunch. He notes that simple data analysis of statistics, which baseball has embraced since its beginnings, has evolved into gathering mountains of unstructured data and employing Hadoop to gain new and better insights from data that isn’t part of the structured game information. Eggers writes:

“By having his data scientist run a Hadoop job before every game, [San Francisco Giants manager] Bruce Bochy can not only make an informed decision about where to locate a 3-1 Matt Cain pitch to Prince Fielder, but he can also predict how and where the ball might be hit, how much ground his infielders and outfielders can cover on such a hit, and thus determine where to shift his defense. Taken one step further, it’s not hard to imagine a day where managers like Bochy have their locker room data scientist run real-time, in-game analytics using technologies like Cassandra, Hbase, Drill, and Impala.”

Read more…

Strata Week: Investors embrace Hadoop BI startups

Platfora, Continuuity secure funding; the Internet of Things gets connected; and personal big data needs a national awareness campaign.

Here are a few stories from the data space that caught my attention this week.

Two Hadoop BI startups secure funding

Hadoop LogoThere were a couple notable pieces of investment news this week. Platfora, a startup looking to democratize Hadoop as a business intelligence (BI) tool for everyday business users, announced this week that it has raised $20 million in series B funding, bringing its total funding to $25.7 million, according to a report by Derrick Harris at GigaOm.

Harris notes that investors seem to get the technology — CEO Ben Werther told Harris that in this funding round, discussions moved to signed term sheets in just three weeks. Harris writes that the smooth investment experience “probably has something to do with the consensus the company has seen among venture capitalists, who project Hadoop will take about 20 percent of a $30 billion legacy BI market and are looking for the startups with the vision to win that business.”

Platfora faces plenty of well-funded legacy BI competitors, but Werther told Christina Farr at Venture Beat that Platfora’s edge is speed: “People can visualize and ask questions about data within hours. There is no six-month cycle time to make Hadoop amazing.”

In other investment news, Continuuity announced it has secured $10 million in series A funding to further develop AppFabric, its cloud-based platform-as-a-service tool designed to host Hadoop-based BI applications. Alex Wilhelm reports at The Next Web that Continuuity is looking to make AppFabric “the de facto location where developers can move their big data tools from idea to product, without worrying about building their own backend, or fretting about element integration.”

Read more…

Strata Week: Real-time Hadoop

Cloudera ventures into real-time queries with Impala, data centers are the new landfill, and Jesper Andersen looks at the relationship between art and data.

Here are a few stories from the data space that caught my attention this week.

Cloudera’s Impala takes Hadoop queries into real-time

Cloudera ventured into real-time Hadoop querying this week, opening up its Impala software platform. As Derrick Harris reports at GigaOm, Impala — an SQL query engine — doesn’t rely on MapReduce, making it faster than tools such as Hive. Cloudera estimates its queries run 10 times faster than Hive, and Charles Zedlewski, Cloudera’s cloud VP of products, told Harris that “small queries can run in less than a second.”

Harris notes that Zedlewski pointed out that Impala wasn’t designed to replace business intelligence (BI) tools, and that “Cloudera isn’t interested in selling BI or other analytic applications.” Rather, Impala serves as the execution engine, still relying on software from Cloudera partners — Zedlewski told Harris, “We’re sticking to our knitting as a platform vendor.”

Joab Jackson at PC World reports that “[e]ventually, Impala will be the basis of a Cloudera commercial offering, called the Cloudera Enterprise RTQ (Real-Time Query), though the company has not specified a release date.”

Impala has plenty of competition on this playing field, which Harris also covers, and he notes the significance of all the recent Hadoop innovation:

“I can’t underscore enough how critical all of this innovation is for Hadoop, which in order to add substance to its unparalleled hype needed to become far more useful to far more users. But the sudden shift from Hadoop as a batch-processing engine built on MapReduce into an ad hoc SQL querying engine might leave industry analysts and even Hadoop users scratching their heads.”

You can read more from Harris’ piece here and Jackson’s piece here. Wired also has an interesting piece on Impala, covering the Google F1 database upon which it is based and the Googler Cloudera hired away to help build it.

(Cloudera CEO Mike Olson discussed Impala, Hadoop and the importance of real-time at this week’s Strata Conference + Hadoop World.)

Read more…

Strata Week: A realistic look at big data obstacles

Obstacles for big data, big data intelligence, and a privacy plugin puts Google and Facebook settings in the spotlight.

Here are a few stories from the data space that caught my attention this week.

Big obstacles for big data

For the latest issue of Foreign Policy, Uri Friedman put together a summarized history of big data to show “[h]ow we arrived at a term to describe the potential and peril of today’s data deluge.” A couple months ago, MIT’s Alex “Sandy” Pentland took a look at some of that big data potential for Harvard Business Review; this week, he looked at some of the perilous aspects. Pentland writes that to be realistic about big data, it’s important to look not only at its promise, but also its obstacles. He identifies the problem of finding meaningful correlations as one of big data’s biggest obstacles:

“When your volume of data is massive, virtually any problem you tackle will generate a wealth of ‘statistically significant’ answers. Correlations abound with Big Data, but inevitably most of these are not useful connections. For instance, your Big Data set may tell you that on Mondays, people who drive to work rather than take public transportation are more likely to get the flu. Sounds interesting, and traditional research methods show that it’s factually true. Jackpot!

“But why is it true? Is it causal? Is it just an accident? You don’t know. This means, strangely, that the scientific method as we normally use it no longer works, because there are so many possible relationships to consider that many are bound to be ‘statistically significant’. As a consequence, the standard laboratory-based question-and-answering process — the method that we have used to build systems for centuries — begins to fall apart.”

Pentland says that big data is going to push us out of our comfort zone, requiring us to conduct experiments in the real world — outside our familiar laboratories — and change the way we test the causality of connections. He also addresses issues of understanding those correlations enough to put them to use, knowing who owns the data and learning to forge new types of collaborations to use it, and how putting individuals in charge of their own data helps address big data privacy concerns. This piece, together with Pentland’s earlier big data potential post, are this week’s recommended reads.

Read more…

Strata Week: Dueling views on data center efficiency

The New York Times questions the environmental impact of data centers. Also, big data as hiring manager and inside Foursquare's data science.

The NYT investigates data center pollution, Google buys wind power

The New York Times (NYT) has conducted a year-long investigation into data centers and their environmental impact, and the first reports from the investigation were published this week. NYT writer James Glanz reports that the study showed the tens of thousands of data centers required around the world to process the vast amounts of data produced by billions of users each day “is sharply at odds with its image of sleek efficiency and environmental friendliness.” Glanz says that through interviews and research, the NYT found data centers to be wasteful with electricity. Glanz reports:

“Online companies typically run their facilities at maximum capacity around the clock, whatever the demand. As a result, data centers can waste 90 percent or more of the electricity they pull off the grid, The Times found. To guard against a power failure, they further rely on banks of generators that emit diesel exhaust. The pollution from data centers has increasingly been cited by the authorities for violating clean air regulations, documents show. … Worldwide, the digital warehouses use about 30 billion watts of electricity, roughly equivalent to the output of 30 nuclear power plants, according to estimates industry experts compiled for The Times. Data centers in the United States account for one-quarter to one-third of that load, the estimates show.”

Glanz also notes the findings showed that only about 6 to 12% of the electricity data centers are consuming for servers is actually being used to perform computations — the remaining 88+% is being used to maintain idling servers standing at the ready for surges in site activity. You can find Glanz’s full report, along with analysis and industry interviews, here.

Some have criticized the NYT investigation for lumping all data centers together and for relying on old information without looking at the advances taking place in the industry. Those advances were highlighted this week as Google announced it will be powering one of its data centers with wind-generated power. Google’s director of energy and sustainability, Rick Needham, told Robert McMillan at Wired that Google has committed to a 10-year agreement with the Grand River Dam Authority utility company for 48 megawatts of wind power for its data center in Mayes County, Oklahoma. McMillan reports that construction on a 300-megawatt facility to provide the wind energy is underway. The facility is expected to go online later this year.

Read more…

Strata Week: Big problems in the age of big data

Big data and big problems, open data monetization, Hortonworks' first year, and a new Hadoop Partner Ecosystem launches

Here are a few stories that caught my attention in the data space this week.

Big data, Big Brother, big problems

Adam Frank took a look at some of the big problems with big data this week over at NPR. Franks addresses issues in analyzing the sheer volume of complex information inherent in big data. Learning to sort through and mine vasts amounts of data to extrapolate meaning will be a “trick,” he writes, but it turns out the big problems with big data go deeper than volume.

Creating computer models to simulate complex systems with big data, Franks notes, ultimately creates something a bit different from reality: “the very act of bringing the equations over to digital form means you have changed them in subtle ways and that means you are solving a slightly different problem than the real-world version.” Analysis, therefore, “requires trained skepticism, sophistication and, remarkably, some level of intuition about the systems we study,” he writes.

Franks also raises the problem of big data becoming a threat to individuals within society:

“Everyday we are scattering ‘digital breadcrumbs’ into the data-verse. Credit card purchases, cell phone calls, Internet searches: Big Data means memory storage has become so cheap that all data about all those aspects of our lives can be harvested and put to use. And it’s exactly the use of all that harvested data that can pose a threat to society.”

The threat comes from the Big Brother aspect of being constantly monitored in ways we’ve never before imagined, and Franks writes, “It may also allows levels of manipulation that are new and truly unimaginable.” You can read more of Franks thoughts on what it means to live in the age of big data here. (We’ve covered related ethics issues with big data here on Strata.)

Read more…