"nosql" entries

Augmenting Unstructured Data

OSCON 2013 Speaker Series

Our world is filled with unstructured data. By some estimates, it’s as high as 80% of all data.

Unstructured data is data that isn’t in a specific format. It isn’t separated by a delimiter that you could split on and get all of the individual pieces of data. Most often, this data comes directly from humans. Human generated data isn’t the best kind to place in a relational database and run queries on. You’d need to run some algorithms on unstructured data to gain any insight.

Play-by-Play

Advanced NFL Stats was kind enough to open up their play-by-play data for NFL games from 2002 to 2012. This dataset consisted of both structured and unstructured data. Here is a sample line showing a single play in the dataset:

This line is comma separated and has various queryable elements. For example, we could query on the teams playing and the scores. We couldn’t query on the unstructured portion, specifically:

There are many more insights that can be gleaned from this portion. For example, we can see that San Francisco’s quarterback, Colin Kaepernick, passed the ball to the wide receiver, Michael Crabtree. He was tackled by Charles Tillman. Michael Crabtree caught the ball at the 25 yard line, gained 1 yard. After catching the ball, he didn’t make any forward progress or yards after catch (“0-yds YAC“).

Writing a synopsis like that is easy for me as a human. I’ve spent my life working with language and trying to comprehend the meaning. It’s not as easy for a computer. We have to use various methods to extract the information from unstructured data like this. These can vary dramatically in complexity; some use simple string lookups or contains and others use natural language processing.

Parsing would be easy if all of the plays looked like the one above, but they don’t. Each play has a little bit different formatting and varies in the amount of data. The general format is different for each type of play: run, pass, punt, etc. Each type of play needs to be treated a little differently. Also, each person writing the play description will be a little different from the others.

Augmenting Data

As I showed in the synopsis of the play, one can get some interesting insight out of the structured and unstructured data. However, we’re still limited in what we can do and the kinds of queries we can write.

I had a conversation with a Canadian about American Football and the effects of the weather. I was talking about how the NFL now favors domed locations to take weather out as a variable for the Superbowl or NFL championship. With the play-by-play dataset, I couldn’t write a query to tell me the effects of weather one way or another. The data simply doesn’t exist in the dataset.

The physical location and date of the game is captured in the play-by-play. We can see that in the first portion “20121119_CHI@SF“. Here, Chicago is playing at San Francisco on 2012-11-19. There are other datasets that could enable us to augment our play-by-play. These datasets would give us the physical location of the stadium where the game was played (the team’s name doesn’t mean that their stadium is in that city). The dataset for stadiums is a relatively small one. This dataset also shows if the stadium is domed and the ambient temperature is a non-issue.
Read more…

Comment

Analytic engines that factor in security labels

Data stores are rolling out easy-to-use analysis tools

Originated by the NSA, Apache Accumulo is a BigTable inspired data store known for being highly scalable and for its interesting security model. Federal agencies and Defense contractors have deployed Accumulo on clusters of a thousand or more servers. It also uses “cell-level” security to control access to values stored in individual cells1.

What Accumulo was lacking were easy-to-use, standard analytic engines that allow users to interact with data. The release of Sqrrl Enterprise this past week fills that gap. Sqrrl Enterprise provides an initial set of analytic engines for the Accumulo ecosystem2. It includes support for interactive SQL, fulltext search, and queries over graph data. Each of these engines takes into account security labels placed on data: since every data object ingested into Sqrrl has a security label, (query & analytic) results incorporate those access levels. Analysts interact with data as they normally would. For example Sqrrl’s indexing technology accounts for security labels, and search queries are written in standard Lucene syntax. Reminiscent of the Phoenix project for HBase3, SQL queries4 in Sqrrl are converted into optimized Accumulo iterators.

Read more…

Comment

Returning transactions to distributed data stores

Principles for the next generation of NoSQL databases

By David Rosenthal and Stephen Pimentel

Rise of NoSQL

Database technologies are undergoing rapid evolution, with new approaches being actively explored after decades of relative stability. As late as 2008, the term “NoSQL”  barely existed and relational databases were both commercially dominant and entrenched in the developer community. Since then NoSQL systems have rapidly gained prominence and early systems such as Google’s Bigtable and Amazon’s Dynamo have inspired dozens of new databases (HBase, Cassandra, Voldemort, MongoDB, etc.) that fall under the NoSQL umbrella.

The first generation of NoSQL databases aimed to achieve the dual goals of fault tolerance and horizontal scalability on clusters of commodity hardware There are now a variety of NoSQL systems available that, at their best, achieve these goals. Unfortunately, the cost for these benefits is high: limited data model flexibility and extensibility, and weak guarantees for applications due to the lack of multi-statement (global) transactions.

Read more…

Comment
Four short links: 1 March 2013

Four short links: 1 March 2013

Drone Journalism, DNS Sniffing, E-Book Lending, and Structured Data Server

  1. Drone Journalismtwo universities in the US have already incorporated drone use in their journalism programs. The Drone Journalism Lab at the University of Nebraska and the Missouri Drone Journalism Program at the University of Missouri both teach journalism students how to make the most of what drones have to offer when reporting a story. They also teach students how to fly drones, the Federal Aviation Administration (FAA) regulations and ethics.
  2. passivednsA network sniffer that logs all DNS server replies for use in a passive DNS setup.
  3. IFLA E-Lending Background Paper (PDF) — The global dominance of English language eBook title availability reinforced by eReader availability is starkly evident in the statistics on titles available by country: in the USA: 1,000,000; UK: 400,000; Germany/France: 80,000 each; Japan: 50,000; Australia: 35,000; Italy: 20,000; Spain: 15,000; Brazil: 6,000. Many more stats in this paper prepared as context for the International Federation of Library Associations.
  4. The god Architecturea scalable, performant, persistent, in-memory data structure server. It allows massively distributed applications to update and fetch common data in a structured and sorted format. Its main inspirations are Redis and Chord/DHash. Like Redis it focuses on performance, ease of use and a small, simple yet powerful feature set, while from the Chord/DHash projects it inherits scalability, redundancy, and transparent failover behaviour.
Comment
Four short links: 15 June 2012

Four short links: 15 June 2012

On Anonymous, Graph Database, Leap Second, and Debugging Creativity

  1. In Flawed, Epic Anonymous Book, the Abyss Gazes Back (Wired) — Quinn Norton’s review of a book about Anonymous is an excellent introduction to Anonymous. Anonymous made us, its mediafags, masters of hedging language. The bombastic claims and hyperbolic declarations must be reported from their mouths, not from our publications. And yet still we make mistakes and publish lies and assumptions that slip through. There is some of this in all of journalism, but in a world where nothing is true and everything is permitted, it’s a constant existential slog. It’s why there’s not many of us on this beat.
  2. Titan (GitHub) — Apache2-licensed distributed graph database optimized for storing and processing large-scale graphs within a multi-machine cluster. Cassandra and HBase backends, implements the Blueprints graph API. (via Hacker News)
  3. Extra Second This June — we’re getting a leap second this year: there’ll be 2012 June 30, 23h 59m 60s. Calendars are fun.
  4. On Creativity (Beta Knowledge) — I wanted to create a game where even the developers couldn’t see what was coming. Of course I wasn’t thinking about debugging at this point. The people who did the debugging asked me what was a bug. I could not answer that. — Keita Takahashi, game designer (Katamari Damacy, Noby Noby Boy). Awesome quote.
Comment: 1

MySQL in 2012: Report from Percona Live

Checking in on the state of MySQL.

Contrasting deployments at craigslit and Pinterest, trends, commercial offerings, and more

Comments: 3

Data’s next steps

RedMonk's Steve O'Grady weighs in on data's pressing issues.

Redmonk analyst Steve O'Grady discusses the demand for data scientists, the problem of using data to asking the right questions, and why you shouldn't rush into a NoSQL investment.

Comment

Data's next steps

RedMonk's Steve O'Grady weighs in on data's pressing issues.

Redmonk analyst Steve O'Grady discusses the demand for data scientists, the problem of using data to asking the right questions, and why you shouldn't rush into a NoSQL investment.

Comment
Top stories: February 6-10, 2012

Top stories: February 6-10, 2012

The NoSQL movement, a victory for the web, and it's time to end DRM and embrace a unified ebook format.

This week on O'Reilly: Mike Loukides surveyed the NoSQL database landscape, the open web scored an important victory in court, and Joe Wikert said it's time to embrace a unified ebook format and abandon DRM.

Comment
The NoSQL movement

The NoSQL movement

How to think about choosing a database.

A relational database is no longer the default choice. Mike Loukides charts the rise of the NoSQL movement and explains how to choose the right database for your application.

Comments: 12