"database" entries

Augmenting Unstructured Data

OSCON 2013 Speaker Series

Our world is filled with unstructured data. By some estimates, it’s as high as 80% of all data.

Unstructured data is data that isn’t in a specific format. It isn’t separated by a delimiter that you could split on and get all of the individual pieces of data. Most often, this data comes directly from humans. Human generated data isn’t the best kind to place in a relational database and run queries on. You’d need to run some algorithms on unstructured data to gain any insight.

Play-by-Play

Advanced NFL Stats was kind enough to open up their play-by-play data for NFL games from 2002 to 2012. This dataset consisted of both structured and unstructured data. Here is a sample line showing a single play in the dataset:

20121119_CHI@SF,3,17,48,SF,CHI,3,2,76,20,0,(2:48) C.Kaepernick pass short right to M.Crabtree to SF 25 for 1 yard (C.Tillman). Caught at SF 25. 0-yds YAC,0,3,0,27,7 ,2012

This line is comma separated and has various queryable elements. For example, we could query on the teams playing and the scores. We couldn’t query on the unstructured portion, specifically:

(2:48) C.Kaepernick pass short right to M.Crabtree to SF 25 for 1 yard (C.Tillman). Caught at SF 25. 0-yds YAC

There are many more insights that can be gleaned from this portion. For example, we can see that San Francisco’s quarterback, Colin Kaepernick, passed the ball to the wide receiver, Michael Crabtree. He was tackled by Charles Tillman. Michael Crabtree caught the ball at the 25 yard line, gained 1 yard. After catching the ball, he didn’t make any forward progress or yards after catch (“0-yds YAC“).

Writing a synopsis like that is easy for me as a human. I’ve spent my life working with language and trying to comprehend the meaning. It’s not as easy for a computer. We have to use various methods to extract the information from unstructured data like this. These can vary dramatically in complexity; some use simple string lookups or contains and others use natural language processing.

Parsing would be easy if all of the plays looked like the one above, but they don’t. Each play has a little bit different formatting and varies in the amount of data. The general format is different for each type of play: run, pass, punt, etc. Each type of play needs to be treated a little differently. Also, each person writing the play description will be a little different from the others.

Augmenting Data

As I showed in the synopsis of the play, one can get some interesting insight out of the structured and unstructured data. However, we’re still limited in what we can do and the kinds of queries we can write.

I had a conversation with a Canadian about American Football and the effects of the weather. I was talking about how the NFL now favors domed locations to take weather out as a variable for the Superbowl or NFL championship. With the play-by-play dataset, I couldn’t write a query to tell me the effects of weather one way or another. The data simply doesn’t exist in the dataset.

The physical location and date of the game is captured in the play-by-play. We can see that in the first portion “20121119_CHI@SF“. Here, Chicago is playing at San Francisco on 2012-11-19. There are other datasets that could enable us to augment our play-by-play. These datasets would give us the physical location of the stadium where the game was played (the team’s name doesn’t mean that their stadium is in that city). The dataset for stadiums is a relatively small one. This dataset also shows if the stadium is domed and the ambient temperature is a non-issue.
Read more…

Four short links: 5 July 2013

Four short links: 5 July 2013

Tracking Bitcoin, Gaming Deflation, Bloat-Aware Design, and Mapping Entity Relationships

  1. Quantitative Analysis of the Full Bitcoin Transaction Graph (PDF) — We analyzed all these large transactions by following in detail the way these sums were accumulated and the way they were dispersed, and realized that almost all these large transactions were descendants of a single transaction which was carried out in November 2010. Finally, we noted that the subgraph which contains these large transactions along with their neighborhood has many strange looking structures which could be an attempt to conceal the existence and relationship between these transactions, but such an attempt can be foiled by following the money trail in a succinctly persistent way. (via Alex Dong)
  2. Majority of Gamers Today Can’t Finish Level 1 of Super Mario Bros — Nintendo test, and the President of Nintendo said in a talk, We watched the replay videos of how the gamers performed and saw that many did not understand simple concepts like bottomless pits. Around 70 percent died to the first Goomba. Another 50 percent died twice. Many thought the coins were enemies and tried to avoid them. Also, most of them did not use the run button. There were many other depressing things we noted but I can not remember them at the moment. (via Beta Knowledge)
  3. Bloat-Aware Design for Big Data Applications (PDF) — (1) merging and organizing related small data record objects into few large objects (e.g., byte buffers) instead of representing them explicitly as one-object-per-record, and (2) manipulating data by directly accessing buffers (e.g., at the byte chunk level as opposed to the object level). The central goal of this design paradigm is to bound the number of objects in the application, instead of making it grow proportionally with the cardinality of the input data. (via Ben Lorica)
  4. Poderopedia (Github) — originally designed for investigative journalists, the open src software allows you to create and manage entity profile pages that include: short bio or summary, sheet of connections, long newsworthy profiles, maps of connections of an entity, documents related to the entity, sources of all the information and news river with external news about the entity. See the announcement and website.
Four short links: 19 April 2013

Four short links: 19 April 2013

Sterling on Disruption, Coding Crypto Fun, Distributed File System, and Asset Packaging

  1. Bruce Sterling on DisruptionIf more computation, and more networking, was going to make the world prosperous, we’d be living in a prosperous world. And we’re not. Obviously we’re living in a Depression. Slow first 25% but then it takes fire and burns with the heat of a thousand Sun Microsystems flaming out. You must read this now.
  2. The Matasano Crypto Challenges (Maciej Ceglowski) — To my delight, though, I was able to get through the entire sequence. It took diligence, coffee, and a lot of graph paper, but the problems were tractable. And having completed them, I’ve become convinced that anyone whose job it is to run a production website should try them, particularly if you have no experience with application security. Since the challenges aren’t really documented anywhere, I wanted to describe what they’re like in the hopes of persuading busy people to take the plunge.
  3. Tachyona fault tolerant distributed file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. Berkeley-licensed open source.
  4. Jammit (GitHub) — an industrial strength asset packaging library for Rails, providing both the CSS and JavaScript concatenation and compression that you’d expect, as well as YUI Compressor, Closure Compiler, and UglifyJS compatibility, ahead-of-time gzipping, built-in JavaScript template support, and optional Data-URI / MHTML image and font embedding. (via Joseph Misiti)
Four short links: 23 November 2012

Four short links: 23 November 2012

Island Traps, Apolitical Technology, 3D Printing Patent Suits, and Disk-Based Graph Tool

  1. Trap Island — island on most maps doesn’t exist.
  2. Why I Work on Non-Partisan Tech (MySociety) — excellent essay. Obama won using big technology, but imagine if that effort, money, and technique were used to make things that were useful to the country. Political technology is not gov2.0.
  3. 3D Printing Patent Suits (MSNBC) — notable not just for incumbents keeping out low-cost competitors with patents, but also (as BoingBoing observed) Many of the key patents in 3D printing start expiring in 2013, and will continue to lapse through ’14 and ’15. Expect a big bang of 3D printer innovation, and massive price-drops, in the years to come. (via BoingBoing)
  4. GraphChican run very large graph computations on just a single machine, by using a novel algorithm for processing the graph from disk (SSD or hard drive). Programs for GraphChi are written in the vertex-centric model, proposed by GraphLab and Google’s Pregel. GraphChi runs vertex-centric programs asynchronously (i.e changes written to edges are immediately visible to subsequent computation), and in parallel. GraphChi also supports streaming graph updates and removal of edges from the graph.

Google’s Spanner is all about time

Did Google just prove the industry wrong? Early thoughts on the Spanner database.

In case you missed it, Google Research published another one of “those” significant research papers — a paper like the BigTable paper from 2006 that had ramifications for the entire industry (that paper was one of the opening volleys in the NoSQL movement).  

Google’s new paper is about a distributed relational database called Spanner that was a follow up to a presentation from earlier in the year about a new database for AdWords called F1.   If you recall, that presentation revealed Google’s migration of AdWords from MySQL to a new database that supported SQL and hierarchical schemas — two ideas that buck the trend from relational databases.

Meet Spanner

This new database, Spanner, is a database unlike anything we’ve seen.   It’s a database that embraces ACID, SQL, and transactions, that can be distributed across thousands of nodes spanning multiple data centers across multiple regions.  The paper dwells on two main features that define this database:

  • Schematized Semi-relational Tables — A hierarchical approach to grouping tables that allows Spanner to co-locate related data into directories that can be easily stored, replicated, locked, and managed on what Google calls spanservers.    They have a modified SQL syntax that allows for the data to be interleaved, and the paper mentions some changes to support columns encoded with Protobufs.
  • “Reification of Clock Uncertainty” — This is the real emphasis of the paper.    The missing link in relational database scalability was a strong emphasis on coordination backed by a serious attempt to minimize time uncertainty.  In Google’s new global-scale database, the variable that matters is epsilon — time uncertainty.   Google has achieved very low overhead (14ms introduced by Spanner in this paper for datacenters at 1ms network distance) for read-write (RW) transactions that span U.S. East Coast and U.S. West Coast (data centers separated by around 2ms of network time) by creating a system that facilitates distributed transactions bound only by network distance (measured in milliseconds) and time uncertainty (epsilon).

Read more…

Four short links: 12 September 2012

Four short links: 12 September 2012

Time-Series Database, Multi-Device TV, C# to Javascript, and Tiny Research

  1. Seriesly — time-series database written in go.
  2. Tablets and TV (Luke Wroblewski) — In August 2012, 77% of TV viewers used another device at the same time in a typical day. 81% used a smartphone and TV at the same time. 66% used a laptop and TV at the same time.
  3. Saltarelle — open source (Apache2) C# to Javascript compiler. (via Javascript Weekly)
  4. Tiny Transactions on Computer Science — computer science research in 140 characters or fewer.

Heavy data and architectural convergence

Data is getting heavier relative to the networks that carry it around the data center.

Imagine a future where large clusters of like machines dynamically adapt between programming paradigms depending on a combination of the resident data and the required processing.

Four short links: 15 March 2012

Four short links: 15 March 2012

Javascript STM, HTML5 Game Mashup, BerkeleyDB Architecture, and API Ontologies

  1. atomize.jsa distributed Software Transactional Memory implementation in Javascript.
  2. mari0 — not only a great demonstration of what’s possible in web games, but also a clever mashup of Mario and Portal.
  3. Lessons From BerkeleyDB — chapter on BerkeleyDB’s design, architecture, and development philosophy from Architecture of Open Source Applications. (via Pete Warden)
  4. An API Ontology I currently see most real-world deployed APIs fit into a few different categories. All have their pros and cons, and it’s important to see how they relate to one other.

Top stories: February 6-10, 2012

The NoSQL movement, a victory for the web, and it's time to end DRM and embrace a unified ebook format.

This week on O'Reilly: Mike Loukides surveyed the NoSQL database landscape, the open web scored an important victory in court, and Joe Wikert said it's time to embrace a unified ebook format and abandon DRM.

Helping educators find the right stuff

The Learning Registry looks to crack the education resource discovery problem.

There are countless repositories of high-quality content available to teachers, but it is still nearly impossible to find content to use with a particular lesson plan for a particular grade aligned to particular standards. That's where the Department of Education's new Learning Registry comes in.