- Neglected Machine Learning Ideas — Perhaps my list is a “send me review articles and book suggestions” cry for help, but perhaps it is useful to others as an overview of neat things.
- First Crowdfunded Book on Booker Shortlist — Booker excludes self-published works, but “The Wake” was through Unbound, a Threadless-style “if we hit this limit, the book is printed and you have bought a copy” site.
- Watson Can Debate Its Opponents (io9) — Speaking in nearly perfect English, Watson/The Debater replied: “Scanned approximately 4 million Wikipedia articles, returning ten most relevant articles. Scanned all 3,000 sentences in top ten articles. Detected sentences which contain candidate claims. Identified borders of candidate claims. Assessed pro and con polarity of candidate claims. Constructed demo speech with top claim predictions. Ready to deliver.”
- ipfs — a global, versioned, peer-to-peer file system. It combines good ideas from Git, BitTorrent, Kademlia, and SFS. You can think of it like a single BitTorrent swarm, exchanging Git objects, making up the web. IPFS provides an interface much simpler than HTTP, but has permanence built in.. (via Sourcegraph)
ENTRIES TAGGED "database"
When ActiveRecord just isn’t enough
In Just Enough Arel, we explored a bit into how the Arel library transforms our Ruby code into SQL to be executed by the database. To do so, we discovered that Arel abstracts database tables and the fields therein as objects, which in turn receive messages not normally available in ActiveRecord queries. Wrapping up the article, we also looked at arguments for using Arel over falling back to SQL.
As alluded at the end of the previous article, Arel can do much more than merely provide a handful of comparison operators. In this post, we’ll look at how we can call native database functions, construct unions and intersects, and we’ll wrap things up by explicitly building joins with Arel.
Can version control manage content?
Web designers? Git? Github? Aren’t those for programmers? At Artifact, Christopher Schmitt showed designers how much their peers are already doing with Github, and what more they can do. Github (and the underlying Git toolset) changes the way that all kinds of people work together.
Sharing with Git
As amazing as Linux may be, I keep thinking that Git may prove to be Linus Torvalds’ most important contribution to computing. Most people think of it, if they think of it at all, as a tool for managing source code. It can do far more, though, providing a drastically different (and I think better) set of tools for managing distributed projects, especially those that use text.
Git tackles an unwieldy problem, managing the loosely structured documents that humans produce. Text files are incredibly flexible, letting us store everything from random notes to code of all kinds to tightly structured data. As awesome as text files are—readable, searchable, relatively easy to process—they tend to become a mess when there’s a big pile of them.
Hadoop, Sqoop, and ZooKeeper
Kathleen Ting (@kate_ting), Technical Account Manager at Cloudera, and our own Andy Oram (@praxagora) sat down to discuss how to work with structured and unstructured data as well as how to keep a system up and running that is crunching that data.
Key highlights include:
- Misconfigurations consist of almost half of the support issues that the team at Cloudera is seeing [Discussed at 0:22]
- ZooKeeper, the canary in the Hadoop coal mine [Discussed at 1:10]
- Leaky clients are often a problem ZooKeeper detects [Discussed at 2:10]
- Sqoop is a bulk data transfer tool [Discussed at 2:47]
- Sqoop helps to bring together structured and unstructured data [Discussed at 3:50]
- ZooKeep is not for storage, but coordination, reliability, availability [Discussed at 4:44]
You can view the full interview here:
Retreading old topics can be a powerful source of epiphany, sometimes more so than simple extra-box thinking. I was a computer science student, of course I knew statistics. But my recent years as a NoSQL (or better stated: distributed systems) junkie have irreparably colored my worldview, filtering every metaphor with a tinge of information management.
Lounging on a half-world plane ride has its benefits, namely, the opportunity to read. Most of my Delta flight from Tel Aviv back home to Portland lacked both wifi and (in my case) a workable laptop power source. So instead, I devoured Nate Silver’s book, The Signal and the Noise. When Nate reintroduced me to the concept of statistical overfit, and relatedly underfit, I could not help but consider these cases in light of the modern problem of distributed data management, namely, operators (you may call these operators DBAs, but please, not to their faces).
When collecting information, be it for a psychological profile of chimp mating rituals, or plotting datapoints in search of the Higgs Boson, the ultimate goal is to find some sort of usable signal, some trend in the data. Not every point is useful, and in fact, any individual could be downright abnormal. This is why we need several points to spot a trend. The world rarely gives us anything clearer than a jumble of anecdotes. But plotted together, occasionally a pattern emerges. This pattern, if repeatable and useful for prediction, becomes a working theory. This is science, and is generally considered a good method for making decisions.
On the other hand, when lacking experience, we tend to over value the experience of others when we assume they have more. This works in straightforward cases, like learning to cook a burger (watch someone make one, copy their process). This isn’t so useful as similarities diverge. Watching someone make a cake won’t tell you much about the process of crafting a burger. Folks like to call this cargo cult behavior.
How Fit are You, Bro?
You need to extract useful information from experience (which I’ll use the math-y sounding word datapoints). Having a collection of datapoints to choose from is useful, but that’s only one part of the process of decision-making. I’m not speaking of a necessarily formal process here, but in the case of database operators, merely a collection of experience. Reality tends to be fairly biased toward facts (despite the desire of many people for this to not be the case). Given enough experience, especially if that experience is factual, we tend to make better and better decisions more inline with reality. That’s pretty much the essence of prediction. Our mushy human brains are more-or-less good at that, at least, better than other animals. It’s why we have computers and Everybody Loves Raymond, and my cat pees in a box.
Imagine you have a sufficient amount of relevant datapoints that you can plot on a chart. Assuming the axes have any relation to each other, and the data is sound, a trend may emerge, such as a line, or some other bounding shape. A signal is relevant data that corresponds to the rules we discover by best fit. Noise is everything else. It’s somewhat circular sounding logic, and it’s really hard to know what is really a signal. This is why science is hard, and so is choosing a proper database. We’re always checking our assumptions, and one solid counter signal can really be disastrous for a model. We may have been wrong all along, missing only enough data. As Einstein famously said in response to the book 100 Authors Against Einstein: “If I were wrong, then one would have been enough!”
Database operators (and programmers forced to play this role) must make predictions all the time, against a seemingly endless series of questions. How much data can I handle? What kind of latency can I expect? How many servers will I need, and how much work to manage them?
So, like all decision making processes, we refer to experience. The problem is, as our industry demands increasing scale, very few people actually have much experience managing giant scale systems. We tend to draw our assumptions from our limited, or biased smaller scale experience, and extrapolate outward. The theories we then tend to concoct are not the optimal fit that we desire, but instead tend to be overfit.
Overfit is when we have a limited amount of data, and overstate its general implications. If we imagine a plot of likely failure scenarios against a limited number of servers, we may be tempted to believe our biggest odds of failure are insufficient RAM, or disk failure. After all, my network has never given me problems, but I sure have lost a hard drive or two. We take these assumptions, which are only somewhat relevant to the realities of scalable systems and divine some rules for ourselves that entirely miss the point.
In a real distributed system, network issues tend to consume most of our interest. Single-server consistency is a solved problem, and most (worthwhile) distributed databases have some sense of built in redundancy (usually replication, the root of all distributed evil).
OSCON 2013 Speaker Series
Our world is filled with unstructured data. By some estimates, it’s as high as 80% of all data.
Unstructured data is data that isn’t in a specific format. It isn’t separated by a delimiter that you could split on and get all of the individual pieces of data. Most often, this data comes directly from humans. Human generated data isn’t the best kind to place in a relational database and run queries on. You’d need to run some algorithms on unstructured data to gain any insight.
Advanced NFL Stats was kind enough to open up their play-by-play data for NFL games from 2002 to 2012. This dataset consisted of both structured and unstructured data. Here is a sample line showing a single play in the dataset:
20121119_CHI@SF,3,17,48,SF,CHI,3,2,76,20,0,(2:48) C.Kaepernick pass short right to M.Crabtree to SF 25 for 1 yard (C.Tillman). Caught at SF 25. 0-yds YAC,0,3,0,27,7 ,2012
This line is comma separated and has various queryable elements. For example, we could query on the teams playing and the scores. We couldn’t query on the unstructured portion, specifically:
(2:48) C.Kaepernick pass short right to M.Crabtree to SF 25 for 1 yard (C.Tillman). Caught at SF 25. 0-yds YAC
There are many more insights that can be gleaned from this portion. For example, we can see that San Francisco’s quarterback, Colin Kaepernick, passed the ball to the wide receiver, Michael Crabtree. He was tackled by Charles Tillman. Michael Crabtree caught the ball at the 25 yard line, gained 1 yard. After catching the ball, he didn’t make any forward progress or yards after catch (“
Writing a synopsis like that is easy for me as a human. I’ve spent my life working with language and trying to comprehend the meaning. It’s not as easy for a computer. We have to use various methods to extract the information from unstructured data like this. These can vary dramatically in complexity; some use simple string lookups or contains and others use natural language processing.
Parsing would be easy if all of the plays looked like the one above, but they don’t. Each play has a little bit different formatting and varies in the amount of data. The general format is different for each type of play: run, pass, punt, etc. Each type of play needs to be treated a little differently. Also, each person writing the play description will be a little different from the others.
As I showed in the synopsis of the play, one can get some interesting insight out of the structured and unstructured data. However, we’re still limited in what we can do and the kinds of queries we can write.
I had a conversation with a Canadian about American Football and the effects of the weather. I was talking about how the NFL now favors domed locations to take weather out as a variable for the Superbowl or NFL championship. With the play-by-play dataset, I couldn’t write a query to tell me the effects of weather one way or another. The data simply doesn’t exist in the dataset.
The physical location and date of the game is captured in the play-by-play. We can see that in the first portion “
20121119_CHI@SF“. Here, Chicago is playing at San Francisco on 2012-11-19. There are other datasets that could enable us to augment our play-by-play. These datasets would give us the physical location of the stadium where the game was played (the team’s name doesn’t mean that their stadium is in that city). The dataset for stadiums is a relatively small one. This dataset also shows if the stadium is domed and the ambient temperature is a non-issue.
Tracking Bitcoin, Gaming Deflation, Bloat-Aware Design, and Mapping Entity Relationships
- Quantitative Analysis of the Full Bitcoin Transaction Graph (PDF) — We analyzed all these large transactions by following in detail the way these sums were accumulated and the way they were dispersed, and realized that almost all these large transactions were descendants of a single transaction which was carried out in November 2010. Finally, we noted that the subgraph which contains these large transactions along with their neighborhood has many strange looking structures which could be an attempt to conceal the existence and relationship between these transactions, but such an attempt can be foiled by following the money trail in a succinctly persistent way. (via Alex Dong)
- Majority of Gamers Today Can’t Finish Level 1 of Super Mario Bros — Nintendo test, and the President of Nintendo said in a talk, We watched the replay videos of how the gamers performed and saw that many did not understand simple concepts like bottomless pits. Around 70 percent died to the first Goomba. Another 50 percent died twice. Many thought the coins were enemies and tried to avoid them. Also, most of them did not use the run button. There were many other depressing things we noted but I can not remember them at the moment. (via Beta Knowledge)
- Bloat-Aware Design for Big Data Applications (PDF) — (1) merging and organizing related small data record objects into few large objects (e.g., byte buffers) instead of representing them explicitly as one-object-per-record, and (2) manipulating data by directly accessing buffers (e.g., at the byte chunk level as opposed to the object level). The central goal of this design paradigm is to bound the number of objects in the application, instead of making it grow proportionally with the cardinality of the input data. (via Ben Lorica)
- Poderopedia (Github) — originally designed for investigative journalists, the open src software allows you to create and manage entity profile pages that include: short bio or summary, sheet of connections, long newsworthy profiles, maps of connections of an entity, documents related to the entity, sources of all the information and news river with external news about the entity. See the announcement and website.
Sterling on Disruption, Coding Crypto Fun, Distributed File System, and Asset Packaging
- Bruce Sterling on Disruption — If more computation, and more networking, was going to make the world prosperous, we’d be living in a prosperous world. And we’re not. Obviously we’re living in a Depression. Slow first 25% but then it takes fire and burns with the heat of a thousand Sun Microsystems flaming out. You must read this now.
- The Matasano Crypto Challenges (Maciej Ceglowski) — To my delight, though, I was able to get through the entire sequence. It took diligence, coffee, and a lot of graph paper, but the problems were tractable. And having completed them, I’ve become convinced that anyone whose job it is to run a production website should try them, particularly if you have no experience with application security. Since the challenges aren’t really documented anywhere, I wanted to describe what they’re like in the hopes of persuading busy people to take the plunge.
- Tachyon — a fault tolerant distributed file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. Berkeley-licensed open source.
Island Traps, Apolitical Technology, 3D Printing Patent Suits, and Disk-Based Graph Tool
- Trap Island — island on most maps doesn’t exist.
- Why I Work on Non-Partisan Tech (MySociety) — excellent essay. Obama won using big technology, but imagine if that effort, money, and technique were used to make things that were useful to the country. Political technology is not gov2.0.
- 3D Printing Patent Suits (MSNBC) — notable not just for incumbents keeping out low-cost competitors with patents, but also (as BoingBoing observed) Many of the key patents in 3D printing start expiring in 2013, and will continue to lapse through ’14 and ’15. Expect a big bang of 3D printer innovation, and massive price-drops, in the years to come. (via BoingBoing)
- GraphChi — can run very large graph computations on just a single machine, by using a novel algorithm for processing the graph from disk (SSD or hard drive). Programs for GraphChi are written in the vertex-centric model, proposed by GraphLab and Google’s Pregel. GraphChi runs vertex-centric programs asynchronously (i.e changes written to edges are immediately visible to subsequent computation), and in parallel. GraphChi also supports streaming graph updates and removal of edges from the graph.