"database" entries

Four short links: 15 December 2015

Barbie Broken, JSON Database, Lightbulb DRM, and Graph Database

1. Crypto is Hard says Hello BarbieWe discovered several issues with the Hello Barbie app including: it utilizes an authentication credential that can be re-used by attackers; it connects a mobile device to any unsecured Wi-Fi network if it has “Barbie” in the name; it shipped with unused code that serves no function but increases the overall attack surface. On the server side, we also discovered: client certificate authentication credentials can be used outside of the app by attackers to probe any of the Hello Barbie cloud servers; the ToyTalk server domain was on a cloud infrastructure susceptible to the POODLE attack. (via Ars Technica)
2. Kinto — Mozilla’s open source lightweight JSON storage service with synchronisation and sharing abilities. It is meant to be easy to use and easy to self-host.
3. Philips Blocks 3rd Party Lightbulbs — DRM for light fixtures. cf @internetofsh*t
4. gaffer — GCHQ-released open source graph database. …a framework that makes it easy to store large-scale graphs in which the nodes and edges have statistics such as counts, histograms, and sketches. These statistics summarise the properties of the nodes and edges over time windows, and they can be dynamically updated over time. Gaffer is a graph database, rather than a graph processing system. It is optimised for retrieving data on nodes of interest. IHNJH,IJLTS “nodes of interest.”

Four short links: 9 October 2015

Page Loads, Data Engines, Small Groups, and Political Misperception

1. Ludicrously Fast Page Loads: A Guide for Full-Stack Devs (Nate Berkopec) — steps slowly through the steps of page loading using Chrome Developer Tools’ timeline. Very easy to follow.
2. Specialised and Hybrid Data Management and Processing Engines (Ben Lorica) — wrap-up of data engines uncovered at Strata + Hadoop World NYC 2015.
3. Power of Small Groups (Matt Webb) — Matt’s joined a small Slack community of like-minded friends. There’s a space where articles written or edited by members automatically show up. I like that. I caught myself thinking: it’d be nice to have Last.FM here, too, and Dopplr. Nothing that requires much effort. Let’s also pull in Instagram. Automatic stuff so I can see what people are doing, and people can see what I’m doing. Just for this group. Back to those original intentions. Ambient awareness, togetherness. cf Clay Shirky’s situated software. Everything useful from 2004 will be rebuilt once the fetish for scale passes.
4. Asymmetric Misperceptions (PDF) — research into the systematic mismatch between how politicians think their constituents feel on issues, and how the constituents actually feel. Our findings underscore doubts that policymakers perceive opinion accurately: politicians maintain systematic misperceptions about constituents’ views, typically erring by over 10 percentage points, and entire groups of politicians maintain even more severe collective misperceptions. A second, post-election survey finds the electoral process fails to ameliorate these misperceptions.

Add columns to a table on the fly without altering its schema.

MariaDB and similar SQL database systems allow for a variety of data types that may be used for storing data in columns within tables. When creating or altering a table’s schema, it’s good to know what to expect, to know what kind of data will be stored in each column. If you know that a column will contain numbers, use a numeric data type like INT, not VARCHAR. It’s best to use the appropriate data type for a column. Generally, you’ll have better control of the data and possibly better performance.

But sometimes you can’t predict what type of data might be entered into a column. For such a situation, you might use VARCHAR set to 255 characters wide, or maybe TEXT if plenty of data might be entered. This is a very cool and fairly new alternative: you could create a table in which you would add columns on the fly, but without altering the table’s schema. That may sound absurd, but it’s possible to do this in MariaDB with dynamic columns.

Dynamic columns are basically columns within a column. If you know programming well, they’re like a hash within an array. That may sound confusing, but it will make more sense when you see it in action. To illustrate this, I’ll pull some ideas from my new book, Learning MySQL and MariaDB (O’Reilly 2015). All of the examples in my book and this article are based on a database for bird-watchers.

Four short links: 28 July 2014

Secure Server, Angular Style, Recursion History (see Recursion History), Aerospike Open Source

1. streisandsets up a new server running L2TP/IPsec, OpenSSH, OpenVPN, Shadowsocks, Stunnel, and a Tor bridge. It also generates custom configuration instructions for all of these services. At the end of the run you are given an HTML file with instructions that can be shared with friends, family members, and fellow activists.
2. Angular.js Style Guidemy opinionated styleguide for syntax, building and structuring Angular applications.
3. How Recursion Got into ProgrammingCommittee member F.L. Bauer registered his protest by characterizing the addition of recursion to the language as an “Amsterdam plot”.
4. aerospike — open source database server and client, with bold claims of performance.

Four short links: 24 July 2014

Neglected ML, Crowdfunded Recognition, Debating Watson, and Versioned p2p File System

1. Neglected Machine Learning IdeasPerhaps my list is a “send me review articles and book suggestions” cry for help, but perhaps it is useful to others as an overview of neat things.
2. First Crowdfunded Book on Booker Shortlist — Booker excludes self-published works, but “The Wake” was through Unbound, a Threadless-style “if we hit this limit, the book is printed and you have bought a copy” site.
3. Watson Can Debate Its Opponents (io9) — Speaking in nearly perfect English, Watson/The Debater replied: “Scanned approximately 4 million Wikipedia articles, returning ten most relevant articles. Scanned all 3,000 sentences in top ten articles. Detected sentences which contain candidate claims. Identified borders of candidate claims. Assessed pro and con polarity of candidate claims. Constructed demo speech with top claim predictions. Ready to deliver.”
4. ipfsa global, versioned, peer-to-peer file system. It combines good ideas from Git, BitTorrent, Kademlia, and SFS. You can think of it like a single BitTorrent swarm, exchanging Git objects, making up the web. IPFS provides an interface much simpler than HTTP, but has permanence built in.. (via Sourcegraph)

Four short links: 10 June 2014

Trusting Code, Deep Pi, Docker DevOps, and Secure Database

1. Trusting Browser Code (Tim Bray) — on the fundamental weakness of the ‘net as manifest in the browser.
2. Deep Learning in the Raspberry Pi (Pete Warden) — \$30 now gets you a computer you can run deep learning algorithms on. Awesome.
3. Announcing Docker Hub and Official Repositories — as Docker went 1.0 and people rave about how they use it, comes this. They’re thinking hard about “integrating into the build ship run loop”, which aligns well with DevOps-enabling tool use.
4. Apple’s Secure Database for Users (Ian Waring) — excellent breakdown of how Apple have gone out of their way to make their cloud database product safe and robust. They may be slow to “the cloud” but they have decades of experience having users as customers instead of products.

More than enough Arel

When ActiveRecord just isn’t enough

In Just Enough Arel, we explored a bit into how the Arel library transforms our Ruby code into SQL to be executed by the database. To do so, we discovered that Arel abstracts database tables and the fields therein as objects, which in turn receive messages not normally available in ActiveRecord queries. Wrapping up the article, we also looked at arguments for using Arel over falling back to SQL.

As alluded at the end of the previous article, Arel can do much more than merely provide a handful of comparison operators. In this post, we’ll look at how we can call native database functions, construct unions and intersects, and we’ll wrap things up by explicitly building joins with Arel.

Restructuring the Web with Git

Can version control manage content?

Web designers? Git? Github? Aren’t those for programmers? At Artifact, Christopher Schmitt showed designers how much their peers are already doing with Github, and what more they can do. Github (and the underlying Git toolset) changes the way that all kinds of people work together.

Sharing with Git

As amazing as Linux may be, I keep thinking that Git may prove to be Linus Torvalds’ most important contribution to computing. Most people think of it, if they think of it at all, as a tool for managing source code. It can do far more, though, providing a drastically different (and I think better) set of tools for managing distributed projects, especially those that use text.

Git tackles an unwieldy problem, managing the loosely structured documents that humans produce. Text files are incredibly flexible, letting us store everything from random notes to code of all kinds to tightly structured data. As awesome as text files are—readable, searchable, relatively easy to process—they tend to become a mess when there’s a big pile of them.

Dealing with Data in the Hadoop Ecosystem

Kathleen Ting (@kate_ting), Technical Account Manager at Cloudera, and our own Andy Oram (@praxagora) sat down to discuss how to work with structured and unstructured data as well as how to keep a system up and running that is crunching that data.

Key highlights include:

• Misconfigurations consist of almost half of the support issues that the team at Cloudera is seeing [Discussed at 0:22]
• ZooKeeper, the canary in the Hadoop coal mine [Discussed at 1:10]
• Leaky clients are often a problem ZooKeeper detects [Discussed at 2:10]
• Sqoop is a bulk data transfer tool [Discussed at 2:47]
• Sqoop helps to bring together structured and unstructured data [Discussed at 3:50]
• ZooKeep is not for storage, but coordination, reliability, availability [Discussed at 4:44]

You can view the full interview here:

NoSQL Choices: To Misfit or Cargo Cult?

Retreading old topics can be a powerful source of epiphany, sometimes more so than simple extra-box thinking. I was a computer science student, of course I knew statistics. But my recent years as a NoSQL (or better stated: distributed systems) junkie have irreparably colored my worldview, filtering every metaphor with a tinge of information management.

Lounging on a half-world plane ride has its benefits, namely, the opportunity to read. Most of my Delta flight from Tel Aviv back home to Portland lacked both wifi and (in my case) a workable laptop power source. So instead, I devoured Nate Silver’s book, The Signal and the Noise. When Nate reintroduced me to the concept of statistical overfit, and relatedly underfit, I could not help but consider these cases in light of the modern problem of distributed data management, namely, operators (you may call these operators DBAs, but please, not to their faces).

When collecting information, be it for a psychological profile of chimp mating rituals, or plotting datapoints in search of the Higgs Boson, the ultimate goal is to find some sort of usable signal, some trend in the data. Not every point is useful, and in fact, any individual could be downright abnormal. This is why we need several points to spot a trend. The world rarely gives us anything clearer than a jumble of anecdotes. But plotted together, occasionally a pattern emerges. This pattern, if repeatable and useful for prediction, becomes a working theory. This is science, and is generally considered a good method for making decisions.

On the other hand, when lacking experience, we tend to over value the experience of others when we assume they have more. This works in straightforward cases, like learning to cook a burger (watch someone make one, copy their process). This isn’t so useful as similarities diverge. Watching someone make a cake won’t tell you much about the process of crafting a burger. Folks like to call this cargo cult behavior.

How Fit are You, Bro?

You need to extract useful information from experience (which I’ll use the math-y sounding word datapoints). Having a collection of datapoints to choose from is useful, but that’s only one part of the process of decision-making. I’m not speaking of a necessarily formal process here, but in the case of database operators, merely a collection of experience. Reality tends to be fairly biased toward facts (despite the desire of many people for this to not be the case). Given enough experience, especially if that experience is factual, we tend to make better and better decisions more inline with reality. That’s pretty much the essence of prediction. Our mushy human brains are more-or-less good at that, at least, better than other animals. It’s why we have computers and Everybody Loves Raymond, and my cat pees in a box.

Imagine you have a sufficient amount of relevant datapoints that you can plot on a chart. Assuming the axes have any relation to each other, and the data is sound, a trend may emerge, such as a line, or some other bounding shape. A signal is relevant data that corresponds to the rules we discover by best fit. Noise is everything else. It’s somewhat circular sounding logic, and it’s really hard to know what is really a signal. This is why science is hard, and so is choosing a proper database. We’re always checking our assumptions, and one solid counter signal can really be disastrous for a model. We may have been wrong all along, missing only enough data. As Einstein famously said in response to the book 100 Authors Against Einstein: “If I were wrong, then one would have been enough!”

Database operators (and programmers forced to play this role) must make predictions all the time, against a seemingly endless series of questions. How much data can I handle? What kind of latency can I expect? How many servers will I need, and how much work to manage them?

So, like all decision making processes, we refer to experience. The problem is, as our industry demands increasing scale, very few people actually have much experience managing giant scale systems. We tend to draw our assumptions from our limited, or biased smaller scale experience, and extrapolate outward. The theories we then tend to concoct are not the optimal fit that we desire, but instead tend to be overfit.

Overfit is when we have a limited amount of data, and overstate its general implications. If we imagine a plot of likely failure scenarios against a limited number of servers, we may be tempted to believe our biggest odds of failure are insufficient RAM, or disk failure. After all, my network has never given me problems, but I sure have lost a hard drive or two. We take these assumptions, which are only somewhat relevant to the realities of scalable systems and divine some rules for ourselves that entirely miss the point.

In a real distributed system, network issues tend to consume most of our interest. Single-server consistency is a solved problem, and most (worthwhile) distributed databases have some sense of built in redundancy (usually replication, the root of all distributed evil).