Four short links: 16 Feb 2009

A lot of Python and databases today, with some hardware and Twitter pranking/security worries to taste:

  1. Free Telephony Project, Open Telephony Hardware — professionally-designed mass-manufactured hardware for telephony projects. E.g., IP04 runs Asterisk and has four phone jacks and removable Flash storage. Software, schematics, and PCB files released under GPL v2 or later.
  2. Don’t Click Prank Explained — inside the Javascript prank going around Twitter. Transparent overlays would appear to be dangerous.
  3. Tokyo Cabinet: A Modern Implementation of DBM — ok, so there’s definitely something going on with these alternative databases. Here’s the 1979 BTree library reinvented for the modern age, then extended with PyTyrant, a database server for Tokyo Cabinet that offers HTTP REST, memcached, and a simple binary protocol. Cabinet is staggeringly fast, as this article makes clear. And if that wasn’t enough wow for one day, Tokyo Dystopia is the full-text search engine. The Tyrant tutorial shows you how to get the server up and running. And what would technology be without a Slideshare presentation? (via Stinky)
  4. Whoosh — a pure Python fulltext search library.
tags: , , , , , , ,
  • Not exactly a Twitter “prank”, but quite a development nonetheless…

  • “What’s going on” in alternative DBs is pretty easy to understand. No, really.

    You have to understand all databases, of all types, as a layered architecture with a few cross-cutting issues (issues that “cut across” multiple layers or otherwise orthogonal modules). And you have to understand those things against the memory hierarchy and with respect to the quantitative performance of CPU and memory hierarchy and network.

    At the bottom, there is some physical way of storing the primary data in (mis-named) “tertiary storage”. Above that layer are systems of indexing.

    Those layers are typically highly parameterized: I might have 10 different indexing strategies to choose from or I might be choosing between “row” v. “column” storage etc. Physical Schema is an abstraction layer for expressing different configurations of those lowest layers.

    A brilliant insight associated with Codd et al. is that you can switch around physical layout and physical indexing parameters and, at the same time, keep constant a database that has a specific abstract model of the datums stored and the relations among them. Thus Codd et al. identify, in the design space for software architectures, the level of “logical schema”. Below a certain abstraction barrier you are making choices about the physical practice of indexing and storing data – above that layer you can talk abstractly about the data itself, sorta ignoring the physical considerations.

    Brilliant insight number 3 (or so, I forget how they’re numbered) of Codd et al.: different applications want different domain-specific APIs for the logical model of the data but all of the different APIs can easily be expressed in terms of fairly simple programming languages whose primitive terms are basically “finite sets and relations”. Thus, above the basic logical layer there’s a need for a programmatic layer – a query and update language. A common mis-statement is to think that when these guys pointed out the need for such a language that meant SQL and now that SQL is here that need is done. That’s a mistake, though: the abstract theory just says that in the architecture of a given system – you want a programming language of some sort there. It doesn’t have to be SQL. XQuery is an example of an alternative.

    Cross cutting issues: Well, there’s mainly transactions (ACID properties), types (what kinds of data can be most directly represented), reflection (trying to let the whole stack be controllable from the highest layer), and “misc.” (e.g., auto-generated IDs).

    Now, given that architectural theory, DB2, Oracle, later MySQL all doubled down on a particular sub-style, architecturally speaking, with SQL at the top. That settles (mostly) their choices about “types” and (kinda) “transactions”.

    The key thing to understand about SQL for the purposes of this discussion is that SQL was designed with the hypothesis in mind that queries and updates could be optimized by machine, given a logical-layer specification of those queries and updates – and that this would cover all interesting cases. Those systems were further designed with the hypothesis in mind that SQL data types and the SQL language were more than sufficient for all interesting cases.

    Well, those hypotheses are false. I think that should have been obvious up-front but there was really a debate there. The debate is moot: experience has shown.

    To give a kind of architectural “theory” of why programming purely at a reflective physical layer, like SQL, doesn’t work — well, I can’t do it very scientifically, this is architecture, not science. With that caveat: there are too many physical layer choices and trade-offs possible – the combinatorics are ridiculous. There’s no “heuristic” guess possible, given just a logical schema and typical queries, that leads unambiguously (automatably) to the right physical designs – and yet people can often sort of make good guesses and improvise their way around to find very good guesses. In some cases, there are sub-spaces where new heuristics – that don’t really fit a classical, commodity SQL world – are found and then those can be built-out as side-by-side alternatives to SQL.

    So, you get these alternative databases when people look at the SQL offerings, look at their application needs, their memory hierarchy, their processors, and their network and realize that the SQL engines are not even in the ballpark of handling this stuff well. For example, Stonebraker figures out that there are a whole class of apps for which column-oriented storage is needed and, while he sticks to SQL, he finds the economy in making a fully separate implementation. Another example: the SQL “types” (cross-cutting issue) and consequently its heuristic query optimizing, indexing etc. is not so effective for the kinds of data traversals and updates wanted for hierarchical data in the style of XML datums – so people invent XQuery. Many web applications have very limited ACID needs but want instead massive parallelism. When they finally hit physical indexes any optimization from logical to physical tends to be trivial. So they invent map-reduce engines and hashcaches and the like. Years before, the unix community (skeptical of SQL as compared to a file system) started using files for primary data and things like libdb (later Berkeley DB) for indexing, shunning very much automated optimization and instead substituting general purpose languages like C and Perl for SQL (trashing Codd’s architectural conception of the perfect abstraction barriers but to good effect).

    Something like a “map-reduce” engine – as an example – has no ambition to handle the full range of SQL queries as well as, say, an Oracle product. Rather, it picks out a large, handy class of queries that even the best SQL engines handle poorly and that is itself “simple” to optimize the heck out of.

    Notice how all of the alternatives tend to restore (in contrast to the SQL world) both the programmer’s freedom and obligation to more directly code physical stuff, particularly as it relates to transfers of data between layers of memory hierarchy or across networks. There are still new “high level” abstractions there – my map-reduce program doesn’t have to list how many or which specific machines it runs across – but the nature of the query mechanism forces / encourages the application writer to give strong hints about those physical details. Hints that can’t be found in logically equivalent SQL (in any clean way).

    The architectural failure of the project of Codd, et al., was that it was interpreted (if not intended) to lead to storage, queries, and updates as a solved problem – once and for all. It was taken as the path for liberating programmers from any further consideration of anything below the layer of high level query languages. It turns out, nature won’t be tamed that way. The physics of storage and indexing are just to complex to be so simply reduced.