- Sparkey — Spotify’s open-sourced simple constant key/value storage library, for read-heavy systems with infrequent large bulk inserts.
- The Truth of Fact, The Truth of Feeling (Ted Chiang) — story about what happens when lifelogs become searchable. Now with Remem, finding the exact moment has become easy, and lifelogs that previously lay all but ignored are now being scrutinized as if they were crime scenes, thickly strewn with evidence for use in domestic squabbles. (via BoingBoing)
- Algorithms Magnifying Misbehaviour (The Guardian) — when the training set embodies biases, the machine will exhibit biases too.
- Lego Robot That Strips DRM Off Ebooks (BoingBoing) — so. damn. cool. If it had been controlled by a C64, Cory would have hit every one of my geek erogenous zones with this find.
ENTRIES TAGGED "databases"
Constant KV Store, Google Me, Learned Bias, and DRM-Stripping Lego Robot
Flexible Layouts, Web Components, Distributed SQL Database, and Reverse-Engineering Dropbox Client
- intention.js — manipulates the DOM via HTML attributes. The methods for manipulation are placed with the elements themselves, so flexible layouts don’t seem so abstract and messy.
- F1: A Distributed SQL Database That Scales — a distributed relational database system built at Google to support the AdWords business. F1 is a hybrid database that combines high availability, the scalability of NoSQL systems like Bigtable, and the consistency and usability of traditional SQL databases. F1 is built on Spanner, which provides synchronous cross-datacenter replication and strong consistency. Synchronous replication implies higher commit latency, but we mitigate that latency by using a hierarchical schema model with structured data types and through smart application design. F1 also includes a fully functional distributed SQL query engine and automatic change tracking and publishing.
- Looking Inside The (Drop)Box (PDF) — This paper presents new and generic techniques, to reverse engineer frozen Python applications, which are not limited to just the Dropbox world. We describe a method to bypass Dropbox’s two factor authentication and hijack Dropbox accounts. Additionally, generic techniques to intercept SSL data using code injection techniques and monkey patching are presented. (via Tech Republic)
Approximate Queries, Spreadsheet as Database, China Robot Plans, and Open Source Google App Engine
- blinkdb — The current version of BlinkDB supports a slightly constrained set of SQL-style declarative queries and provides approximate results for standard SQL aggregate queries, specifically queries involving COUNT, AVG, SUM and PERCENTILE and is being extended to support any User-Defined Functions (UDFs). Queries involving these operations can be annotated with either an error bound, or a time constraint, based on which the system selects an appropriate sample to operate on.
- China Plans to Become a Leader in Robotics (Quartz) — The ODCCC too funds high risk research initiatives through the Thousand Talent Project (TTP), a three-year term project with possible extension. The goal of the TTP is to recruit thousands of foreign researchers with strong expertise in hardware and software to help develop innovation in China. There are already more than 100 foreign researchers working in China since 2008, the year TTP started.
- AppScale (GitHub) — open source implementation of Google App Engine.
Model-Driven Configuration, 1,000 RSS Readers Bloom, JSON Query Language, and Doug Engelbart's Vision
- ansible — Model-driven configuration management, multi-node deployment/orchestration, and remote task execution system. Uses SSH by default, so no special software has to be installed on the nodes you manage. Ansible can be extended in any language.
- The Golden Age of RSS — One of the things I expected least to see in 2013 was that this year would mark the greatest flourishing of RSS reader applications in the decade since it first came to prominence on the web.
- JSONiq: the JSON Query Language — expressive and highly optimizable language to query and update NoSQL stores. It enables developers to leverage the same productive high-level language across a variety of NoSQL products. Implemented in Zorba, an Apache-licensed virtual machine for JSONiq and XQuery queries.
- Bret Victor on Doug Engelbart — If you attempt to make sense of Engelbart’s design by drawing correspondences to our present-day systems, you will miss the point, because our present-day systems do not embody Engelbart’s intent. Engelbart hated our present-day systems. Poetic, articulate, and bang on the money.
Paperclip Computing, Packet Capture, Offline Wikipedia, and Sensor Databases
- How to Build a Working Digital Computer Out of Paperclips (Evil Mad Scientist) — from a 1967 popular science book showing how to build everything from parts that you might find at a hardware store: items like paper clips, little light bulbs, thread spools, wire, screws, and switches (that can optionally be made from paper clips).
- Moloch (Github) — an open source, large scale IPv4 packet capturing (PCAP), indexing and database system with a simple web GUI.
- Offline Wikipedia Reader (Amazon) — genius, because what Wikipedia needed to be successful was to be read-only. (via BoingBoing)
- Storing and Publishing Sensor Data — rundown of apps and sites for sensor data. (via Pete Warden)
Processing for Illustrator, Archiving Tools, Sweet Retro Art, and More Database Tools
- Drawscript — Processing for Illustrator. (via BERG London)
- Archive Team Warrior — a virtual archiving appliance. You can run it to help with the ArchiveTeam archiving efforts. It will download sites and upload them to our archive. (via Ed Vielmetti)
- Retro Vectors — royalty-free and free of charge.
- TokutekDB Goes Open Source — a high-performance, transactional storage engine for MySQL and MariaDB. See the announcement.
Drone Journalism, DNS Sniffing, E-Book Lending, and Structured Data Server
- Drone Journalism — two universities in the US have already incorporated drone use in their journalism programs. The Drone Journalism Lab at the University of Nebraska and the Missouri Drone Journalism Program at the University of Missouri both teach journalism students how to make the most of what drones have to offer when reporting a story. They also teach students how to fly drones, the Federal Aviation Administration (FAA) regulations and ethics.
- passivedns — A network sniffer that logs all DNS server replies for use in a passive DNS setup.
- IFLA E-Lending Background Paper (PDF) — The global dominance of English language eBook title availability reinforced by eReader availability is starkly evident in the statistics on titles available by country: in the USA: 1,000,000; UK: 400,000; Germany/France: 80,000 each; Japan: 50,000; Australia: 35,000; Italy: 20,000; Spain: 15,000; Brazil: 6,000. Many more stats in this paper prepared as context for the International Federation of Library Associations.
- The god Architecture — a scalable, performant, persistent, in-memory data structure server. It allows massively distributed applications to update and fetch common data in a structured and sorted format. Its main inspirations are Redis and Chord/DHash. Like Redis it focuses on performance, ease of use and a small, simple yet powerful feature set, while from the Chord/DHash projects it inherits scalability, redundancy, and transparent failover behaviour.
SQL Indexes, Instagram Effects in JS, Evil Fake Keyboard, and Preschool UX
- Use The Index, Luke — free ebook on tuning SQL database access.
- Don’t Stick That There — USB device pretending to be a keyboard. The benefit of this is that even with USB auto-run disabled, our exploit will still work as it emulates a keyboard. No one ever blocks USB keyboards! (via David Sklar)
- Best Practices: Designing Touch Tablet Experiences for Preschoolers (Sesame Workshop) — the good people at Sesame Street Workshop tell what works and what doesn’t when you make tablet touch UIs for kids. Double Tap: Children expect immediate feedback from their touch and tend to think the app is unresponsive when a double tap is required. We suggest only using double tap to prevent a child from accidental navigation (e.g., leaving an activity, accessing parent content).
Big Data's Big Picture, Real-Time Queries, Real-Time Queries, Single-Process Real-Time Queries
- Big Data: the Big Picture (Vimeo) — Jim Stogdill’s excellent talk: although Big Data is presented as part of the Gartner Hype Cycle, it’s an epoch of the Information Age which will have significant effects on the structure of corporations and the economy.
- Impala (github) — Cloudera’s open source (Apache) implementation of Google’s F1 (PDF), for realtime queries across clusters. Impala is different from Hive and Pig because it uses its own daemons that are spread across the cluster for queries. Furthermore, Impala does not leverage MapReduce, allowing Impala to return result in real-time. (via Wired)
- druid (github) — open source (GPLv2) a distributed, column-oriented analytical datastore. It was originally created to resolve query latency issues seen with trying to use Hadoop to power an interactive service. See also the announcement of its open-sourcing.
- Supersonic (Google Code) — an ultra-fast, column oriented query engine library written in C++. It provides a set of data transformation primitives which make heavy use of cache-aware algorithms, SIMD instructions and vectorised execution, allowing it to exploit the capabilities and resources of modern, hyper pipelined CPUs. It is designed to work in a single process. Apache-licensed.
Matching the missing to the dead involves reconciling two national databases.
Javier Reveron went missing from Ohio in 2004. His wallet turned up in New York City, but he was nowhere to be found. By the time his parents arrived to search for him and hand out fliers, his remains had already been buried in an unmarked indigent grave. In New York, where coroner’s resources are precious, remains wait a few months to be claimed before they’re buried by convicts in a potter’s field on uninhabited Hart Island, just off the Bronx in Long Island Sound.
The story, reported by the New York Times last week, has as happy an ending as it could given that beginning. In 2010 Reveron’s parents added him to a national database of missing persons. A month later police in New York matched him to an unidentified body and his remains were disinterred, cremated and given burial ceremonies in Ohio.
Reveron’s ordeal suggests an intriguing, and impactful, machine-learning problem. The Department of Justice maintains separate national, public databases for missing people, unidentified people and unclaimed people. Many records are full of rich data that is almost never a perfect match to data in other databases — hair color entered by a police department might differ from how it’s remembered by a missing person’s family; weights fluctuate; scars appear. Photos are provided for many missing people and some unidentified people, and matching them is difficult. Free-text fields in many entries describe the circumstances under which missing people lived and died; a predilection for hitchhiking could be linked to a death by the side of a road.
I’ve called the Department of Justice (DOJ) to ask about the extent to which they’ve worked with computer scientists to match missing and unidentified people, and will update when I hear back. One thing that’s not immediately apparent is the public availability of the necessary training set — cases that have been successfully matched and removed from the lists. The DOJ apparently doesn’t comment on resolved cases, which could make getting this data difficult. But perhaps there’s room for a coalition to request the anonymized data and manage it to the DOJ’s satisfaction while distributing it to capable data scientists.