Andy Oram

Andy Oram is an editor at O'Reilly Media. An employee of the company since 1992, Andy currently specializes in open source technologies and software engineering. His work for O'Reilly includes the first books ever released by a U.S. publisher on Linux, the 2001 title Peer-to-Peer, and the 2007 best-seller Beautiful Code.

Entities move data from tables to knowledge

Cleaning and combining fields can turn messy data into actionable insight.

Red_Buckets_in_a_Row_Don_DeBold_Flickr

We often talk in business and computing about moving from “raw data” to “knowledge,” hoping to take useful actions based on the data our organization has collected over time. Before one can view trends in your data or do other analytics, you need tools for data cleaning and for combining multiple data sources into meaningful collections of information, known as entities. An entity may be a customer, a product, a point of sale, an incident being investigated by the police, or anything else around which you want to build meaningful context.

In this post, we’ll explore some of the complexities in real-life data that create headaches — and how analytical software can help users prepare data for sophisticated queries and visualizations. Read more…

Comment

What the IoT can learn from the health care industry

Federated authentication and authorization could provide security solutions for the Internet of Things.

Adrian Gropper co-authored this post.

Nyckel_erik_forsberg_FlickrAfter a short period of excitement and rosy prospects in the movement we’ve come to call the Internet of Things (IoT), designers are coming to realize that it will survive or implode around the twin issues of security and user control: a few electrical failures could scare people away for decades, while a nagging sense that someone is exploiting our data without our consent could sour our enthusiasm. Early indicators already point to a heightened level of scrutiny — Senator Ed Markey’s office, for example, recently put the automobile industry under the microscope for computer and network security.

In this context, what can the IoT draw from well-established technologies in federated trust? Federated trust in technologies as diverse as the Kerberos and SAML has allowed large groups of users to collaborate securely, never having to share passwords with people they don’t trust. OpenID was probably the first truly mass-market application of federated trust.

OpenID and OAuth, which have proven their value on the Web, have an equally vital role in the exchange of data in health care. This task — often cast as the interoperability of electronic health records — can reasonably be described as the primary challenge facing the health care industry today, at least in the IT space. Reformers across the health care industry (and even Congress) have pressured the federal government to make data exchange the top priority, and the Office of the National Coordinator for Health Information Technology has declared it the centerpiece of upcoming regulations. Read more…

Comment

Marshal your data with entity resolution

Analytics can make combining or comparing data faster and less painful.

Entity_Resolution_webcastEntity resolution refers to processes that businesses and other organizations have to do all the time in order to produce full reports on people, organizations, or events. Entity resolution can be used, for instance, to:

  • Combine your customer data with a list purchased from a data broker. Identical data may be in columns of different names, such as “last” and “surname.” Connecting columns from different databases is a common extract, transform, and load (ETL) task.
  • Extract values from one database and match them against one or more columns in another. For instance, if you get a party list, you might want to find your clients among the attendees. A police detective might want to extract the names of people involved in a crime report and see whether any suspects are among them.
  • Find a match in dirty data, such as a person whose name is spelled differently in different rows.

Dirty, inconsistent, or unstructured data is the chief challenge in entity resolution. Jenn Reed, director of product management for Novetta Entity Analytics, points out that it’s easy for two numbers to get switched, such as a person’s driver’s license and social security numbers. Over time, sophisticated rules have been created to compare data, and it often requires the comparison of several fields to make sure a match is correct. (For instance, health information exchanges use up to 17 different types of data to make sure the Marcia Marquez who just got admitted to the ER is the same Marcia Marquez who visited her doctor last week.) Read more…

Comment

How browsers get to know you in milliseconds

Behind the scenes of a real-time ad auction on the web.

A small technological marvel occurs on almost every visit to a web page. In the seconds that elapse between the user’s click and the display of the page, an ad auction takes place in which hundreds of bidders gather whatever information they can get on the user, determine which ads are likely to be of interest, place bids, and transmit the winning ad to be placed in the page.

How can all that happen in approximately 100 milliseconds? Let’s explore the timeline and find out what goes on behind the scenes in a modern ad auction. Most of the information I have comes from two companies that handle different stages of the auction: the ad exchange AppNexus and the demand side platform Yashi. Both store critical data in an Aerospike database running on flash to achieve sub-second speeds.

Read more…

Comments: 7

Fast data calls for new ways to manage its flow

Examples of multi-layer, three-tier data-processing architecture.

Storage_Servers_grover_net_Flickr

Like CPU caches, which tend to be arranged in multiple levels, modern organizations direct their data into different data stores under the principle that a small amount is needed for real-time decisions and the rest for long-range business decisions. This article looks at options for data storage, focusing on one that’s particularly appropriate for the “fast data” scenario described in a recent O’Reilly report.

Many organizations deal with data on at least three levels:

  1. They need data at their fingertips, rather like a reference book you leave on your desk. Organizations use such data for things like determining which ad to display on a web page, what kind of deal to offer a visitor to their website, or what email message to suppress as spam. They store such data in memory, often in key/value stores that allow fast lookups. Flash is a second layer (slower than memory, but much cheaper), as I described in a recent article. John Piekos, vice president of engineering at VoltDB, which makes an in-memory database, says that this type of data storage is used in situations where delays of just 20 or 30 milliseconds mean lost business.
  2. For business intelligence, theses organizations use a traditional relational database or a more modern “big data” tool such as Hadoop or Spark. Although the use of a relational database for background processing is generally called online analytic processing (OLAP), it is nowhere near as online as the previous data used over a period of just milliseconds for real-time decisions.
  3. Some data is archived with no immediate use in mind. It can be compressed and perhaps even stored on magnetic tape.

For the new fast data tier, where performance is critical, techniques such as materialized views further improve responsiveness. According to Piekos, materialized views bypass a certain amount of database processing to cut milliseconds off of queries. Read more…

Comment

The big data sweet spot: policy that balances benefits and risks

Deciding what data to collect is hard when consequences are unpredictable.

Footprints_Jo_Jakeman

A big reason why discussions of “big data” get complicated — and policy-makers resort to vague hand-waving, as in the well-known White House executive office report — is that its ripple effects travel fast and far. Your purchase, when recorded by a data broker, affects not only the the ads and deals they offer you in the future, but the ones they offer innumerable people around the country that share some demographic with you.

Policy-making might be simple if data collectors or governments could say, “We’ll collect certain kinds of data for certain purposes and no others” — but the impacts of data collection are rarely predictable. And if one did restrict big data that way, its value would be seriously reduced.

Follow my steps: big data privacy vs collection

Data collection will explode as we learn how to connect you to different places you’ve been by the way you walk or to recognize certain kinds of diseases by your breath.

When such data exhaust is being collected, you can’t evade consequences by paying cash and otherwise living off the grid. In fact, trying to do so may disadvantage you even more: people who lack access to sophisticated technologies leave fewer tracks and therefore may not be served by corporations or governments. Read more…

Comment

Commodity data analytics for health care

Predixion service could signal a trend for smaller health facilities.

Analytics are expensive and labor intensive; we need them to be routine and ubiquitous. I complained earlier this year that analytics are hard for health care providers to muster because there’s a shortage of analysts and because every data-driven decision takes huge expertise.

Currently, only major health care institutions such as Geisinger, the Mayo Clinic, and Kaiser Permanente incorporate analytics into day-to-day decisions. Research facilities employ analytics teams for clinical research, but perhaps not so much for day-to-day operations. Large health care providers can afford departments of analysts, but most facilities — including those forming accountable care organizations — cannot.

Imagine that you are running a large hospital and are awake nights worrying about the Medicare penalty for readmitting patients within 30 days of their discharge. Now imagine you have access to analytics that can identify about 40 measures that combine to predict a readmission, and a convenient interface is available to tell clinicians in a simple way which patients are most at risk of readmission. Better still, the interface suggests specific interventions to reduce readmissions risk: giving the patient a 30-day supply of medication, arranging transportation to rehab appointments, etc. Read more…

Comment

Fast data fuels real-time streaming applications

A new report describes an imminent shift in real-time applications and the data architecture they require.

Fast_data_coverThe era is here: we’re starting to see computers making decisions that people used to make, through a combination of historical and real-time data. These streams of data come together in applications that answer questions like:

  • What news items or ads is this website visitor likely to be interested in?
  • Is current network traffic part of a Distributed Denial of Service attack?
  • Should our banking site offer a visitor a special deal on a mortgage, based on her credit history?
  • What promotion will entice this gamer to stay on our site longer?
  • Is a particular part of the assembly line overheating and need to be shut down?

Such decisions require the real-time collection of data from the particular user or device, along with others in the environment, and often need to be done on a per-person or per-event basis. For instance, leaderboarding (determining who is top candidate among a group of users, based on some criteria) requires a database that tracks all the relevant users. Such a database nowadays often resides in memory. Read more…

Comment

New Swift language shows Apple history

Still reflects conventions going back to original adoption of Objective-C

A closer look at SwiftLike many Apple programmers (and new programmers who are curious about iOS), I treated Apple’s Swift language as a breath of fresh air. I welcomed when Vandad Nahavandipoor updated his persistently popular iOS Programming Cookbook to cover Swift exclusively. But I soon realized that the LLVM compiler and iOS runtime have persistent attributes of their own that have not gone away when programmers adopt Swift. This post tries to alert new iOS programmers to the idiosyncrasies of the runtime that they still need to learn.

Read more…

Comments: 4

How Flash changes the design of database storage engines

High-performing memory throws many traditional decisions overboard

supermicro_storage

Over the past decade, SSD drives (popularly known as Flash) have radically changed computing at both the consumer level — where USB sticks have effectively replaced CDs for transporting files — and the server level, where it offers a price/performance ratio radically different from both RAM and disk drives. But databases have just started to catch up during the past few years. Most still depend on internal data structures and storage management fine-tuned for spinning disks.

Citing price and performance, one author advised a wide range of database vendors to move to Flash. Certainly, a database administrator can speed up old databases just by swapping out disk drives and inserting Flash, but doing so captures just a sliver of the potential performance improvement promised by Flash. For this article, I asked several database experts — including representatives of Aerospike, Cassandra, FoundationDB, RethinkDB, and Tokutek — how Flash changes the design of storage engines for databases. The various ways these companies have responded to its promise in their database designs are instructive to readers designing applications and looking for the best storage solutions.

Read more…

Comments: 2