- Trajectory Data Mining: An Overview (Paper a Day) — This is the data created by a moving object, as a sequence of locations, often with uncertainty around the exact location at each point. This could be GPS trajectories created by people or vehicles, spatial trajectories obtained via cell phone tower IDs and corresponding transmission times, the moving trajectories of animals (e.g. birds) fitted with trackers, or even data concerning natural phenomena such as hurricanes and ocean currents. It turns out, there’s a lot to learn about working with such data!
- Search Engine Manipulation Effect (PNAS) — Internet search rankings have a significant impact on consumer choices, mainly because users trust and choose higher-ranked results more than lower-ranked results. Given the apparent power of search rankings, we asked whether they could be manipulated to alter the preferences of undecided voters in democratic elections. They could. Read the article for their methodology. (via Aeon)
- Keshif — open source interactive data explorer.
- proselint — analyse text for sins of usage and abusage.
By David Andrzejewski of SumoLogic
A few weeks ago I had the pleasure of hosting the machine data track of talks at Strata Santa Clara. Like “big data”, the phrase “machine data” is associated with multiple (sometimes conflicting) definitions, two prominent ones come from Curt Monash and Daniel Abadi. The focus of the machine data track is on data which is generated and/or collected automatically by machines. This includes software logs and sensor measurements from systems as varied as mobile phones, airplane engines, and data centers. The concept is closely related to the “internet of things”, which refers to the trend of increasing connectivity and instrumentation in existing devices, like home thermostats.
More data, more problems
This data can be useful for the early detection of operational problems or the discovery of opportunities for improved efficiency. However, the decoupling of data generation and collection from human action means that the volume of machine data can grow at machine scales (i.e., Moore’s Law), an issue raised by both Monash and Abadi. This explosive growth rate amplifies existing challenges associated with “big data”. In particular two common motifs among the talks at Strata were the difficulties around:
- mechanics: the technical details of data collection, storage, and analysis
- semantics: extracting understandable and actionable information from the data deluge
A distributed, near real-time system simplifies the collection, storage, and mining of massive amounts of event data
One of the keys to Twitter’s ability to process 500 millions tweets daily is a software development process that values monitoring and measurement. A recent post from the company’s Observability team detailed the software stack for monitoring the performance characteristics of software services, and alert teams when problems occur. The Observability stack collects 170 million individual metrics (time-series) every minute and serves up 200 million queries per day. Simple query tools are used to populate charts and dashboards (a typical user monitors about 47 charts).
The stack is about three years old1 and consists of instrumentation2 (data collection primarily via Finagle), storage (Apache Cassandra), a query language and execution engine3, visualization4, and basic analytics. Four distinct Cassandra clusters are used to serve different requirements (real-time, historical, aggregate, index). A lot of engineering work went into making these tools as simple to use as possible. The end result is that these different pieces provide a flexible and interactive framework for developers: insert a few lines of (instrumentation) code and start viewing charts within minutes5.
Compelling large-scale data platforms originate from the world of IT Operations
I’ve been noticing that many interesting big data systems are coming out of IT operations. These are systems that go beyond the standard “capture/measure, display charts, and send alerts”. IT operations has long been a source of many interesting big data1 problems and I love that it’s beginning to attract the attention2 of many more data scientists and data engineers.
It’s not surprising that many of the interesting large-scale systems that target time-series and event data have come from ops teams: in an earlier post on time-series, several of the tools I highlighted came out of IT operations. IT operations involves monitoring many different hardware and software systems, a task that requires a variety of tools and which quickly leads to “metrics overload”. A partial list includes data captured from a wide range of application log files, network traffic, energy and power sources.
The volume of IT ops data has led to new tools like OpenTSDB and KairosDB – time series databases that leverage HBase and Cassandra. But storage, simple charts, and lookups are just the foundation of what’s needed. IT Ops track many interdependent systems, some of which might be correlated3. Not only are IT ops faced with highlighting “unknown unknowns” in their massive data sets, they often need to do so in near realtime.