How Twitter monitors millions of time-series

One of the keys to Twitter’s ability to process 500 millions tweets daily is a software development process that values monitoring and measurement. A recent post from the company’s Observability team detailed the software stack for monitoring the performance characteristics of software services, and alert teams when problems occur. The Observability stack collects 170 million individual metrics (time-series) every minute and serves up 200 million queries per day. Simple query tools are used to populate charts and dashboards (a typical user monitors about 47 charts).

The stack is about three years old¹ and consists of instrumentation² (data collection primarily via Finagle), storage (Apache Cassandra), a query language and execution engine³, visualization⁴, and basic analytics. Four distinct Cassandra clusters are used to serve different requirements (real-time, historical, aggregate, index). A lot of engineering work went into making these tools as simple to use as possible. The end result is that these different pieces provide a flexible and interactive framework for developers: insert a few lines of (instrumentation) code and start viewing charts within minutes⁵.

The Observability stack’s suite of analytic functions is a work in progress – only simple tools are currently available. Potential anomalies are highlighted visually and users can input simple alerts (“if the value exceeds 100 for 10 minutes, alert me”). While rule-based alerts are useful, they cannot proactively detect unexpected problems (or “unknown unknowns”). When faced with tracking a large number of time series, correlations are essential: if one time series signals an anomaly, it’s critical to know what others one should be worried about. In place of automatic correlation detection, for now Observability users leverage Zipkin (a distributed tracing system) to identify service dependencies. But its solid technical architecture should allow the Observability team to easily expand its analytic capabilities. Over the coming months, the team plans to add tools⁶ for pattern matching (search), and automatic correlation and anomaly detection.

While latency requirements tend to grab headlines (e.g. high frequency trading), Twitter’s Observability stack addresses a more common pain point: managing and mining many millions of time-series. In an earlier post I noted that many interesting systems developed for monitoring IT Operations are beginning to tackle this problem. As self-tracking apps continue to proliferate, massively scalable backend systems for time series need to be built. So while I appreciate Twitter’s decision to open source Summingbird, I think just as many users will want to get their hands on an open source version of their Observability stack. I certainly hope the company decides to open source it in the near future.

Logo of Twitter's Observability stack

Related posts:

(0) Thanks to Franklin Hu and Yann Ramin of Twitter for walking me through the details of the Observability stack.
(1) The precursor to the Observability stack was a system that relied on tools like Ganglia and Nagios.
(2) “Just as easy as adding a print statement.”
(3) In-house tools written in Scala, the queries are written in a “declarative, functional inspired language”. In order to achieve near real-time latency, in-memory caching techniques are used.
(4) In-house tools based on HTML + Javascript, including command line tools for creating charts and dashboards.
(5) The system is best described as near real-time. Or more precisely, human real-time (since humans are still in the loop).
(6) Dynamic time warping at massive scale is on their radar. Since historical data is archived, simulation tools (for what-if scenario analysis) are possible but currently not planned. In an earlier post I highlighted one such tool from CloudPhysics.

O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.

Strata Rx Health Data Conference: September 25-27 | Boston, MA
Strata + Hadoop World: October 28-30 | New York, NY
Strata in London: November 15-17 | London, England

How Twitter monitors millions of time-series

A distributed, near real-time system simplifies the collection, storage, and mining of massive amounts of event data

Get the O’Reilly Data Newsletter