The log: The lifeblood of your data pipeline

Why every data pipeline should have a Unified Logging Layer.

The value of log data for business is unimpeachable. On every level of the organization, the question, “How are we doing?” is answered, ultimately, by log data. Error logs tell developers what went wrong in their applications. User event logs give product managers insights on usage. If the CEO has a question about the next quarter’s revenue forecast, the answer ultimately comes from payment/CRM logs. In this post, I explore the ideal frameworks for collecting and parsing logs.

Apache Kafka Architect Jay Kreps wrote a wonderfully crisp survey on log data. He begins with the simple question of “What is the log?” and elucidates its key role in thinking about data pipelines. Jay’s piece focuses mostly on storing and processing log data. Here, I focus on the steps before storing and processing.

Changing the way we think about log data

oreilly_radar_fluentd_1

The old paradigm — machines to humans, and the new — machines to machines. Image courtesy of Kiyoto Tamura.

Over the last decade, the primary consumer of log data shifted from humans to machines.

Software engineers still read logs, especially when their software behaves in an unexpected manner. However, in terms of “bytes processed,” humans account for a tiny fraction of the total consumption. Much of today’s “big data” is some form of log data, and businesses run tens of thousands of servers to parse and mine these logs to gain competitive edge. Read more…

Comment

Investigating Spark’s performance

A deep dive into performance bottlenecks with Spark PMC member Kay Ousterhout.

Ousterhout_WebcastFor many who use and deploy Apache Spark, knowing how to find critical bottlenecks is extremely important. In a recent O’Reilly webcast, Making Sense of Spark Performance, Spark committer and PMC member Kay Ousterhout gave a brief overview of how Spark works, and dove into how she measured performance bottlenecks using new metrics, including block-time analysis. Ousterhout walked through high-level takeaways from her in-depth analysis of several workloads, and offered a live demo of a new performance analysis tool and explained how you can use it to improve your Spark performance.

Her research uncovered surprising insights into Spark’s performance on two benchmarks (TPC-DS and the Big Data Benchmark), and one production workload. As part of our overall series of webcasts on big data, data science, and engineering, this webcast debunked commonly held ideas surrounding network performance, showing that CPU — not I/O — is often a critical bottleneck, and demonstrated how to identify and fix stragglers.

Network performance is almost irrelevant

While there’s been a lot of research work on performance — mainly surrounding the issues of whether to cache input data in-memory or on machine, scheduling, straggler tasks, and network performance — there haven’t been comprehensive studies into what’s most important to performance overall. This is where Ousterhout’s research comes in — taking on what she refers to as “community dogma,” beginning with the idea that network and disk I/O are major bottlenecks. Read more…

Comment

Beyond bitcoin and the blockchain to booming business

Widespread blockchain adoption requires understanding between developers and domain experts.

Chain_of_fate_Hiroyuki_Takeda_Flickr

Editor’s note: this post is part of our investigation into the future of money. The full video compilation from our first event, Bitcoin & the Blockchain, is now available.

The vision for bitcoin and the blockchain is unabashedly optimistic, though already it is being realized. More and more technologists, venture capitalists, financial institutions, and even regulators are seeing its long-term potential to transform industries, from financial services to data management to the Internet of Things. In the medium term, there remain hurdles to overcome before blockchain technology can offer sufficiently compelling solutions for the complex financial and technological world we live in, but there is progress to date — and it’s promising.

Blockchain-based remittance vehicles offered by Coins.ph, BitPagos, and BitPesa, though early stage, aim to take a chunk of the $450 billion remittance industry by offering speedier, more efficient, and cheaper alternatives to traditional solutions. BitPay offers bitcoin/fiat payment processing for merchants as well as bank integration. Increasingly, private investors are diversifying their portfolios by purchasing bitcoin alongside traditional assets. Most recently, Coinbase even received funding from a group of blue-chip investors, including the New York Stock Exchange, and launched its own exchange, signaling both greater acceptance by the financial services industry as well as confidence in its future value. Ripple Labs has taken a very different approach with its protocol, permitting the decentralized transmission of practically any currency type — cryptographic or fiat — like an SMTP for money, and circumventing traditional payment networks. And to this end, it’s already inked agreements with Cross River Bank (New Jersey), CBW Bank (Kansas), and Fidor Bank (Germany), with more on the horizon. Read more…

Comments: 3

The VR growth cycle: What’s different this time around

A chat with Tony Parisi on where we are with VR, where we need to go, and why we're going to get there this time.

chaos_geralt_Pixabay

Consumer virtual reality (VR) is in the midst of a dizzying and exhilarating upswing. A new breed of systems, pioneered by Oculus and centered on head-worn displays with breakthrough quality, are minting believers — whether investors, developers, journalists, or early-adopting consumers. Major new hardware announcements and releases are occurring on a regular basis, game studios and production houses big and small are tossing their hats into the ring, and ambitious startups are getting funded to stake out many different application domains. Is it a boom, a bubble, or the birth of a new computing platform?

Underneath this fundamental quandary, there are many basic questions that remain unresolved: Which hardware and software platforms will dominate? What input and touch feedback technologies will prove themselves? What are the design and artistic principles in this medium? What role will standards play, who will develop them, and when? The list goes on.

For many of these questions, we’ll need to wait a bit longer for answers to emerge; like smartphones in 2007, we can only speculate about, say, the user interface conventions that will emerge as designers grapple with this new paradigm. But on other issues, there is some wisdom to be gleaned. After all, VR has been around for a long time, and there are some poor souls who have been working in the mines all along. Read more…

Comments: 4

Building big data systems in academia and industry

The O'Reilly Data Show Podcast: Mikio Braun on stream processing, academic research, and training.

Mikio Braun is a machine learning researcher who also enjoys software engineering. We first met when he co-founded a real-time analytics company called streamdrill. Since then, I’ve always had great conversations with him on many topics in the data space. He gave one of the best-attended sessions at Strata + Hadoop World in Barcelona last year on some of his work at streamdrill.

I recently sat down with Braun for the latest episode of the O’Reilly Data Show Podcast, and we talked about machine learning, stream processing and analytics, his recent foray into data science training, and academia versus industry (his interests are a bit on the “applied” side, but he enjoys both).

mikio-big-data-solution-fig

An example of a big data solution. Source: Mikio Braun, used with permission.

Read more…

Comment

A real-time tool for a real-time problem

Using VoltDB and the Lambda Architecture to locate abnormal behavior.

Pattern_Language_Rob_Deutscher_Flickr

Subscriber Identity Module box (SIMbox) fraud is a type of telecommunications fraud where users avoid an international outbound-calls charge by redirecting the call through voice over IP to a SIM in the country where the destination is located. This is an issue we helped a client address at Wise Athena.

Taking on this type of problem requires a stream-based analysis of the Call Detail Record (CDR) logs, which are typically generated quickly. Detecting this kind of activity requires in-memory computations of streaming data. You might also need to scale horizontally.

We recently evaluated the use of VoltDB together with our cognitive analytics and machine-learning system to analyze CDRs and provide accurate and fast SIMbox fraud detection. At the beginning, we used batch processing to detect SIMbox fraud, but the response time took too long, so we switched to a technology that allows in-memory computations in order to reach the desired time constraints.

VoltDB’s in-memory distributed database provides transactions at streaming speed in a fast environment. It can support millions of small transactions per second. It also allows streaming aggregation and fast counters over incoming data. These attributes allowed us to develop a real-time analytics layer on top of VoltDB. Read more…

Comments: 9