A deep dive into performance bottlenecks with Spark PMC member Kay Ousterhout.
For many who use and deploy Apache Spark, knowing how to find critical bottlenecks is extremely important. In a recent O’Reilly webcast, Making Sense of Spark Performance, Spark committer and PMC member Kay Ousterhout gave a brief overview of how Spark works, and dove into how she measured performance bottlenecks using new metrics, including block-time analysis. Ousterhout walked through high-level takeaways from her in-depth analysis of several workloads, and offered a live demo of a new performance analysis tool and explained how you can use it to improve your Spark performance.
Her research uncovered surprising insights into Spark’s performance on two benchmarks (TPC-DS and the Big Data Benchmark), and one production workload. As part of our overall series of webcasts on big data, data science, and engineering, this webcast debunked commonly held ideas surrounding network performance, showing that CPU — not I/O — is often a critical bottleneck, and demonstrated how to identify and fix stragglers.
Network performance is almost irrelevant
While there’s been a lot of research work on performance — mainly surrounding the issues of whether to cache input data in-memory or on machine, scheduling, straggler tasks, and network performance — there haven’t been comprehensive studies into what’s most important to performance overall. This is where Ousterhout’s research comes in — taking on what she refers to as “community dogma,” beginning with the idea that network and disk I/O are major bottlenecks. Read more…
An enterprise architecture solution for scale and efficiency.
Data processing in the enterprise goes very swiftly from “good enough” to “we need to be faster!” as expectations grow. The Zeta Architecture is an enterprise architecture that enables simplified business processes and defines a scalable way for increasing the speed of integrating data into the business. Following a bit of history and a description of the architecture, I’ll use Google as an example and look at the way the company deploys technologies for Gmail.
Origin story and motivation
I’ve worked on a variety of different information systems over my career, each with their own classes of challenge. The most interesting from a capacity perspective was for a company that delivers digital advertising. The biggest technical problems in that industry flow from the sheer volume of transactions that occur on a daily basis. Traffic flows in all hours of the day, but there are certainly peak periods, which means all planning must revolve around the capacity during the peak hours. This solution space isn’t altogether different than that of Amazon; they had to build their infrastructure to handle massive loads of peak traffic. Both Amazon and digital advertising, incidentally, have a Black Friday spike.
Many different architectural ideas came to my mind while I was in digital advertising. Real-time performance tracking of the advertising platform was one such thing. This was well before real-time became a hot buzzword in the technology industry. There was a point in time where this digital advertising company was “satisfied” with, or perhaps tolerated, having a two-to-three-hour delay between making changes to the system and having complete insight into the effects of the changes. After nearly a year at this company, I was finally able to get a large architectural change made to streamline log collection and management. Before the implementation started, I told everyone involved what would happen. Although this approach would enable the business to see the performance within approximately 5-10 minutes of the time a change was made, that this would not be good enough after people got a feel for what real-time could deliver. Since people didn’t have that taste in their mouths, they wouldn’t yet support going straight to real-time for this information. The implementation of this architecture was in place a few months after I departed the company for a new opportunity. The implementation worked great, and after about three months of experience with the new architecture, my former colleagues contacted me and told me they were looking to re-architect the entire solution to go to real time. Read more…
Widespread blockchain adoption requires understanding between developers and domain experts.
Editor’s note: this post is part of our investigation into the future of money. The full video compilation from our first event, Bitcoin & the Blockchain, is now available.
The vision for bitcoin and the blockchain is unabashedly optimistic, though already it is being realized. More and more technologists, venture capitalists, financial institutions, and even regulators are seeing its long-term potential to transform industries, from financial services to data management to the Internet of Things. In the medium term, there remain hurdles to overcome before blockchain technology can offer sufficiently compelling solutions for the complex financial and technological world we live in, but there is progress to date — and it’s promising.
Blockchain-based remittance vehicles offered by Coins.ph, BitPagos, and BitPesa, though early stage, aim to take a chunk of the $450 billion remittance industry by offering speedier, more efficient, and cheaper alternatives to traditional solutions. BitPay offers bitcoin/fiat payment processing for merchants as well as bank integration. Increasingly, private investors are diversifying their portfolios by purchasing bitcoin alongside traditional assets. Most recently, Coinbase even received funding from a group of blue-chip investors, including the New York Stock Exchange, and launched its own exchange, signaling both greater acceptance by the financial services industry as well as confidence in its future value. Ripple Labs has taken a very different approach with its protocol, permitting the decentralized transmission of practically any currency type — cryptographic or fiat — like an SMTP for money, and circumventing traditional payment networks. And to this end, it’s already inked agreements with Cross River Bank (New Jersey), CBW Bank (Kansas), and Fidor Bank (Germany), with more on the horizon. Read more…
A chat with Tony Parisi on where we are with VR, where we need to go, and why we're going to get there this time.
Consumer virtual reality (VR) is in the midst of a dizzying and exhilarating upswing. A new breed of systems, pioneered by Oculus and centered on head-worn displays with breakthrough quality, are minting believers — whether investors, developers, journalists, or early-adopting consumers. Major new hardware announcements and releases are occurring on a regular basis, game studios and production houses big and small are tossing their hats into the ring, and ambitious startups are getting funded to stake out many different application domains. Is it a boom, a bubble, or the birth of a new computing platform?
Underneath this fundamental quandary, there are many basic questions that remain unresolved: Which hardware and software platforms will dominate? What input and touch feedback technologies will prove themselves? What are the design and artistic principles in this medium? What role will standards play, who will develop them, and when? The list goes on.
For many of these questions, we’ll need to wait a bit longer for answers to emerge; like smartphones in 2007, we can only speculate about, say, the user interface conventions that will emerge as designers grapple with this new paradigm. But on other issues, there is some wisdom to be gleaned. After all, VR has been around for a long time, and there are some poor souls who have been working in the mines all along. Read more…
The O'Reilly Data Show Podcast: Mikio Braun on stream processing, academic research, and training.
Mikio Braun is a machine learning researcher who also enjoys software engineering. We first met when he co-founded a real-time analytics company called streamdrill. Since then, I’ve always had great conversations with him on many topics in the data space. He gave one of the best-attended sessions at Strata + Hadoop World in Barcelona last year on some of his work at streamdrill.
I recently sat down with Braun for the latest episode of the O’Reilly Data Show Podcast, and we talked about machine learning, stream processing and analytics, his recent foray into data science training, and academia versus industry (his interests are a bit on the “applied” side, but he enjoys both).
Using VoltDB and the Lambda Architecture to locate abnormal behavior.
Subscriber Identity Module box (SIMbox) fraud is a type of telecommunications fraud where users avoid an international outbound-calls charge by redirecting the call through voice over IP to a SIM in the country where the destination is located. This is an issue we helped a client address at Wise Athena.
Taking on this type of problem requires a stream-based analysis of the Call Detail Record (CDR) logs, which are typically generated quickly. Detecting this kind of activity requires in-memory computations of streaming data. You might also need to scale horizontally.
We recently evaluated the use of VoltDB together with our cognitive analytics and machine-learning system to analyze CDRs and provide accurate and fast SIMbox fraud detection. At the beginning, we used batch processing to detect SIMbox fraud, but the response time took too long, so we switched to a technology that allows in-memory computations in order to reach the desired time constraints.
VoltDB’s in-memory distributed database provides transactions at streaming speed in a fast environment. It can support millions of small transactions per second. It also allows streaming aggregation and fast counters over incoming data. These attributes allowed us to develop a real-time analytics layer on top of VoltDB. Read more…