"data science" entries
An enterprise architecture solution for scale and efficiency.
Data processing in the enterprise goes very swiftly from “good enough” to “we need to be faster!” as expectations grow. The Zeta Architecture is an enterprise architecture that enables simplified business processes and defines a scalable way for increasing the speed of integrating data into the business. Following a bit of history and a description of the architecture, I’ll use Google as an example and look at the way the company deploys technologies for Gmail.
Origin story and motivation
I’ve worked on a variety of different information systems over my career, each with their own classes of challenge. The most interesting from a capacity perspective was for a company that delivers digital advertising. The biggest technical problems in that industry flow from the sheer volume of transactions that occur on a daily basis. Traffic flows in all hours of the day, but there are certainly peak periods, which means all planning must revolve around the capacity during the peak hours. This solution space isn’t altogether different than that of Amazon; they had to build their infrastructure to handle massive loads of peak traffic. Both Amazon and digital advertising, incidentally, have a Black Friday spike.
Many different architectural ideas came to my mind while I was in digital advertising. Real-time performance tracking of the advertising platform was one such thing. This was well before real-time became a hot buzzword in the technology industry. There was a point in time where this digital advertising company was “satisfied” with, or perhaps tolerated, having a two-to-three-hour delay between making changes to the system and having complete insight into the effects of the changes. After nearly a year at this company, I was finally able to get a large architectural change made to streamline log collection and management. Before the implementation started, I told everyone involved what would happen. Although this approach would enable the business to see the performance within approximately 5-10 minutes of the time a change was made, that this would not be good enough after people got a feel for what real-time could deliver. Since people didn’t have that taste in their mouths, they wouldn’t yet support going straight to real-time for this information. The implementation of this architecture was in place a few months after I departed the company for a new opportunity. The implementation worked great, and after about three months of experience with the new architecture, my former colleagues contacted me and told me they were looking to re-architect the entire solution to go to real time. Read more…
A chat with Tony Parisi on where we are with VR, where we need to go, and why we're going to get there this time.
Consumer virtual reality (VR) is in the midst of a dizzying and exhilarating upswing. A new breed of systems, pioneered by Oculus and centered on head-worn displays with breakthrough quality, are minting believers — whether investors, developers, journalists, or early-adopting consumers. Major new hardware announcements and releases are occurring on a regular basis, game studios and production houses big and small are tossing their hats into the ring, and ambitious startups are getting funded to stake out many different application domains. Is it a boom, a bubble, or the birth of a new computing platform?
Underneath this fundamental quandary, there are many basic questions that remain unresolved: Which hardware and software platforms will dominate? What input and touch feedback technologies will prove themselves? What are the design and artistic principles in this medium? What role will standards play, who will develop them, and when? The list goes on.
For many of these questions, we’ll need to wait a bit longer for answers to emerge; like smartphones in 2007, we can only speculate about, say, the user interface conventions that will emerge as designers grapple with this new paradigm. But on other issues, there is some wisdom to be gleaned. After all, VR has been around for a long time, and there are some poor souls who have been working in the mines all along. Read more…
The O'Reilly Data Show Podcast: Mikio Braun on stream processing, academic research, and training.
Mikio Braun is a machine learning researcher who also enjoys software engineering. We first met when he co-founded a real-time analytics company called streamdrill. Since then, I’ve always had great conversations with him on many topics in the data space. He gave one of the best-attended sessions at Strata + Hadoop World in Barcelona last year on some of his work at streamdrill.
I recently sat down with Braun for the latest episode of the O’Reilly Data Show Podcast, and we talked about machine learning, stream processing and analytics, his recent foray into data science training, and academia versus industry (his interests are a bit on the “applied” side, but he enjoys both).
Using VoltDB and the Lambda Architecture to locate abnormal behavior.
Subscriber Identity Module box (SIMbox) fraud is a type of telecommunications fraud where users avoid an international outbound-calls charge by redirecting the call through voice over IP to a SIM in the country where the destination is located. This is an issue we helped a client address at Wise Athena.
Taking on this type of problem requires a stream-based analysis of the Call Detail Record (CDR) logs, which are typically generated quickly. Detecting this kind of activity requires in-memory computations of streaming data. You might also need to scale horizontally.
We recently evaluated the use of VoltDB together with our cognitive analytics and machine-learning system to analyze CDRs and provide accurate and fast SIMbox fraud detection. At the beginning, we used batch processing to detect SIMbox fraud, but the response time took too long, so we switched to a technology that allows in-memory computations in order to reach the desired time constraints.
VoltDB’s in-memory distributed database provides transactions at streaming speed in a fast environment. It can support millions of small transactions per second. It also allows streaming aggregation and fast counters over incoming data. These attributes allowed us to develop a real-time analytics layer on top of VoltDB. Read more…
Things are moving fast in the stream processing world.
Register for Strata + Hadoop World, London. Editor’s note: Ben Lorica is an advisor to Databricks and Graphistry. Many of the technologies discussed in this post will be covered in trainings, tutorials, and sessions at Strata + Hadoop World in London this coming May.
There’s renewed interest in stream processing and analytics. I write this based on some data points (attendance in webcasts and conference sessions; a recent meetup), and many conversations with technologists, startup founders, and investors. Certainly, applications are driving this recent resurgence. I’ve written previously about systems that come from IT operations as well as how the rise of cheap sensors are producing stream mining solutions from wearables (mostly health-related apps) and the IoT (consumer, industrial, and municipal settings). In this post, I’ll provide a short update on some of the systems that are being built to handle large amounts of event data.
Apache projects (Kafka, Storm, Spark Streaming, Flume) continue to be popular components in stream processing stacks (I’m not yet hearing much about Samza). Over the past year, many more engineers started deploying Kafka alongside one of the two leading distributed stream processing frameworks (Storm or Spark Streaming). Among the major Hadoop vendors, Hortonworks has been promoting Storm, Cloudera supports Spark Streaming, and MapR supports both. Kafka is a high-throughput distributed pub/sub system that provides a layer of indirection between “producers” that write to it and “consumers” that take data out of it. A new startup (Confluent) founded by the creators of Kafka should further accelerate the development of this already very popular system. Apache Flume is used to collect, aggregate, and move large amounts of streaming data, and is frequently used with Kafka (Flafka or Flume + Kafka). Spark Streaming continues to be one of the more popular components within the Spark ecosystem, and its creators have been adding features at a rapid pace (most recently Kafka integration, a Python API, and zero data loss). Read more…
Cleaning and combining fields can turn messy data into actionable insight.
We often talk in business and computing about moving from “raw data” to “knowledge,” hoping to take useful actions based on the data our organization has collected over time. Before one can view trends in your data or do other analytics, you need tools for data cleaning and for combining multiple data sources into meaningful collections of information, known as entities. An entity may be a customer, a product, a point of sale, an incident being investigated by the police, or anything else around which you want to build meaningful context.
In this post, we’ll explore some of the complexities in real-life data that create headaches — and how analytical software can help users prepare data for sophisticated queries and visualizations. Read more…
The O'Reilly Data Show Podcast: Erich Nachbar on testing and deploying open source, distributed computing components.
When I first hear of a new open source project that might help me solve a problem, the first thing I do is ask around to see if any of my friends have tested it. Sometimes, however, the early descriptions sound so promising that I just jump right in and try it myself — and in a few cases, I transition immediately (this was certainly the case for Spark).
I recently had a conversation with Erich Nachbar, founder and CTO of Virtual Power Systems, and one of the earliest adopters of Spark. In the early days of Spark, Nachbar was CTO of Quantifind, a startup often cited by the creators of Spark as one of the first “production deployments.” On the latest episode of the O’Reilly Data Show Podcast, we talk about the ease with which Nachbar integrates new open source components into existing infrastructure, his contributions to Mesos, and his new “software-defined power distribution” startup.
Ecosystem of open source big data technologies
When evaluating a new software component, nothing beats testing it against workloads that mimic your own. Nachbar has had the luxury of working in organizations where introducing new components isn’t subject to multiple levels of decision-making. But, as he notes, everything starts with testing things for yourself:
“I have sort of my mini test suite…If it’s a data store, I would just essentially hook it up to something that’s readily available, some feed like a Twitter fire hose, and then just let it be bombarded with data, and by now, it’s my simple benchmark to know what is acceptable and what isn’t for the machine…I think if more people, instead of reading papers and paying people to tell them how good or bad things are, would actually set aside a day and try it, I think they would learn a lot more about the system than just reading about it and theorizing about the system. Read more…
Tensor methods for machine learning are fast, accurate, and scalable, but we'll need well-developed libraries.
Data scientists frequently find themselves dealing with high-dimensional feature spaces. As an example, text mining usually involves vocabularies comprised of 10,000+ different words. Many analytic problems involve linear algebra, particularly 2D matrix factorization techniques, for which several open source implementations are available. Anyone working on implementing machine learning algorithms ends up needing a good library for matrix analysis and operations.
But why stop at 2D representations? In a recent Strata + Hadoop World San Jose presentation, UC Irvine professor Anima Anandkumar described how techniques developed for higher-dimensional arrays can be applied to machine learning. Tensors are generalizations of matrices that let you look beyond pairwise relationships to higher-dimensional models (a matrix is a second-order tensor). For instance, one can examine patterns between any three (or more) dimensions in data sets. In a text mining application, this leads to models that incorporate the co-occurrence of three or more words, and in social networks, you can use tensors to encode arbitrary degrees of influence (e.g., “friend of friend of friend” of a user).
Being able to capture higher-order relationships proves to be quite useful. In her talk, Anandkumar described applications to latent variable models — including text mining (topic models), information science (social network analysis), recommender systems, and deep neural networks. A natural entry point for applications is to look at generalizations of matrix (2D) techniques to higher-dimensional arrays. Read more…