"strata submissions" entries
Preview of an upcoming tutorial at Strata Santa Clara 2013
This month at Strata, the U.C. Berkeley AMPLab will be running a full day of big data tutorials.In this post, we present the motivation and vision for the Berkeley Data Analytics Stack (BDAS), and an overview of several BDAS components that we released over the past two years, including Mesos, Spark, Spark Streaming, and Shark.
While batch processing systems like Hadoop MapReduce paved the way for organizations to ask questions about big datasets, they represent only the beginning of what users need to do with big data. More and more, users wish to move from periodically building reports about datasets to continuously using new data to make informed business decisions in real-time. Achieving these goals imposes three key requirements on big data processing:
- Low latency queries: Interactive ad-hoc queries allows data scientists to find valuable inferences faster, or explore a larger solution space to make better decisions. Furthermore, there is an increasing need for stream processing, as this allows organizations to make decisions in real-time, such as detecting an SLA violation and fixing the problem before the users notice, or deciding what ads to show based on user’s live tweets.
- Sophisticated analysis: People are increasingly looking to use new state of art algorithms, such as predictive machine learning algorithms, to make better forecasts and decisions.
- Unification of existing data computation models: Users want to integrate interactive queries, batch, and streaming processing to handle the ever increasing requirements of their processing pipelines. For example, detecting anomalies in user behavior may require (1) stream processing to compare the behavior of users in real-time across different segments (e.g., genre, ages, location, device), (2) interactive queries to detect differences in user’s daily (or weekly) behavior, and (3) batch processing to build sophisticated predictive models.
In response to the above requirements, more than three years ago we began building BDAS.
Join us in the data revolution.
When I told some of my friends and family that I was joining O’Reilly Media as an editor focusing on ORM’s Strata practice area, their responses reflected the diversity of my loved ones.
I’ve paraphrased some of the best ones here:
- “That is great! I have a bunch of their books. Everyone I know has the animal books.”
- “Bill O’Reilly owns a media company?”
- “I don’t get you techie people. Didn’t you already do a bunch of weird ninja-y data type stuff?”
- “Congrats! I have a lot of respect for ORM.”
- “… wait a sec, didn’t you STOP being a Java editor years ago to go work at an assessment data startup? ”
The people in my life have a few things in common. They are smart, articulate, really truly not afraid to say what they think, and seek to be the change they wish to see in the world. We don’t always agree [massive understatement]. Yet, our motivations are the same.
Why am I telling you this?
I believe that at our core, no matter how different we may seem, we do not actively seek to harm. Yet, everyone that works with data already has or will be facing certain choices on what to do with data. Choices that are obviously for good or for evil. Choices that are neither completely for good or completely for evil. Choices that we are reluctant to discuss because we do not want to implicate ourselves or the companies we work for. Yet, just because we are reluctant to discuss them does not mean we are not facing these challenges.
If you have the courage to speak out regarding the real everyday challenges that you experience while working with data, then I want to listen. If you have discovered solutions to these everyday challenges, then I want to publish your insight. If you engage in anything I publish, whether you agree or disagree, have suggestions for how things could be different or better, then please say something.