"strata santa clara 2013" entries
Insights from a Strata Santa Clara 2013 session
Strata Santa Clara 2013 is a wrap, and I had a great time speaking and interacting with all of the amazing attendees. I’d like to recap the talk that Tim Palko and I gave, entitled “Large-Scale Data Collection and Real-Time Analytics Using Redis”, and maybe even answer a few questions we were asked following our time on stage.
Our talk centered around a system we designed to collect environmental sensor data from remote sensors located in various places across the country and provide real-time visualization, monitoring, and event detection. Our primary challenge for the initial phase of development proved to be scaling the system to collect data from thousands of nodes, each of which sent sensor readings roughly once per second, which maintaining the ability to query the data in real time for event detection. While each data record was only ~300kb, our expected maximum sensor load indicated a collection rate of about 27 million records, or 8GB, per hour. However, our primary issue was not data size, but data rate. A large number of inserts had to happen each second, and we were unable to buffer inserts into batches or transactions without incurring a delay in the real-time data stream.
Preview of upcoming session at the Strata Conference
As a preview, let’s talk about two pretty pictures.
I’m running some typical distributed systems (HDFS, MapReduce, Impala, HBase, Zookeeper) on a small, seven-node cluster. The diagram above has individual processes and the TCP connections they’ve established to each other. Some processes are “masters” and they end up talking to many other processes.
Preview of an upcoming session at Strata Santa Clara
In many modern web and big data applications the data arrives in a streaming fashion and needs to be processed on the fly. In these applications, the data is usually too large to fit in main memory, and the computations need to be done incrementally upon arrival of new pieces of data. Sketching techniques allow these applications to be realized with high levels of efficiency in memory, computation, and network communications.
In the algorithms research community, sketching techniques first appeared in the literature in 1980s, e.g., in the seminal work of Philippe Flajolet and G. Nigel Martin, then caught attentions in late 1990s, partially inspired by the award-winning work of Noga Alon, Yossi Matias, and Mario Szegedy, and were/are on fire in 2000’s/2010’s, when sketches got successfully designed not only for fundamental problems such as heavy hitters, but also for matrix computations, network algorithms, and machine learning. These techniques are now at an inflection point in the course of their history, due to the following factors:
1. Untapped potential: Being so new, their huge practical potential has been yet barely tapped into.
2. Breadth and maturity: They are now both broad and mature enough to start to be widely used across a variety of big data applications, and even act as basic building blocks for new highly efficient big data management systems.
Preview of upcoming session "Who is Fake?" at the Strata Conference
By Lutz Finger
In the Matrix, the idea of a computer algorithm determining what we think may seemed far-fetched. Really? Far-fetched? Let’s look at some numbers.
About half of all Americans get their news in digital form. This news is written up by journalists, half of whom at least partially source their stories from social media. They use tools to harvest the real time knowledge of 100,000 tweets per second and more.
But what if someone could influence those tools and create messages that look as though they were part of a common consensus? Or create the appearance of trending?
Preview of The Laws of Data Mining Session at Strata Santa Clara 2013
Many years ago I was taught about the three laws of thermodynamics. When that didn’t stick, I was taught a quick way to remember originally identified by C.P. Snow:
- 1st Law: you can’t win
- 2nd Law: you can’t draw
- 3rd Law: you can’t get out of the game
These laws (well the real ones) were firmly established by the mid 19th century. Yet, it wasn’t until the 1930s that the value of the 0th law was identified.
They may possibly, just possibly, not be as important as the laws of thermodynamics, but at Strata they will be supported by an equally important 0th Law.
Strata Santa Clara session preview on core data science skills
The McKinsey Global Institute forecasts a shortage of over 140,000 data scientists in the U.S. by 2018. I forecast a shortage of 140,000 people to explain to their respective hiring managers that make it Hadoop is not an appropriate articulation of what these people can or should do. If big data is the new bubble, then here’s to the prolonged correct data recession that hopefully follows.
Correct data? Such skills used to be called unsexy names like statistics or scientific experiments, but we now prefer to spice up the job titles (and salaries!) a bit and brand ourselves as data scientists, data storytellers, data prophets, or—if my next promotion comes through—Lord High Chancellor of Data, appointed by the Sovereign on the advice of the Prime Minister to oversee Her Majesty’s Terabytes. Modesty, it sometimes feels, is low on the burgeoning list of big data skills.