"data science" entries
For maximum business value, big data applications have to involve multiple Hadoop ecosystem components.
Data is deluging today’s enterprise organizations from ever-expanding sources and in ever-expanding formats. To gain insight from this valuable resource, organizations have been adopting Apache Hadoop with increasing momentum. Now, the most successful players in big data enterprise are no longer only utilizing Hadoop “core” (i.e., batch processing with MapReduce), but are moving toward analyzing and solving real-world problems using the broader set of tools in an enterprise data hub (often interactively) — including components such as Impala, Apache Spark, Apache Kafka, and Search. With this new focus on workload diversity comes an increased demand for developers who are well-versed in using a variety of components across the Hadoop ecosystem.
Due to the size and variety of the data we’re dealing with today, a single use case or tool — no matter how robust — can camouflage the full, game-changing potential of Hadoop in the enterprise. Rather, developing end-to-end applications that incorporate multiple tools from the Hadoop ecosystem, not just the Hadoop core, is the first step toward activating the disparate use cases and analytic capabilities of which an enterprise data hub is capable. Whereas MapReduce code primarily leverages Java skills, developers who want to work on full-scale big data engineering projects need to be able to work with multiple tools, often simultaneously. An authentic big data applications developer can ingest and transform data using Kite SDK, write SQL queries with Impala and Hive, and create an application GUI with Hue. Read more…
We need primitives; pipeline synthesis tools; and most importantly, error analysis and verification.
There are many algorithms with implementations that scale to large data sets (this list includes matrix factorization, SVM, logistic regression, LASSO, and many others). In fact, machine learning experts are fond of pointing out: if you can pose your problem as a simple optimization problem then you’re almost done.
Of course, in practice, most machine learning projects can’t be reduced to simple optimization problems. Data scientists have to manage and maintain complex data projects, and the analytic problems they need to tackle usually involve specialized machine learning pipelines. Decisions at one stage affect things that happen downstream, so interactions between parts of a pipeline are an area of active research.
In his Strata+Hadoop World New York presentation, UC Berkeley Professor Ben Recht described new UC Berkeley AMPLab projects for building and managing large-scale machine learning pipelines. Given AMPLab’s ties to the Spark community, some of the ideas from their projects are starting to appear in Apache Spark. Read more…
Becoming more familiar with mathematics will help cross pollinate ideas between mathematics and software engineering.
Editor’s note: Alice Zheng will be part of the team teaching Large-scale Machine Learning Day at Strata + Hadoop World in San Jose. Visit the Strata + Hadoop World website for more information on the program.
During my first year in graduate school, I had an epiphany about mathematics that changed my whole perspective about the field. I had chosen to study machine learning, a cross-disciplinary research area that combines elements of computer science, statistics, and numerous subfields of mathematics, such as optimization and linear algebra. It was a lot to take in, and all of us first-year students were struggling to absorb the deluge of new concepts.
One night, I was sitting in the office trying to grok linear algebra. A wonderfully lucid textbook served as my guide: Introduction to Linear Algebra, written by Gilbert Strang. But I just wasn’t getting it. I was looking at various definitions — eigen decomposition, Jordan canonical forms, matrix inversions, etc. — and I thought, “Why?” Why does everything look so weird? Why is the inverse defined this way? Come to think of it, why are any of the matrix operations defined the way they are?
While staring at a hopeless wall of symbols, a flash of lightning went off in my mind. I had an insight: math is a design. Prior to that moment, I had approached mathematics as if it were universal truth: transcendent in its perfection, almost unknowable by mere mortals. But on that night, I realized that mathematics is a human-constructed tool. Math is designed, just like software programs are designed, and using many of the same design principles. These principles may not be apparent, but they are comprehensible. In that moment, mathematics went from being unknowable to reasonable. Read more…
Drawing inspiration from recent advances in data preparation.
One of the trends we’re following is the rise of applications that combine big data, algorithms, and efficient user interfaces. As I noted in an earlier post, our interest stems from both consumer apps as well as tools that democratize data analysis. It’s no surprise that one of the areas where “cognitive augmentation” is playing out is in data preparation and curation. Data scientists continue to spend a lot of their time on data wrangling, and the increasing number of (public and internal) data sources paves the way for tools that can increase productivity in this critical area.
At Strata + Hadoop World New York, NY, two presentations from academic spinoff start-ups — Mike Stonebraker of Tamr and Joe Hellerstein and Sean Kandel of Trifacta — focused on data preparation and curation. While data wrangling is just one component of a data science pipeline, and granted we’re still in the early days of productivity tools in data science, some of the lessons these companies have learned extend beyond data preparation.
Scalability ~ data variety and size
Not only are enterprises faced with many data stores and spreadsheets, data scientists have many more (public and internal) data sources they want to incorporate. The absence of a global data model means integrating data silos, and data sources requires tools for consolidating schemas.
Random samples are great for working through the initial phases, particularly while you’re still familiarizing yourself with a new data set. Trifacta lets users work with samples while they’re developing data wrangling “scripts” that can be used on full data sets.
A look at the social and moral implications of living in a deeply connected, analyzed, and informed world.
We’ll now look at both the light and the shadows of this new dawn, the social and moral implications of living in a deeply connected, analyzed, and informed world. This is both the promise and the peril of big data in an age of widespread sensors, fast networks, and distributed computing.
Solving the big problemsThe planet’s systems are under strain from a burgeoning population. Scientists warn of rising tides, droughts, ocean acidity, and accelerating extinction. Medication-resistant diseases, outbreaks fueled by globalization, and myriad other semi-apocalyptic Horsemen ride across the horizon.
Can data fix these problems? Can we extend agriculture with data? Find new cures? Track the spread of disease? Understand weather and marine patterns? General Electric’s Bill Ruh says that while the company will continue to innovate in materials sciences, the place where it will see real gains is in analytics.
It’s often been said that there’s nothing new about big data. The “iron triangle” of Volume, Velocity, and Variety that Doug Laney coined in 2001 has been a constraint on all data since the first database. Basically, you could have any two you want fairly affordably. Consider:
- A coin-sorting machine sorts a large volume of coins rapidly, but assumes a small variety of coins. It wouldn’t work well if there were hundreds of coin types.
- A public library, organized by the Dewey Decimal System, has a wide variety of books and topics, and a large volume of those books — but stacking and retrieving the books happens at a slow velocity.
What’s new about big data is that the cost of getting all three Vs has become so cheap it’s almost not worth billing for. A Google search happens with great alacrity, combs the sum of online knowledge, and retrieves a huge variety of content types. Read more…