Looking at the collision of hardware and software through the eyes of a data scientist.
Editor’s note: This is part one of a two-part series reflecting on the O’Reilly Solid Conference from the perspective of a data scientist. Normally we wouldn’t publish takeaways from an event held nearly two months ago, but these insights were so good we thought they needed to be shared.
In mid-May, I was at Solid, O’Reilly’s new conference on the convergence of hardware and software. I went in as something close to a blank slate on the subject, as someone with (I thought) not very strong opinions about hardware in general.
The talk on the grapevine in my community, data scientists who tend to deal primarily with web data, was that hardware data was the next big challenge, the place that the “alpha geeks” were heading. There are still plenty of big problems left to solve on the web, but I was curious enough to want to go check out Solid to see if I was missing out on the future. I don’t have much experience with hardware — beyond wiring up LEDs as a kid, making bird houses in shop class in high school, and mucking about with an Arduino in college. Read more…
Business users are becoming more comfortable with graph analytics.
The rise of sensors and connected devices will lead to applications that draw from network/graph data management and analytics. As the number of devices surpasses the number of people — Cisco estimates 50 billion connected devices by 2020 — one can imagine applications that depend on data stored in graphs with many more nodes and edges than the ones currently maintained by social media companies.
This means that researchers and companies will need to produce real-time tools and techniques that scale to much larger graphs (measured in terms of nodes & edges). I previously listed tools for tapping into graph data, and I continue to track improvements in accessibility, scalability, and performance. For example, at the just-concluded Spark Summit, it was apparent that GraphX remains a high-priority project within the Spark1 ecosystem.
A practical example of how anomaly detection makes complex data problems easier to solve.
As new tools for distributed storage and analysis of big data are becoming more stable and widely known, there is a growing need for discovering best practices for analytics at this scale. One of the areas of widespread interest that crosses many verticals is anomaly detection.
At its best, anomaly detection is used to find unusual, rarely occurring events or data for which little is known in advance. Examples include changes in sensor data reported for a variety of parameters, suspicious behavior on secure websites, or unexpected changes in web traffic. In some cases, the data patterns being examined are simple and regular and, thus, fairly easy to model.
Anomaly detection approaches start with some essential but sometimes overlooked ideas about anomalies:
- Anomalies are defined not by their own characteristics but in contrast to what is normal.
- Before you can spot an anomaly, you first have to figure out what “normal” actually is.
This need to first discover what is considered “normal” may seem obvious, but it is not always obvious how to do it, especially in situations with complicated patterns of behavior. Best results are achieved when you use statistical methods to build an adaptive model of events in the system you are analyzing as a first step toward discovering anomalous behavior. Read more…
The Lambda Architecture has its merits, but alternatives are worth exploring.
Nathan Marz wrote a popular blog post describing an idea he called the Lambda Architecture (“How to beat the CAP theorem“). The Lambda Architecture is an approach to building stream processing applications on top of MapReduce and Storm or similar systems. This has proven to be a surprisingly popular idea, with a dedicated website and an upcoming book. Since I’ve been involved in building out the real-time data processing infrastructure at LinkedIn using Kafka and Samza, I often get asked about the Lambda Architecture. I thought I would describe my thoughts and experiences.
What is a Lambda Architecture and how do I become one?
The Lambda Architecture looks something like this:
Data from the Internet of Things makes an integrated data strategy vital.
The Internet of Things (IoT) is more than a network of smart toasters, refrigerators, and thermostats. For the moment, though, domestic appliances are the most visible aspect of the IoT. But they represent merely the tip of a very large and mostly invisible iceberg.
IDC predicts by the end of 2020, the IoT will encompass 212 billion “things,” including hardware we tend not to think about: compressors, pumps, generators, turbines, blowers, rotary kilns, oil-drilling equipment, conveyer belts, diesel locomotives, and medical imaging scanners, to name a few. Sensors embedded in such machines and devices use the IoT to transmit data on such metrics as vibration, temperature, humidity, wind speed, location, fuel consumption, radiation levels, and hundreds of other variables. Read more…
Why my understanding of AI is different from yours.
Editor’s note: this post is part of our Intelligence Matters investigation.
Let me start with a secret: I feel self-conscious when I use the terms “AI” and “artificial intelligence.” Sometimes, I’m downright embarrassed by them.
Before I get into why, though, answer this question: what pops into your head when you hear the phrase artificial intelligence?
For the layperson, AI might still conjure HAL’s unblinking red eye, and all the misfortune that ensued when he became so tragically confused. Others jump to the replicants of Blade Runner or more recent movie robots. Those who have been around the field for some time, though, might instead remember the “old days” of AI — whether with nostalgia or a shudder — when intelligence was thought to primarily involve logical reasoning, and truly intelligent machines seemed just a summer’s work away. And for those steeped in today’s big-data-obsessed tech industry, “AI” can seem like nothing more than a high-falutin’ synonym for the machine-learning and predictive-analytics algorithms that are already hard at work optimizing and personalizing the ads we see and the offers we get — it’s the term that gets trotted out when we want to put a high sheen on things. Read more…