"distributed computing" entries

Using Apache Spark to predict attack vectors among billions of users and trillions of events

The O’Reilly Data Show podcast: Fang Yu on data science in security, unsupervised learning, and Apache Spark.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science: Stitcher, TuneIn, iTunes, SoundCloud, RSS.

350px-Zaunreparatur_beim_Museum_Arlerhof_in_Abtenau_26

In this episode of the O’Reilly Data Show, I spoke with Fang Yu, co-founder and CTO of DataVisor. We discussed her days as a researcher at Microsoft, the application of data science and distributed computing to security, and hiring and training data scientists and engineers for the security domain.

DataVisor is a startup that uses data science and big data to detect fraud and malicious users across many different application domains in the U.S. and China. Founded by security researchers from Microsoft, the startup has developed large-scale unsupervised algorithms on top of Apache Spark, to (as Yu notes in our chat) “predict attack vectors early among billions of users and trillions of events.”

Several years ago, I found myself immersed in the security space and at that time tools that employed machine learning and big data were still rare. More recently, with the rise of tools like Apache Spark and Apache Kafka, I’m starting to come across many more security professionals who incorporate large-scale machine learning and distributed systems into their software platforms and consulting practices.

Read more…

From search to distributed computing to large-scale information extraction

The O'Reilly Data Show Podcast: Mike Cafarella on the early days of Hadoop/HBase and progress in structured data extraction.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

The_Wonders_of_The_World_British_Library_FlickrFebruary 2016 marks the 10th anniversary of Hadoop — at a point in time when many IT organizations actively use Hadoop, and/or one of the open source, big data projects that originated after, and in some cases, depend on it.

During the latest episode of the O’Reilly Data Show Podcast, I had an extended conversation with Mike Cafarella, assistant professor of computer science at the University of Michigan. Along with Strata + Hadoop World program chair Doug Cutting, Cafarella is the co-founder of both Hadoop and Nutch. In addition, Cafarella was the first contributor to HBase

We talked about the origins of Nutch, Hadoop (HDFS, MapReduce), HBase, and his decision to pursue an academic career and step away from these projects. Cafarella’s pioneering contributions to open source search and distributed systems fits neatly with his work in information extraction. We discussed a new startup he recently co-founded, ClearCutAnalytics, to commercialize a highly regarded academic project for structured data extraction (full disclosure: I’m an advisor to ClearCutAnalytics). As I noted in a previous post, information extraction (from a variety of data types and sources) is an exciting area that will lead to the discovery of new features (i.e., variables) that may end up improving many existing machine learning systems. Read more…

Managing complexity in distributed systems

The O'Reilly Radar Podcast: Astrid Atkinson on optimization, and Kelsey Hightower on distributed computing.

Subscribe to the O’Reilly Radar Podcast to track the technologies and people that will shape our world in the years to come.

640px-Albero_genealogico_marsciano_ughelli

In this week’s episode, O’Reilly’s Mac Slocum talks to Astrid Atkinson, director of software engineering at Google, about the delicate balance of managing complexity in distributed systems and her experience working on-call rotations at Google.

Here are a few snippets from their chat:

I think it’s often really hard for organizations that are scaling quickly to find time to manage complexity in their systems. That can be really a trap, because if you’re really always just focused on the next deadline or whatever, and never planning for what you’re going to live with when you’re done, then you might never find the time.

You can only optimize what you pay attention to, and so if you can’t see what your system is doing, if you can’t see whether it’s working, it’s not working.

I used to get paged awake at two in the morning. You go from zero to Google is down. That’s a lot to wake up to.

Read more…

Cheap sensors, fast networks, and distributed computing

The history of computing has been a constant pendulum — that pendulum is now swinging back toward distribution.

Editor’s note: this is an excerpt from our new report Data: Emerging Trends and Technologies, by Alistair Croll. You can download the free report here.

The trifecta of cheap sensors, fast networks, and distributing computing are changing how we work with data. But making sense of all that data takes help, which is arriving in the form of machine learning. Here’s one view of how that might play out.

Clouds, edges, fog, and the pendulum of distributed computing

The history of computing has been a constant pendulum, swinging between centralization and distribution.

The first computers filled rooms, and operators were physically within them, switching toggles and turning wheels. Then came mainframes, which were centralized, with dumb terminals.

As the cost of computing dropped and the applications became more democratized, user interfaces mattered more. The smarter clients at the edge became the first personal computers; many broke free of the network entirely. The client got the glory; the server merely handled queries.

Once the web arrived, we centralized again. LAMP (Linux, Apache, MySQL, PHP) buried deep inside data centers, with the computer at the other end of the connection relegated to little more than a smart terminal rendering HTML. Load-balancers sprayed traffic across thousands of cheap machines. Eventually, the web turned from static sites to complex software as a service (SaaS) applications.

Then the pendulum swung back to the edge, and the clients got smart again. First with AJAX, Java, and Flash; then in the form of mobile apps, where the smartphone or tablet did most of the hard work and the back end was a communications channel for reporting the results of local action. Read more…

Four short links: 18 November 2014

Four short links: 18 November 2014

A Worm Mind Forever LEGO Voyaging, Automatic Caption Generator, ELK Stack, and Amazonian Deployment

  1. A Worm’s Mind in a Lego Body — the c. elegans worm’s 302 neurons has been sequenced, modelled in open source code, and now hooked up to a Lego robot. It is claimed that the robot behaved in ways that are similar to observed C. elegans. Stimulation of the nose stopped forward motion. Touching the anterior and posterior touch sensors made the robot move forward and back accordingly. Stimulating the food sensor made the robot move forward. There is video.
  2. Show and Tell: A Neural Image Caption Generator — Google Research paper on generating captions like “Two pizzas sitting on top of a stove top oven” from a photo. Wow.
  3. Big Data with the ELK Stack — ElasticSearch, logstash, and Kibana. Interesting and powerful combination of tools!
  4. Apollo: Amazon’s Deployment EngineApollo will stripe the rolling update to simultaneously deploy to an equivalent number of hosts in each location. This keeps the fleet balanced and maximizes redundancy in the case of any unexpected events. When the fleet scales up to handle higher load, Apollo automatically installs the latest version of the software on the newly added hosts. Lust.

Small brains, big data

How neuroscience is benefiting from distributed computing — and how computing might learn from neuroscience.

Neurons

When we think about big data, we usually think about the web: the billions of users of social media, the sensors on millions of mobile phones, the thousands of contributions to Wikipedia, and so forth. Due to recent innovations, web-scale data can now also come from a camera pointed at a small, but extremely complex object: the brain. New progress in distributed computing is changing how neuroscientists work with the resulting data — and may, in the process, change how we think about computation. Read more…

Revisiting “What is DevOps”

If all companies are software companies, then all companies must learn to manage their online operations.

DevOpsBirds

Two years ago, I wrote What is DevOps. Although that article was good for its time, our understanding of organizational behavior, and its relationship to the operation of complex systems, has grown.

A few themes have become apparent in the two years since that last article. They were latent in that article, I think, but now we’re in a position to call them out explicitly. It’s always easy to think of DevOps (or of any software industry paradigm) in terms of the tools you use; in particular, it’s very easy to think that if you use Chef or Puppet for automated configuration, Jenkins for continuous integration, and some cloud provider for on-demand server power, that you’re doing DevOps. But DevOps isn’t about tools; it’s about culture, and it extends far beyond the cubicles of developers and operators. As Jeff Sussna says in Empathy: The Essence of DevOps:

…it’s not about making developers and sysadmins report to the same VP. It’s not about automating all your configuration procedures. It’s not about tipping up a Jenkins server, or running your applications in the cloud, or releasing your code on Github. It’s not even about letting your developers deploy their code to a PaaS. The true essence of DevOps is empathy.

Read more…