"Big Data" entries

Is 2016 the year you let robots manage your money?

The O’Reilly Data Show podcast: Vasant Dhar on the race to build “big data machines” in financial investing.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.


In this episode of the O’Reilly Data Show, I sat down with Vasant Dhar, a professor at the Stern School of Business and Center for Data Science at NYU, founder of SCT Capital Management, and editor-in-chief of the Big Data Journal (full disclosure: I’m a member of the editorial board). We talked about the early days of AI and data mining, and recent applications of data science to financial investing and other domains.

Dhar’s first steps in applying machine learning to finance

I joke with people, I say, ‘When I first started looking at finance, the only thing I knew was that prices go up and down.’ It was only when I actually went to Morgan Stanley and took time off from academia that I learned about finance and financial markets. … What I really did in that initial experiment is I took all the trades, I appended them with information about the state of the market at the time, and then I cranked it through a genetic algorithm and a tree induction algorithm. … When I took it to the meeting, it generated a lot of really interesting discussion. … Of course, it took several months before we actually finally found the reasons for why I was observing what I was observing.

Read more…

Four short links: 22 December 2015

Four short links: 22 December 2015

Machine Poetry, Robo Script Kiddies, Big Data of Love, and Virtual Currency and the Nation State

  1. How Machines Write PoetryHarmon would love to have writers or other experts judge FIGURE8’s work, too. Her online subjects tended to rate the similes better if they were obvious. “The snow continued like a heavy rain” got high scores, for example, even though Harmon thought this was quite a bad effort on FIGURE8’s part. She preferred “the snow falls like a dead cat,” which got only middling ratings from humans. “They might have been cat lovers,” she says. FIGURE8 (PDF) system generates figurative language.
  2. The Decisions the Pentagon Wants to Leave to Robots“You cannot have a human operator operating at human speed fighting back at determined cyber tech,” Work said. “You are going to need have a learning machine that does that.” I for one welcome our new robot script kiddie overlords.
  3. Love in the Age of Big DataOver decades, John has observed more than 3,000 couples longitudinally, discovering patterns of argument and subtle behaviors that can predict whether a couple would be happily partnered years later or unhappy or divorced. Turns out, “don’t be a jerk” is good advice for marriages, too. (via Cory Doctorow)
  4. National Security Implications of Virtual Currency (PDF) — Rand research report examining the potential for non-state actor deployment.
Four short links: 7 December 2015

Four short links: 7 December 2015

Telepresent Axeman, Toxic Workers, Analysis Code, and Cryptocurrency Attacks

  1. Axe-Wielding Robot w/Telepresence (YouTube) — graphic robot-on-wall action at 2m30s. (via IEEE)
  2. Toxic Workers (PDF) — In comparing the two costs, even if a firm could replace an average worker with one who performs in the top 1%, it would still be better off by replacing a toxic worker with an average worker by more than two-to-one. Harvard Business School research. (via Fortune)
  3. Replacing Sawzall (Google) — At Google, most Sawzall analysis has been replaced by Go […] we’ve developed a set of Go libraries that we call Lingo (for Logs in Go). Lingo includes a table aggregation library that brings the powerful features of Sawzall aggregation tables to Go, using reflection to support user-defined types for table keys and values. It also provides default behavior for setting up and running a MapReduce that reads data from the logs proxy. The result is that Lingo analysis code is often as concise and simple as (and sometimes simpler than) the Sawzall equivalent.
  4. Attacks in the World of Cryptocurrency — a review of some of the discussed weakness, attacks, or oddities in cryptocurrency (esp. bitcoin).
Four short links: 24 November 2015

Four short links: 24 November 2015

Tabular Data, Distrusting Authority, Data is the Future, and Remote Working Challenges

  1. uitable — cute library for tabular data in console golang programs.
  2. Did Carnegie Mellon Attack Tor for the FBI? (Bruce Schneier) — The behavior of the researchers is reprehensible, but the real issue is that CERT Coordination Center (CERT/CC) has lost its credibility as an honest broker. The researchers discovered this vulnerability and submitted it to CERT. Neither the researchers nor CERT disclosed this vulnerability to the Tor Project. Instead, the researchers apparently used this vulnerability to deanonymize a large number of hidden service visitors and provide the information to the FBI. Does anyone still trust CERT to behave in the Internet’s best interests? Analogous to the CIA organizing a fake vaccination drive to get close to Osama. “Intelligence” agencies.
  3. Google Open-Sourcing TensorFlow Shows AI’s Future is Data not Code (Wired) — something we’ve been saying for a long time.
  4. Challenges of Working Remote (Moishe Lettvin) — the things that make working remote hard aren’t, primarily, logistical; they’re emotional.

Jai Ranganathan on architecting big data applications in the cloud

The O’Reilly Data Show podcast: The Hadoop ecosystem, the recent surge in interest in all things real time, and developments in hardware.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.


Given the quick pace of innovation in the data ecosystem, we like to take a step back from the details of individual components, architecture, and applications, in order to take a wider view of the landscape of big data. This allows us to evaluate the progress of technology and infrastructure along the way, shifting our attention from the details of individual components like Spark and Kafka, to larger trends.

Some of the larger trends we’ve been exploring include the capabilities of distributed machine learning and the tradeoffs and design decisions involved in cloud architecture and stream processing.

In this episode of the O’Reilly Data Show, I sat down with Jai Ranganathan, senior director of product management at Cloudera. We talked about the trends in the Hadoop ecosystem, cloud computing, the recent surge in interest in all things real time, and hardware trends:

Large-scale machine learning

This sounds a bit like this should already exist in really good form right now, but one of the things that I’m really interested in is expanding the set of capabilities for distributed machine learning. While there are systems out there today that do do this, I think relative to what you can experience from a singular environment learning scikit-learn or R, the set of things you can do in a distributed fashion is limited. …  It’s not easy to distribute various algorithms and model-building techniques. I think there is still a lot of work for us to do to improve that experience. … And I do want to have good open source options like MLlib. MLlib may be the right answer. I would be perfectly happy if that’s the final answer, but we do need systems just to provide the kind of depth that you typically are used to in the singular environment. That’s just a matter of time and investment because these are non-trivial problems, but they are things that people are working on.

Read more…

Building systems for massive scale data applications

The O’Reilly Data Show podcast: Tyler Akidau on the evolution of systems for bounded and unbounded data processing.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.


Many of the open source systems and projects we’ve come to love — including Hadoop and HBase — were inspired by systems used internally within Google. These systems were described in papers and implemented by people who needed frameworks that could comfortably scale to massive data sets.

Google engineers and scientists continue to publish interesting papers, and these days some of the big data systems they describe in publications are available on their cloud platform.

In this episode of the O’Reilly Data Show, I sat down with Tyler Akidau one of the lead engineers in Google’s streaming and Dataflow technologies. He recently wrote an extremely popular article that provided a framework for how to think about bounded and unbounded data processing (a follow-up article is due out soon). We talked about the evolution of stream processing, the challenges of building systems that scale to massive data sets, and the recent surge in interest in all things real time:

On the need for MillWheel: A new stream processing engine

At the time [that MillWheel was built], there was, as far as I know, literally nothing externally that could handle the scale that we needed to handle. A lot of the existing streaming systems didn’t focus on out-of-order processing, which was a big deal for us internally. Also we really wanted to hit a strong focus on consistency — being able to get absolutely correct answers. … All three of these things were lacking in at least some area in [the systems we examined].

Read more…

Four short links: 28 October 2015

Four short links: 28 October 2015

DRM-Breaking Broken, IT Failures, Social Graph Search, and Dataviz Interview

  1. Librarian of Congress Grants Limited DRM-Breaking Rights (Cory Doctorow) — The Copyright Office said you will be able to defeat locks on your car’s electronics, provided: You wait a year first (the power to impose waiting times on exemptions at these hearings is not anywhere in the statute, is without precedent, and has no basis in law); You only look at systems that do not interact with your car’s entertainment system (meaning that car makers can simply merge the CAN bus and the entertainment system and get around the rule altogether); Your mechanic does not break into your car — only you are allowed to do so. The whole analysis is worth reading—this is not a happy middle-ground; it’s a mess. And remember: there are plenty of countries without even these exemptions.
  2. Lessons from a Decade of IT Failures (IEEE Spectrum) — full of cautionary tales like, Note: No one has an authoritative set of financials on ECSS. That was made clear in the U.S. Senate investigation report, which expressed frustration and outrage that the Air Force couldn’t tell it what was spent on what, when it was spent, nor even what ECSS had planned to spend over time. Scary stories to tell children at night.
  3. Unicorn: A System for Searching the Social Graph (Facebook) — we describe the data model and query language supported by Unicorn, which is an online, in-memory social graph-aware indexing system designed to search trillions of edges between tens of billions of users and entities on thousands of commodity servers. Unicorn is based on standard concepts in information retrieval, but it includes features to promote results with good social proximity. It also supports queries that require multiple round-trips to leaves in order to retrieve objects that are more than one edge away from source nodes.
  4. Alberto Cairo InterviewSo, what really matters to me is not the intention of the visualization – whether you created it to deceive or with the best of intentions; what matters is the result: if the public is informed or the public is misled. In terms of ethics, I am a consequentialist – meaning that what matters to me ethically is the consequences of our actions, not so much the intentions of our actions.
Four short links: 26 October 2015

Four short links: 26 October 2015

Dataflow Computers, Data Set Explorer, Design Brief, and Coping with Uncertainty

  1. Dataflow Computers: Their History and Future (PDF) — entry from 2008 Wiley Encyclopedia of Computer Science and Engineering.
  2. Mirador — open source tool for visual exploration of complex data sets. It enables users to discover correlation patterns and derive new hypotheses from the data.
  3. How 23AndMe Got Regulatory Approval Back (Fast Company) — In order to meet FDA requirements, the design team had to prove that the reports provided on the website would be comprehensible to any American consumer, regardless of their background or education level. And you thought YOUR design brief was hard.
  4. Getting Comfortable with Uncertainty (The Atlantic) — We have this natural distaste for things that are unfamiliar to us, things that are ambiguous. It goes up from situational stressors, on an individual level and a group level. And we’re stuck with it simply because we have to be ambiguity-reducers.

Turning big data into actionable insights

The O’Reilly Data Show podcast: Evangelos Simoudis on data mining, investing in data startups, and corporate innovation.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.


Can developments in data science and big data infrastructure drive corporate innovation? To be fair, many companies are still in the early stages of incorporating these ideas and tools into their organizations.

Evangelos Simoudis has spent many years interacting with entrepreneurs and executives at major global corporations. Most recently, he’s been advising companies interested in developing long-term strategies pertaining to big data, data science, cloud computing, and innovation. He began his career as a data mining researcher and practitioner, and is counted among the pioneers who helped data mining technologies get adopted in industry.

In this episode of the O’Reilly Data Show, I sat down with Simoudis and we talked about his thoughts on investing, data applications and products, and corporate innovation:

Open source software companies

I very much appreciate open source. I encourage my portfolio companies to use open source components as appropriate, but I’ve never seen the business model as being one that is particularly easy to really build the companies around them. Everybody points to Red Hat, and that may be the exception, but I have not seen companies that have, on the one hand, remained true to the open source principles and become big and successful companies that do not require constant investment. … The revenue streams never prove to be sufficient for building big companies. I think the companies that get started from open source in order to become big and successful … [are] ones that, at some point, decided to become far more proprietary in their model and in the services that they deliver. Or they become pure professional services companies as opposed to support services companies. Then they reach the necessary levels of success.

Read more…

Get started with cloud-based data science

Learn how to deploy machine learning solutions using Azure ML.


Download the free, updated report “Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update.

Cloud-based machine learning platforms, like Microsoft’s Azure Machine Learning (Azure ML), provide a simplified path to create and deploy analytic solutions. Azure ML is a fully managed and secure machine learning platform that resides within the Microsoft Cortana Analytics Suite.

Azure ML workflows (known as “experiments”) are constructed using a combination of drag-and-drop modules, SQL, R, and Python scripts. The wide range of built modules support the typical steps in a machine learning workflow, from data ingestion and data munging to model construction and cross validation.

Once your Azure ML experiment is ready, there are several options to deploy it. Azure ML experiments can access large-scale data stored in Azure Blob storage, Azure SQL and Hive, to name a few options. Similarly, your experiment can write results back to multiple scalable Azure storage options.

Read more…