How Big Data Impacts Analytics

Research for our just published report on Big Data management technologies, included conversations with teams who are at the forefront of analyzing massive data sets. We were particularly impressed with the work being produced by Linkedin’s analytics team. [We have more details on Linkedin's analytics team, in an article in the upcoming issue of Release 2.0.]

At the second Social Web Foo camp, I had a chance to visit with Linkedin’s Chief Scientist DJ Patil. As a mathematician specializing in dynamical systems and chaos theory, DJ began his career as a weather forecaster working for the Federal government. Years later, he ended up in an analytics role at Ebay where his prior experience with massive data sets came in handy. In the short video below, DJ shares his observations on how analytics has changed in recent years, especially as Big Data increasingly becomes common. Companies are casting a wider net, and are hiring scientists from fields not traditionally known as fertile recruiting grounds for data intelligence teams.

DJ also talks about his personal journey from mathematics to e-commerce and social networks. Among his previous stints, DJ worked with the DOD and used “… social network analysis to identify terrorists.”

Other short videos from Social Web Foo camp:

  • Ty Ahmad-Taylor on the Challenges Facing Television
  • Steve Ganz’ observations midway through Social Web Foo Camp Year 2
  • tags: , , , , , ,

    Get the O’Reilly Data Newsletter

    Stay informed. Receive weekly insight from industry insiders.

    • http://friendfeed.com/egonw Egon Willighagen

      ORF++

    • Falafulu Fisi

      LinkedIn’s data is pretty much static and there isn’t much pre-processing involved unless I am wrong here. It is a huge dataset, no doubt, but it is not a fast evolving dataset that complicates analysis such as data from scientific experimentation (particle physics, engineering electro-magnetic, fluid dynamic, etc…) or from financial markets, where trading is being registered every few milliseconds or so (worldwide) where the analytics is very complex including its pre-processing steps and it needs to be dynamic & realtime. This means that financial models built using historical data upto the last 10 minutes ago (ie, T = t-10) , would be different from models built using historical data upto this very moment, which is now (ie, T = t). In comparison to LinkedIn, the analytic model built yesterday (or even last week) is pretty much the same as if the model is build today. This is due to the almost static nature of LinkedIn new subscribers. Sure, there are new people joining everyday, but it is unnoticeable. Google data is also an evolving dataset, since there is always a new document that pops up all the time somewhere in the world. They (Google) needs to update their dataset (every hour or so, perhaps even longer) for PageRank to re-crunch so that it is recent.

      Finally, a question to DJ Patil. Is it possible that you can give us readers here about the sort of analytics that LinkedIn is doing? What sorts of metrics are you looking at or analyze? Just curious, that’s all.

    • A. Aramburu

      Could you recommend any books on Analytics for business?

    • http://www.stat.yale.edu/~jay/ Jay Emerson

      Not only are big-data analytic infrastructures like Hadoop now available in the open-source world, but the analytic methods themselves are available too.

      R (a statistical programming environment, http://www.r-project.org) is the pre-eminent open source project for statistical analysis. There are more than a few of us in the R community working on Big Data projects. My graduate student and I offer a library of functions in R to help scale those analytics to very large data sets (http://cran.r-project.org/web/packages/bigmemory/) using shared
      memory and (optionally) file-backed matrices for larger-than-RAM data sets.

      In addition, R has just been integrated with Hadoop by Saptarshi Guha (http://ml.stat.purdue.edu/rhipe), and the commercial open-source company REvolution Computing (whom I consult for) offers large-scale
      distributed computing tools for R
      (http://www.revolution-computing.com/products/parallel-r.php).

      I completely agree with DJ Patil that with these open-source tools available (and increasingly easy to use), we’re set to see an explosion in their use for “analytics.” LinkedIn, FaceBook and
      Google are already using open-source tools to analyze their data — see http://dataspora.com/blog/predictive-analytics-using-r.

      • http://radar.oreilly.com/ben/ Ben Lorica

        Hi Jay,

        Great points. I started using S and S-Plus in the 1990′s, seamlessly made the transition (to R) and am quite happy to see R steadily gaining traction. Open source has already made an impact in analytics, I’m glad to see the community is making sure the tools we’ve come to love, are part of the Big Data age.

        Regards,
        Ben Lorica

    • Falafulu Fisi

      Jay Emmerson, just a passing comment. R started here in New Zealand (University of Auckland) by Prof. Ross Ihaka at the Statistics Department. I know Prof. Ihaka (a former lecturer of mine).

    • Falafulu Fisi

      Jay Emmerson said…
      …but the analytic methods themselves are available too.

      Also another popular Java open source data-mining project named WEKA, that has been made available for over a decade here in New Zealand at University of Waikato. It is still growing, because there are lots of contributors to it.

    • http://andrei.lopatenko.com/ Andrei Lopatenko

      Do not forget about the huge number of open source implementation of inference (“analytics”) tools like libsvm (perfect integration with Python) and many other svm implementation (in C, Java, R packages, Matlab packages), BUGS (gibbs sampling based Bayesian inference), Apache’s Mahout (they are trying to build scalable stat inference algorithm)
      I believe almost any standard method of statistical inference is implemented as open source nowdays
      Huge amounts of data became available to the public (recent Amazon initiative, Zillow data API, Yelp data API etc)