How Big Data Impacts Analytics
by Ben Lorica | @dliman | comments: 9Research for our just published report on Big Data management technologies, included conversations with teams who are at the forefront of analyzing massive data sets. We were particularly impressed with the work being produced by Linkedin's analytics team. [We have more details on Linkedin's analytics team, in an article in the upcoming issue of Release 2.0.]
At the second Social Web Foo camp, I had a chance to visit with Linkedin's Chief Scientist DJ Patil. As a mathematician specializing in dynamical systems and chaos theory, DJ began his career as a weather forecaster working for the Federal government. Years later, he ended up in an analytics role at Ebay where his prior experience with massive data sets came in handy. In the short video below, DJ shares his observations on how analytics has changed in recent years, especially as Big Data increasingly becomes common. Companies are casting a wider net, and are hiring scientists from fields not traditionally known as fertile recruiting grounds for data intelligence teams.
DJ also talks about his personal journey from mathematics to e-commerce and social networks. Among his previous stints, DJ worked with the DOD and used "... social network analysis to identify terrorists."
Other short videos from Social Web Foo camp:
tags: analytics, big data, foo camp, hadoop, social networking, social web, swfoo, video
| comments: 9
submit:
0 TrackBacks
TrackBack URL for this entry: http://blogs.oreilly.com/cgi-bin/mt/mt-t.cgi/8756
Comments: 9
LinkedIn's data is pretty much static and there isn't much pre-processing involved unless I am wrong here. It is a huge dataset, no doubt, but it is not a fast evolving dataset that complicates analysis such as data from scientific experimentation (particle physics, engineering electro-magnetic, fluid dynamic, etc...) or from financial markets, where trading is being registered every few milliseconds or so (worldwide) where the analytics is very complex including its pre-processing steps and it needs to be dynamic & realtime. This means that financial models built using historical data upto the last 10 minutes ago (ie, T = t-10) , would be different from models built using historical data upto this very moment, which is now (ie, T = t). In comparison to LinkedIn, the analytic model built yesterday (or even last week) is pretty much the same as if the model is build today. This is due to the almost static nature of LinkedIn new subscribers. Sure, there are new people joining everyday, but it is unnoticeable. Google data is also an evolving dataset, since there is always a new document that pops up all the time somewhere in the world. They (Google) needs to update their dataset (every hour or so, perhaps even longer) for PageRank to re-crunch so that it is recent.
Finally, a question to DJ Patil. Is it possible that you can give us readers here about the sort of analytics that LinkedIn is doing? What sorts of metrics are you looking at or analyze? Just curious, that's all.
A. Aramburu
Here are a few
1. Programming Collective Intelligence
2. Excel Scientific and Engineering Cookbook
3. Analyzing Business Data with Excel
Good luck,
Ben
Not only are big-data analytic infrastructures like Hadoop now available in the open-source world, but the analytic methods themselves are available too.
R (a statistical programming environment, http://www.r-project.org) is the pre-eminent open source project for statistical analysis. There are more than a few of us in the R community working on Big Data projects. My graduate student and I offer a library of functions in R to help scale those analytics to very large data sets (http://cran.r-project.org/web/packages/bigmemory/) using shared
memory and (optionally) file-backed matrices for larger-than-RAM data sets.
In addition, R has just been integrated with Hadoop by Saptarshi Guha (http://ml.stat.purdue.edu/rhipe), and the commercial open-source company REvolution Computing (whom I consult for) offers large-scale
distributed computing tools for R
(http://www.revolution-computing.com/products/parallel-r.php).
I completely agree with DJ Patil that with these open-source tools available (and increasingly easy to use), we're set to see an explosion in their use for "analytics." LinkedIn, FaceBook and
Google are already using open-source tools to analyze their data -- see http://dataspora.com/blog/predictive-analytics-using-r.
Hi Jay,
Great points. I started using S and S-Plus in the 1990's, seamlessly made the transition (to R) and am quite happy to see R steadily gaining traction. Open source has already made an impact in analytics, I'm glad to see the community is making sure the tools we've come to love, are part of the Big Data age.
Regards,
Ben Lorica
Jay Emmerson, just a passing comment. R started here in New Zealand (University of Auckland) by Prof. Ross Ihaka at the Statistics Department. I know Prof. Ihaka (a former lecturer of mine).
Jay Emmerson said...
...but the analytic methods themselves are available too.
Also another popular Java open source data-mining project named WEKA, that has been made available for over a decade here in New Zealand at University of Waikato. It is still growing, because there are lots of contributors to it.
Do not forget about the huge number of open source implementation of inference ("analytics") tools like libsvm (perfect integration with Python) and many other svm implementation (in C, Java, R packages, Matlab packages), BUGS (gibbs sampling based Bayesian inference), Apache's Mahout (they are trying to build scalable stat inference algorithm)
I believe almost any standard method of statistical inference is implemented as open source nowdays
Huge amounts of data became available to the public (recent Amazon initiative, Zillow data API, Yelp data API etc)
Post A Comment:
STAY CONNECTED
RECENT COMMENTS
- Andrei Lopatenko on How Big Data Impacts Analytics: Do not forget about the...
- Falafulu Fisi on How Big Data Impacts Analytics: Jay Emmerson said... .....
- Falafulu Fisi on How Big Data Impacts Analytics: Jay Emmerson, just a pa...
- Ben Lorica on How Big Data Impacts Analytics: Hi Jay, Great points. ...
- Jay Emerson on How Big Data Impacts Analytics: Not only are big-data a...
- Ben Lorica on How Big Data Impacts Analytics: A. Aramburu Here are a...
- A. Aramburu on How Big Data Impacts Analytics: Could you recommend any...
- Falafulu Fisi on How Big Data Impacts Analytics: LinkedIn's data is pret...
- Egon Willighagen on How Big Data Impacts Analytics: ORF++...


Egon Willighagen [2009-04-28 04:49 AM]
ORF++