Big Data: SSD's, R, and Linked Data Streams

The Solid State Storage Revolution: If you haven’t seen it, I recommend you watch Andy Bechtolsheim’s keynote at the recent Mysqlconf. We covered SSD’s in our just published report on Big Data management technologies. Since then, we’ve gotten additional signals from our network of alpha geeks and our interest in them remains high.

R and Linked Data Streams: I had a chance to visit with Dataspora founder and blogger Mike Driscoll, an enthusiastic advocate for the use of the open source statistical computing language, R. After founding and leading online retailer, Mike went back to grad school and earned a doctorate in Bioinformatics. He has applied data analysis and programming in a variety of domains including retail, biotech, academia, and government projects.

Having been an avid user of S/S-Plus in the 1990’s, I seamlessly switched over to R in the early 2000’s. To this day, I consider the S/S-Plus user manuals to be the best reference and introductory books on the R programming language. (Mike wholeheartedly agrees.) R has been popular in the statistics community for many years, but I’ve been noticing that its visualization and analytic capabilities are attracting interest from developers. Moreover, recent efforts by the R community to improve its ability to scale large data sets (see brief update from Jay Emerson), will strengthen R’s place in the Big Data stack.

While we talked about statistics and R, our main focus was Big Data. Mike is particularly excited about the growing number of open data sources, and the potential for linking them together to create interesting applications. The growing importance of data is something we’ve covered in recent years. Tim highlighted early on that companies who accumulate data are usually able to develop interesting services, many of which involve non-obvious uses of their vast data collections (see “Data as the new Intel Inside”). In addition, the concept of linking different data sources was at the heart of our Money:Tech conference.

tags: , , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • Falafulu Fisi

    I have used R in the past and although I still looked at it once in a while just to find out some algorithm that’s not available in any other languages, I am now firmly addicted to both Matlab & Mathematica for numerical algorithm proto-typing. I am not sure but I believe that underlying numerical linear algebra engine (NLAE) in R is the Fortran LAPACK or its variants. The NLAE in Matlab is also the same LAPACK. The Lapack version for Java, which is the one I use is JLapack, which is a direct porting/conversion of LAPACK fortran into Java.

  • Amyric Duclert

    You don’t discuss R’s capacities for interfacing with Hadoop – but this would be a key component of any emerging “Big Data stack.” Here is a project that attempts to do just that:

    So instead of LAMP, we have LAHR (Hadoop as the data layer, R as the analytics layer). Not a sexy acronym, but statisticians (and the R community) aren’t known for their sexiness.

  • OK, that went straight over my head…

  • I use Rpy and matlab at times for most of my social network analysis work.

    loved the video by the way.