"R" entries

MATLAB, R, and Julia: Languages for data analysis

Inside core features of specialized data analysis languages.

by Avi Bryant | October 15, 2012

Big data frameworks like Hadoop have received a lot of attention recently, and with good reason: when you have terabytes of data to work with — and these days, who doesn’t? — it’s amazing to have affordable, reliable and ubiquitous tools that allow you to spread a computation over tens or hundreds of CPUs on commodity hardware. The dirty truth is, though, that many analysts and scientists spend as much time or more working with mere megabytes or gigabytes of data: a small sample pulled from a larger set, or the aggregated results of a Hadoop job, or just a dataset that isn’t all that big (like, say, all of Wikipedia, which can be squeezed into a few gigs without too much trouble).

At this scale, you don’t need a fancy distributed framework. You can just load the data into memory and explore it interactively in your favorite scripting language. Or, maybe, a different scripting language: data analysis is one of the few domains where special-purpose languages are very commonly used. Although in many respects these are similar to other dynamic languages like Ruby or Javascript, these languages have syntax and built-in data structures that make common data analysis tasks both faster and more concise. This article will briefly cover some of these core features for two languages that have been popular for decades — MATLAB and R — and another, Julia, that was just announced this year.

MATLAB

MATLAB is one of the oldest programming languages designed specifically for data analysis, and it is still extremely popular today. MATLAB was conceived in the late ’70s as a simple scripting language wrapped around the FORTRAN libraries LINPACK and EISPACK, which at the time were the best way to efficiently work with large matrices of data — as they arguably still are, through their successor LAPACK. These libraries, and thus MATLAB, were solely concerned with one data type: the matrix, a two-dimensional array of numbers.

This may seem very limiting, but in fact, a very wide range of scientific and data-analysis problems can be represented as matrix problems, and often very efficiently. Image processing, for example, is an obvious fit for the 2D data structure; less obvious, perhaps, is that a directed graph (like Twitter’s follow graph, or the graph of all links on the web) can be expressed as an adjacency matrix, and that graph algorithms like Google’s PageRank can be easily implemented as a series of additions and multiplications of these matrices. Similarly, the winning entry to the Netflix Prize recommendation challenge relied, in part, on a matrix representation of everyone’s movie ratings (you can imagine every row representing a Netflix user, every column a movie, and every entry in the matrix a rating), and in particular on an operation called Singular Value Decomposition, one of those original LINPACK matrix routines that MATLAB was designed to make easy to use.

Read more…

Four short links: 24 August 2012

PublicSpeaking App, Wacky Javascript, Open Science in R, and Surviving DDOS

by Nat Torkington | @gnat | +Nat Torkington | August 24, 2012

Speak Like a Pro (iTunes) — practice public speaking, and your phone will rate your performance and give you tips to improve. (via Idealog)
If Hemingway Wrote Javascript — glorious. I swear I marked Andre Breton’s assignments at university. (via BoingBoing)
R Open Sci — open source R packages that provide programmatic access to a variety of scientific data, full-text of journal articles, and repositories that provide real-time metrics of scholarly impact.
Keeping Your Site Alive (EFF) — guide to surviving DDOS attacks. (via BoingBoing)

Four short links: 5 July 2011

Organising Conferences, Moving to the JVM, Language Crowdsourcing, and Bayesian Computing

by Nat Torkington | @gnat | +Nat Torkington | July 5, 2011

Conference Organisers Handbook — accurate guide to running a two-day 300-person conference. See also Yet Another Perl Conference guidelines.
Twitter Shifting More Code to JVM — interesting how, at scale, there are some tools and techniques of the scorned Enterprise that the web cool kids must turn to. Some. Business Process Workflow XML Schemas will never find love.
Louis von Ahn on Duolingo — from the team that gave us “OCR books as you verify you are a human” CAPTCHAs comes “learn a new language as you translate the web”. I would love to try this, it sounds great (and is an example of what crowdsourcing can be).
Fully Bayesian Computing (PDF) — A fully Bayesian computing environment calls for the possibility of defining vector and array objects that may contain both random and deterministic quantities, and syntax rules that allow treating these objects much like any variables or numeric arrays. Working within the statistical package R, we introduce a new object-oriented framework based on a new random variable data type that is implicitly represented by simulations. Perl made text processing easy because strings were first-class objects with a rich set of functions to operate on them; Node.js has a sweet HTTP library; it’s interesting to see how much more intuitive an algorithm becomes when random variables are a data type. (via BigData)

Four short links: 9 March 2011

R IDE, Audience Participation, Machine Learning, Surviving Success

by Nat Torkington | @gnat | +Nat Torkington | March 9, 2011

R Studio — AGPLv3-licensed IDE for R. It brings your R console, source code, plots, help, history, and workspace browser into one cohesive package. We’ve added some neat productivity features like a searchable endless command history, function/symbol completion, data import dialog with preview, one-click Sweave compile, and more. Source on github. Built as a web-app on Google AppEngine, from Joe Cheng who did Windows Live Writer at Microsoft. (via DeWitt Clinton)
Adventures in Participatory Audience — Nina Simon helped thirteen students produce three projects to encourage participation in museum audiences: Xavier, Stringing Connections, and Dirty Laundry. My favourite was Dirty Laundry, where people shared secrets connected to works of art. Nina’s description of what she learned has some nuggets: friendly faces welcoming people in gets better response than a card with instructions, and I am still flummoxed as to what would make someone admit to an affair or bad parenting in a sterile art gallery, or the devastating one that read, “I avoid the important, difficult conversations with those I love the most.” Audience participation in the real world has lessons on what works for those who would build social software.
Why Generic Machine Learning Fails — Returns for increasing data size come from two sources: (1) the importance of tails and (2) the cost of model innovation. When tails are important, or when model innovation is difficult relative to cost of data capture, then more data is the answer. […] Machine learning is not undifferentiated heavy lifting, it’s not commoditizable like EC2, and closer to design than coding. The Netflix prize is a good example: the last 10% reduction in RMSE wasn’t due to more powerful generic algorithms, but rather due to some very clever thinking about the structure of the problem; observations like “people who rate a whole slew of movies at one time tend to be rating movies they saw a long time ago” from BellKor.
Anatomy of a Crushing — Maciej Ceglowski describes how pinboard.in survived the flood of Delicious émigrées. It took several rounds of rewrites to get the simple tag cloud script right, and this made me very skittish about touching any other parts of the code over the next few days, even when the fixes were easy and obvious. The part of my brain that knew what to do no longer seemed to be connected directly to my hands.

Four short links: 1 September 2010

Faces in R, Open Source Web Analytics, Small File Store, Building Mapper

by Nat Torkington | @gnat | +Nat Torkington | September 1, 2010

R Library for Chernoff Faces — faces represent the rows of a data matrix by faces. plot.faces plots faces into a scatterplot. Interesting emotional way to visualize data, which was used to good effect (though not with this library) by BERG in Schooloscope. (via the tutorial at Flowing Data)
Piwik — GPLed web analytics package.
Pomegranate — a data store for billions of tiny files. (via the High Scalability blog interview with the creator of Pomegranate)
New Backpack Makes 3D Maps of Buildings — the backpack indoor equivalent of the Google Maps cars, from Berkeley researchers.

Four short links: 1 January 2010

Fonty Inkness, Machine Learning, Time-Series Indexes, and Graph Analysis

by Nat Torkington | @gnat | +Nat Torkington | January 1, 2010

Measuring Type — clever way to measure which font uses more ink.
Vowpal Wabbit — fast learning software from Yahoo! Research and Hunch. Code available in git. (via zecharia on Delicious)
Literature Review on Indexing Time-Series Data — a graduate student’s research work included this literature review of papers on indexing time-series data. (via jpatanooga on Delicious)
igraph — programming library for manipulating graph data, with the usual algorithms (minimum spanning tree, network flow, cliques, etc.) available in R, Python, and C.

Four short links: 5 November 2009

Heat Maps in R, EC2 Blackhat Tricks, Snickersome Unicode, and Decoding Statistics

by Nat Torkington | @gnat | +Nat Torkington | November 5, 2009

Heat Maps in R — We used financial data here because it’s easier to access than the airline data, but it’s actually a pretty interesting way of looking at a financial time series. Weekend and holiday effects are a bit more obvious, and it’s a bit like being able to see the daily, weekly, monthly and yearly closes all at once (by scanning your eye over the calendar in different directions). Includes source code. (via migurski on Delicious)
BlackHat and EC2 — Theft of resources is the red-headed step-child of attack classes and doesn’t get much attention, but on cloud platforms where resources are shared amongst many users these attacks can have a very real impact. With this in mind, we wanted to show how EC2 was vulnerable to a number of resource theft attacks and the videos below demonstrate three separate attacks against EC2 that permit an attacker to boot up massive numbers of machines, steal computing time/bandwidth from other users and steal paid-for AMIs. (via straup on Delicious)
Funny Characters in Unicode — I never get tired of the wacky stuff in Unicode. I love the thought of a Unicode committee somewhere arguing passionately about the number of buttons on the snowman …. (via Hacker News)
Statistics to English Translation — The terms sensitivity and specificity generally refer to diagnostic or screening procedures, such as an HIV or allergy tests. The sensitivity of a test is its true positive rate; the specificity is its true negative rate, although it can be more intuitive to think of specificity as the complement of the false positive rate. This matters. Bandying around numbers with misleading labels, or misinterpreting numbers that have a precise and defined meaning, does not further understanding. (Said 78.4% of statisticians, with a 20% confidence factor probability of false positives)

Four Short Links: 28 August 2009

The Future, Python Metrics, Distributed Version Control, and Stylish R

by Nat Torkington | @gnat | +Nat Torkington | August 28, 2009

What The Future’s All About (Webstock Words) — Bruce Sterling on the future. We’re not going to get a future Cloud World as somehow opposed to a future Augmented Reality World. It can’t happen. The ideas can be clearly distinguished, but ideas about technology, labels for technology, predictions and suppositions about technology, they don’t map onto actual real-world technology. Human culture doesn’t work like a logical argument.
PyMetrics — code analysis software that produces metrics for your code. (via the excellent 10 Ways To Let People Know You’re a Bad Python Programmer by Noah Gift)
Prophet and SD 0.7 Are Now Available — Prophet is a lightweight schemaless database designed for peer to peer replication and disconnected operation. Prophet keeps a full copy of your data and (history) on your laptop, desktop or server. Prophet syncs when you want it to, so you can use Prophet-backed applications whether or not you have network. SD (Simple Defects) is a peer-to-peer issue tracking system built on top of Prophet. In addition to being a full-fledged distributed bug tracker, SD can also bidirectionally sync with your RT, Hiveminder, Trac, GitHub or Google Code issue tracker.
Google’s R Style Guide — R is a high-level programming language used primarily for statistical computing and graphics. The goal of the R Programming Style Guide is to make our R code easier to read, share, and verify. The rules below were designed in collaboration with the entire R user community at Google. (via Bo Cowgill’s blog)

Four short links: 20 August 2009

DIY SPY, Screencasting, Social Network Analysis, Term Extraction

by Nat Torkington | @gnat | +Nat Torkington | August 20, 2009

DIY SPY – a homebrew 2.4GHz wi-fi spectrum analyzer — As proof of concept (and a cool toy for anyone who has one of these lying around), I have implemented a working Wi-Fi spectrum analyzer on TI’s ez430-RF2500 development kit ($50), a 2-part USB dongle which consists essentially of a CC2500 radio strapped to an MSP430 low-power microcontroller (detachable bottom half) and a USB interface which enumerates as a virtual serial port (top half). The top half doubles as a standalone MSP430 programmer, so this kit is a great cheap way to get started playing with them. (via joshua on Delicious)
Screenr — Instant screencasts for Twitter. Flash-based, uploads to their site and tweets the URL. The whole “for Twitter” thing is going a little too far: who records screencasts only for Twitter? It’s like having a spellchecker only for three-letter words.
Social Network Analysis in R — video and slides for talk on doing social network analysis with R.
We’re Keeping the Term Extraction Service — Yahoo!’s useful API gets a stay of execution. OK, we heard you. You’ve made it clear to us that shutting down the Term Extraction Service would be a mistake. So, we’ve changed our plans. We’re leaving the service up and running indefinitely. (via Simon Willison )

Making Government Transparent Using R

Danese Cooper thinks it will be an important tool in Open Gov

by James Turner | @blackbearnh | +James Turner | July 14, 2009

With Open Source now considered an accepted part of the software industry, some people are starting to wonder if we can’t bring the same degree of openness and innovation into government. Danese Cooper, who is actively involved in the open source community through her work with the Open Source Initiative and Apache, as well as working as an R wonk for Revolution Computing, would love to see the government become more open. Part of that openness is being able to access and interpret the mass of data that the government collects, something Cooper thinks R would be a great tool for. She’ll be talking about R and Open Government at O’Reilly’s Open Source Conference, OSCON.