Ray DiGiacomo, Jr.

A Hands-on Introduction to R

OSCON 2013 Speaker Series

R is an open-source statistical computing environment similar to SAS and SPSS that allows for the analysis of data using various techniques like sub-setting, manipulation, visualization and modeling. There are versions that run on Windows, Mac OS X, Linux, and other Unix-compatible operating systems.

To follow along with the examples below, download and install R from your local CRAN mirror found at r-project.org. You’ll also want to place the example CSV into your Documents folder (Windows) or home directory (Mac/Linux).

After installation, open the R application. The R Console will pop-up automatically. This is where R code is processed. To begin writing code, open an editor window (File -> New Script on Windows or File -> New Document on a Mac) and type the following code into your editor:

Place your cursor anywhere on the “1+1” code line, then hit Control-R (in Windows) or Command-Return (in Mac). You’ll notice that your “1+1” code is automatically executed in the R Console. This is the easiest way to run code in R. You can also run R code by typing the code directly into your R Console, but using the editor is much easier.

If you want to refresh your R Console, click anywhere inside of it and hit Control-L (in Windows) or Command-Option-L (in Mac).

Now let’s create a Vector, the simplest possible data structure in R. A Vector is similar to a column of data inside a spreadsheet. We use the combine function to do so:

To view the contents of raysVector, just run the line of code above. After running the code shown above, double-click on raysVector (in the editor) and then run the code that is automatically highlighted after double-clicking. You will now see the contents of raysVector in your R Console.

The object we just created is now stored in memory and we can see this by running the following code:

R is an interpreted language with support for procedural and object-oriented programming. Here we use the mean statistical function to calculate the statistical mean of raysVector:

Getting help on the mean function is easy using:

We can create a simple plot of raysVector using:

Importing CSV files is simple too:

We can subset the CSV data in many different ways. Here are two different methods that do the same thing:

There are many ways to transform your data in R. Here’s a method that doubles everyone’s age:

The apply function allows us to apply a standard or custom function without loops. Here we apply the mean function column-wise to the first 3 rows of the dataset in order to analyze the age and height columns of the dataset. We will also ignore missing values during the calculation:

Here we build a linear regression model that predicts a person’s weight based on their age and height:

We can plot our residuals like this:

We can install the Predictive Model Markup Language (PMML) package to quickly deploy our predictive model in a Business Intelligence system without custom SQL: Read more…

Comment