# ENTRIES TAGGED "OSCON"

## Open Source In The Classroom

### OSCON 2013 Speaker Series

To teach computer programming to a young person with no experience, you must imagine what it’s like to know nothing about languages, algorithms, data structures, design patterns, object orientation, development tools, etc. Yet the kids I’ve seen in high school over the past several years are so immersed in a world of computing, they have interesting partial understandings of how things work and usually far deeper knowledge of what’s being done with the technologies at a consumer level than their teacher. The contemporary adolescent, a child of the late 1980’s and early 1990’s, has lived a life that parallels the advent of the web with no experience of a world before such powerful new verbs as email, Google, or blog. On the other hand, most have never seen the HTML of a web page or the IP address of a networked computer, and none of them can guess what a compiler does. As a high school computer programming teacher, I am usually the first to help them bridge that enormous and growing gap, and it doesn’t help that the first thing they ask is “How do I write an app for my phone?”

So I need to remember what it was like for me to learn programming for the first time given that, as a child of the early 1970’s, my own life parallels the advent of the home computer. I go backwards in time, way before my modern toolkit mainstays like Java, XML, or Linux (or their amalgam in Android). Programming days came before my Webmaster days, before my analyst days, before I employed scripting languages like Perl, JavaScript, Tcl, or even the Unix shells. I was programming before college, where I used older languages like C, Lisp, or Pascal. When I fondly reminisce about banging out BASIC programs on my beloved Commodore-64 I think surely I’ve arrived at the beginning, but no. When I really do recall the very first time I wrote an original computer program, that was even longer ago: it was 1984. I remember my seventh grade math class, in a Massachusetts public school, being escorted into a lab crammed with no more than a dozen new Apple IIe computers. There we wrote programs in a language called Logo, although everybody just called it “turtle”.

Logo, first created in Cambridge, MA in 1967, was designed from the outset as a programming language for teaching about programming. The open source KDE education project includes an application, KTurtle, for writing programs in Turtlescript, a modern adaptation of Logo. I have found KTurtle to be an extremely effective introduction to computer programming in my own classroom. The complete and well-written Turtlescript language reference can be printed in a small packet of about eight double-sided pages which I can distribute to each pupil. Your first view is a simple IDE, with a code editor that does multi-color highlighting, an inspector for tracing variables and functions, and a canvas for displaying the output of your program. I can demonstrate commands in seconds, gratifying students with immediate results as the onscreen turtle draws lines on the canvas. I can slow the speed of the turtle or step through a program to show what’s going on at each command of a larger program. I can save my final output as an image file, having limited but definite value to a young person. The kids get it, and with about 15 minutes of explanation, a few examples, and a handful of commands they are ready and eager to try themselves.

A modern paradigm of teaching secondary mathematics is giving students a sequence of tasks that give them a progressive understanding of concepts by putting them in situations where they struggle enough to realize the need for new insights. If I ask the students to use KTurtle to draw a square, having only provided them with the basic commands for moving the turtle, they can quickly come up with something like this:

If the next tasks involve a greater number of adjacent squares, this approach becomes tiresome. Now I have the opportunity to introduce looping control structures (along with the programming virtue of laziness) to show them a better way:

As I give them a few more tasks with squares, some students may continue to try to draw the pictures in one giant block of commands, creating complicated Eulerian trails to trace each figure, but are eventually convinced by the obvious benefits and their peers to make the switch. When they graduate to cutting and pasting their square loop repeatedly, I can introduce subroutines:

## Why Choose a Graph Database

### Collaborative filtering with Neo4j

By this time, chances are very likely that you’ve heard of NoSQL, and of graph databases like Neo4j.

NoSQL databases address important challenges that we face today, in terms of data size and data complexity. They offer a valuable solution by providing particular data models to address these dimensions.

On one side of the spectrum, these databases resolve issues for scaling out and high data values using compounded aggregate values, on the other side is a relationship based data model that allows us to model real world information containing high fidelity and complexity.

Neo4j, like many other graph databases, builds upon the property graph model; labeled nodes (for informational entities) are connected via directed, typed relationships. Both nodes and relationships hold arbitrary properties (key-value pairs). There is no rigid schema, but with node-labels and relationship-types we can have as much meta-information as we like. When importing data into a graph database, the relationships are treated with as much value as the database records themselves. This allows the engine to navigate your connections between nodes in constant time. That compares favorably to the exponential slowdown of many-JOIN SQL-queries in a relational database.

How can you use a graph database?

Graph databases are well suited to model rich domains. Both object models and ER-diagrams are already graphs and provide a hint at the whiteboard-friendliness of the data model and the low-friction mapping of objects into graphs.

Instead of de-normalizing for performance, you would normalize interesting attributes into their own nodes, making it much easier to move, filter and aggregate along these lines. Content and asset management, job-finding, recommendations based on weighted relationships to relevant attribute-nodes are some use cases that fit this model very well.

Many people use graph databases because of their high performance online query capabilities. They process large amounts or high volumes of raw data with Map/Reduce in Hadoop or Event-Processing (like Storm, Esper, etc.) and project the computation results into a graph. We’ve seen examples of this from many domains from financial (fraud detection in money flow graphs), biotech (protein analysis on genome sequencing data) to telco (mobile network optimizations on signal-strength-measurements).

Graph databases shine when you can express your queries as a local search using a few starting points (e.g., people, products, places, orders). From there, you can follow relevant relationships to accumulate interesting information, or project visited nodes and relationships into a suitable result.

Comment: 1

## A Hands-on Introduction to R

### OSCON 2013 Speaker Series

R is an open-source statistical computing environment similar to SAS and SPSS that allows for the analysis of data using various techniques like sub-setting, manipulation, visualization and modeling. There are versions that run on Windows, Mac OS X, Linux, and other Unix-compatible operating systems.

To follow along with the examples below, download and install R from your local CRAN mirror found at r-project.org. You’ll also want to place the example CSV into your Documents folder (Windows) or home directory (Mac/Linux).

After installation, open the R application. The R Console will pop-up automatically. This is where R code is processed. To begin writing code, open an editor window (File -> New Script on Windows or File -> New Document on a Mac) and type the following code into your editor:

Place your cursor anywhere on the “1+1” code line, then hit Control-R (in Windows) or Command-Return (in Mac). You’ll notice that your “1+1” code is automatically executed in the R Console. This is the easiest way to run code in R. You can also run R code by typing the code directly into your R Console, but using the editor is much easier.

If you want to refresh your R Console, click anywhere inside of it and hit Control-L (in Windows) or Command-Option-L (in Mac).

Now let’s create a Vector, the simplest possible data structure in R. A Vector is similar to a column of data inside a spreadsheet. We use the combine function to do so:

To view the contents of raysVector, just run the line of code above. After running the code shown above, double-click on raysVector (in the editor) and then run the code that is automatically highlighted after double-clicking. You will now see the contents of raysVector in your R Console.

The object we just created is now stored in memory and we can see this by running the following code:

R is an interpreted language with support for procedural and object-oriented programming. Here we use the mean statistical function to calculate the statistical mean of raysVector:

Getting help on the mean function is easy using:

We can create a simple plot of raysVector using:

Importing CSV files is simple too:

We can subset the CSV data in many different ways. Here are two different methods that do the same thing:

There are many ways to transform your data in R. Here’s a method that doubles everyone’s age:

The apply function allows us to apply a standard or custom function without loops. Here we apply the mean function column-wise to the first 3 rows of the dataset in order to analyze the age and height columns of the dataset. We will also ignore missing values during the calculation:

Here we build a linear regression model that predicts a person’s weight based on their age and height:

We can plot our residuals like this:

We can install the Predictive Model Markup Language (PMML) package to quickly deploy our predictive model in a Business Intelligence system without custom SQL: Read more…

Comment

## Get Hadoop, Hive, and HBase Up and Running in Less Than 15 Minutes

### OSCON 2013 Speaker Series

If you have delved into Apache Hadoop and related projects, you know that installing and configuring Hadoop is hard. Often, a minor mistake during installation or configuration with messy tarballs will lurk for a long time until some otherwise innocuous change to the system or workload causes difficulties. Moreover, there is little to no integration testing among different projects (e.g. Hadoop, Hive, HBase, Zookeeper, etc.) in the ecosystem. Apache Bigtop is an open source project aimed at bridging exactly those gaps by:

1. Making it easier for users to deploy and configure Hadoop and related projects on their bare metal or virtualized clusters.

2. Performing integration testing among various components in the Hadoop ecosystem.

The primary goal of Apache Bigtop is to build a community around the packaging and interoperability testing of Hadoop related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc.) developed by a community with a focus on the system as a whole, rather than individual projects.

The latest released version of Apache Bigtop is Bigtop 0.5 which integrates the latest versions of various projects including Hadoop, Hive, HBase, Flume, Sqoop, Oozie and many more! The supported platforms include CentOS/RHEL 5 and 6, Fedora 16 and 17, SuSE Linux Enterprise 11, OpenSuSE 12.2, Ubuntu LTS Lucid and Precise, and Ubuntu Quantal.

Who uses Bigtop?

Folks who use Bigtop can be divided into two major categories. The first category of users are those who leverage Bigtop to power their own Hadoop Distributions. The second category of users are those who use Bigtop for deployment purposes.

In alphabetical order, they are:

Comment

## Augmenting Unstructured Data

### OSCON 2013 Speaker Series

Our world is filled with unstructured data. By some estimates, it’s as high as 80% of all data.

Unstructured data is data that isn’t in a specific format. It isn’t separated by a delimiter that you could split on and get all of the individual pieces of data. Most often, this data comes directly from humans. Human generated data isn’t the best kind to place in a relational database and run queries on. You’d need to run some algorithms on unstructured data to gain any insight.

## Play-by-Play

Advanced NFL Stats was kind enough to open up their play-by-play data for NFL games from 2002 to 2012. This dataset consisted of both structured and unstructured data. Here is a sample line showing a single play in the dataset:

This line is comma separated and has various queryable elements. For example, we could query on the teams playing and the scores. We couldn’t query on the unstructured portion, specifically:

There are many more insights that can be gleaned from this portion. For example, we can see that San Francisco’s quarterback, Colin Kaepernick, passed the ball to the wide receiver, Michael Crabtree. He was tackled by Charles Tillman. Michael Crabtree caught the ball at the 25 yard line, gained 1 yard. After catching the ball, he didn’t make any forward progress or yards after catch (“0-yds YAC“).

Writing a synopsis like that is easy for me as a human. I’ve spent my life working with language and trying to comprehend the meaning. It’s not as easy for a computer. We have to use various methods to extract the information from unstructured data like this. These can vary dramatically in complexity; some use simple string lookups or contains and others use natural language processing.

Parsing would be easy if all of the plays looked like the one above, but they don’t. Each play has a little bit different formatting and varies in the amount of data. The general format is different for each type of play: run, pass, punt, etc. Each type of play needs to be treated a little differently. Also, each person writing the play description will be a little different from the others.

## Augmenting Data

As I showed in the synopsis of the play, one can get some interesting insight out of the structured and unstructured data. However, we’re still limited in what we can do and the kinds of queries we can write.

I had a conversation with a Canadian about American Football and the effects of the weather. I was talking about how the NFL now favors domed locations to take weather out as a variable for the Superbowl or NFL championship. With the play-by-play dataset, I couldn’t write a query to tell me the effects of weather one way or another. The data simply doesn’t exist in the dataset.

The physical location and date of the game is captured in the play-by-play. We can see that in the first portion “20121119_CHI@SF“. Here, Chicago is playing at San Francisco on 2012-11-19. There are other datasets that could enable us to augment our play-by-play. These datasets would give us the physical location of the stadium where the game was played (the team’s name doesn’t mean that their stadium is in that city). The dataset for stadiums is a relatively small one. This dataset also shows if the stadium is domed and the ambient temperature is a non-issue.

Comment

## Scaling People, Process, and Technology with Python

### OSCON 2013 Speaker Series

NOTE: If you are interested in attending OSCON to check out Dave’s talk or the many other cool sessions, click over to the OSCON website where you can use the discount code OS13PROG to get 20% off your registration fee.

Since 2009, I’ve been leading the optimization team at AppNexus, a real-time advertising exchange. On this exchange, advertisers participate in real-time auctions to bid on individual ad impressions. The highest bid wins the auction, and that advertiser gets to show an ad. This allows advertisers to carefully target where they advertise—maximizing the effectiveness of their advertising budget—and lets websites maximize their ad revenue.

We do these auctions often (~50 billion a day) and fast (<100 milliseconds). Not surprisingly, this creates a lot of technical challenges. One of those challenges is how to automatically maximize the value advertisers get for their marketing budgets—systematically driving consumer engagement through ad placements on particular websites, times of day, etc.—and we call this process “optimization.” The volume of data is large, and the algorithms and strategies aren’t trivial.

In order to win clients and build our business to the scale we have today, it was crucial that we build a world-class optimization system. But when I started, we didn’t have a scalable tech stack to process the terabytes of data flowing through our systems every day, and we didn't have the team to do any of the required data modeling.

## People

So, we needed to hire great people fast. However, there aren’t many veterans in the advertising optimization space, and because of that, we couldn’t afford to narrow our search to only experts in Java or R or Matlab. In order to give us the largest talent pool possible to recruit from, we had to choose a tech stack that is both powerful and accessible to people with diverse experience and backgrounds. So we chose Python.

Python is easy to learn. We found that people coding in R, Matlab, Java, PHP, and even those who have never programmed before could quickly learn and get up to speed with Python. This opened us up to hiring a tremendous pool of talent who we could train in Python once they joined AppNexus. To top it off, there’s a great community for hiring engineers and the PyData community is full of programmers who specialize in modeling and automation.

Additionally, Python has great libraries for data modeling. It offers great analytical tools for analysts and quants and when combined, Pandas, IPython, and Matplotlib give you a lot of the functionality of Matlab or R. This made it easy to hire and onboard our quants and analysts who were familiar with those technologies. Even better, analysts and quants can share their analysis through the browser with IPython.

## Process

Now that we had all of these wonderful employees, we needed a way to cut down the time to get them ramped up and pushing code to production.

First, we wanted to get our analysts and quants looking at and modeling data as soon as possible. We didn’t want them worrying about writing database connector code, or figuring out how to turn a cursor into a data frame. To tackle this, we built a project called Link.

Imagine you have a MySQL database. You don’t want to hardcode all of your connection information because you want to have a different config for different users, or for different environments. Link allows you to define your “environment” in a JSON config file, and then reference it in code as if it is a Python object.

Now, with only three lines of code you have a database connection and a data frame straight from your mysql database. This same methodology works for Vertica, Netezza, Postgres, Sqlite, etc. New “wrappers” can be added to accommodate new technologies, allowing team members to focus on modeling the data, not how to connect to all these weird data sources.

By having the flexibility to easily connect to new data sources and APIs, our quants were able to adapt to the evolving architectures around us, and stay focused on modeling data and creating algorithms.

Second, we wanted to minimize the amount of work it took to take an algorithm from research/prototype phase to full production scale. Luckily, with everyone working in Python, our quants, analysts, and engineers are using the same language and data processing libraries. There was no need to re-implement an R script in Java to get it out across the platform.

Comment

## Six Ways to Make Your Peer Code Reviews More Effective

### OSCON 2013 Speaker Series

NOTE: If you are interested in attending OSCON to check out Emma Jane’s talk or the many other cool sessions, click over to the OSCON website where you can use the discount code OS13PROG to get 20% off your registration fee.

The critique is a design school staple. You can find more than a few blog posts from design students about how “the crit” either made them as a designer, or broke their will to live. Substitute “critique” with “code review” and you’re likely to find just as many angst-filled blog posts in the open source community. Or, more likely, you’ll notice a lack of contributors following a particularly harsh text-based transaction.

Somewhere along the way we dropped one of the original meanings of the word “critique”. According to Wikipedia, a critique is “a method of disciplined, systematic analysis of a written or oral discourse”.

A good critique is more than just a positive feedback sandwich that wraps “negative” feedback between two slices of “positive” feedback. An effective critique needs to have three components in place to be successful:

1. An agreed-upon framework for the evaluation.
2. Reviewer objectivity.
3. A creator uncoupled from his or her work.

## The Quantitative Review

Quantitative reviews are the easiest to conduct. And, generally, coding standards for a project are quantitative. A quantitative review answers questions such as: Is this code formatted according to coding standards? Does the code cause any performance regressions? Quantitative evaluations can be easily measured. The code being reviewed is either right, or non-compliant. The findings of quantitative reviews are generally hard to argue with. (Ever tried to pick a fight with Jenkins?) Having a quantitative framework in place often means that your project will also be able to free up reviewer time by putting automated tests in place. Which leaves time for a more difficult type of review: the qualitative review.

## The Qualitative Review

A framework for a qualitative review is incredibly difficult to create because there is space for subjective thought as it addresses the values of a project. There is room for opinion. Two people could be right at the same time, even though they are not at all in agreement. Qualitative frameworks are so difficult to create that most projects don’t even try; however, an effective qualitative framework helps both the coder and the reviewer to improve their skills. A qualitative framework teaches non-experts what attributes they should be looking for in the code they are reviewing. Questions posed during a qualitative review may include: Does this code implement a pattern we’ve already seen somewhere in the project? Has the code been sufficiently abstracted to make it reusable in other situations?

Comment

## HTML 5 Geolocation, SharePoint Tech, Strangeloop, and More

### Tech events you don't want to miss.

Each Monday, we round up upcoming event highlights from the programming and technology spaces. Have an event to share? Send us a note.

Intro to Raspberry Pi : Ed Snajder explains what a Raspberry Pi is, how it differs from an Arduino and shows attendees some cool things you can do with a Raspberry Pi. Register for this free webcast.

Date: 10 a.m. PT, June 25 Location: Online webcast

Graphlab Workshop on Large Scale Machine Learning: This workshop is a meeting place for both academia and industry to discuss upcoming challenges of large scale machine learning and solution methods. The main goal for this year’s workshop is to bring together top researchers from academia as well as top data scientists from the industry, with the special focus of large-scale machine learning on sparse graphs. For more information and to register, visit the event page.

Date: 8 a.m. to 7 p.m. PT, July 1 Location: San Francisco, CA

Comment

### OSCON 2013 Speaker Series

Ed Snajder is a 3D-printer aficionado, DBA at Jive Software and OSCON 2013 speaker. We talk about the ins and outs of the new world of 3D printing (a little sneak preview of what Ed will be speaking about at this year’s OSCON). If you are interested in attending to check out Ed’s talk or the many other cool sessions, click over to the OSCON website where you can use the discount code OS13PROG to get 20% your registration fee.

Key highlights include:

• What is RepRap? [Discussed at 0:20]
• Hack a printer together from scratch, purchase a kit or get one ready to print (sort of) [Discussed at 3:15]
• How do you get from 2D to 3D? [Discussed at 5:54]
• If you smell popcorn and you’re not making it, your 3D printer is burning [Discussed at 9:59]
• Creating a greener world one object at a time [Discussed at 12:36]
• 3D printer = Piracy machine? [Discussed at 14:25]

You can view the full interview here:

Comment

## Intro to Raspberry Pi, Wharton Web Conference, Agile 2013, and More

### Tech events you don't want to miss

Each Monday, we round up upcoming event highlights from the programming and technology spaces. Have an event to share? Send us a note.

The Revolution Will Not Be Televised webcast: Jonathan Stark discusses the coming wireless wave and how it will profoundly affect every aspect of society—the iPhone will look like a fax machine compared to what’s coming next. Register for this free webcast.
Date: 10 a.m. PT, June 20 Location: Online webcast

Intro to Raspberry Pi : Ed Snajder explains what a Raspberry Pi is, how it differs from an Arduino and shows attendees some cool things you can do with a Raspberry Pi. Register for this free webcast.
Date: 10 a.m. PT, June 25 Location: Online webcast

Comment