"OSCON" entries

Will Developers Move to Sputnik?

The past, present, and future of Dell's project

Barton George (@barton808) is the Director of Development Programs at Dell, and the lead on Project Sputnik—Dell’s Ubuntu-based developer laptop (and its accompanying software). He sat down with me at OSCON to talk about what’s happened in the past year since OSCON 2012, and why he thinks Sputnik has a real chance at attracting developers.

Key highlights include:

  • The developers that make up Sputnik’s ideal audience [Discussed at 1:00]
  • The top three reasons you should try Sputnik [Discussed at 2:46]
  • What Barton hopes to be talking about in 2014 [Discussed at 4:36]
  • The key to building a community is documentation [Discussed at 5:20]

You can view the full interview here:

Read more…

In Praise of the Lone Contributor

The O'Reilly Open Source Awards 2013

Over the years, OSCON has become a big conference. With over 3900 registered this year, it was hard not to look at the packed hallways and sessions and think what a huge crowd it is. The number of big-name companies participating–Microsoft, Google, Dell, and even General Motors–reinforce the popular refrain that open source has come a long way; it’s all mainstream now.

Which is as it should be. And it’s been a long haul. But thinking of open source in terms of numbers and size puts us in danger of forgetting the very thing that makes open source special, and that’s the individual contributor. So while open source software has indeed found a place in almost every organization that exists, it was made possible by the hard work of real people who saw the need for it, most of them volunteering in their spare time.

The O’Reilly Open Source Awards were created to recognize and thank these individuals. It’s a community-driven effort: nominations come in from the open source community (this year there were around 50) and then are judged by the previous year’s winners. It’s not intended to be political or a popularity contest, but honest appreciation for hard work that matters. Let’s look at this year’s winners.
Read more…

Open Source convention considers situational awareness in cars, and more

A report from OSCon

Every conference draws people in order to make contacts, but the Open Source convention also inspires them with content. I had one friend withdraw from an important business meeting (sending an associate) in order to attend a tutorial.

Lots of sessions and tutorials had to turn away attendees. This was largely fall-out from the awkward distribution of seats in the Oregon Convention Center: there are just half a dozen ballroom-sized spaces, forcing the remaining sessions into smaller rooms that are more appropriate for casual meetings of a few dozen people. When the conference organizers measure the popularity of the sessions, I suggest that any session at or near capacity have its attendance counted as infinity.

More than 3,900 people registered for OSCon 2013, and a large contingent kept attending sessions all the way through Friday.

Read more…

Open Source In The Classroom

OSCON 2013 Speaker Series

To teach computer programming to a young person with no experience, you must imagine what it’s like to know nothing about languages, algorithms, data structures, design patterns, object orientation, development tools, etc. Yet the kids I’ve seen in high school over the past several years are so immersed in a world of computing, they have interesting partial understandings of how things work and usually far deeper knowledge of what’s being done with the technologies at a consumer level than their teacher. The contemporary adolescent, a child of the late 1980’s and early 1990’s, has lived a life that parallels the advent of the web with no experience of a world before such powerful new verbs as email, Google, or blog. On the other hand, most have never seen the HTML of a web page or the IP address of a networked computer, and none of them can guess what a compiler does. As a high school computer programming teacher, I am usually the first to help them bridge that enormous and growing gap, and it doesn’t help that the first thing they ask is “How do I write an app for my phone?”

So I need to remember what it was like for me to learn programming for the first time given that, as a child of the early 1970’s, my own life parallels the advent of the home computer. I go backwards in time, way before my modern toolkit mainstays like Java, XML, or Linux (or their amalgam in Android). Programming days came before my Webmaster days, before my analyst days, before I employed scripting languages like Perl, JavaScript, Tcl, or even the Unix shells. I was programming before college, where I used older languages like C, Lisp, or Pascal. When I fondly reminisce about banging out BASIC programs on my beloved Commodore-64 I think surely I’ve arrived at the beginning, but no. When I really do recall the very first time I wrote an original computer program, that was even longer ago: it was 1984. I remember my seventh grade math class, in a Massachusetts public school, being escorted into a lab crammed with no more than a dozen new Apple IIe computers. There we wrote programs in a language called Logo, although everybody just called it “turtle”.

Logo, first created in Cambridge, MA in 1967, was designed from the outset as a programming language for teaching about programming. The open source KDE education project includes an application, KTurtle, for writing programs in Turtlescript, a modern adaptation of Logo. I have found KTurtle to be an extremely effective introduction to computer programming in my own classroom. The complete and well-written Turtlescript language reference can be printed in a small packet of about eight double-sided pages which I can distribute to each pupil. Your first view is a simple IDE, with a code editor that does multi-color highlighting, an inspector for tracing variables and functions, and a canvas for displaying the output of your program. I can demonstrate commands in seconds, gratifying students with immediate results as the onscreen turtle draws lines on the canvas. I can slow the speed of the turtle or step through a program to show what’s going on at each command of a larger program. I can save my final output as an image file, having limited but definite value to a young person. The kids get it, and with about 15 minutes of explanation, a few examples, and a handful of commands they are ready and eager to try themselves.

A modern paradigm of teaching secondary mathematics is giving students a sequence of tasks that give them a progressive understanding of concepts by putting them in situations where they struggle enough to realize the need for new insights. If I ask the students to use KTurtle to draw a square, having only provided them with the basic commands for moving the turtle, they can quickly come up with something like this:

reset
forward 100
turnright 90
forward 100
turnright 90
forward 100
turnright 90
forward 100

If the next tasks involve a greater number of adjacent squares, this approach becomes tiresome. Now I have the opportunity to introduce looping control structures (along with the programming virtue of laziness) to show them a better way:

reset
repeat 4 {
  forward 100
  turnright 90
}

first_tasks

As I give them a few more tasks with squares, some students may continue to try to draw the pictures in one giant block of commands, creating complicated Eulerian trails to trace each figure, but are eventually convinced by the obvious benefits and their peers to make the switch. When they graduate to cutting and pasting their square loop repeatedly, I can introduce subroutines:
Read more…

Why Choose a Graph Database

Collaborative filtering with Neo4j

By this time, chances are very likely that you’ve heard of NoSQL, and of graph databases like Neo4j.

NoSQL databases address important challenges that we face today, in terms of data size and data complexity. They offer a valuable solution by providing particular data models to address these dimensions.

On one side of the spectrum, these databases resolve issues for scaling out and high data values using compounded aggregate values, on the other side is a relationship based data model that allows us to model real world information containing high fidelity and complexity.

Neo4j, like many other graph databases, builds upon the property graph model; labeled nodes (for informational entities) are connected via directed, typed relationships. Both nodes and relationships hold arbitrary properties (key-value pairs). There is no rigid schema, but with node-labels and relationship-types we can have as much meta-information as we like. When importing data into a graph database, the relationships are treated with as much value as the database records themselves. This allows the engine to navigate your connections between nodes in constant time. That compares favorably to the exponential slowdown of many-JOIN SQL-queries in a relational database.

property-graph

How can you use a graph database?

Graph databases are well suited to model rich domains. Both object models and ER-diagrams are already graphs and provide a hint at the whiteboard-friendliness of the data model and the low-friction mapping of objects into graphs.

Instead of de-normalizing for performance, you would normalize interesting attributes into their own nodes, making it much easier to move, filter and aggregate along these lines. Content and asset management, job-finding, recommendations based on weighted relationships to relevant attribute-nodes are some use cases that fit this model very well.

Many people use graph databases because of their high performance online query capabilities. They process large amounts or high volumes of raw data with Map/Reduce in Hadoop or Event-Processing (like Storm, Esper, etc.) and project the computation results into a graph. We’ve seen examples of this from many domains from financial (fraud detection in money flow graphs), biotech (protein analysis on genome sequencing data) to telco (mobile network optimizations on signal-strength-measurements).

Graph databases shine when you can express your queries as a local search using a few starting points (e.g., people, products, places, orders). From there, you can follow relevant relationships to accumulate interesting information, or project visited nodes and relationships into a suitable result.

Read more…

A Hands-on Introduction to R

OSCON 2013 Speaker Series

R is an open-source statistical computing environment similar to SAS and SPSS that allows for the analysis of data using various techniques like sub-setting, manipulation, visualization and modeling. There are versions that run on Windows, Mac OS X, Linux, and other Unix-compatible operating systems.

To follow along with the examples below, download and install R from your local CRAN mirror found at r-project.org. You’ll also want to place the example CSV into your Documents folder (Windows) or home directory (Mac/Linux).

After installation, open the R application. The R Console will pop-up automatically. This is where R code is processed. To begin writing code, open an editor window (File -> New Script on Windows or File -> New Document on a Mac) and type the following code into your editor:

1+1

Place your cursor anywhere on the “1+1” code line, then hit Control-R (in Windows) or Command-Return (in Mac). You’ll notice that your “1+1” code is automatically executed in the R Console. This is the easiest way to run code in R. You can also run R code by typing the code directly into your R Console, but using the editor is much easier.

If you want to refresh your R Console, click anywhere inside of it and hit Control-L (in Windows) or Command-Option-L (in Mac).

Now let’s create a Vector, the simplest possible data structure in R. A Vector is similar to a column of data inside a spreadsheet. We use the combine function to do so:

raysVector <- c(2, 5, 1, 9, 4)

To view the contents of raysVector, just run the line of code above. After running the code shown above, double-click on raysVector (in the editor) and then run the code that is automatically highlighted after double-clicking. You will now see the contents of raysVector in your R Console.

The object we just created is now stored in memory and we can see this by running the following code:

ls()

R is an interpreted language with support for procedural and object-oriented programming. Here we use the mean statistical function to calculate the statistical mean of raysVector:

mean(raysVector)

Getting help on the mean function is easy using:

?mean

We can create a simple plot of raysVector using:

barplot(raysVector, col = "red")

Importing CSV files is simple too:

data <- read.csv("raysData3.csv", na.strings = "")

We can subset the CSV data in many different ways. Here are two different methods that do the same thing:

data[ 1:2, 2:4 ]
data[ 1:2, c("age", "weight", "height") ]

There are many ways to transform your data in R. Here’s a method that doubles everyone’s age:

dataT <- transform( data, age = age * 2 )

The apply function allows us to apply a standard or custom function without loops. Here we apply the mean function column-wise to the first 3 rows of the dataset in order to analyze the age and height columns of the dataset. We will also ignore missing values during the calculation:

apply( data[1:3, c("age", "height")], 2, mean, na.rm = T )

Here we build a linear regression model that predicts a person’s weight based on their age and height:

raysModel <- lm(weight ~ age + height, data = data)

We can plot our residuals like this:

plot(raysModel$residuals, pch = 15, col = "red")

We can install the Predictive Model Markup Language (PMML) package to quickly deploy our predictive model in a Business Intelligence system without custom SQL: Read more…

Get Hadoop, Hive, and HBase Up and Running in Less Than 15 Minutes

OSCON 2013 Speaker Series

If you have delved into Apache Hadoop and related projects, you know that installing and configuring Hadoop is hard. Often, a minor mistake during installation or configuration with messy tarballs will lurk for a long time until some otherwise innocuous change to the system or workload causes difficulties. Moreover, there is little to no integration testing among different projects (e.g. Hadoop, Hive, HBase, Zookeeper, etc.) in the ecosystem. Apache Bigtop is an open source project aimed at bridging exactly those gaps by:

1. Making it easier for users to deploy and configure Hadoop and related projects on their bare metal or virtualized clusters.

2. Performing integration testing among various components in the Hadoop ecosystem.

More about Apache Bigtop

The primary goal of Apache Bigtop is to build a community around the packaging and interoperability testing of Hadoop related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc.) developed by a community with a focus on the system as a whole, rather than individual projects.

The latest released version of Apache Bigtop is Bigtop 0.5 which integrates the latest versions of various projects including Hadoop, Hive, HBase, Flume, Sqoop, Oozie and many more! The supported platforms include CentOS/RHEL 5 and 6, Fedora 16 and 17, SuSE Linux Enterprise 11, OpenSuSE 12.2, Ubuntu LTS Lucid and Precise, and Ubuntu Quantal.

Who uses Bigtop?

Folks who use Bigtop can be divided into two major categories. The first category of users are those who leverage Bigtop to power their own Hadoop Distributions. The second category of users are those who use Bigtop for deployment purposes.

In alphabetical order, they are:
Read more…

Augmenting Unstructured Data

OSCON 2013 Speaker Series

Our world is filled with unstructured data. By some estimates, it’s as high as 80% of all data.

Unstructured data is data that isn’t in a specific format. It isn’t separated by a delimiter that you could split on and get all of the individual pieces of data. Most often, this data comes directly from humans. Human generated data isn’t the best kind to place in a relational database and run queries on. You’d need to run some algorithms on unstructured data to gain any insight.

Play-by-Play

Advanced NFL Stats was kind enough to open up their play-by-play data for NFL games from 2002 to 2012. This dataset consisted of both structured and unstructured data. Here is a sample line showing a single play in the dataset:

20121119_CHI@SF,3,17,48,SF,CHI,3,2,76,20,0,(2:48) C.Kaepernick pass short right to M.Crabtree to SF 25 for 1 yard (C.Tillman). Caught at SF 25. 0-yds YAC,0,3,0,27,7 ,2012

This line is comma separated and has various queryable elements. For example, we could query on the teams playing and the scores. We couldn’t query on the unstructured portion, specifically:

(2:48) C.Kaepernick pass short right to M.Crabtree to SF 25 for 1 yard (C.Tillman). Caught at SF 25. 0-yds YAC

There are many more insights that can be gleaned from this portion. For example, we can see that San Francisco’s quarterback, Colin Kaepernick, passed the ball to the wide receiver, Michael Crabtree. He was tackled by Charles Tillman. Michael Crabtree caught the ball at the 25 yard line, gained 1 yard. After catching the ball, he didn’t make any forward progress or yards after catch (“0-yds YAC“).

Writing a synopsis like that is easy for me as a human. I’ve spent my life working with language and trying to comprehend the meaning. It’s not as easy for a computer. We have to use various methods to extract the information from unstructured data like this. These can vary dramatically in complexity; some use simple string lookups or contains and others use natural language processing.

Parsing would be easy if all of the plays looked like the one above, but they don’t. Each play has a little bit different formatting and varies in the amount of data. The general format is different for each type of play: run, pass, punt, etc. Each type of play needs to be treated a little differently. Also, each person writing the play description will be a little different from the others.

Augmenting Data

As I showed in the synopsis of the play, one can get some interesting insight out of the structured and unstructured data. However, we’re still limited in what we can do and the kinds of queries we can write.

I had a conversation with a Canadian about American Football and the effects of the weather. I was talking about how the NFL now favors domed locations to take weather out as a variable for the Superbowl or NFL championship. With the play-by-play dataset, I couldn’t write a query to tell me the effects of weather one way or another. The data simply doesn’t exist in the dataset.

The physical location and date of the game is captured in the play-by-play. We can see that in the first portion “20121119_CHI@SF“. Here, Chicago is playing at San Francisco on 2012-11-19. There are other datasets that could enable us to augment our play-by-play. These datasets would give us the physical location of the stadium where the game was played (the team’s name doesn’t mean that their stadium is in that city). The dataset for stadiums is a relatively small one. This dataset also shows if the stadium is domed and the ambient temperature is a non-issue.
Read more…

Scaling People, Process, and Technology with Python

OSCON 2013 Speaker Series

NOTE: If you are interested in attending OSCON to check out Dave’s talk or the many other cool sessions, click over to the OSCON website where you can use the discount code OS13PROG to get 20% off your registration fee.

Since 2009, I’ve been leading the optimization team at AppNexus, a real-time advertising exchange. On this exchange, advertisers participate in real-time auctions to bid on individual ad impressions. The highest bid wins the auction, and that advertiser gets to show an ad. This allows advertisers to carefully target where they advertise—maximizing the effectiveness of their advertising budget—and lets websites maximize their ad revenue.

We do these auctions often (~50 billion a day) and fast (<100 milliseconds). Not surprisingly, this creates a lot of technical challenges. One of those challenges is how to automatically maximize the value advertisers get for their marketing budgets—systematically driving consumer engagement through ad placements on particular websites, times of day, etc.—and we call this process “optimization.” The volume of data is large, and the algorithms and strategies aren’t trivial.

In order to win clients and build our business to the scale we have today, it was crucial that we build a world-class optimization system. But when I started, we didn’t have a scalable tech stack to process the terabytes of data flowing through our systems every day, and we didn't have the team to do any of the required data modeling.

People

So, we needed to hire great people fast. However, there aren’t many veterans in the advertising optimization space, and because of that, we couldn’t afford to narrow our search to only experts in Java or R or Matlab. In order to give us the largest talent pool possible to recruit from, we had to choose a tech stack that is both powerful and accessible to people with diverse experience and backgrounds. So we chose Python.

Python is easy to learn. We found that people coding in R, Matlab, Java, PHP, and even those who have never programmed before could quickly learn and get up to speed with Python. This opened us up to hiring a tremendous pool of talent who we could train in Python once they joined AppNexus. To top it off, there’s a great community for hiring engineers and the PyData community is full of programmers who specialize in modeling and automation.

Additionally, Python has great libraries for data modeling. It offers great analytical tools for analysts and quants and when combined, Pandas, IPython, and Matplotlib give you a lot of the functionality of Matlab or R. This made it easy to hire and onboard our quants and analysts who were familiar with those technologies. Even better, analysts and quants can share their analysis through the browser with IPython.

Process

Now that we had all of these wonderful employees, we needed a way to cut down the time to get them ramped up and pushing code to production.

First, we wanted to get our analysts and quants looking at and modeling data as soon as possible. We didn’t want them worrying about writing database connector code, or figuring out how to turn a cursor into a data frame. To tackle this, we built a project called Link.

Imagine you have a MySQL database. You don’t want to hardcode all of your connection information because you want to have a different config for different users, or for different environments. Link allows you to define your “environment” in a JSON config file, and then reference it in code as if it is a Python object.

 { "dbs":{
  "my_db": {
   "wrapper": "MysqlDB",
   "host": "mysql-master.123fakestreet.net",
   "password": "",
   "user": "",
   "database": ""
  }
 }}

Now, with only three lines of code you have a database connection and a data frame straight from your mysql database. This same methodology works for Vertica, Netezza, Postgres, Sqlite, etc. New “wrappers” can be added to accommodate new technologies, allowing team members to focus on modeling the data, not how to connect to all these weird data sources.

In [1]: from link import lnk
 
In [2]: my_db = lnk.dbs.my_db
 
In [3]: df = my_db.select('select * from my_table').as_dataframe()
 

Int64Index: 325 entries, 0 to 324
Data columns:
id    325 non-null values
user_id   323 non-null values
app_id   325 non-null values
name    325 non-null values
body    325 non-null values
created   324 non-null values

By having the flexibility to easily connect to new data sources and APIs, our quants were able to adapt to the evolving architectures around us, and stay focused on modeling data and creating algorithms.

Second, we wanted to minimize the amount of work it took to take an algorithm from research/prototype phase to full production scale. Luckily, with everyone working in Python, our quants, analysts, and engineers are using the same language and data processing libraries. There was no need to re-implement an R script in Java to get it out across the platform.
Read more…

Six Ways to Make Your Peer Code Reviews More Effective

OSCON 2013 Speaker Series

NOTE: If you are interested in attending OSCON to check out Emma Jane’s talk or the many other cool sessions, click over to the OSCON website where you can use the discount code OS13PROG to get 20% off your registration fee.

The critique is a design school staple. You can find more than a few blog posts from design students about how “the crit” either made them as a designer, or broke their will to live. Substitute “critique” with “code review” and you’re likely to find just as many angst-filled blog posts in the open source community. Or, more likely, you’ll notice a lack of contributors following a particularly harsh text-based transaction.

Somewhere along the way we dropped one of the original meanings of the word “critique”. According to Wikipedia, a critique is “a method of disciplined, systematic analysis of a written or oral discourse”.

A good critique is more than just a positive feedback sandwich that wraps “negative” feedback between two slices of “positive” feedback. An effective critique needs to have three components in place to be successful:

  1. An agreed-upon framework for the evaluation.
  2. Reviewer objectivity.
  3. A creator uncoupled from his or her work.

The Quantitative Review

Quantitative reviews are the easiest to conduct. And, generally, coding standards for a project are quantitative. A quantitative review answers questions such as: Is this code formatted according to coding standards? Does the code cause any performance regressions? Quantitative evaluations can be easily measured. The code being reviewed is either right, or non-compliant. The findings of quantitative reviews are generally hard to argue with. (Ever tried to pick a fight with Jenkins?) Having a quantitative framework in place often means that your project will also be able to free up reviewer time by putting automated tests in place. Which leaves time for a more difficult type of review: the qualitative review.

The Qualitative Review

A framework for a qualitative review is incredibly difficult to create because there is space for subjective thought as it addresses the values of a project. There is room for opinion. Two people could be right at the same time, even though they are not at all in agreement. Qualitative frameworks are so difficult to create that most projects don’t even try; however, an effective qualitative framework helps both the coder and the reviewer to improve their skills. A qualitative framework teaches non-experts what attributes they should be looking for in the code they are reviewing. Questions posed during a qualitative review may include: Does this code implement a pattern we’ve already seen somewhere in the project? Has the code been sufficiently abstracted to make it reusable in other situations?

Read more…