"Python" entries

Mining One Million Tweets About #Syria

Surprising social media stats

by Matthew Russell | September 11, 2013

I’ve been filtering Twitter’s firehose for tweets about “#Syria” for about the past week in order to accumulate a sizable volume of data about an important current event. As of Friday, I noticed that the tally has surpassed one million tweets, so it seemed to be a good time to apply some techniques from Mining the Social Web and explore the data.

While some of the findings from a preliminary analysis confirm common intuition, others are a bit surprising. The remainder of this post explores the tweets with a cursory analysis addressing the “Who?, What?, Where?, and When?” of what’s in the data.

Read more…

Four short links: 30 August 2013

Flexible Layouts, Web Components, Distributed SQL Database, and Reverse-Engineering Dropbox Client

by Nat Torkington | @gnat | +Nat Torkington | August 30, 2013

intention.js — manipulates the DOM via HTML attributes. The methods for manipulation are placed with the elements themselves, so flexible layouts don’t seem so abstract and messy.
Introducing Brick: Minimal-markup Web Components for Faster App Development (Mozilla) — a cross-browser library that provides new custom HTML tags to abstract away common user interface patterns into easy-to-use, flexible, and semantic Web Components. Built on Mozilla’s x-tags library, Brick allows you to plug simple HTML tags into your markup to implement widgets like sliders or datepickers, speeding up development by saving you from having to initially think about the under-the-hood HTML/CSS/JavaScript.
F1: A Distributed SQL Database That Scales — a distributed relational database system built at Google to support the AdWords business. F1 is a hybrid database that combines high availability, the scalability of NoSQL systems like Bigtable, and the consistency and usability of traditional SQL databases. F1 is built on Spanner, which provides synchronous cross-datacenter replication and strong consistency. Synchronous replication implies higher commit latency, but we mitigate that latency by using a hierarchical schema model with structured data types and through smart application design. F1 also includes a fully functional distributed SQL query engine and automatic change tracking and publishing.
Looking Inside The (Drop)Box (PDF) — This paper presents new and generic techniques, to reverse engineer frozen Python applications, which are not limited to just the Dropbox world. We describe a method to bypass Dropbox’s two factor authentication and hijack Dropbox accounts. Additionally, generic techniques to intercept SSL data using code injection techniques and monkey patching are presented. (via Tech Republic)

Data analysis tools target non-experts

Tools simplify the application of advanced analytics and the interpretation of results

by Ben Lorica | @bigdata | +Ben Lorica | August 25, 2013

A new set of tools make it easier to do a variety of data analysis tasks. Some require no programming, while other tools make it easier to combine code, visuals, and text in the same workflow. They enable users who aren’t statisticians or data geeks, to do data analysis. While most of the focus is on enabling the application of analytics to data sets, some tools also help users with the often tricky task of interpreting results. In the process users are able to discern patterns and evaluate the value of data sources by themselves, and only call upon expert¹ data analysts when faced with non-routine problems.

Visual Analysis and Simple Statistics
Three SaaS startups – DataHero, DataCracker, Statwing – make it easy to perform simple data wrangling, visual analysis, and statistical analysis. All three (particularly DataCracker) appeal to users who analyze consumer surveys. Statwing and DataHero simplify the creation of Pivot Tables² and suggest³ charts that work well with your data. StatWing users are also able to execute and view the results of a few standard statistical tests in plain English (detailed statistical outputs are also available).

Statistics and Machine-learning
BigML and Datameer’s Smart Analytics are examples of recent tools that make it easy for business users to apply machine-learning algorithms to data sets (massive data sets, in the case of Datameer). It makes sense to offload routine data analysis tasks to business analysts and I expect other vendors such as Platfora and ClearStory to provide similar capabilities in the near future.

Read more…

So, You Want to Run a Young Coders Class?

Teaching Future Coders

by Katie Cunningham | August 16, 2013

Ever since PyCon 2013, the interest in the Young Coders class has been intensifying. Practically every Python conference since then has asked about doing one, and several have run their own. Classes outside of conferences have sprung up, as well, from one time workshops to after school clubs.

As more classes happen, more people have been asking about running their own. These classes do take quite a bit of effort to set up, but the payoff is enormous. Also, once you do one, doing subsequent ones gets easier and easier.
Read more…

In Praise of the Lone Contributor

The O'Reilly Open Source Awards 2013

by Kathryn Barrett | August 2, 2013

Over the years, OSCON has become a big conference. With over 3900 registered this year, it was hard not to look at the packed hallways and sessions and think what a huge crowd it is. The number of big-name companies participating–Microsoft, Google, Dell, and even General Motors–reinforce the popular refrain that open source has come a long way; it’s all mainstream now.

Which is as it should be. And it’s been a long haul. But thinking of open source in terms of numbers and size puts us in danger of forgetting the very thing that makes open source special, and that’s the individual contributor. So while open source software has indeed found a place in almost every organization that exists, it was made possible by the hard work of real people who saw the need for it, most of them volunteering in their spare time.

The O’Reilly Open Source Awards were created to recognize and thank these individuals. It’s a community-driven effort: nominations come in from the open source community (this year there were around 50) and then are judged by the previous year’s winners. It’s not intended to be political or a popularity contest, but honest appreciation for hard work that matters. Let’s look at this year’s winners.
Read more…

Zero Downtime Application Updates with Ansible

OSCON 2013 Speaker Series

by Michael DeHaan | July 18, 2013

Automating the configuration management of your operating systems and the rollout of your applications is one of the most important things an administrator or developer can do to avoid surprises when updating services, scaling up, or recovering from failures. However, it’s often not enough. Some of the most common operations that happen in your datacenter (or cloud environment) involve large numbers of machines working together and humans to mediate those processes. While we have been able to remove a lot of human effort from configuration, there has been a lack of software able to handle these higher-level operations.

I used to work for a hosted web application company where the IT process for executing an application update involved locking six people in a room for sometimes 3-4 hours, each person pressing the right buttons at the right time. This process almost always had a glitch somewhere where someone forgot to run the right command or something wasn’t well tested beforehand. While some technical solutions were applied to handle configuration automation, nothing that could perform configuration could really accomplish that high level choreography on top as well. This is why I wrote Ansible.

Ansible is a configuration management, application deployment, and IT orchestration system. One of Ansible’s strong points is having a very simple, human readable language – it allows users very fine, precise control over what happens on what machines at what times.

Getting started

To get started, create an inventory file, for instance, ~/ansible_hosts that defines what machines you are managing, and which machines are frequently organized into groups. Ansible can also pull inventory from multiple cloud sources, but an inventory file is a quick way to get started:

[webservers]
www01.example.com
www02.example.com
# add more webservers here

[monitoring]
nagios1.example.com

[lbservers]
haproxy1.example.com
haproxy2.example.com

Now that you have defined what machines you are managing, you have to define what you are going to do on the remote machines.

Ansible calls this description of processes a “playbook,” and you don’t have to have just one, you could have different playbooks for different kinds of tasks.

Let’s look at an example for describing a rolling update process. This example is somewhat involved because it’s using haproxy, but haproxy is freely available. Ansible also includes modules for dealing with Netscalers and F5 load balancers, so this is just an example — ordinarily you would start more simply and work up to an example like this:
Read more…

Scaling People, Process, and Technology with Python

OSCON 2013 Speaker Series

by Dave Himrod | July 15, 2013

NOTE: If you are interested in attending OSCON to check out Dave’s talk or the many other cool sessions, click over to the OSCON website where you can use the discount code OS13PROG to get 20% off your registration fee.

Since 2009, I’ve been leading the optimization team at AppNexus, a real-time advertising exchange. On this exchange, advertisers participate in real-time auctions to bid on individual ad impressions. The highest bid wins the auction, and that advertiser gets to show an ad. This allows advertisers to carefully target where they advertise—maximizing the effectiveness of their advertising budget—and lets websites maximize their ad revenue.

We do these auctions often (~50 billion a day) and fast (<100 milliseconds). Not surprisingly, this creates a lot of technical challenges. One of those challenges is how to automatically maximize the value advertisers get for their marketing budgets—systematically driving consumer engagement through ad placements on particular websites, times of day, etc.—and we call this process “optimization.” The volume of data is large, and the algorithms and strategies aren’t trivial.

In order to win clients and build our business to the scale we have today, it was crucial that we build a world-class optimization system. But when I started, we didn’t have a scalable tech stack to process the terabytes of data flowing through our systems every day, and we didn't have the team to do any of the required data modeling.

People

So, we needed to hire great people fast. However, there aren’t many veterans in the advertising optimization space, and because of that, we couldn’t afford to narrow our search to only experts in Java or R or Matlab. In order to give us the largest talent pool possible to recruit from, we had to choose a tech stack that is both powerful and accessible to people with diverse experience and backgrounds. So we chose Python.

Python is easy to learn. We found that people coding in R, Matlab, Java, PHP, and even those who have never programmed before could quickly learn and get up to speed with Python. This opened us up to hiring a tremendous pool of talent who we could train in Python once they joined AppNexus. To top it off, there’s a great community for hiring engineers and the PyData community is full of programmers who specialize in modeling and automation.

Additionally, Python has great libraries for data modeling. It offers great analytical tools for analysts and quants and when combined, Pandas, IPython, and Matplotlib give you a lot of the functionality of Matlab or R. This made it easy to hire and onboard our quants and analysts who were familiar with those technologies. Even better, analysts and quants can share their analysis through the browser with IPython.

Process

Now that we had all of these wonderful employees, we needed a way to cut down the time to get them ramped up and pushing code to production.

First, we wanted to get our analysts and quants looking at and modeling data as soon as possible. We didn’t want them worrying about writing database connector code, or figuring out how to turn a cursor into a data frame. To tackle this, we built a project called Link.

Imagine you have a MySQL database. You don’t want to hardcode all of your connection information because you want to have a different config for different users, or for different environments. Link allows you to define your “environment” in a JSON config file, and then reference it in code as if it is a Python object.

 { "dbs":{
  "my_db": {
   "wrapper": "MysqlDB",
   "host": "mysql-master.123fakestreet.net",
   "password": "",
   "user": "",
   "database": ""
  }
 }}

Now, with only three lines of code you have a database connection and a data frame straight from your mysql database. This same methodology works for Vertica, Netezza, Postgres, Sqlite, etc. New “wrappers” can be added to accommodate new technologies, allowing team members to focus on modeling the data, not how to connect to all these weird data sources.

In [1]: from link import lnk
 
In [2]: my_db = lnk.dbs.my_db
 
In [3]: df = my_db.select('select * from my_table').as_dataframe()
 

Int64Index: 325 entries, 0 to 324
Data columns:
id    325 non-null values
user_id   323 non-null values
app_id   325 non-null values
name    325 non-null values
body    325 non-null values
created   324 non-null values

By having the flexibility to easily connect to new data sources and APIs, our quants were able to adapt to the evolving architectures around us, and stay focused on modeling data and creating algorithms.

Second, we wanted to minimize the amount of work it took to take an algorithm from research/prototype phase to full production scale. Luckily, with everyone working in Python, our quants, analysts, and engineers are using the same language and data processing libraries. There was no need to re-implement an R script in Java to get it out across the platform.
Read more…

Python’s New-Style Inheritance Algorithm

by Mark Lutz | July 9, 2013

This article takes a brief look at the inheritance search mechanism in the Python programming language. Like some other aspects of Python today, this mechanism varies per line: inheritance has grown much more convoluted in 3.X, though 2.X users still have a choice in the matter. To truly understand the current state of affairs, then, we need to begin our story in simpler times.

Classic Inheritance

Once upon a time (well, in 2.X’s default and still widely used classic classes), Python attribute inheritance—the object.name lookup at the heart of object-oriented code—was fairly simple. It essentially boiled down to this:

Attribute name references search the instance, its class, and the class’s superclasses depth-first and left-to-right, and use the first occurrence found along the way. Attribute assignments normally store in the target object itself.

And that’s it. The reference search may be kicked off from either an instance or a class, and there are special cases for __getattr__ (run if the lookup failed to find a name) and __setattr__ (run for all attribute assignments), but the procedure is by and large straightforward.

New-Style Inheritance

In new-style classes—an option in 2.X and mandated in 3.X—inheritance is richer but substantially more complex, potentially requiring knowledge of advanced topics to accurately resolve an attribute name’s meaning, including descriptors, metaclasses, and the linearized class-tree paths known as MROs. We won’t delve into those prerequisite topics here, but the following is a cursory overview of the algorithm used, taken from the newly released Learning Python, 5th Edition, where you’ll find new and more complete coverage.

To look up an attribute name:

From an instance I, search the instance, its class, and its superclasses, as follows:
1. Search the __dict__ of all classes on the __mro__ found at I’s __class__
2. If a data descriptor was found in step a, call its __get__ and exit
3. Else, return a value in the __dict__ of the instance I
4. Else, call a nondata descriptor or return a value found in step a
From a class C, search the class, its superclasses, and its metaclasses tree, as follows:
1. Search the __dict__ of all metaclasses on the __mro__ found at C’s __class__
2. If a data descriptor was found in step a, call its __get__ and exit
3. Else, call a descriptor or return a value in the __dict__ of a class on C’s own __mro__
4. Else, call a nondata descriptor or return a value found in step a
In both rule 1 and 2, built-in operations essentially use just step a sources for their implicit name look up (described further in the book).

Name sources in this procedure are attempted in order, either as numbered or per their left-to-right order in “or” conjunctions. On top of all this, method __getattr__ may be run if defined when an attribute is not found; method __getattribute__ may be run for every attribute fetch; and the implied object superclass provides some defaults at the top of every class and metaclass tree (that is, at the end of every MRO).

Read more…

Printing Plastic Tchotchkes Was Fun, but MakerBot Was Just Too High-Maintenance

Breaking up with MakerBot

by Sanders Kleinfeld | @sandersk | +Sanders Kleinfeld | July 2, 2013

I’ll never forget the day I first met MakerBot. It was August 1, 2012 when he^*—a bright, shiny first-generation Replicator—arrived at our Cambridge, MA, office, greeted by screams of delight by a throng of fans. I must admit, I was a bit intimidated and star-struck: MakerBot’s reputation preceded him. He was a rockstar in the DIY community, a true maverick of a machine, ushering in the “Wild West of 3D printing” among our sedate sea of MacBook Air laptops running Adobe InDesign. All we had ever made here before were PDF files, but with MakerBot humming cheerfully in the lounge next to the kitchen, that had all changed. We were now maker-magicians, spinning ABS thread into gold.

At first, it was hard to get any quality time with MakerBot. I’d come into the office in the morning, and he’d already be surrounded by three or four groupies, who were browsing the catalog at Thingiverse, selecting a fresh set of STL models to print: from Mario and Batman to Mayan Robot.

The T-Rex (far left) and Barack Obama figurine (bottom-right) were made with glow-in-the-dark ABS thread (hence “Glowbama”).

But MakerBot didn’t just allow me and my coworkers to print out other people’s models; he offered us the promise of designing our own plastic masterpieces. He came packaged with the open source software ReplicatorG, which provides a nice GUI for doing simple modifications on existing models (scaling, rotating, etc.). ReplicatorG isn’t a tool for constructing models from scratch, however, so I also started experimenting with other 3D rendering applications like Blender, MeshLab, and OpenSCAD.

I was interested in the possibilities in transforming 2D photos into 3D models that MakerBot could print, so I started experimenting with a Python tool called img2scad, which can convert a JPEG image file into a .scad file (convertible to a compatible STL file with OpenSCAD) by transforming each pixel in the image to a rectangular prism whose height is directly proportional to how dark/light the pixel is. When this SCAD model is printed, the output is a photograph embossed into a sheet of plastic. Pretty cool—although, in practice, the results were somewhat lackluster since much of the detail captured in the subtle shading differences among pixels in the source JPEG didn’t get preserved in the conversion to prisms.
Read more…

Easily invoke common protocols with Twisted

Spin up Python-friendly services with 0 lines of code

by Jessica McKellar | June 11, 2013

Twisted is a framework for writing, testing, and deploying event-driven clients and servers in Python. In my previous Twisted blog post, we explored an architectural overview of Twisted and examples of simple TCP, UDP, SSL, and HTTP echo servers.

While Twisted makes it easy to build servers in just a few lines of Python, you can actually use Twisted to spin up servers with 0 lines of code!

We can accomplish this with twistd (pronounced twist-dee), a command line utility that ships with Twisted for deploying your Twisted applications. In addition to providing a standardized deployment interface for common production features like daemonization, logging, and authentication, twistd can use Twisted’s plugin architecture to run flexible servers for a variety of protocols. Here are some examples:

twistd web --port 8000 --path .

Run an HTTP server on port 8000, serving both static and dynamic content out of
the current working directory. Visit http://localhost:8000 to see the directory listing:

Read more…