- The Amen Break (YouTube) — fascinating 20m history of the amen break, a handful of bars of drum solo from a forgotten 1969 song which became the origin of a huge amount of popular music from rap to jungle and commercials, and the contested materials at the heart of sample-based music. Remix it and weep. (via Beta Knowledge)
- The MySQL Ecosystem at Scale (PDF) — nice summary of how MySQL is used on massive users, and where the sweet spots have been found.
- Lab41 (Github) — open sourced code from a spook hacklab in Silicon Valley.
- Fanulus — open sourced Hadoop-based graph analytics engine for analyzing graphs represented across a multi-machine compute cluster. A breadth-first version of the graph traversal language Gremlin operates on graphs stored in the distributed graph database Titan, in any Rexster-fronted graph database, or in HDFS via various text and binary formats.
Add columns to a table on the fly without altering its schema.
MariaDB and similar SQL database systems allow for a variety of data types that may be used for storing data in columns within tables. When creating or altering a table’s schema, it’s good to know what to expect, to know what kind of data will be stored in each column. If you know that a column will contain numbers, use a numeric data type like
VARCHAR. It’s best to use the appropriate data type for a column. Generally, you’ll have better control of the data and possibly better performance.
But sometimes you can’t predict what type of data might be entered into a column. For such a situation, you might use
VARCHAR set to 255 characters wide, or maybe
TEXT if plenty of data might be entered. This is a very cool and fairly new alternative: you could create a table in which you would add columns on the fly, but without altering the table’s schema. That may sound absurd, but it’s possible to do this in MariaDB with dynamic columns.
Dynamic columns are basically columns within a column. If you know programming well, they’re like a hash within an array. That may sound confusing, but it will make more sense when you see it in action. To illustrate this, I’ll pull some ideas from my new book, Learning MySQL and MariaDB (O’Reilly 2015). All of the examples in my book and this article are based on a database for bird-watchers.
Powering your app with open source and OpenShift
As a software developer, you are no doubt familiar with the process of abstracting away unnecessary detail in code — imagine if that same principle were applied to application hosting. Say hello to Platform as a Service (PaaS), which enables you to host your applications in the cloud without having to worry about the logistics, leaving you to focus on your code. This post will discuss five ways in which PaaS benefits software developers, using the open source OpenShift PaaS by Red Hat as an example.
No More Tedious Config Tasks
Most of us don’t become developers to do system administration, but when you are running your own infrastructure you end up doing exactly that. A PaaS can take that pain away by handling pesky config and important security updates for you. As a bonus, it makes your sys admin happy too by allowing you to provision your own environment for that killer new app idea you want to tinker with, rather than nagging them for root access on a new VM.
On OpenShift, it goes like this: let’s say you decide you want to test an idea for a Java app, using Tomcat and PostgreSQL (yes, we could argue about the merits of those choices, but work with me here). You can spin that up with a one-line terminal command:
rhc app create myawesomeapp tomcat-7 postgresql-9.2 -s
That -s on the end is telling the platform to make the app auto-scaling, which I will elaborate on later; yes, that’s all it takes. RHC (Red Hat Cloud) is just a Ruby Gem wrapping calls to the OpenShift REST API. You could also use the OpenShift web console or an IDE plugin to do this, or call the API directly if that’s how you roll. The key technologies in play here are just plain old Git and SSH — there’s nothing proprietary.
Amen Break, MySQL Scale, Spooky Source, and Graph Analytics Engine
OSCON 2013 Speaker Series
NOTE: If you are interested in attending OSCON to check out Dave’s talk or the many other cool sessions, click over to the OSCON website where you can use the discount code OS13PROG to get 20% off your registration fee.
Since 2009, I’ve been leading the optimization team at AppNexus, a real-time advertising exchange. On this exchange, advertisers participate in real-time auctions to bid on individual ad impressions. The highest bid wins the auction, and that advertiser gets to show an ad. This allows advertisers to carefully target where they advertise—maximizing the effectiveness of their advertising budget—and lets websites maximize their ad revenue.
We do these auctions often (~50 billion a day) and fast (<100 milliseconds). Not surprisingly, this creates a lot of technical challenges. One of those challenges is how to automatically maximize the value advertisers get for their marketing budgets—systematically driving consumer engagement through ad placements on particular websites, times of day, etc.—and we call this process “optimization.” The volume of data is large, and the algorithms and strategies aren’t trivial.
In order to win clients and build our business to the scale we have today, it was crucial that we build a world-class optimization system. But when I started, we didn’t have a scalable tech stack to process the terabytes of data flowing through our systems every day, and we didn't have the team to do any of the required data modeling.
So, we needed to hire great people fast. However, there aren’t many veterans in the advertising optimization space, and because of that, we couldn’t afford to narrow our search to only experts in Java or R or Matlab. In order to give us the largest talent pool possible to recruit from, we had to choose a tech stack that is both powerful and accessible to people with diverse experience and backgrounds. So we chose Python.
Python is easy to learn. We found that people coding in R, Matlab, Java, PHP, and even those who have never programmed before could quickly learn and get up to speed with Python. This opened us up to hiring a tremendous pool of talent who we could train in Python once they joined AppNexus. To top it off, there’s a great community for hiring engineers and the PyData community is full of programmers who specialize in modeling and automation.
Additionally, Python has great libraries for data modeling. It offers great analytical tools for analysts and quants and when combined, Pandas, IPython, and Matplotlib give you a lot of the functionality of Matlab or R. This made it easy to hire and onboard our quants and analysts who were familiar with those technologies. Even better, analysts and quants can share their analysis through the browser with IPython.
Now that we had all of these wonderful employees, we needed a way to cut down the time to get them ramped up and pushing code to production.
First, we wanted to get our analysts and quants looking at and modeling data as soon as possible. We didn’t want them worrying about writing database connector code, or figuring out how to turn a cursor into a data frame. To tackle this, we built a project called Link.
Imagine you have a MySQL database. You don’t want to hardcode all of your connection information because you want to have a different config for different users, or for different environments. Link allows you to define your “environment” in a JSON config file, and then reference it in code as if it is a Python object.
Now, with only three lines of code you have a database connection and a data frame straight from your mysql database. This same methodology works for Vertica, Netezza, Postgres, Sqlite, etc. New “wrappers” can be added to accommodate new technologies, allowing team members to focus on modeling the data, not how to connect to all these weird data sources.
In : from link import lnk
In : my_db = lnk.dbs.my_db
In : df = my_db.select('select * from my_table').as_dataframe()
Int64Index: 325 entries, 0 to 324
id 325 non-null values
user_id 323 non-null values
app_id 325 non-null values
name 325 non-null values
body 325 non-null values
created 324 non-null values
By having the flexibility to easily connect to new data sources and APIs, our quants were able to adapt to the evolving architectures around us, and stay focused on modeling data and creating algorithms.
Second, we wanted to minimize the amount of work it took to take an algorithm from research/prototype phase to full production scale. Luckily, with everyone working in Python, our quants, analysts, and engineers are using the same language and data processing libraries. There was no need to re-implement an R script in Java to get it out across the platform.
OSCON 2013 Speaker Series
Henrik Ingo (@h_ingo) is a MySQL architect, author, and OSCON 2013 Speaker. In this interview we talk about high availability in MySQL and why he switched to MongoDB.
NOTE: If you are interested in attending OSCON to check out Henrik’s talk or the many other cool sessions, click over to the OSCON website where you can use the discount code OS13PROG to get 20% your registration fee.
Key highlights include:
- Why is high availability so important now? [Discussed at 2:25]
- Percona and Maria DB are not much different than plain old MySQL. [Discussed at 6:24]
- Moving from MySQL to MongoDB [Discussed at 8:40]
You can view the full interview here:
Velocity 2013 Speaker Series
Be honest, have you ever wanted to play Steve Souders for a day and pull some revealing stats or trends about some web sites of your choice? Or maybe dig around the HTTP archive? You can do that and more by setting up your own HTTP Archive.
httparchive.org is a fantastic tool to track, monitor, and review how the web is built. You can dig into trends around page size, page load time, content delivery network (CDN) usage, distribution of different mimetypes, and many other stats. With the integration of WebPagetest, it’s a great tool for synthetic testing as well.
You can download an HTTP Archive MySQL dump (warning: it’s quite large) and the source code from the download page and dissect a snapshot of the data yourself. Once you’ve set up the database, you can easily query anything you want.
You need MySQL, PHP, and your own webserver running. As I mentioned above, HTTP Archive relies on WebPagetest—if you choose to run your own private instance of WebPagetest, you won’t have to request an API key. I decided to ask Patrick Meenan for an API key with limited query access. That was sufficient for me at the time. If I ever wanted to use more than 200 page loads per day, I would probably want to set up a private instance of WebPagetest.
To find more details on how to set up an HTTP Archive instance yourself and any further advice, please check out my blog post.
Going back to the scenario I described above: the real motivation is that often you don’t want to throw your website(s) in a pile of other websites (e.g. not related to your business) to compare or define trends. Our digital property at the Canadian Broadcasting Corporation’s (CBC) spans over dozens of URLs that have different purposes and audiences. For example, CBC Radio covers most of the Canadian radio landscape, CBC News offers the latest breaking news, CBC Hockey Night in Canada offers great insights on anything related to hockey, and CBC Video is the home for any video available on CBC. It’s valuable for us to not only compare cbc.ca to the top 100K Alexa sites but also to verify stats and data against our own pool of web sites.
In this case, we want to use a set of predefined URLs that we can collect HTTP Archive stats for. Hence a private instance can come in handy—we can run tests every day, or every week, or just every month to gather information about the performance of the sites we’ve selected. From there, it’s easy to not only compare trends from httparchive.org to our own instance as a performance baseline, but also have a great amount of data in our local database to run queries against and to do proper performance monitoring and investigation.
The beautiful thing about having your own instance is that you can be your own master of data visualization: you can now create more charts in addition to the ones that came out of the box with the default HTTP Archive setup. And if you don’t like Google chart tools, you may even want to check out D3.js or Highcharts instead.
The image below shows all mime types used by CBC web properties that are captured in our HTTP archive database, using D3.js bubble charts for visualization.
Laine Campbell on why AWS is a good platform option for running MySQL at scale
In the following interview, PalominoDB owner and CEO Laine Campbell discusses advantages and disadvantages of using Amazon Web Services (AWS) as a platform for running MySQL. The solution provides a functional environment for young startups who can’t afford a database administrator (DBA), Campbell says, but there are drawbacks to be aware of, such as a lack of access to your database’s file system, and troubleshooting “can get quite hairy.” This interview is a sneak preview to Campbell’s upcoming Velocity session, “Using Amazon Web Services for MySQL at Scale.”
Why is AWS a good platform for scaling MySQL?
Laine Campbell: The elasticity of Amazon’s cloud service is key to scaling on most tiers in an application’s infrastructure, and this is true with MySQL as well. Concurrency is a recurring pattern with MySQL’s scaling capabilities, and as traffic and concurrent queries grow, one has to introduce some fairly traditional scaling patterns. One such pattern is adding replicas to distribute read I/O and reduce contention and concurrency, which is easy to do with rapid deployment of new instances and Elastic Block Storage (EBS) snapshots.
Additionally, sharding can be done with less impact via EBS snapshots being used to recreate the dataset, and then data that is not part of the new shard is removed. Amazon’s relational database service for MySQL—RDS—is also a new, rather compelling scaling pattern for the early stages of a company’s life, when resources are scarce and administrators have not been hired. RDS is a great pattern for people to emulate in terms of rapid deployment of replicas, ease of master failovers, and the ability to easily redeploy hosts when errors occur, rather than spending extensive time trying to repair or clean up data.
Processing for Illustrator, Archiving Tools, Sweet Retro Art, and More Database Tools
- Drawscript — Processing for Illustrator. (via BERG London)
- Archive Team Warrior — a virtual archiving appliance. You can run it to help with the ArchiveTeam archiving efforts. It will download sites and upload them to our archive. (via Ed Vielmetti)
- Retro Vectors — royalty-free and free of charge.
- TokutekDB Goes Open Source — a high-performance, transactional storage engine for MySQL and MariaDB. See the announcement.
Drone Overload, Mac MySQL Tool, Better Cancer Diagnosis Through AI, and Inconstant Identifiers
- Drones Over Somalia are Hazard to Air Traffic (Washington Post) — In a recently completed report, U.N. officials describe several narrowly averted disasters in which drones crashed into a refugee camp, flew dangerously close to a fuel dump and almost collided with a large passenger plane over Mogadishu, the capital. (via Jason Leopold)
- Sequel Pro — free and open source Mac app for managing MySQL databases. It’s an update of CocoaMySQL.
- Neural Network Improves Accuracy of Least Invasive Breast Cancer Test — nice use of technology to make lives better, for which the creator won the Google Science Fair. Oh yeah, she’s 17. (via Miss Representation)
- Free Harder to Find on Amazon — so much for ASINs being permanent and unchangeable. Amazon “updated” the ASINs for a bunch of Project Gutenberg books, which means they’ve lost all the reviews, purchase history, incoming links, and other juice that would have put them at the top of searches for those titles. Incompetence, malice, greed, or a purely innocent mistake? (via Glyn Moody)