MySQL in 2012: Report from Percona Live

Checking in on the state of MySQL.

by Andy Oram | @praxagora | +Andy Oram | April 14, 2012

The big annual MySQL conference, started by MySQL AB in 2003 and run
by my company O’Reilly for several years, lives on under the able
management of Percona. This
fast-growing company started out doing consulting on MySQL,
particularly in the area of performance, and branched out into
development and many other activities. The principals of this company
wrote the most recent two editions of the popular O’Reilly book High
Performance MySQL.

Percona started offering conferences a couple years ago and decided to
step in when O’Reilly decided not to run the annual MySQL conference
any more. Oracle did not participate in Percona Live, but has
announced its own MySQL
conference for next September.

Percona Live struck me as a success, with about one thousand attendees
and the participation of leading experts from all over the MySQL
world, save for Oracle itself. The big players in the MySQL user
community came out in force: Facebook, HP, Google, Pinterest (the
current darling of the financial crowd), and so on.

The conference followed the pattern laid down by old ones in just
about every way, with the same venue (the Santa Clara Convention
Center, which is near a light-rail but nothing else of interest), the
same food (scrumptious), the standard format of one day of tutorials
and two days of sessions (although with an extra developer day tacked
on, which I will describe later), an expo hall (smaller than before,
but with key participants in the ecosystem), and even community awards
(O’Reilly Media won an award as Corporate Contributor of the Year).
Monty Widenius was back as always with a MariaDB entourage, so it
seemed like old times. The keynotes seemed less well attended than the
ones from previous conferences, but the crowd was persistent and
showed up in impressive numbers for the final events–and I don’t
believe it was because everybody thought they might win one of the
door prizes.

Jeremy Zawodny ready to hand out awards.

Two contrasting database deployments

I checked out two well-attended talks by system architects from two

high-traffic sites: Pinterest and craigslist. The radically divergent
paths they took illustrate the range of options open to data centers
nowadays–and the importance of studying these options so a data
center can choose the path appropriate to its mission and
applications.

Jeremy Zawodny (co-author of the first edition of High Performance
MySQL) presented
the design of craigslist’s site, which illustrates the model of
software accretion over time and an eager embrace of heterogeneity.
Among their components are:

Memcache, lying between the web servers and the MySQL database in
classic fashion.
MySQL to serve live postings, handle abuse, data for monitoring
system, and other immediate needs.
MongoDB to store almost 3 billion items related to archived (no longer
live) postings.
HAproxy to direct requests to the proper MySQL server in a cluster.
Sphinx for text searches, with

indexes over all live postings, archived postings, and forums.
Redis for temporary items such as counters and blobs.
An XFS filesystem for images.
Other helper functions that Zawodny lumped together as “async
services.”

Care and feeding of this menagerie becomes a job all in itself.
Although craigslist hires enough developers to assign them to
different areas of expertise, they have also built an object layer
that understands MySQL, cache, Sphinx, MongoDB. The original purpose
of this layer was to aid in migrating old data from MySQL to MongoDB
(a procedure Zawodny admitted was painful and time-consuming) but it
was extended into a useful framework that most developers can use
every day.

Zawodny praised MySQL’s durability and its method of replication. But
he admitted that they used MySQL also because it was present when they
started and they were familiar with it. So adopting the newer entrants
into the data store arena was by no means done haphazardly or to try
out cool new tools. Each one precisely meets particular needs of the
site.

For instance, besides being fast and offering built-in sharding,
MongoDB was appealing because they don’t have to run ALTER TABLE every
time they add a new field to the database. Old entries coexist happily
with newer ones that have different fields. Zawodny also likes using a
Perl client to interact with a database, and the Perl client provided
by MongoDB is unusually robust because it was developed by 10gen
directly, in contrast to many other datastores where Perl was added by
some random volunteer.

The architecture at craigslist was shrewdly chosen to match their
needs. For instance, because most visitors click on the limited set
of current listings, the Memcache layer handles the vast majority of
hits and the MySQL database has a relatively light load.

However, the MySQL deployment is also carefully designed. Clusters are
vertically partitioned in two nested ways. First, different types of
items are stored on separate partitions. Then, within each type, the
nodes are further divided by the type of query:

A single master to handle all writes.
A group for very fast reads (such as lookups on a primary key)
A group for “long reads” taking a few seconds
A special node called a “thrash handler” for rare, very complex
queries

It’s up to the application to indicate what kind of query it is
issuing, and HAproxy interprets this information to direct the query
to the proper set of nodes.

Naturally, redundancy is built in at every stage (three HAproxy
instances used in round robin, two Memcache instances holding the same
data, two data centers for the MongoDB archive, etc.).

It’s also interesting what recent developments have been eschewed by
craigslist. The self-host everything and use no virtualization.
Zawodny admits this leads to an inefficient use of hardware, but
avoids the overhead associated with virtualization. For efficiency,
they have switched to SSDs, allowing them to scale down from 20
servers to only 3. They don’t use a CDN, finding that with aggressive
caching and good capacity planning they can handle the load
themselves. They send backups and logs to a SAN.

Let’s turn now from the teeming environment of craigslist to the
decidedly lean operation of Pinterest, a much younger and smaller
organization. As presented
by Marty Weiner and Yashh Nelapati, when they started web-scale
growth in the Autumn of 2011, they reacted somewhat like craigslist,
but with much less thinking ahead, throwing in all sorts of software
such as Cassandra and MongoDB, and diversifying a bit recklessly.
Finally they came to their senses and went on a design diet. Their
resolution was to focus on MySQL–but the way they made it work is
unique to their data and application.

They decided against using a cluster, afraid that bad application code
could crash everything. Sharding is much simpler and doesn’t require
much maintenance. Their advice for implementing MySQL sharding
included:

Make sure you have a stable schema, and don’t add features for a
couple months.
Remove all joins and complex queries for a while.
Do simple shards first, such as moving a huge table into its own
database.

They use Pyres, a
Python clone of Resque, to move data into shards.

However, sharding imposes severe constraints that led them to
hand-crafted work-arounds.

Many sites want to leave open the possibility for moving data between
shards. This is useful, for instance, if they shard along some
dimension such as age or country, and they suddenly experience a rush
of new people in their 60s or from China. The implementation of such a
plan requires a good deal of coding, described in the O’Reilly book MySQL High
Availability, including the creation of a service that just
accepts IDs and determines what shard currently contains the ID.

The Pinterest staff decided the ID service would introduce a single
point of failure, and decided just to hard-code a shard ID in every ID
assigned to a row. This means they never move data between shards,
although shards can be moved bodily to new nodes. I think this works
for Pinterest because they shard on arbitrary IDs and don’t have a
need to rebalance shards.

Even more interesting is how they avoid joins. Suppose they want to
retrieve all pins associated with a certain board associated with a
certain user. In classical, normalized relational database practice,
they’d have to do a join on the comment, pin, and user tables. But
Pinterest maintains extra mapping tables. One table maps users to
boards, while another maps boards to pins. They query the
user-to-board table to get the right board, query the board-to-pin
table to get the right pin, and then do simple queries without joins
on the tables with the real data. In a way, they implement a custom
NoSQL model on top of a relational database.

Pinterest does use Memcache and Redis in addition to MySQL. As with
craigslist, they find that most queries can be handled by Memcache.
And the actual images are stored in S3, an interesting choice for a
site that is already enormous.

It seems to me that the data and application design behind Pinterest
would have made it a good candidate for a non-ACID datastore. They
chose to stick with MySQL, but like organizations that use NoSQL
solutions, they relinquished key aspects of the relational way of
doing things. They made calculated trade-offs that worked for their
particular needs.

My take-away from these two fascinating and well-attended talks was
that how you must understand your application, its scaling and
performance needs, and its data structure, to know what you can
sacrifice and what solution gives you your sweet spot. craigslist
solved its problem through the very precise application of different
tools, each with particular jobs that fulfilled craigslist’s

requirements. Pinterest made its own calculations and found an
entirely different solution depending on some clever hand-coding
instead of off-the-shelf tools.

Current and future MySQL

The conference keynotes surveyed the state of MySQL and some
predictions about where it will go.

Conference co-chair Sarah Novotny at keynotes.

The world of MySQL is much more complicated than it was a couple years
ago, before Percona got heavily into the work of releasing patches to
InnoDB, before they created entirely new pieces of software, and
before Monty started MariaDB with the express goal of making a better
MySQL than MySQL. You can now choose among Oracle’s official MySQL
releases, Percona’s supported version, and MariaDB’s supported
version. Because these are all open source, a major user such as

Facebook can even apply patches to get the newest features.

Nor are these different versions true forks, because Percona and
MariaDB create their enhancements as patches that they pass back to
Oracle, and Oracle is happy to include many of them in a later
release. I haven’t even touched on the commercial ecosystem around
MySQL, which I’ll look at later in this article.

In his opening
keynote, Percona founder Peter Zaitsev praised the latest MySQL
release by Oracle. With graceful balance he expressed pleasure that
the features most users need are in the open (community) edition, but
allowed that the proprietary extensions are useful too. In short, he
declared that MySQL is less buggy and has more features than ever.

The former
CEO of MySQL AB, Mårten Mickos, also found that MySQL is
doing well under Oracle’s wing. He just chastised Oracle for failing
to work as well as it should with potential partners (by which I
assume he meant Percona and MariaDB). He lauded their community
managers but said the rest of the company should support them more.

Keynote by Mårten Mickos.

Brian
Aker presented an OpenStack MySQL service developed by his current
employer, Hewlett-Packard. His keynote retold the story that had led
over the years to his developing Drizzle (a true fork of MySQL
that tries to return it to its lightweight, Web-friendly roots) and
eventually working on cloud computing for HP. He described modularity,
effective use of multiple cores, and cloud deployment as the future of
databases.

A panel
on the second day of the conference brought together high-level
managers from many of the companies that have entered the MySQL space
from a variety of directions in a high-level discussion of the
database engine’s future. Like most panels, the conversation ranged
over a variety of topics–NoSQL, modular architecture, cloud
computing–but hit some depth only on the topic of security, which was
not represented very strongly at the conference and was discussed here
at the insistence of Slavik Markovich from McAfee.

Keynote by Brian Aker.

Many of the conference sessions disappointed me, being either very
high level (although presumably useful to people who are really new to
various topics, such as Hadoop or flash memory) or unvarnished
marketing pitches. I may have judged the latter too harshly though,
because a decent number of attendees came, and stayed to the end, and
crowded around the speakers for information.

Two talks, though, were so fast-paced and loaded with detail that I
couldn’t possibly keep my typing up with the speaker.

One such talk was the keynote
by Mark Callaghan of Facebook. (Like the other keynotes, it should be
posted online soon.) A smattering of points from it:

Percona and MariaDB are adding critical features that make replication
and InnoDB work better.
When a logical backup runs, it is responsible for 50% of IOPS.
Defragmenting InnoDB improves compression.
Resharding is not worthwhile for a large, busy site (an insight also
discovered by Pinterest, as I reported earlier)

The other fact-filled talk was by
Yoshinori Matsunobu of Facebook, and concerned how to achieve
NoSQL-like speeds while sticking with MySQL and InnoDB. Much of the
talk discussed an InnoDB memcached plugin, which unfortunately is
still in the “lab” or “pre-alpha” stage. But he also suggested some
other ways to better performance, some involving Memcache and others
more round-about:

Coding directly with the storage engine API, which is storage-engine
independent.
Using HandlerSocket, which queues write requests and performs them
through a single thread, avoiding costly fsync() calls. This can
achieve 30,000 writes per second, robustly.

Matsunobu claimed that many optimizations are available within MySQL
because a lot of data can fit in main memory. For instance, if you
have 10 million users and store 400 bytes per user, the entire user
table can fit in 20 GB. Matsunobu tests have shown that most CPU time
in MySQL is spent in functions that are not essential for processing
data, such as opening and closing a table. Each statement opens a
separate connection, which in turn requires opening and closing the
table again. Furthermore, a lot of data is sent over the wire besides
the specific fields requested by the client. The solutions in the talk
evade all this overhead.

The commercial ecosystem

Both as vendors and as sponsors, a number of companies have always
lent another dimension to the MySQL conference. Some of these really
have nothing to do with MySQL, but offer drop-in replacements for it.
Others really find a niche for MySQL users. Here are a few that I
happened to talk to:

Clustrix provides a very
different architecture for relational data. They handle sharding
automatically, permitting such success stories as the massive scaling
up of the social media site Massive Media NV without extra
administrative work. Clustrix also claims to be more efficient by
breaking queries into fragments (such as the WHERE clauses of joins)
and executing them on different nodes, passing around only the data

produced by each clause.
Akiban also offers faster
execution through a radically different organization of data. They
flatten the normalized tables of a normalized database into a single
data structure: for instance, a customer and his orders may be located
sequentially in memory. This seems to me an import of the document
store model into the relational model. Creating, in effect, an object
that maps pretty closely to the objects used in the application
program, Akiban allows common queries to be executed very quickly, and
could be deployed as an adjunct to a MySQL database.
Tokutek produced a drop-in
replacement for InnoDB. The founders developed a new data structure
called a fractal tree as a faster alternative to the B-tree structures
normally used for indexes. The existence of Tokutek vindicates both

the open source distribution of MySQL and its unique modular design,
because these allowed Tokutek’s founders to do what they do
best–create a new storage engine–without needing to create a whole
database engine with the related tools and interfaces it would
require.
Nimbus Data Systems creates a
flash-based hardware appliance that can serve as a NAS or SAN to
support MySQL. They support a large number of standard data transfer
protocols, such as InfiniBand, and provide such optimizations as
caching writes in DRAM and making sure they write complete 64KB blocks
to flash, thus speeding up transfers as well as preserving the life of
the flash.

Post-conference events

A low-key developer’s day followed Percona Live on Friday. I talked to
people in the Drizzle and
Sphinx tracks.

As a relatively young project, the Drizzle talks were aimed mostly at
developers interested in contributing. I heard talks about their
kewpie test framework and about build and release conventions. But in
keeping with it’s goal to make database use easy and light-weight, the
project has added some cool features.

Thanks to a
JSON
interface and a built-in web server, Drizzle now presents you with a
Web interface for entering SQL commands. The Web interface translates
Drizzle’s output to simple HTML tables for display, but you can also
capture the JSON directly, making programmatic access to Drizzle
easier. A developer explained to me that you can also store JSON
directly in Drizzle; it is simply stored as a single text column and
the JSON fields can be queried directly. This reminded me of an XQuery
interface added to some database years ago. There too, the XML was
simply stored as a text field and a new interface was added to run the
XQuery selects.

Sphinx, in contrast to Drizzle, is a mature product with commercial
support and (as mentioned earlier in the article) production
deployments at places such as craigslist, as well as an O’Reilly
book. I understood better, after attending today’s sessions, what
makes Sphinx appealing. Its quality is unusually high, due to the use

of sophisticated ranking algorithms from the research literature. The
team is looking at recent research to incorporate even better
algorithms. It is also fast and scales well. Finally, integration with
MySQL is very clean, so it’s easy to issue queries to Sphinx and pick
up results.

Recent enhancements include an add-on called fSphinx
to make faceted searches faster (through caching) and easier, and
access to Bayesian Sets to find “items similar to this one.” In Sphinx
itself, the team is working to add high availability, include a new
morphology (stemming, etc.) engine that handles German, improve
compression, and make other enhancements.

The day ended with a reception and copious glasses of Monty Widenius’s
notorious licorice-flavored vodka, an ending that distinguishes the
MySQL conference from others for all time.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20