Data Week: Becoming a data scientist

Data Pointed, CouchDB in the Cloud, Launching Strata

Note: Data Week is a new series that collects notable stories and developments from the data world.

Data visualization and art

Artist and scientist Stephen von Worley announced the launch of a new blog, Data Pointed, showcasing his data visualization research. “One part magazine and two parts blog”, the site tells the story of von Worley’s own data visualizations, and surveys choice picks from others.

The lead story covers the results of the XKCD color name survey, illustrating the entertaining differences in the way different genders refer to colors.

Across the top, witness the nuanced verbal repertoire of feminine color differentiation. While us men are busy grunting, guzzling beer, and shoving our hands down our pants, women get specific by mixing fruits, animals, spices, flowers, and other such familiarities with finely-honed modifiers like neon and dusty. The result? A vast panoply of warm-fuzzy color names that seemingly trounces anything our Y-chromosomes have to offer.

The visualization shows colors organized horizontally by hue, and vertically by gender preference. Immediately obvious is the contrast in nuance between female, at the top, and male. Click the graphic below to reach the full article and interactive visualization.

His and Hers Colors

Becoming a data scientist

The community Q&A site Quora is rich with information about data science, analytics and computing. An especially illuminating answer was given this week to the question How do I become a data scientist — how does someone with a computer science background get the math and statistics knowledge required for data science?

Providing an extensive reply, Alex Kamil gives eight points from his perspective as an undergraduate student. Many of these reference statistics and math, and Kamil provides an excellent list of papers, websites and technologies to tinker with.

Several of Kamil’s suggested starting points struck me as common themes among those who define themselves as data scientists:

  • Start learning statistics by coding with R: whatever the size of the data you’re working with, many data analysts perform and prototype investigations using the R language. Some will later translate these into larger map-reduce jobs to be run on Hadoop, for instance. R provides a hands-on way for developers to teach themselves statistics in practice.
  • Linear algebra: a grounding in linear algebra is common among many data scientists, and important because matrix math underpins many data mining applications, such as the famous PageRank.
  • Machine learning: allowing computers to alter behavior based on input data is fundamental to many innovative data-based products and services. Many developers start this ad-hoc, but there is much available literature. Kamil references Bradford Cross’ extensive list of machine learning resources.

There are many more starting points referenced in the full answer.

The field of data science is a place where book learning meets code and produces results. In the words of Kurt Lewin: “There’s nothing so practical as a good theory.”

CouchDB clusters and particles

Cloudant, providers of hosted CouchDB infrastructure, have released their clustering technology, BigCouch, as open source . CouchDB is a document-oriented “NoSQL” database, noted for its replication features. Use it as part of your application and you can count on database replication “for free.”

As part of offering cloud-based CouchDB services, Cloudant has developed software to create clusters of CouchDB instances, distributed among many servers. In Cloudant’s words, “Instead of one big honking CouchDB, the result is an elastic data store which is fully CouchDB API-compliant.”

The most direct comparison to existing technologies is with Amazon’s Dynamo, according to the announcement:

The clustering layer [features] consistent hashing, replication, and quorum for read/write operations. CouchDB view indexing occurs in parallel on each partition, and can achieve impressive speedups as compared to standalone serial indexing.

You can download BigCouch from its Github Project Page, and collaborate on the #cloudant IRC channel on Freenode.

Elsewhere this week, ReadWriteEnterprise reports that scientists working at CERN are using CouchDB to support their work on the Compact Muon Solenoid Experiment (CMS) on the famous Large Hadron Collider.

Chief among the attractions of CouchDB to the scientists is the ability to manage petabytes of data, the built-in replication features, and an easy compatibility with Oracle. Simon Metson of the Data Management and Workflow Management group at the CMS project reports that CouchDB has a shallow learning curve, but is harder for those with deep SQL backgrounds: “The more you know Oracle, the harder it is to pick up.”

Strata: The Business of Data

Strata 2011If you enjoyed any of the previous items, stay tuned — we’re excited to announce the launch of Strata, an O’Reilly conference focusing on the business and practice of data. The conference will be held in Santa Clara, Calif. from Feb. 1-3, 2011.

At O’Reilly, we believe that the future belongs to those who understand how to collect and use their data successfully. There’s a change in both the skills of data analysts and the technology they use that’s sweeping through industry and science. Our aim with Strata is to be the defining event for that change: for practitioners, businesses and data vendors.

The call for participation is open until Sept. 28. We’re looking for proposals from practitioners, business leaders, analysts, designers, and developers covering the spectrum of data business and practice. Suggested topics include:

  • Distributed data processing, Hadoop ecosystem
  • From research to product
  • Streaming data processing
  • Becoming a data-driven organization
  • Data science best practices
  • Data acquisition, cleaning, distribution and markets
  • Machine learning
  • Training and recruitment of data scientists
  • Applications, case studies, and cautionary tales
  • Visualization and design principles
  • Augmented reality and immersive interfaces
  • Data protection, privacy, and policy
  • Changing role of business intelligence

Send us news

Email us news, tips and interesting tidbits at dataweek@oreilly.com.

tags: , , , , , ,