"data tools" entries

Dimensionality reduction at the command line

Introducing Tapkee, an efficient command-line tool and C++ library for linear and nonlinear dimensionality reduction.

Get 50% off the “Data Science at the Command Line” ebook with code DATA50.

Editor’s Note: This post is a slightly adapted excerpt from Jeroen Janssens’ recent book, “Data Science at the Command Line.” To follow along with the code, and learn more about the various tools, you can install the Data Science Toolbox, a free virtual machine that runs on Microsoft Windows, Mac OS X, and Linux, and has all the command-line tools pre-installed.

The goal of dimensionality reduction is to map high-dimensional data points onto a lower dimensional space. The challenge is to keep similar data points close together on the lower-dimensional mapping. As we’ll see in the next section, our data set contains 13 features. We’ll stick with two dimensions because that’s straightforward to visualize.

Dimensionality reduction is often regarded as being part of the exploring step. It’s useful for when there are too many features for plotting. You could do a scatter plot matrix, but that only shows you two features at a time. It’s also useful as a preprocessing step for other machine-learning algorithms. Most dimensionality reduction algorithms are unsupervised, which means that they don’t employ the labels of the data points in order to construct the lower-dimensional mapping.

In this post, we’ll use Tapkee, a new command-line tool to perform dimensionality reduction. More specifically, we’ll demonstrate two techniques: PCA, which stands for Principal Components Analysis (Pearson, 1901) and t-SNE, which stands for t-distributed Stochastic Neighbor Embedding (van der Maaten & Hinton, 2008). Coincidentally, t-SNE was discussed in detail in a recent O’Reilly blog post. But first, let’s obtain, scrub, and explore the data set we’ll be using. Read more…

2013 Data Science Salary Survey

Tools, Trends, What Pays (and What Doesn't) for Data Professionals

salary_survey_coverThere is no shortage of news about the importance of data or the career opportunities within data. Yet a discussion of modern data tools can help us understand what the current data evolution is all about, and it can also be used as a guide for those considering stepping into the data space or progressing within it.

In our report, 2013 Data Science Salary Survey, we make our own data-driven contribution to the conversation. We collected a survey from attendees of the Strata Conference in New York and Santa Clara, California, about tool usage and salary.

Strata attendees span a wide spectrum within the data world: Hadoop experts and business leaders, software developers and analysts.  By no means does everyone use data on a “Big” scale, but almost all attendees have some technical aspect to their role.  Strata attendees may not represent a random sample of all professionals working with data, but they do represent a broad slice of the population.  If there is a bias, it is likely toward the forefront of the data space, with attendees using the newest tools (or being very interested in learning about them).

Read more…

An update on in-memory data management

In-memory data management brings data close to the computation.

By Ben Lorica and Roger Magoulas

We wanted to give you a brief update on what we’ve learned so far from our series of interviews with players and practitioners in the in-memory data management space. A few preliminary themes have emerged, some expected, others surprising.

Performance improves as you put data as close to the computation as possible. We talked to people in systems, data management, web applications, and scientific computing who have embraced this concept. Some solutions go to the the lowest level of hardware (L1, L2 cache), The next generation SSDs will have latency performance closer to main memory, potentially blurring the distinction between storage and memory. For performance and power consumption considerations we can imagine a future where the primary way systems are sized will be based on the amount of non-volatile memory* deployed.

Putting data in-memory does not negate the importance of distributed computing environments. Data size and the ability to leverage parallel environments are frequently cited reasons. The same characteristics that make the distributed environments compelling also apply to in-memory systems: fault-tolerance and parallelism for performance. An additional consideration is the ability to gracefully spillover to disk when main is memory full. Read more…

Six ways data journalism is making sense of the world, around the world

Early responses from our investigation into data-driven journalism had an international flavor.

When I wrote that Radar was investigating data journalism and asked for your favorite examples of good work, we heard back from around the world.

I received emails from Los Angeles, Philadelphia, Canada and Italy that featured data visualization, explored the role of data in government accountability, and shared how open data can revolutionize environmental reporting. A tweet pointed me to a talk about how R is being used in the newsroom. Another tweet linked to relevant interviews on social science and the media:

Two of the case studies focused on data visualization, an important practice that my colleague Julie Steele and other editors at O’Reilly Media have been exploring over the past several years.

Several other responses are featured at more length below. After you read through, make sure to also check out this terrific Ignite talk on data journalism recorded at this year’s Newsfoo in Arizona. Read more…

Health records support genetics research at Children’s Hospital of Philadelphia

Michael Italia on making use of data collected in health care settings.

Michael Italia from Children's Hospital of Philadelphia discusses the tools and methods his team uses to manage health care data.

Health records support genetics research at Children's Hospital of Philadelphia

Michael Italia on making use of data collected in health care settings.

Michael Italia from Children's Hospital of Philadelphia discusses the tools and methods his team uses to manage health care data.

Everyone has a big data problem

MetaLayer's Jonathan Gosier on data tools and the data divide.

MetaLayer's Jonathan Gosier talks about the need to democratize data tools because everyone has a big data problem.

Why data visualization matters

The best data visualizations expose something new.

Effective data visualizations go beyond aesthetics; they also allow organizations to make quick and correct decisions from massive amounts of information.

Embracing the chaos of data

Pete Warden on the upside of unstructured data.

Data scientists, it's time to welcome errors and uncertainty into your data projects. In this interview, Jetpac CTO Pete Warden discusses the advantages of unstructured data.

Global Adaptation Index enables better data-driven decisions

The Global Adaptation Index combines development indicators from 161 countries.

Speed, accessibility and open data have come together in the Global Adaptation Index, a new data browser that rates a given country's vulnerability to environmental shifts.