"command line" entries
Introducing Tapkee, an efficient command-line tool and C++ library for linear and nonlinear dimensionality reduction.
Editorâ€™s Note: This post is a slightly adapted excerpt from Jeroen Janssensâ€™ recent book, “Data Science at the Command Line.” To follow along with the code, and learn more about the various tools, you can install the Data Science Toolbox, a free virtual machine that runs on Microsoft Windows, Mac OS X, and Linux, and has all the command-line tools pre-installed.The goal of dimensionality reduction is to map high-dimensional data points onto a lower dimensional space. The challenge is to keep similar data points close together on the lower-dimensional mapping. As weâ€™ll see in the next section, our data set contains 13 features. Weâ€™ll stick with two dimensions because thatâ€™s straightforward to visualize.
Dimensionality reduction is often regarded as being part of the exploring step. Itâ€™s useful for when there are too many features for plotting. You could do a scatter plot matrix, but that only shows you two features at a time. Itâ€™s also useful as a preprocessing step for other machine-learning algorithms. Most dimensionality reduction algorithms are unsupervised, which means that they donâ€™t employ the labels of the data points in order to construct the lower-dimensional mapping.
In this post, weâ€™ll use Tapkee, a new command-line tool to perform dimensionality reduction. More specifically, we’ll demonstrate two techniques: PCA, which stands for Principal Components Analysis (Pearson, 1901) and t-SNE, which stands for t-distributed Stochastic Neighbor Embedding (van der Maaten & Hinton, 2008). Coincidentally, t-SNE was discussed in detail in a recent O’Reilly blog post. But first, let’s obtain, scrub, and explore the data set we’ll be using. Read more…