# "data scientists" entries

## Clustering bitcoin accounts using heuristics

### In this O'Reilly Data Show Podcast: Sarah Meiklejohn on analytic applications for blockchain and cryptocurrency technology.

Editor’s note: we’ll explore present and future applications of cryptocurrency and blockchain technologies at our upcoming Radar Summit: Bitcoin & the Blockchain on Jan. 27, 2015, in San Francisco.

A few data scientists are starting to play around with cryptocurrency data, and as bitcoin and related technologies start gaining traction, I expect more to wade in. As the space matures, there will be many interesting applications based on analytics over the transaction data produced by these technologies. The blockchain — the distributed ledger that contains all bitcoin transactions — is publicly available, and the underlying data set is of modest size. Data scientists can work with this data once it’s loaded into familiar data structures, but producing insights requires some domain knowledge and expertise.

Subscribe to the O’Reilly Data Show Podcast

I recently spoke with Sarah Meiklejohn, a lecturer at UCL, and an expert on computer security and cryptocurrencies. She was part of an academic research team that studied pseudo-anonymity (“pseudonymity”) in bitcoin. In particular, they used transaction data to compare “potential” anonymity to the “actual” anonymity achieved by users. A bitcoin user can use many different public keys, but careful research led to a few heuristics that allowed them to cluster addresses belonging to the same user:

“In theory, a user can go by many different pseudonyms. If that user is careful and keeps the activity of those different pseudonyms separate, completely distinct from one another, then they can really maintain a level of, maybe not anonymity, but again, cryptographically it’s called pseudo-anonymity. So, if they are a legitimate businessman on the one hand, they can use a certain set of pseudonyms for that activity, and then if they are dealing drugs on Silk Road, they might use a completely different set of pseudonyms for that, and you wouldn’t be able to tell that that’s the same user.

Comment: 1

## Cheap sensors, fast networks, and distributed computing

### The history of computing has been a constant pendulum — that pendulum is now swinging back toward distribution.

Editor’s note: this is an excerpt from our new report Data: Emerging Trends and Technologies, by Alistair Croll. You can download the free report here.

The trifecta of cheap sensors, fast networks, and distributing computing are changing how we work with data. But making sense of all that data takes help, which is arriving in the form of machine learning. Here’s one view of how that might play out.

## Clouds, edges, fog, and the pendulum of distributed computing

The history of computing has been a constant pendulum, swinging between centralization and distribution.

The first computers filled rooms, and operators were physically within them, switching toggles and turning wheels. Then came mainframes, which were centralized, with dumb terminals.

As the cost of computing dropped and the applications became more democratized, user interfaces mattered more. The smarter clients at the edge became the first personal computers; many broke free of the network entirely. The client got the glory; the server merely handled queries.

Once the web arrived, we centralized again. LAMP (Linux, Apache, MySQL, PHP) buried deep inside data centers, with the computer at the other end of the connection relegated to little more than a smart terminal rendering HTML. Load-balancers sprayed traffic across thousands of cheap machines. Eventually, the web turned from static sites to complex software as a service (SaaS) applications.

Then the pendulum swung back to the edge, and the clients got smart again. First with AJAX, Java, and Flash; then in the form of mobile apps, where the smartphone or tablet did most of the hard work and the back end was a communications channel for reporting the results of local action. Read more…

Comment

### Exploring open web crawl data — what if you had your own copy of the entire web, and you could do with it whatever you want?

For the last few millennia, libraries have been the custodians of human knowledge. By collecting books, and making them findable and accessible, they have done an incredible service to humanity. Our modern society, culture, science, and technology are all founded upon ideas that were transmitted through books and libraries.

Then the web came along, and allowed us to also publish all the stuff that wasn’t good enough to put in books, and do it all much faster and cheaper. Although the average quality of material you find on the web is quite poor, there are some pockets of excellence, and in aggregate, the sum of all web content is probably even more amazing than all libraries put together.

Google (and a few brave contenders like Bing, Baidu, DuckDuckGo and Blekko) have kindly indexed it all for us, acting as the web’s librarians. Without search engines, it would be terribly difficult to actually find anything, so hats off to them. However, what comes next, after search engines? It seems unlikely that search engines are the last thing we’re going to do with the web. Read more…

## Building Apache Kafka from scratch

### In this episode of the O'Reilly Data Show Podcast, Jay Kreps talks about data integration, event data, and the Internet of Things.

At the heart of big data platforms are robust data flows that connect diverse data sources. Over the past few years, a new set of (mostly open source) software components have become critical to tackling data integration problems at scale. By now, many people have heard of tools like Hadoop, Spark, and NoSQL databases, but there are a number of lesser-known components that are “hidden” beneath the surface.

In my conversations with data engineers tasked with building data platforms, one tool stands out: Apache Kafka, a distributed messaging system that originated from LinkedIn. It’s used to synchronize data between systems and has emerged as an important component in real-time analytics.

Subscribe to the O’Reilly Data Show Podcast

In my travels over the past year, I’ve met engineers across many industries who use Apache Kafka in production. A few months ago, I sat down with O’Reilly author and Radar contributor Jay Kreps, a highly regarded data engineer and former technical lead for Online Data Infrastructure at LinkedIn, and most recently CEO/co-founder of Confluent. Read more…

Comment

## 2014 Data Science Salary Survey

### Salary insights from more than 800 data professionals reveal a correlation to skills and tools.

Data is growing: Whether in terms of data-driven applications, the diversity of tools or the actual quantities of data we collect and process, the data space is characterized by expansion. The excitement around data has been tempered in some circles — the first two query completion suggestions for a Google search of “Is data science” are “dead” and “a fad” — but from a practitioner’s perspective, things are looking quite rosy.

In the results of this year’s O’Reilly Media Data Science Salary Survey, we found a median total salary of $98k ($144k for US respondents only). The 816 data professionals in the survey included engineers, analysts, entrepreneurs, and managers (although almost everyone had some technical component in their role).

Why the high salaries? While the demand for data applications has increased rapidly, the number of people who set up the systems and perform advanced analytics has increased much more slowly. Newer tools such as Hadoop and Spark should have even fewer expert users, and correspondingly we found that users of these tools have particularly high salaries. Read more…

Comment: 1

## Fast data calls for new ways to manage its flow

### Examples of multi-layer, three-tier data-processing architecture.

Like CPU caches, which tend to be arranged in multiple levels, modern organizations direct their data into different data stores under the principle that a small amount is needed for real-time decisions and the rest for long-range business decisions. This article looks at options for data storage, focusing on one that’s particularly appropriate for the “fast data” scenario described in a recent O’Reilly report.

Many organizations deal with data on at least three levels:

1. They need data at their fingertips, rather like a reference book you leave on your desk. Organizations use such data for things like determining which ad to display on a web page, what kind of deal to offer a visitor to their website, or what email message to suppress as spam. They store such data in memory, often in key/value stores that allow fast lookups. Flash is a second layer (slower than memory, but much cheaper), as I described in a recent article. John Piekos, vice president of engineering at VoltDB, which makes an in-memory database, says that this type of data storage is used in situations where delays of just 20 or 30 milliseconds mean lost business.
2. For business intelligence, theses organizations use a traditional relational database or a more modern “big data” tool such as Hadoop or Spark. Although the use of a relational database for background processing is generally called online analytic processing (OLAP), it is nowhere near as online as the previous data used over a period of just milliseconds for real-time decisions.
3. Some data is archived with no immediate use in mind. It can be compressed and perhaps even stored on magnetic tape.

For the new fast data tier, where performance is critical, techniques such as materialized views further improve responsiveness. According to Piekos, materialized views bypass a certain amount of database processing to cut milliseconds off of queries. Read more…

Comment

## The science of moving dots: the O’Reilly Data Show Podcast

### Rajiv Maheswaran talks about the tools and techniques required to analyze new kinds of sports data.

Many data scientists are comfortable working with structured operational data and unstructured text. Newer techniques like deep learning have opened up data types like images, video, and audio.

Other common data sources are garnering attention. With the rise of mobile phones equipped with GPS, I’m meeting many more data scientists at start-ups and large companies who specialize in spatio-temporal pattern recognition. Analyzing “moving dots” requires specialized tools and techniques.

Subscribe to the O’Reilly Data Show Podcast

A few months ago, I sat down with Rajiv Maheswaran founder and CEO of Second Spectrum, a company that applies analytics to sports tracking data. Maheswaran talked about this new kind of data and the challenge of finding patterns:

“It’s interesting because it’s a new type of data problem. Everybody knows that big data machine learning has done a lot of stuff in structured data, in photos, in translation for language, but moving dots is a very new kind of data where you haven’t figured out the right feature set to be able to find patterns from. There’s no language of moving dots, at least not that computers understand. People understand it very well, but there’s no computational language of moving dots that are interacting. We wanted to build that up, mostly because data about moving dots is very, very new. It’s only in the last five years, between phones and GPS and new tracking technologies, that moving data has actually emerged.”

## The big data sweet spot: policy that balances benefits and risks

### Deciding what data to collect is hard when consequences are unpredictable.

A big reason why discussions of “big data” get complicated — and policy-makers resort to vague hand-waving, as in the well-known White House executive office report — is that its ripple effects travel fast and far. Your purchase, when recorded by a data broker, affects not only the the ads and deals they offer you in the future, but the ones they offer innumerable people around the country that share some demographic with you.

Policy-making might be simple if data collectors or governments could say, “We’ll collect certain kinds of data for certain purposes and no others” — but the impacts of data collection are rarely predictable. And if one did restrict big data that way, its value would be seriously reduced.

## Follow my steps: big data privacy vs collection

Data collection will explode as we learn how to connect you to different places you’ve been by the way you walk or to recognize certain kinds of diseases by your breath.

When such data exhaust is being collected, you can’t evade consequences by paying cash and otherwise living off the grid. In fact, trying to do so may disadvantage you even more: people who lack access to sophisticated technologies leave fewer tracks and therefore may not be served by corporations or governments. Read more…

Comment

## The problem of managing schemas

### Schemas inevitably will change — Apache Avro offers an elegant solution.

When a team first starts to consider using Hadoop for data storage and processing, one of the first questions that comes up is: which file format should we use?

This is a reasonable question. HDFS, Hadoop’s data storage, is different from relational databases in that it does not impose any data format or schema. You can write any type of file to HDFS, and it’s up to you to process it later.

The usual first choice of file formats is either comma delimited text files, since these are easy to dump from many databases, or JSON format, often used for event data or data arriving from a REST API.

There are many benefits to this approach — text files are readable by humans and therefore easy to debug and troubleshoot. In addition, it is very easy to generate them from existing data sources and all applications in the Hadoop ecosystem will be able to process them. Read more…