"Big Data" entries

Four short links: 19 December 2014

Statistical Causality, Clustering Bitcoin, Hardware Security, and A Language for Scripts

1. Distinguishing Cause and Effect using Observational Data — research paper evaluating effectiveness of the “additive noise” test, a nifty statistical trick to identify causal relationships from observational data. (via Slashdot)
2. Clustering Bitcoin Accounts Using Heuristics (O’Reilly Radar) — In theory, a user can go by many different pseudonyms. If that user is careful and keeps the activity of those different pseudonyms separate, completely distinct from one another, then they can really maintain a level of, maybe not anonymity, but again, cryptographically it’s called pseudo-anonymity. […] It turns out in reality, though, the way most users and services are using bitcoin, was really not following any of the guidelines that you would need to follow in order to achieve this notion of pseudo-anonymity. So, basically, what we were able to do is develop certain heuristics for clustering together different public keys, or different pseudonyms.
3. A Primer on Hardware Security: Models, Methods, and Metrics (PDF) — Camouflaging: This is a layout-level technique to hamper image-processing-based extraction of gate-level netlist. In one embodiment of camouflaging, the layouts of standard cells are designed to look alike, resulting in incorrect extraction of the netlist. The layout of nand cell and the layout of nor cell look different and hence their functionality can be extracted. However, the layout of a camouflaged nand cell and the layout of camouflaged nor cell can be made to look identical and hence an attacker cannot unambiguously extract their functionality.
4. Prompter: A Domain-Specific Language for Versu (PDF) — literally a scripting language (you write theatrical-style scripts, characters, dialogues, and events) for an inference engine that lets you talk to characters and have a different story play out each time.
Comment

Clustering bitcoin accounts using heuristics

In this O'Reilly Data Show Podcast: Sarah Meiklejohn on analytic applications for blockchain and cryptocurrency technology.

Editor’s note: we’ll explore present and future applications of cryptocurrency and blockchain technologies at our upcoming Radar Summit: Bitcoin & the Blockchain on Jan. 27, 2015, in San Francisco.

A few data scientists are starting to play around with cryptocurrency data, and as bitcoin and related technologies start gaining traction, I expect more to wade in. As the space matures, there will be many interesting applications based on analytics over the transaction data produced by these technologies. The blockchain — the distributed ledger that contains all bitcoin transactions — is publicly available, and the underlying data set is of modest size. Data scientists can work with this data once it’s loaded into familiar data structures, but producing insights requires some domain knowledge and expertise.

Subscribe to the O’Reilly Data Show Podcast

I recently spoke with Sarah Meiklejohn, a lecturer at UCL, and an expert on computer security and cryptocurrencies. She was part of an academic research team that studied pseudo-anonymity (“pseudonymity”) in bitcoin. In particular, they used transaction data to compare “potential” anonymity to the “actual” anonymity achieved by users. A bitcoin user can use many different public keys, but careful research led to a few heuristics that allowed them to cluster addresses belonging to the same user:

“In theory, a user can go by many different pseudonyms. If that user is careful and keeps the activity of those different pseudonyms separate, completely distinct from one another, then they can really maintain a level of, maybe not anonymity, but again, cryptographically it’s called pseudo-anonymity. So, if they are a legitimate businessman on the one hand, they can use a certain set of pseudonyms for that activity, and then if they are dealing drugs on Silk Road, they might use a completely different set of pseudonyms for that, and you wouldn’t be able to tell that that’s the same user.

Comment: 1

Cheap sensors, fast networks, and distributed computing

The history of computing has been a constant pendulum — that pendulum is now swinging back toward distribution.

Editor’s note: this is an excerpt from our new report Data: Emerging Trends and Technologies, by Alistair Croll. You can download the free report here.

The trifecta of cheap sensors, fast networks, and distributing computing are changing how we work with data. But making sense of all that data takes help, which is arriving in the form of machine learning. Here’s one view of how that might play out.

Clouds, edges, fog, and the pendulum of distributed computing

The history of computing has been a constant pendulum, swinging between centralization and distribution.

The first computers filled rooms, and operators were physically within them, switching toggles and turning wheels. Then came mainframes, which were centralized, with dumb terminals.

As the cost of computing dropped and the applications became more democratized, user interfaces mattered more. The smarter clients at the edge became the first personal computers; many broke free of the network entirely. The client got the glory; the server merely handled queries.

Once the web arrived, we centralized again. LAMP (Linux, Apache, MySQL, PHP) buried deep inside data centers, with the computer at the other end of the connection relegated to little more than a smart terminal rendering HTML. Load-balancers sprayed traffic across thousands of cheap machines. Eventually, the web turned from static sites to complex software as a service (SaaS) applications.

Then the pendulum swung back to the edge, and the clients got smart again. First with AJAX, Java, and Flash; then in the form of mobile apps, where the smartphone or tablet did most of the hard work and the back end was a communications channel for reporting the results of local action. Read more…

Comment

Four short links: 16 December 2014

Memory Management, Stream Processing, Robot's Google, and Emotive Words

1. Effectively Managing Memory at Gmail Scale — how they gathered data, how Javascript memory management works, and what they did to nail down leaks.
2. tigonan open-source, real-time, low-latency, high-throughput stream processing framework.
3. Robo Brain — machine knowledge of the real world for robots. (via MIT Technology Review)
4. The Structure and Interpretation of the Computer Science Curriculum — convincing argument for teaching intro to programming with Scheme, but not using the classic text SICP.

Update: the original fourth link to Depeche Mood led only to a README on GitHub; we’ve replaced it with a new link.

Exploring open web crawl data — what if you had your own copy of the entire web, and you could do with it whatever you want?

For the last few millennia, libraries have been the custodians of human knowledge. By collecting books, and making them findable and accessible, they have done an incredible service to humanity. Our modern society, culture, science, and technology are all founded upon ideas that were transmitted through books and libraries.

Then the web came along, and allowed us to also publish all the stuff that wasn’t good enough to put in books, and do it all much faster and cheaper. Although the average quality of material you find on the web is quite poor, there are some pockets of excellence, and in aggregate, the sum of all web content is probably even more amazing than all libraries put together.

Google (and a few brave contenders like Bing, Baidu, DuckDuckGo and Blekko) have kindly indexed it all for us, acting as the web’s librarians. Without search engines, it would be terribly difficult to actually find anything, so hats off to them. However, what comes next, after search engines? It seems unlikely that search engines are the last thing we’re going to do with the web. Read more…

Four short links: 10 December 2014

Clearing Tor, Offline Cookbook, Burning Great Things, and Batch Pipelines

1. Clearing the Air Around Tor (Quinn Norton) — Occasionally the stars align between spooks and activists and governments and anarchists. Tor, like a road system or a telephone network or many pieces of public infrastructure, is useful to all of these people and more (hence the debate on child pornographers and drug markets) because it’s just such a general architecture of encryption. The FBI may want Tor to be broken, but I promise any spies who are counting on it for mission and life don’t.
2. Offline Cookbook — how Chrome intends to solve the offline problem in general. I hope it works and takes off because offline is the bane of this webapp-user’s life.
3. The Pirate Bay, Down Forever?As a big fan of the KLF I once learned that it’s great to burn great things up. At least then you can quit while you’re on top.
4. Luigi (Github) — a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, etc. It also comes with Hadoop support built in. (via Asana engineering blog)

Comment

Building Apache Kafka from scratch

In this episode of the O'Reilly Data Show Podcast, Jay Kreps talks about data integration, event data, and the Internet of Things.

At the heart of big data platforms are robust data flows that connect diverse data sources. Over the past few years, a new set of (mostly open source) software components have become critical to tackling data integration problems at scale. By now, many people have heard of tools like Hadoop, Spark, and NoSQL databases, but there are a number of lesser-known components that are “hidden” beneath the surface.

In my conversations with data engineers tasked with building data platforms, one tool stands out: Apache Kafka, a distributed messaging system that originated from LinkedIn. It’s used to synchronize data between systems and has emerged as an important component in real-time analytics.

Subscribe to the O’Reilly Data Show Podcast

In my travels over the past year, I’ve met engineers across many industries who use Apache Kafka in production. A few months ago, I sat down with O’Reilly author and Radar contributor Jay Kreps, a highly regarded data engineer and former technical lead for Online Data Infrastructure at LinkedIn, and most recently CEO/co-founder of Confluent. Read more…

Comment

2014 Data Science Salary Survey

Salary insights from more than 800 data professionals reveal a correlation to skills and tools.

Data is growing: Whether in terms of data-driven applications, the diversity of tools or the actual quantities of data we collect and process, the data space is characterized by expansion. The excitement around data has been tempered in some circles — the first two query completion suggestions for a Google search of “Is data science” are “dead” and “a fad” — but from a practitioner’s perspective, things are looking quite rosy.

In the results of this year’s O’Reilly Media Data Science Salary Survey, we found a median total salary of $98k ($144k for US respondents only). The 816 data professionals in the survey included engineers, analysts, entrepreneurs, and managers (although almost everyone had some technical component in their role).

Why the high salaries? While the demand for data applications has increased rapidly, the number of people who set up the systems and perform advanced analytics has increased much more slowly. Newer tools such as Hadoop and Spark should have even fewer expert users, and correspondingly we found that users of these tools have particularly high salaries. Read more…

Comment: 1

Four short links: 4 December 2014

1. One Click Captcha (Wired) — Google’s new Captcha tech is just a checkbox: “I am not a robot”. Instead of depending upon the traditional distorted word test, Google’s “reCaptcha” examines cues every user unwittingly provides: IP addresses and cookies provide evidence that the user is the same friendly human Google remembers from elsewhere on the Web. And Shet says even the tiny movements a user’s mouse makes as it hovers and approaches a checkbox can help reveal an automated bot.
2. The Responsive Enterprise: Embracing the Hacker Way (ACM) — Letting developers wander around without clear goals in the vastness of the software universe of all computable functions is one of the major reasons why projects fail, not because of lack of process or planning. I like all of this, although at times it can be a little like what I imagine it would be like if Cory Doctorow wrote a management textbook. (via Greg Linden)
3. Pizza Hut Tests Ordering via Eye-TrackingThe digital menu shows diners a canvas of 20 toppings and builds their pizza, from one of 4,896 combinations, based on which toppings they looked at longest.
4. How Browsers Get to Know You in Milliseconds (Andy Oram) — breaks down info exchange, data exchange, timing, even business relationships for ad auctions. Augment understanding of the user from third-party data (10 milliseconds). These third parties are the companies that accumulate information about our purchasing habits. The time allowed for them to return data is so short that they often can’t spare time for network transmission, and instead co-locate at the AppNexus server site. In fact, according to Magnusson, the founders of AppNexus created a cloud server before opening their exchange.
Comment

How browsers get to know you in milliseconds

Behind the scenes of a real-time ad auction on the web.

A small technological marvel occurs on almost every visit to a web page. In the seconds that elapse between the user’s click and the display of the page, an ad auction takes place in which hundreds of bidders gather whatever information they can get on the user, determine which ads are likely to be of interest, place bids, and transmit the winning ad to be placed in the page.

How can all that happen in approximately 100 milliseconds? Let’s explore the timeline and find out what goes on behind the scenes in a modern ad auction. Most of the information I have comes from two companies that handle different stages of the auction: the ad exchange AppNexus and the demand side platform Yashi. Both store critical data in an Aerospike database running on flash to achieve sub-second speeds.