"data mining" entries

Four short links: 20 May 2015

Four short links: 20 May 2015

Robots and Shadow Work, Time Lapse Mining, CS Papers, and Software for Reproducibility

  1. Rise of the Robots and Shadow Work (NY Times) — In “Rise of the Robots,” Ford argues that a society based on luxury consumption by a tiny elite is not economically viable. More to the point, it is not biologically viable. Humans, unlike robots, need food, health care and the sense of usefulness often supplied by jobs or other forms of work. Two thought-provoking and related books about the potential futures as a result of technology-driven change.
  2. Time Lapse Mining from Internet Photos (PDF) — First, we cluster 86 million photos into landmarks and popular viewpoints. Then, we sort the photos by date and warp each photo onto a common viewpoint. Finally, we stabilize the appearance of the sequence to compensate for lighting effects and minimize flicker. Our resulting time-lapses show diverse changes in the world’s most popular sites, like glaciers shrinking, skyscrapers being constructed, and waterfalls changing course.
  3. Git Repository of CS PapersThe intention here is to both provide myself with backups and easy access to papers, while also collecting a repository of links so that people can always find the paper they are looking for. Pull the repo and you’ll never be short of airplane/bedtime reading.
  4. Software For Reproducible ScienceThis quality is indeed central to doing science with code. What good is a data analysis pipeline if it crashes when I fiddle with the data? How can I draw conclusions from simulations if I cannot change their parameters? As soon as I need trust in code supporting a scientific finding, I find myself tinkering with its input, and often breaking it. Good scientific code is code that can be reused, that can lead to large-scale experiments validating its underlying assumptions.
Comment
Four short links: 29 April 2015

Four short links: 29 April 2015

Deceptive Visualisation, Small Robots, Managing Secrets, and Large Time Series

  1. Disinformation Visualisation: How to Lie with DatavisWe don’t spread visual lies by presenting false data. That would be lying. We lie by misrepresenting the data to tell the very specific story we’re interested in telling. If this is making you slightly uncomfortable, that’s a good thing; it should. If you’re concerned about adopting this new and scary habit, well, don’t worry; it’s not new. Just open your CV to be reminded you’ve lied with truthful data before. This time, however, it will be explicit and visual. (via Regine Debatty)
  2. Microtugsa new type of small robot that can apply orders of magnitude more force than it weighs. This is in stark contrast to previous small robots that have become progressively better at moving and sensing, but lacked the ability to change the world through the application of human-scale loads.
  3. Vaulta tool for securely managing secrets and encrypting data in-transit.
  4. iSAX: Indexing and Mining Terabyte Sized Time Series (PDF) — Our approach allows both fast exact search and ultra-fast approximate search. We show how to exploit the combination of both types of search as sub-routines in data mining algorithms, allowing for the exact mining of truly massive real-world data sets, containing millions of time series. (via Benjamin Black)
Comment
Four short links: 11 June 2014

Four short links: 11 June 2014

Right to Mine, Summarising Microblogs, C Sucks for Stats, and Scanning Logfiles

  1. UK Copyright Law Permits Researchers to Data Mine — changes mean Copyright holders can require researchers to pay to access their content but cannot then restrict text or data mining for non-commercial purposes thereafter, under the new rules. However, researchers that use the text or data they have mined for anything other than a non-commercial purpose will be said to have infringed copyright, unless the activity has the consent of rights holders. In addition, the sale of the text or data mined by researchers is prohibited. The derivative works will be very interesting: if university mines the journals, finds new possibility for a Thing, is verified experimentally, is that Thing the university’s to license commercially for profit?
  2. Efficient Online Summary of Microblogging Streams (PDF) — research paper. The algorithm we propose uses a word graph, along with optimization techniques such as decaying windows and pruning. It outperforms the baseline in terms of summary quality, as well as time and memory efficiency.
  3. Statistical Shortcomings in Standard Math Libraries — or “Why C Derivatives Are Not Popular With Statistical Scientists”. The following mathematical functions are necessary for implementing any rudimentary statistics application; and yet they are general enough to have many applications beyond statistics. I hereby propose adding them to the standard C math library and to the libraries which inherit from it. For purposes of future discussion, I will refer to these functions as the Elusive Eight.
  4. fail2ban — open source tool that scans logfiles for signs of malice, and triggers actions (e.g., iptables updates).
Comment

Big data and privacy: an uneasy face-off for government to face

MIT workshop kicks off Obama campaign on privacy

Thrust into controversy by Edward Snowden’s first revelations last year, President Obama belatedly welcomed a “conversation” about privacy. As cynical as you may feel about US spying, that conversation with the federal government has now begun. In particular, the first of three public workshops took place Monday at MIT.

Given the locale, a focus on the technical aspects of privacy was appropriate for this discussion. Speakers cheered about the value of data (invoking the “big data” buzzword often), delineated the trade-offs between accumulating useful data and preserving privacy, and introduced technologies that could analyze encrypted data without revealing facts about individuals. Two more workshops will be held in other cities, one focusing on ethics and the other on law.

Read more…

Comment

The technical aspects of privacy

The first of three public workshops kicked off a conversation with the federal government on data privacy in the US.

Thrust into controversy by Edward Snowden’s first revelations last year, President Obama belatedly welcomed a “conversation” about privacy. As cynical as you may feel about US spying, that conversation with the federal government has now begun. In particular, the first of three public workshops took place Monday at MIT.

Given the locale, a focus on the technical aspects of privacy was appropriate for this discussion. Speakers cheered about the value of data (invoking the “big data” buzzword often), delineated the trade-offs between accumulating useful data and preserving privacy, and introduced technologies that could analyze encrypted data without revealing facts about individuals. Two more workshops will be held in other cities, one focusing on ethics and the other on law. Read more…

Comments: 7

How did we end up with a centralized Internet for the NSA to mine?

The Internet is naturally decentralized, but it's distorted by business considerations.

I’m sure it was a Wired editor, and not the author Steven Levy, who assigned the title “How the NSA Almost Killed the Internet” to yesterday’s fine article about the pressures on large social networking sites. Whoever chose the title, it’s justifiably grandiose because to many people, yes, companies such as Facebook and Google constitute what they know as the Internet. (The article also discusses threats to divide the Internet infrastructure into national segments, which I’ll touch on later.)

So my question today is: How did we get such industry concentration? Why is a network famously based on distributed processing, routing, and peer connections characterized now by a few choke points that the NSA can skim at its leisure?
Read more…

Comments: 7

Mining the social web, again

If you want to engage with the data that's surrounding you, Mining the Social Web is the best place to start.

When we first published Mining the Social Web, I thought it was one of the most important books I worked on that year. Now that we’re publishing a second edition (which I didn’t work on), I find that I agree with myself. With this new edition, Mining the Social Web is more important than ever.

While we’re seeing more and more cynicism about the value of data, and particularly “big data,” that cynicism isn’t shared by most people who actually work with data. Data has undoubtedly been overhyped and oversold, but the best way to arm yourself against the hype machine is to start working with data yourself, to find out what you can and can’t learn. And there’s no shortage of data around. Everything we do leaves a cloud of data behind it: Twitter, Facebook, Google+ — to say nothing of the thousands of other social sites out there, such as Pinterest, Yelp, Foursquare, you name it. Google is doing a great job of mining your data for value. Why shouldn’t you?

There are few better ways to learn about mining social data than by starting with Twitter; Twitter is really a ready-made laboratory for the new data scientist. And this book is without a doubt the best and most thorough approach to mining Twitter data out there. Read more…

Comments: 2

Investigating the Twitter Interest Graph

Why Is Twitter All the Rage?

I’m presenting a short webcast entitled Why Twitter Is All the Rage: A Data Miner’s Perspective that is loosely adapted from material that appears early in Mining the Social Web (2nd Ed). I wanted to share out the content that inspired the topic. The remainder of this post is a slightly abridged reproduction of a section that appears early in Chapter 1. If you enjoy it, you can download all of Chapter 1 as a free PDF to learn more about mining Twitter data.
Read more…

Comment

Writing Paranoid Code

Computing Twitter Influence, Part 2

In the previous post of this series, we aspired to compute the influence of a Twitter account and explored some relevant variables to arriving at a base metric. This post continues the conversation by presenting some sample code for making “reliable” requests to Twitter’s API to facilitate the data collection process.

Given a Twitter screen name, it’s (theoretically) quite simple to get all of the account profiles that follow the screen name. Perhaps the most economical route is to use the GET /followers/ids API to request all of the follower IDs in batches of 5,000 per response, followed by the GET /users/lookup API to retrieve full account profiles for up to Y of those IDs in batches of 100 per response. Thus, if an account has X followers, you’d need to anticipate making ceiling(X/5000) API calls to GET /followers/ids and ceiling(X/100) API calls toGET /users/lookup. Although most Twitter accounts may not have enough followers that the total number of requests to each API resource presents rate-limiting problems, you can rest assured that the most popular accounts will trigger rate-limiting enforcements that manifest as an HTTP error in RESTful APIs.

Read more…

Comment

Computing Twitter Influence, Part 1: Arriving at a Base Metric

The subtle variables affecting a base metric

This post introduces a series that explores the problem of approximating a Twitter account’s influence. With the ubiquity of social media and its effects on everything from how we shop to how we vote at the polls, it’s critical that we be able to employ reasonably accurate and well-understood measurements for approximating influence from social media signals.

Unlike social networks such as LinkedIn and Facebook in which connections between entities are symmetric and typically correspond to a real world connection, Twitter’s underlying data model is fundamentally predicated upon asymmetric following relationships. Another way of thinking about a following relationship is to consider that it’s little more than a subscription to a feed about some content of interest. In other words, when you follow another Twitter user, you are expressing interest in that other user and are opting-in to whatever content it would like to place in your home timeline. As such, Twitter’s underlying network structure can be interpreted as an interest graph and mined for insights about the relative popularity of one user when compared to another.
Read more…

Comment