"data mining" entries

Four short links: 11 June 2014

Four short links: 11 June 2014

Right to Mine, Summarising Microblogs, C Sucks for Stats, and Scanning Logfiles

  1. UK Copyright Law Permits Researchers to Data Mine — changes mean Copyright holders can require researchers to pay to access their content but cannot then restrict text or data mining for non-commercial purposes thereafter, under the new rules. However, researchers that use the text or data they have mined for anything other than a non-commercial purpose will be said to have infringed copyright, unless the activity has the consent of rights holders. In addition, the sale of the text or data mined by researchers is prohibited. The derivative works will be very interesting: if university mines the journals, finds new possibility for a Thing, is verified experimentally, is that Thing the university’s to license commercially for profit?
  2. Efficient Online Summary of Microblogging Streams (PDF) — research paper. The algorithm we propose uses a word graph, along with optimization techniques such as decaying windows and pruning. It outperforms the baseline in terms of summary quality, as well as time and memory efficiency.
  3. Statistical Shortcomings in Standard Math Libraries — or “Why C Derivatives Are Not Popular With Statistical Scientists”. The following mathematical functions are necessary for implementing any rudimentary statistics application; and yet they are general enough to have many applications beyond statistics. I hereby propose adding them to the standard C math library and to the libraries which inherit from it. For purposes of future discussion, I will refer to these functions as the Elusive Eight.
  4. fail2ban — open source tool that scans logfiles for signs of malice, and triggers actions (e.g., iptables updates).
Comment

Big data and privacy: an uneasy face-off for government to face

MIT workshop kicks off Obama campaign on privacy

Thrust into controversy by Edward Snowden’s first revelations last year, President Obama belatedly welcomed a “conversation” about privacy. As cynical as you may feel about US spying, that conversation with the federal government has now begun. In particular, the first of three public workshops took place Monday at MIT.

Given the locale, a focus on the technical aspects of privacy was appropriate for this discussion. Speakers cheered about the value of data (invoking the “big data” buzzword often), delineated the trade-offs between accumulating useful data and preserving privacy, and introduced technologies that could analyze encrypted data without revealing facts about individuals. Two more workshops will be held in other cities, one focusing on ethics and the other on law.

Read more…

Comment

The technical aspects of privacy

The first of three public workshops kicked off a conversation with the federal government on data privacy in the US.

Thrust into controversy by Edward Snowden’s first revelations last year, President Obama belatedly welcomed a “conversation” about privacy. As cynical as you may feel about US spying, that conversation with the federal government has now begun. In particular, the first of three public workshops took place Monday at MIT.

Given the locale, a focus on the technical aspects of privacy was appropriate for this discussion. Speakers cheered about the value of data (invoking the “big data” buzzword often), delineated the trade-offs between accumulating useful data and preserving privacy, and introduced technologies that could analyze encrypted data without revealing facts about individuals. Two more workshops will be held in other cities, one focusing on ethics and the other on law. Read more…

Comments: 7

How did we end up with a centralized Internet for the NSA to mine?

The Internet is naturally decentralized, but it's distorted by business considerations.

I’m sure it was a Wired editor, and not the author Steven Levy, who assigned the title “How the NSA Almost Killed the Internet” to yesterday’s fine article about the pressures on large social networking sites. Whoever chose the title, it’s justifiably grandiose because to many people, yes, companies such as Facebook and Google constitute what they know as the Internet. (The article also discusses threats to divide the Internet infrastructure into national segments, which I’ll touch on later.)

So my question today is: How did we get such industry concentration? Why is a network famously based on distributed processing, routing, and peer connections characterized now by a few choke points that the NSA can skim at its leisure?
Read more…

Comments: 7

Mining the social web, again

If you want to engage with the data that's surrounding you, Mining the Social Web is the best place to start.

When we first published Mining the Social Web, I thought it was one of the most important books I worked on that year. Now that we’re publishing a second edition (which I didn’t work on), I find that I agree with myself. With this new edition, Mining the Social Web is more important than ever.

While we’re seeing more and more cynicism about the value of data, and particularly “big data,” that cynicism isn’t shared by most people who actually work with data. Data has undoubtedly been overhyped and oversold, but the best way to arm yourself against the hype machine is to start working with data yourself, to find out what you can and can’t learn. And there’s no shortage of data around. Everything we do leaves a cloud of data behind it: Twitter, Facebook, Google+ — to say nothing of the thousands of other social sites out there, such as Pinterest, Yelp, Foursquare, you name it. Google is doing a great job of mining your data for value. Why shouldn’t you?

There are few better ways to learn about mining social data than by starting with Twitter; Twitter is really a ready-made laboratory for the new data scientist. And this book is without a doubt the best and most thorough approach to mining Twitter data out there. Read more…

Comments: 2

Investigating the Twitter Interest Graph

Why Is Twitter All the Rage?

I’m presenting a short webcast entitled Why Twitter Is All the Rage: A Data Miner’s Perspective that is loosely adapted from material that appears early in Mining the Social Web (2nd Ed). I wanted to share out the content that inspired the topic. The remainder of this post is a slightly abridged reproduction of a section that appears early in Chapter 1. If you enjoy it, you can download all of Chapter 1 as a free PDF to learn more about mining Twitter data.
Read more…

Comment

Writing Paranoid Code

Computing Twitter Influence, Part 2

In the previous post of this series, we aspired to compute the influence of a Twitter account and explored some relevant variables to arriving at a base metric. This post continues the conversation by presenting some sample code for making “reliable” requests to Twitter’s API to facilitate the data collection process.

Given a Twitter screen name, it’s (theoretically) quite simple to get all of the account profiles that follow the screen name. Perhaps the most economical route is to use the GET /followers/ids API to request all of the follower IDs in batches of 5,000 per response, followed by the GET /users/lookup API to retrieve full account profiles for up to Y of those IDs in batches of 100 per response. Thus, if an account has X followers, you’d need to anticipate making ceiling(X/5000) API calls to GET /followers/ids and ceiling(X/100) API calls toGET /users/lookup. Although most Twitter accounts may not have enough followers that the total number of requests to each API resource presents rate-limiting problems, you can rest assured that the most popular accounts will trigger rate-limiting enforcements that manifest as an HTTP error in RESTful APIs.

Read more…

Comment

Computing Twitter Influence, Part 1: Arriving at a Base Metric

The subtle variables affecting a base metric

This post introduces a series that explores the problem of approximating a Twitter account’s influence. With the ubiquity of social media and its effects on everything from how we shop to how we vote at the polls, it’s critical that we be able to employ reasonably accurate and well-understood measurements for approximating influence from social media signals.

Unlike social networks such as LinkedIn and Facebook in which connections between entities are symmetric and typically correspond to a real world connection, Twitter’s underlying data model is fundamentally predicated upon asymmetric following relationships. Another way of thinking about a following relationship is to consider that it’s little more than a subscription to a feed about some content of interest. In other words, when you follow another Twitter user, you are expressing interest in that other user and are opting-in to whatever content it would like to place in your home timeline. As such, Twitter’s underlying network structure can be interpreted as an interest graph and mined for insights about the relative popularity of one user when compared to another.
Read more…

Comment

Demographics are dead: the new, technical face of marketing

Technology has changed the way we understand targeting and contextual relevance. How will marketing adapt?

Over the past five years, marketing has transformed from a primarily creative process into an increasingly data-driven discipline with strong technological underpinnings.

The central purpose of marketing hasn’t changed: brands still aim to tell a story, to emotionally connect with a prospective customer, with the goal of selling a product or service. But while the need to tell an interesting, authentic story has remained constant, customers and channels have fundamentally changed. Old Marketing took a spray-and-pray approach aimed at a broad, passive audience: agencies created demographic or psychographic profiles for theoretical consumers and broadcast ads on mass-consumption channels, such as television, print, and radio. “Targeting” was primarily about identifying high concentrations of a given consumer type in a geographic area.

The era of demographics is over. Advances in data mining have enabled marketers to develop highly specific profiles of customers at the individual level, using data drawn from actual personal behavior and consumption patterns. Now when a brand tells a story, it has the ability to tailor the narrative in such a way that each potential customer finds it relevant, personally. Users have become accustomed to this kind of sophisticated targeting; broad-spectrum advertising on the Internet is now essentially spam. At the same time, there is still a fine line between “well-targeted” and “creepy.” Read more…

Comments: 4