"data mining" entries

Four short links: 4 October 2012

Four short links: 4 October 2012

Vannevar Bush, Topic Transparency, Ancient Maps, and Concussion Sensors

  1. As We May Think (Vannevar Bush) — incredibly prescient piece he wrote for The Atlantic in 1945.
  2. Transparency and Topic Models (YouTube) — a talk from DataGotham 2012, by Hanna Wallach. She uses latent Dirichlet allocation topic models to mine text data in declassified documents where the metadata are useless. She’s working on predicting classification durations (AWESOME!). (via Matt Biddulph)
  3. Slippy Map of the Ancient World — this. is. so. cool!
  4. Technology in the NFLX2IMPACT’s Concussion Management System (CMS) is a great example of this trend. CMS, when combined with a digital mouth guard, also made by X2, enables coaches to see head impact data in real-time and asses concussions through monitoring the accelerometers in a players mouth guard. That data helps teams to decide whether to keep a player on the field or take them off for their own safety. Insert referee joke here.
Comments: 2
Four short links: 30 August 2012

Four short links: 30 August 2012

Decoding ToS, Impact Factors are Nonsense, Crappy Open Source Code, and Data Mining History

  1. TOS;DR — terms of service rendered comprehensible. “Make the hard stuff easy” is a great template for good ideas, and this just nails it.
  2. Sick of Impact Factorstypically only 15% of the papers in a journal account for half the total citations. Therefore only this minority of the articles has more than the average number of citations denoted by the journal impact factor. Take a moment to think about what that means: the vast majority of the journal’s papers — fully 85% — have fewer citations than the average. The impact factor is a statistically indefensible indicator of journal performance; it flatters to deceive, distributing credit that has been earned by only a small fraction of its published papers. (via Sci Blogs)
  3. A Generation Lost in the Bazaar (ACM) — Today’s Unix/Posix-like operating systems, even including IBM’s z/OS mainframe version, as seen with 1980 eyes are identical; yet the 31,085 lines of configure for libtool still check if and exist, even though the Unixen, which lacked them, had neither sufficient memory to execute libtool nor disks big enough for its 16-MB source code. […] That is the sorry reality of the bazaar Raymond praised in his book: a pile of old festering hacks, endlessly copied and pasted by a clueless generation of IT “professionals” who wouldn’t recognize sound IT architecture if you hit them over the head with it. It is hard to believe today, but under this embarrassing mess lies the ruins of the beautiful cathedral of Unix, deservedly famous for its simplicity of design, its economy of features, and its elegance of execution. (Sic transit gloria mundi, etc.)
  4. History as Science (Nature) — Turchin and his allies contend that the time is ripe to revisit general laws, thanks to tools such as nonlinear mathematics, simulations that can model the interactions of thousands or millions of individuals at once, and informatics technologies for gathering and analysing huge databases of historical information.
Comments: 3
Four short links: 23 August 2012

Four short links: 23 August 2012

Computational Social Science, Infrastructure Drives Design, Narcodrones Imminent, and Muscle Memory

  1. Computational Social Science (Nature) — Facebook and Twitter data drives social science analysis. (via Vaughan Bell)
  2. The Single Most Important Object in the Global Economy (Slate) — Companies like Ikea have literally designed products around pallets: Its “Bang” mug, notes Colin White in his book Strategic Management, has had three redesigns, each done not for aesthetics but to ensure that more mugs would fit on a pallet (not to mention in a customer’s cupboard). (via Boing Boing)
  3. Narco Ultralights (Wired) — it’s just a matter of time until there are no humans on the ultralights. Remote-controlled narcodrones can’t be far away.
  4. Shortcut Foo — a typing tutor for editors, photoshop, and the commandline, to build muscle memory of frequently-used keystrokes. Brilliant! (via Irene Ros)
Comment: 1

Mining the astronomical literature

A clever data project shows the promise of open and freely accessible academic literature.

There is a huge debate right now about making academic literature freely accessible and moving toward open access. But what would be possible if people stopped talking about it and just dug in and got on with it?

NASA’s Astrophysics Data System (ADS), hosted by the Smithsonian Astrophysical Observatory (SAO), has quietly been working away since the mid-’90s. Without much, if any, fanfare amongst the other disciplines, it has moved astronomers into a world where access to the literature is just a given. It’s something they don’t have to think about all that much.

The ADS service provides access to abstracts for virtually all of the astronomical literature. But it also provides access to the full text of more than half a million papers, going right back to the start of peer-reviewed journals in the 1800s. The service has links to online data archives, along with reference and citation information for each of the papers, and it’s all searchable and downloadable.

Number of papers published in the three main astronomy journals each year
Number of papers published in the three main astronomy journals each year. CREDIT: Robert Simpson

The existence of the ADS, along with the arXiv pre-print server, has meant that most astronomers haven’t seen the inside of a brick-built library since the late 1990s.

It also makes astronomy almost uniquely well placed for interesting data mining experiments, experiments that hint at what the rest of academia could do if they followed astronomy’s lead. The fact that the discipline’s literature has been scanned, archived, indexed and catalogued, and placed behind a RESTful API makes it a treasure trove, both for hypothesis generation and sociological research.

Read more…

Comments: 10
Four short links: 24 May 2012

Four short links: 24 May 2012

Maker Tribe, Concept Mapping, Magic Wand, and Site Performance Matters

  1. Last Saturday My Son Found His People at the Maker Faire — aww to the power of INFINITY.
  2. Dictionaries Linking Words to Concepts (Google Research) — Wikipedia entries for concepts, text strings from searches and the oppressed workers down the Text Mines, and a count indicating how often the two were related.
  3. Magic Wand (Kickstarter) — I don’t want the game, I want a Bluetooth magic wand. I don’t want to click the OK button, I want to wave a wand and make it so! (via Pete Warden)
  4. E-Commerce Performance (Luke Wroblewski) — If a page load takes more than two seconds, 40% are likely to abandon that site. This is why you should follow Steve Souders like a hawk: if your site is slower than it could be, you’re leaving money on the table.
Comment: 1
Four short links: 8 February 2012

Four short links: 8 February 2012

Text Mining, Unstoppable Sociality, Unicode Fun, and Scholarly Publishing

  1. Mavunoan open source, modular, scalable text mining toolkit built upon Hadoop. (Apache-licensed)
  2. Cow Clicker — Wired profile of Cowclicker creator Ian Bogost. I was impressed by Cow Clickers […] have turned what was intended to be a vapid experience into a source of camaraderie and creativity. People create communities around social activities, even when they are antisocial. (via BoingBoing)
  3. Unicode Has a Pile of Poo Character (BoingBoing) — this is perfect.
  4. The Research Works Act and the Breakdown of Mutual Incomprehension (Cameron Neylon) — an excellent summary of how researchers and publishers view each other and their place in the world.
Comment

Unstructured data is worth the effort when you’ve got the right tools

Alyona Medelyan and Anna Divoli on the opportunities in chaotic data.

Alyona Medelyan and Anna Divoli are inventing tools to help companies contend with vast quantities of fuzzy data. They discuss their work and what lies ahead for big data in this interview.

Comment

Unstructured data is worth the effort when you've got the right tools

Alyona Medelyan and Anna Divoli on the opportunities in chaotic data.

Alyona Medelyan and Anna Divoli are inventing tools to help companies contend with vast quantities of fuzzy data. They discuss their work and what lies ahead for big data in this interview.

Comment
Four short links: 13 January 2012

Four short links: 13 January 2012

Internet in Culture, Flash Security Tool, Haptic E-Books, and Facebook Mining Private Updates

  1. How The Internet Gets Inside Us (The New Yorker) — at any given moment, our most complicated machine will be taken as a model of human intelligence, and whatever media kids favor will be identified as the cause of our stupidity. When there were automatic looms, the mind was like an automatic loom; and, since young people in the loom period liked novels, it was the cheap novel that was degrading our minds. When there were telephone exchanges, the mind was like a telephone exchange, and, in the same period, since the nickelodeon reigned, moving pictures were making us dumb. When mainframe computers arrived and television was what kids liked, the mind was like a mainframe and television was the engine of our idiocy. Some machine is always showing us Mind; some entertainment derived from the machine is always showing us Non-Mind. (via Tom Armitage)
  2. SWFScan — Windows-only Flash decompiler to find hardcoded credentials, keys, and URLs. (via Mauricio Freitas)
  3. Paranga — haptic interface for flipping through an ebook. (via Ben Bashford)
  4. Facebook Gives Politico Deep Access to Users Political Sentiments (All Things D) — Facebook will analyse all public and private updates that mention candidates and an exclusive partner will “use” the results. Remember, if you’re not paying for it then you’re the product and not the customer.
Comment: 1
Four short links: 12 January 2012

Four short links: 12 January 2012

Smart Meter Snitches, Company Culture, Text Classification, and Live Face Substitution

  1. Smart Hacking for Privacy — can mine smart power meter data (or even snoop it) to learn what’s on the TV. Wow. (You can also watch the talk). (via Rob Inskeep)
  2. Conditioning Company Culture (Bryce Roberts) — a short read but thought-provoking. It’s easy to create mindless mantras, but I’ve seen the technique that Bryce describes and (when done well) it’s highly effective.
  3. hydrat (Google Code) — a declarative framework for text classification tasks.
  4. Dynamic Face Substitution (FlowingData) — Kyle McDonald and Arturo Castro play around with a face tracker and color interpolation to replace their own faces, in real-time, with celebrities such as that of Brad Pitt and Paris Hilton. Awesome. And creepy. Amen.
Comment: 1