"data mining" entries

Strata Week: Building data startups

Strata registration opens, making money with data, dolphins and cellphones, data in the dirt

In this week's look at the world of data, learn how to build a money-making data startup, register for Strata 2011, and hear of new developments in the mining of offline social networks.

Four short links: 21 October 2010

Four short links: 21 October 2010

MySQL as NoSQL, Handmade SLR, Mac App Store, and Datamining Privacy Workshop

  1. Using MysQL as NoSQL750,000+ qps on a commodity MySQL/InnoDB 5.1 server from remote web clients.
  2. Making an SLR Camera from Scratch — amazing piece of hardware devotion. (via hackaday.com)
  3. Mac App Store Guidelines — Apple announce an app store for the Macintosh, similar to its app store for iPhones and iPads. “Mac App” no longer means generic “program”, it has a new and specific meaning, a program that must be installed through the App store and which has limited functionality (only one can run at a time, it’s full-screen, etc.). The list of guidelines for what kinds of programs you can’t sell through the App Store is interesting. Many have good reasons to be, but It creates a store inside itself for selling or distributing other software (i.e., an audio plug-in store in an audio app) is pure greed. Some are afeared that the next step is to make the App store the only way to install apps on a Mac, a move that would drive me away. It would be a sad day for Mac-lovers if Microsoft were to be the more open solution than Apple. cf the Owner’s Manifesto.
  4. Privacy Aspects of Data Mining — CFP for an IEEE workshop in December. (via jschneider on Twitter)
Four short links: 19 October 2010

Four short links: 19 October 2010

Positive Gov2, Psychology of Places, Open Source Embedded Devices, and Dilbert on Data

  1. YIMBY — Swedish site for “Yes, In My Back Yard”. Provides an opportunity for the net to aggregate positive desires (“please put a bus stop on my street”, “we want wind power”) rather than simply aggregating complaints. (via cityofsound on Twitter)
  2. Getting People in the Door — a summary of some findings about people’s approaches to the physical layout of shopping space. People like to walk in a loop. They avoid “cul de sacs” that they can see are dead-ends, because they don’t want to get bored walking through the same merchandise twice. Apply these to your next office space.
  3. OpenBricksembedded Linux framework that provides easy creation of custom distributions for industrial embedded devices. It features a complete embedded development kit for rapid deployment on x86, ARM, PowerPC and MIPS systems.
  4. Dilbert on Data — pay attention, data miners. (via Kevin Marks)
Four short links: 14 October 2010

Four short links: 14 October 2010

Google Price Index, The High Cost of Freemium, Literate Programming, Results Clustering

  1. Google Creates New Inflation Measure (The Guardian) — The Google Price Index will be based on the cost of goods sold online and could use real-time search data to forecast official figures. Clever use of unique data, but can the GPI findings be reproduced by another agency? I do like the idea of moving national statistical measures into real-time.
  2. How To Break The Trust of Your Customers In Just One Day — some horrifying revelations about how freemium worked for Chargify and their customers: Over the past year, we discovered that the customer that never paid had the highest support load. […] Everyone’s always talking about freemium, but very few people actually use it, and we discovered this in looking at our customers for the past year. The reality was that less than 0.4% of customers had any sizeable number of free customers on their accounts. (via Hacker News)
  3. Annotated Backbone.js — very readable literate programming. (via Simon Willison)
  4. Carrot2 — open source results clustering engine.
Four short links: 13 October 2010

Four short links: 13 October 2010

Data Privacy, Journalism and Dataviz, Web Shell, and Kindle Singles

  1. ‘Scrapers’ Dig Deep for Data on Web (WSJ) — our users’ data comprise a valuable resource to mine and sell, but so do their kidneys. The data world faces serious issues with informed consent, control, and exploitation–it’s not just a shiny new business model, it can also leave people feeling very violated. Again, if you’re not paying for it then you’re the product and not the customer. The majority of humanity is not conscious of the difference between “user” and “customer”. (via Mike Brown on Twitter)
  2. Journalism in the Age of Data (Video) — Stanford video, with annotations and links, on the challenge of using dataviz as a storytelling medium. (via Ben Goldacre on Twitter)
  3. webshell (Github) — open source (Apache-licensed) console utility, requiring node.js, for debugging and understanding HTTP connections. (via Chris Shiflett on Twitter, who prefers it to yesterday’s htty)
  4. Amazon to Launch Kindle Singles (press release) — shorter-form works (think: novellas) as a format to expand publishing market rather than shrink it. Damn near every business book ever written should have been this size instead of 300 pages of tedium.
Four short links: 14 June 2010

Four short links: 14 June 2010

Open Data, Open PCR, Open Sara Winge, and Open Source Big Graph Mining

  1. Learning from Libraries: the Literacy Challenge of Open Data (David Eaves) — a powerful continuation of the theme from my Rethinking Open Data post. David observes that dumping data over the fence isn’t enough, we must help citizens engage. We have a model for that help, in the form of libraries: We didn’t build libraries for an already literate citizenry. We built libraries to help citizens become literate. Today we build open data portals not because we have a data or public policy literate citizenry, we build them so that citizens may become literate in data, visualization, coding and public policy.
  2. OpenPCR on KickstarterIn 1983, Kary Mullis first developed PCR, for which he later received a Nobel Prize. But the tool is still expensive, even though the technology is almost 30 years old. If computing grew at the same pace, we would all still be paying $2,000+ for a 1 MHz Apple II computer. Innovation in biotech needs a kick start!
  3. Wingeing It — profile of O’Reilly’s wonderful Sara Winge by the ever fabulous Quinn Norton.
  4. PEGASUS — petascale graph mining toolkit from CMU. See their most recent publication. (via univerself on Delicious)
Four short links: 25 May 2010

Four short links: 25 May 2010

European Economic Crisis, Scaling Guardian API, Cheerful Pessimism, and Science Mapping

  1. Lending Merry-Go-Round — these guys have been Australia’s sharpest satire for years, filling the role of the Daily Show. Here they ask some strong questions about the state of Europe’s economies … (via jdub on Twitter)
  2. What’s Powering the Guardian’s Content API — Scala and Solr/Lucene on EC2 is the short answer. The long answer reveals the details of their setup, including some of their indexing tricks that means Solr can index all their content in just an hour. (via Simon Willison)
  3. What I Learned About Engineering from the Panama Canal (Pete Warden) — I consider myself a cheerful pessimist. I’ve been through enough that I know how steep the odds of success are, but I’ve made a choice that even a hopeless fight in a good cause is worthwhile. What a lovely attitude!
  4. Mapping the Evolution of Scientific Fields (PLoSone) — clever use of data. We build an idea network consisting of American Physical Society Physics and Astronomy Classification Scheme (PACS) numbers as nodes representing scientific concepts. Two PACS numbers are linked if there exist publications that reference them simultaneously. We locate scientific fields using a community finding algorithm, and describe the time evolution of these fields over the course of 1985-2006. The communities we identify map to known scientific fields, and their age depends on their size and activity. We expect our approach to quantifying the evolution of ideas to be relevant for making predictions about the future of science and thus help to guide its development.
Four short links: 20 May 2010

Four short links: 20 May 2010

New Take on Ubicomp, Language Insight, Sexy Viz, and iPad Usability

  1. People are Walking Architecture — presentation by Matt Jones of BERG, taking a new lens to this AR/ubicomp/whatever-it-is-today world. “[Mobile phones are] a whole toy box full of playful, inventive strategies for exploring cities ….”
  2. Lexicalist — insight into geographic and age distribution of language use, based on Twitter data. (via Language Log)
  3. Advanced Visualization Techniques — nice overview of some non-standard visualization techniques. Short shameful confession: I love polar dendrograms with a passion. These techniques are to visualizers as algorithms and data structures to programmers: each is used in specific circumstances and compromises some things to gain in others. (via Flowing Data)
  4. iPad Usability Report (Nielsen-Norman Group) — 93-page report based on user studies. The iPad etched-screen aesthetic does look good. No visual distractions or nerdy buttons. The penalty for this beauty is the re-emergence of a usability problem we haven’t seen since the mid-1990s: Users don’t know where they can click. For the last 15 years of Web usability research, the main problems have been that users don’t know where to go or which option to choose — not that they don’t even know which options exist. With iPad UIs, we’re back to this square one. (via Andrew Savikas)
Four short links: 13 May 2010

Four short links: 13 May 2010

Open Facebook, Internet Stats, Handling Interviews, and Textual Relationships

  1. Don’t Simply Build a More Open Facebook, Build a Better OneMost people don’t care so much about whether technology is “open” or “closed” so long as it works. (Case in point: iPhone.) Rather than starting your plans by picking which “open” standards you’ll use, start by designing a better social networking service and then determine how “open” specs will help you build that service. (via David Recordon)
  2. Internet Stats from Google — very nice categorized factoids about internet use, technology, trends, etc. 64% of C-level executives conduct six or more searches per day to locate business information.
  3. Qualitative Methods for IS Research — summary of qualitative methods (interviews, documents, observation data) as applied to IS. Written for academics, so you have to choke back passive voice vomit (sorry, “passive voice vomit must be choked back”) but it’s got lots of useful information on approaches and tools. (via johnny723 on Twitter)
  4. Social Signaling and Language Use — turns out the stopwords like “to”, “be”, and “on” are the ones that indicate manager-subordinate relationships. In so many fields I see again and again that you keep data at each stage of transformation, because transforming for one use prevents others. (via terrycojones on Twitter)
Four short links: 5 March 2010

Four short links: 5 March 2010

GMail CRM, Django Best Practices, Stats-Think, and WoW Number Crunching

  1. Rapportivea simple social CRM built into Gmail. They replace the ads in Gmail with photos, bio, and info from social media sites. (via ReadWrite Web)
  2. Best Practices in Web Development with Django and Python — great set of recommendations. (via Jon Udell‘s article on checklists)
  3. Think Like a Statistician Without The Math (Flowing Data) — Finally, and this is the most important thing I’ve learned, always ask why. When you see a blip in a graph, you should wonder why it’s there. If you find some correlation, you should think about whether or not it makes any sense. If it does make sense, then cool, but if not, dig deeper. Numbers are great, but you have to remember that when humans are involved, errors are always a possibility. This is basically how to be a scientist: know the big picture, study the details to find deviations, and always ask “why”.
  4. WoW Armory Data Mining — a blog devoted to data mining on the info from the Wow Amory, which has a lot of data taken from the servers. It’s baseball statistics for World of Warcraft. Fascinating! (via Chris Lewis)