- Canonical’s New Plan for Banshee — Canonical prepare the Linux distribution Ubuntu. They will distribute the popular iTunes-alike Banshee, but instead of the standard Amazon store plugin (which generates much $ in affiliate revenue for the GNOME Foundation) they will have Canonical’s own Amazon store plugin and keep 75% of the revenue (25% going to the GNOME Foundation). They’re legally within their rights, and it underscores for me how the goal of providing freedom from control is incompatible with a goal of making money. Free and open source software gives self-destination with software, and that includes the right to replace your money pump with theirs.
- Oluolu — an open source query log mining tool which works on Hadoop. This tool provides resources to add new features to search engines. Concretely Oluolu supports automatic dictionary creation such as spelling correction, context queries or frequent query n-grams from query log data. The dictionaries are applied to search engines to add features such as ‘did you mean’ or ‘related keyword suggestion’ service in search engines. (via Matt Biddulph on Delicious)
- Information is Beautiful Process (David McCandless) — David’s process for creating his beautiful and moving visualizations.
- Facebook for Repressive Regimes — The purpose of this blog post is not to help repressive regimes use Facebook better, but rather to warn activists about the risks they face when using Facebook. (via Justine Sanderson on Delicious)
ENTRIES TAGGED "data mining"
Four short links: 25 February 2011
Banshee Bucks, Log Mining, Visualization Secrets, and Repression Tools
Four short links: 26 January 2011
Identifying Communities, Web Principles, Wiring Library, and Instapaper Interview
- Find Communities — algorithm for uncovering communities in networks of millions of nodes, for producing identifiable subgroups as in LinkedIn InMaps. (via Matt Biddulph’s Delicious links)
- Seven Ways to Think Like The Web (Jon Udell) — seven principles that will head off a lot of mistakes. They should be seared into the minds of anyone working in the web. 2. Pass by reference rather than by value. [pass URLs, not copies of data] [...] Why? Nobody else cares about your data as much as you do. If other people and other systems source your data from a canonical URL that you advertise and control, then they will always get data that’s as timely and accurate as you care to make it.
- Wire It — an open-source javascript library to create web wirable interfaces for dataflow applications, visual programming languages, graphical modeling, or graph editors. (via Pete Warden)
- Interview with Marco Arment (Rands in Repose) — Most people assume that online readers primarily view a small number of big-name sites. Nearly everyone who guesses at Instapaper’s top-saved-domain list and its proportions is wrong. The most-saved site is usually The New York Times, The Guardian, or another major traditional newspaper. But it’s only about 2% of all saved articles. The top 10 saved domains are only about 11% of saved articles. (via Courtney Johnston’s Instapaper Feed)
Strata gems: What your inbox knows
Mining implicit data trails makes CRM more effective
One of the richest sources of data exhaust, email logs contain valuable information. When added to data from a traditional CRM, email analytics can provide a much fuller picture of your company's relationships and activity.
Four short links: 17 December 2010
Systems Programming, Peer Review, Web Mining, Facebook Design
- Down the ls(1) Rabbit Hole — exactly how ls(1) does what it does, from logic to system calls to kernel. This is the kind of deep understanding of systems that lets great programmers cut great code. (via Hacker News)
- Towards a scientific concept of free will as a biological trait: spontaneous actions and decision-making in invertebrates (Royal Society) — peer-reviewed published paper that was initially reviewed and improved in Google Docs and got comments there, in FriendFeed, and on his blog. The bitter irony: Royal Society charged him €2000 to make it available for free download. (via Fabiana Kubke)
- Bixo — an open source web mining toolkit. (via Matt Biddulph on Delicious)
- How Facebook Does Design — podcast (with transcript) with stories about how tweaking design improved the user activity on Facebook. One of the designers thought closing your account should be more like leaving summer camp (you know a place which has all your friends, and you don’t want to leave.) So he created this page above for deactivation which has all your friends waving good-bye to you as you deactivate. Give you that final tug of the heart before you leave. This reduced the deactivation rate by 7%.
Four short links: 3 December 2010
Snake Oil, JSON v XML, Pac Man, and the Full Stack
- Data is Snake Oil (Pete Warden) — data is powerful but fickle. A lot of theoretically promising approaches don’t work because there’s so many barriers between spotting a possible relationship and turning it into something useful and actionable. This is the pin of reality which deflates the bubble of inflated expectations. Apologies for the camel’s nose of rhetoric poking under the metaphoric tent.
- XML vs the Web (James Clark) — resignation and understanding from one of the markup legends. I think the Web community has spoken, and it’s clear that what it wants is HTML5, JavaScript and JSON. XML isn’t going away but I see it being less and less a Web technology; it won’t be something that you send over the wire on the public Web, but just one of many technologies that are used on the server to manage and generate what you do send over the wire. (via Simon Willison)
- Understanding Pac Man Ghost Behaviour — The ghosts’ AI is very simple and short-sighted, which makes the complex behavior of the ghosts even more impressive. Ghosts only ever plan one step into the future as they move about the maze. Whenever a ghost enters a new tile, it looks ahead to the next tile that it will reach, and makes a decision about which direction it will turn when it gets there. Really detailed analysis of just one component of this very successful game. (via Hacker News)
- The Full Stack (Facebook) — we like to think that programming is easy. Programming is easy, but it is difficult to solve problems elegantly with programming. I like to think that a CS education teaches you this kind of “full stack” approach to looking at systems, but I suspect it’s a side-effect and not a deliberate output. This is the core skill of great devops: to know what’s happening up and down the stack so you’re not solving a problem at level 5 that causes problems at level 3.
Four short links: 1 November 2010
Crap Phones, HTML Editors, Digital Rights Minimization, and Data Munging
- The Most Popular Phone in the World (Gizmodo) — I have a mate who does prototyping R&D type stuff at a telco and this is his phone. “Why’d you carry a crap phone like that?” “Because this is the most popular phone with our customers.” The Gizmodo article talks about an upcoming Nokia that looks very promising: full keyboard, camera, et al. for under $100. (via Andrew Hedges on Twitter)
- Aloha Editor — very nice open source (AGPL3) HTML5 text editor widget for web apps. (via Jessy Cowan-Sharp on Twitter)
- How Do We Solve a Problem Like Geographic Restrictions — if you’re building a new business in the US around ebooks, digital music, or digital video, then be aware that your international uptake will be absolutely buggerized by rights issues. YouTube is the only US media site that doesn’t suck for overseas users: don’t rave to us about Hulu, it’s inaccessible to the rest of the world. (via Liza Daly on Twitter)
- Needlebase — tool with AI-type smarts to help you merge, munge, and export data. Check out Thread, the query language, for an interesting way of querying graphs. Was made by ITA Software, now owned by Google. Wonder what it’ll be wrapped into or released as …
Strata Week: Building data startups
Strata registration opens, making money with data, dolphins and cellphones, data in the dirt
In this week's look at the world of data, learn how to build a money-making data startup, register for Strata 2011, and hear of new developments in the mining of offline social networks.
Four short links: 21 October 2010
MySQL as NoSQL, Handmade SLR, Mac App Store, and Datamining Privacy Workshop
- Using MysQL as NoSQL — 750,000+ qps on a commodity MySQL/InnoDB 5.1 server from remote web clients.
- Making an SLR Camera from Scratch — amazing piece of hardware devotion. (via hackaday.com)
- Mac App Store Guidelines — Apple announce an app store for the Macintosh, similar to its app store for iPhones and iPads. “Mac App” no longer means generic “program”, it has a new and specific meaning, a program that must be installed through the App store and which has limited functionality (only one can run at a time, it’s full-screen, etc.). The list of guidelines for what kinds of programs you can’t sell through the App Store is interesting. Many have good reasons to be, but It creates a store inside itself for selling or distributing other software (i.e., an audio plug-in store in an audio app) is pure greed. Some are afeared that the next step is to make the App store the only way to install apps on a Mac, a move that would drive me away. It would be a sad day for Mac-lovers if Microsoft were to be the more open solution than Apple. cf the Owner’s Manifesto.
- Privacy Aspects of Data Mining — CFP for an IEEE workshop in December. (via jschneider on Twitter)
Four short links: 14 October 2010
Google Price Index, The High Cost of Freemium, Literate Programming, Results Clustering
- Google Creates New Inflation Measure (The Guardian) — The Google Price Index will be based on the cost of goods sold online and could use real-time search data to forecast official figures. Clever use of unique data, but can the GPI findings be reproduced by another agency? I do like the idea of moving national statistical measures into real-time.
- How To Break The Trust of Your Customers In Just One Day — some horrifying revelations about how freemium worked for Chargify and their customers: Over the past year, we discovered that the customer that never paid had the highest support load. [...] Everyone’s always talking about freemium, but very few people actually use it, and we discovered this in looking at our customers for the past year. The reality was that less than 0.4% of customers had any sizeable number of free customers on their accounts. (via Hacker News)
- Annotated Backbone.js — very readable literate programming. (via Simon Willison)
- Carrot2 — open source results clustering engine.
Radar
Radar on
Radar on
Radar on
Radar on 