Four short links: 7 August 2009

Recovery.gov, Meme tracking, RFID Scans, Open Source Search Engines

  1. Defragging the Stimuluseach [recovery] site has its own silo of data, and no site is complete. What we need is a unified point of access to all sources of information: firsthand reports from Recovery.gov and state portals, commentary from StimulusWatch and MetaCarta, and more. Suggests that Recovery.gov should be the hub for this presently-decentralised pile of recovery data.
  2. Memetracker — site accompanying the research written up by the New York Times as Researchers at Cornell, using powerful computers and clever algorithms, studied the news cycle by looking for repeated phrases and tracking their appearances on 1.6 million mainstream media sites and blogs […] For the most part, the traditional news outlets lead and the blogs follow, typically by 2.5 hours […] a relative handful of blog sites are the quickest to pick up on things that later gain wide attention on the Web. Confirming that blogs and traditional media have a symbiotic relationship, not a parasitic one. (via Stats article in NY Times)
  3. Feds at DefCon Alarmed After RFIDs Scanned (Wired) — RFID badges make for convenient security, and for convenient attack. Black hats can read your security cards from 2 or 3 feet away, and few in government are aware of the attack vector. To help prevent surreptitious readers from siphoning RFID data, a company named DIFRWear was doing brisk business at DefCon selling leather Faraday-shielded wallets and passport holders lined with material that prevents readers from sniffing RFID chips in proximity cards.
  4. A Comparison of Open Source Search Engines and Indexing Twitter — Detailed write-up of the open source search options and how they stack up on a pile of Tweets. While researching for the Software section, I was quite surprised by the number of open source vertical search solutions I found: Lucene (Nutch, Solr, Hounder), Sphinx, zettair, Terrier, Galago, Minnion, MG4J, Wumpus, RDBMS (mysql, sqlite), Indri, Xapian, grep … And I was even more surprised by the lack of comparisons between these solutions. Many of these platforms advertise their performance benchmarks, but they are in isolation, use different data sets, and seem to be more focused on speed as opposed to say relevance. (via joshua on Delicious)
tags: , , , , , , , , , ,
  • http://ken-blog.krugler.org Ken Krugler

    Hi Nat,

    Re comparing OSS search engines – that generated a lot of commentary on the various project lists. The main issue is that to really do the comparison, you need to get the help of each solution’s community to ensure you’ve configured things properly…and that’s a lot of work.

    Less critical (but more common in the search world) is the challenge of actually comparing the quality of results. Most people use TREC, but for many applications the quality of the results will depend heavily on weightings of different fields, stemming, tokenization, stop words, etc.

    But it was great to have somebody throw up a rough draft of this comparison :) I’d love to see each community work to optimize results for their particular solution.

    – Ken