"search" entries

Four short links: 7 August 2009

Four short links: 7 August 2009

Recovery.gov, Meme tracking, RFID Scans, Open Source Search Engines

  1. Defragging the Stimuluseach [recovery] site has its own silo of data, and no site is complete. What we need is a unified point of access to all sources of information: firsthand reports from Recovery.gov and state portals, commentary from StimulusWatch and MetaCarta, and more. Suggests that Recovery.gov should be the hub for this presently-decentralised pile of recovery data.
  2. Memetracker — site accompanying the research written up by the New York Times as Researchers at Cornell, using powerful computers and clever algorithms, studied the news cycle by looking for repeated phrases and tracking their appearances on 1.6 million mainstream media sites and blogs […] For the most part, the traditional news outlets lead and the blogs follow, typically by 2.5 hours […] a relative handful of blog sites are the quickest to pick up on things that later gain wide attention on the Web. Confirming that blogs and traditional media have a symbiotic relationship, not a parasitic one. (via Stats article in NY Times)
  3. Feds at DefCon Alarmed After RFIDs Scanned (Wired) — RFID badges make for convenient security, and for convenient attack. Black hats can read your security cards from 2 or 3 feet away, and few in government are aware of the attack vector. To help prevent surreptitious readers from siphoning RFID data, a company named DIFRWear was doing brisk business at DefCon selling leather Faraday-shielded wallets and passport holders lined with material that prevents readers from sniffing RFID chips in proximity cards.
  4. A Comparison of Open Source Search Engines and Indexing Twitter — Detailed write-up of the open source search options and how they stack up on a pile of Tweets. While researching for the Software section, I was quite surprised by the number of open source vertical search solutions I found: Lucene (Nutch, Solr, Hounder), Sphinx, zettair, Terrier, Galago, Minnion, MG4J, Wumpus, RDBMS (mysql, sqlite), Indri, Xapian, grep … And I was even more surprised by the lack of comparisons between these solutions. Many of these platforms advertise their performance benchmarks, but they are in isolation, use different data sets, and seem to be more focused on speed as opposed to say relevance. (via joshua on Delicious)
Four short links: 14 July 2009

Four short links: 14 July 2009

Twenty Questions, CC Pix, INSERT INTO WEB, and Wash Your Hands!

  1. Twenty Questions about GPLv3 (Jacob Kaplan-Moss) — twenty very challenging questions about the GPLv3. foo.js is a JavaScript library released under the GPLv3. bar.js is a library with all rights reserved. For performance reasons, I would like to minimize all my site’s JavaScript into a single compressed file called foobar.js. If I distribute this file, must I also distribute bar.js under the GPL?
  2. CC Searching within Google Image Search — what it seems. (via waxy)
  3. YQL INSERT INTOinsert into {table} (status,username,password) values ("new tweet from YQL", "twitterusernamehere","twitterpasswordhere"). That’s too cool. (via Simon Willison)
  4. CleanWell — very low-cost recyclable enviro-friendly antimicrobials to battle third-world disease. Met the founder at Sci Foo. He said women wash hands more than men, because women enter bathrooms in pairs. Single easiest way to increase handwashing compliance is to put sinks and basins outside the room, in public view.
Four short links: 10 July 2009

Four short links: 10 July 2009

Network File System, Internet Use, Lovelace Comic, Search User Interfaces

  1. Ceph — open source distributed filesystem from UCSC. Ceph is built from the ground up to seamlessly and gracefully scale from gigabytes to petabytes and beyond. Scalability is considered in terms of workload as well as total storage. Ceph is designed to handle workloads in which tens thousands of clients or more simultaneously access the same file, or write to the same directory-usage scenarios that bring typical enterprise storage systems to their knees. (via joshua on delicious)
  2. Daily Internet Activities, 2000-2009 — Pew Charitable Trust’s Internet usage survey. We’ve finally broken 50% of Americans using the Internet daily. Twitter is almost a rounding error. (via dhowell on Twitter)
  3. The Thrilling Adventures of Lovelace and Babbage — fantastic comic, with end-notes that explain how Babbage and Lovelace’s lives and works are reflected in the action of the comic. (via suw on Twitter)
  4. Search User Interfaces — full text of this book about the different (successful and un-) interfaces to search. (via sebchan on Twitter)

Bing's Sanaz Ahari on Query Level Categorization (1 of 2)

A couple of weeks ago Bing had a small search summit for analysts, bloggers, SEO experts, entrepreneurs and advertisers. It was held in Bellevue; they put us up in the hotel and fed us. While there we received demos from Bing project teams. I was able to snag an interview with Sanaz Ahari, Lead PM on Bing. She led the team that developed the categories you see on a Bing web search. The interview was based on the slides from her presentation at the event. I have posted the significant images from her slides. The first portion of the interview focuses on how the Bing team handles Query level categorization and some of the problems they face.

Four short links: 18 June 2009

Four short links: 18 June 2009

Weaker Copyright Good, YQL.gov, GeoSPARQL, Happiness

  1. Harvard Study Finds Weaker Copyright Protection Has Benefited Society (Michael Geist) — Given the increase in artistic production along with the greater public access conclude that “weaker copyright protection, it seems, has benefited society.” This is consistent with the authors’ view that weaker copyright is “uambiguously desirable if it does not lessen the incentives of artists and entertainment companies to produce new works.” (read the original paper)
  2. Using Public Data for Good With the Power of YQLThe first part is a new batch of YQL tables providing data on the U.S. government, earthquake data, and the non-profit micro-lender Kiva. The second part is an incredibly easy way to render YQL queries on websites. After all, what good is data that no one can see?
  3. GeoSPARQL — RDF meets geo goodness. SELECT ?s ?p ?o WHERE { ?s gn:name "Dallas" . ?s ?p ?o } (via the geowanking mailing list)
  4. How To Be Happy in Business — this Venn diagram makes me happy. (via Ned Batchedler)

happyinbiz.jpg

Search for Developers

Vanessa Fox just posted her slides from her talk Diagnosing Technical Issues With Search Engine Optimization. They are packed with handy SEO/SEM suggestions, checklists and resources. It’s worth going through at least once.

Google Squared is an Exponential Improvement in Search

One of the things I’ve learned about Google is that the most amazing things will come out of them with barely a whisper of fanfare. Such is the case with Google Squared, a new Google Labs tool that was released today. What does Google Squared do? It organizes and tables information from searches for you in a way that makes it much more useful.

Google Engineering Explains Microformat Support in Searches

Today, Google is releasing support for parsing and display of microformat data in their search results. While the initial launch will be limited to a specific set of partners (including LinkedIn, Yelp and CNet reviews), the intent is that very quickly, anyone who marks their pages up with the appropriate microformat data will be able to make their information understandable…

Four short links: 29 Apr 2009

Four short links: 29 Apr 2009

4chan, urban redesign, 3d printing, python

  1. Moot Wins, Time Inc. Loses — summary of how the 4chan group Anonymous rigged the voting in Time’s 100 Most Influential poll to not just put their man at the top, but also spell an in-joke with the initial letters of the first 21 people. Time tried weakly to prevent the vote-rigging, and ReCAPTCHA gave the Internet scalliwags their biggest setback, but check out how they automated as much as possible so that human effort was targeted most effectively. It’s the same mindset that build Google’s project management, ops, and dev systems. Notice how they tried to game ReCAPTCHA, a collective intelligence app whose users train the system to read OCRed words, by essentially outvoting genuine users so that every word was read as “penis”. Collective intelligence should never be the only security/discovery/etc. feature because such apps are often vulnerable to coordinated action.
  2. The old mint in downtown SF painted by 7 perfectly mapped HD projectors — looks absolutely spectacular. I love the combination of permanent and fleeting, architecture and infotexture. (via BoingBoing)
  3. 3-D Printing Hits Rock-bottom Prices With Homemade Ceramics Mix (Science Daily) — University of Washington researchers invent, and give away, a new 3D printer supply mix that costs under a dollar a pound (versus current commercial mixes of $30-50/pound).
  4. Haystack and Whoosh Notes (Richard Crowley) — notes on installing the search framework Haystack and the search back-end Whoosh, both pure Python. It’s a quick get-up-and-go so you can add quite sophisticated search to your Django apps. (via Simon Willison)

Practical Tips for Government Web Sites (And Everyone Else!) To Improve Their Findability in Search

In an earlier post, I said that key to government opening its data to citizens, being more transparent, and improving the relationship between citizens and government in light of our web 2.0 world was ensuring content on government sites could be easily found in search engines. Architecting sites to be search engine friendly, particularly sites with as much content and legacy code as those the government manages, can be a resource-intensive process that takes careful long-term planning. But two keys are assessing who the audience is and what they’re searching for and also ensuring the site architecture is easily crawlable…