"search" entries

Four short links: 2 August 2010

Four short links: 2 August 2010

Search Tips, Web Parsing, DNS Blacklists, Complex Machines

  1. Hidden Features of Google (StackExchange) — rather than Google’s list of search features, here are the features that real (sophisticated) users find useful. My new favourite: the ~ operator for approximate searching. (via Hacker News)
  2. Natural Language Parsing for the Web — JSON API to the Stanford Natural Language Parser. I wonder why the API to the library isn’t an open source library, given the Stanford parser is GPLv2. It’d be super-cool to have this as an EC2 instance, Ubuntu package, or Chef recipe so it’s trivial to add to an existing hosted project.
  3. Taking Back the DNS (Paul Vixie) — defining a spec whereby you can subscribe to blacklists for DNS, as Most new domain names are malicious.
  4. Building Complex Machines with Lego — I saw the (Lego) Antikythera Mechanism at Sci Foo. It’s as amazing as it looks.
Four short links: 23 July 2010

Four short links: 23 July 2010

Reputation Systems, Faceted Search Tutorial, Video Utility, and Chinese Slang

  1. 5 Reputation Missteps (and how to avoid them) (YouTube) — a Google Tech Talk from one of the authors of the O’Reilly-published Building Web Reputation Systems.
  2. Solr on EC2 Tutorial — the tutorial shows how to index Wikipedia with Solr. (via Matt Biddulph)
  3. clivea command line utility for extracting (or downloading) videos from Youtube and other video sharing Web sites. It was originally written to bypass the Adobe Flash requirement needed to view the hosted videos..
  4. ChinaSmack — how to talk smack online in Chinese. (via BoingBoing)

Search is the Web's fun and wicked problem

"Search Patterns" author Peter Morville looks at the next wave of search and reveals the one innovation that led to a watershed moment

We may think of search as static and mature, but it’s a tool in flux. Developments in mobile, augmented reality, and social graphs signal big changes ahead. In this Q&A, “Search Patterns” author Peter Morville shows how experiments at the periphery and weird ideas will shape search’s future. He also reveals the one semi-recent innovation that unlocked a watershed moment for search (it’s not what you’d expect).

Four short links: 4 February 2010

Four short links: 4 February 2010

Personal Ad Preferences, Android Kernel, EC2 Deconstructed, Symbian Opened

  1. Google Ad Preferencesmy defaults look reasonable and tailored to my interest. Creepy but kinda cool: I guess that if I have to have ads, they should be ones I’m not going to hate. (via rabble on Twitter)
  2. Android and the Linux Kernel — the Android kernel is forked from the standard Linux kernel, and a Linux kernel maintainer says that Google has made no efforts to integrate. (via Slashdot)
  3. On Amazon EC2’s Underlying Architecture — fascinating deconstruction of the EC2 physical and virtual servers, without resorting to breaking NDAs. (via Hacker News)
  4. First Full Open Source Symbian Release (BBC) — source code will be available for download from the Symbian Foundation web site as of 1400GMT. Nokia bought Symbian for US$410M in 2008 (for comparison, AOL bought Netscape for $4.2B in 1999 but the source code tarball had been escape-podded from the company a year before the deal closed). This makes Symbian more open than Android, says the head of the foundation: “About a third of the Android code base is open and nothing more,” says Williams. “And what is open is a collection of middleware. Everything else is closed or proprietary.” (quote from Wired’s story).

Forget Google, social search is all about mobile

New research from Aardvark shows higher social search use on the mobile side

A new research report from social answering service Aardvark finds that social search is more popular with mobile users. It begs the question: will the mix of social search and mobile apps catalyze search's next evolution?

Four short links: 7 January 2010

Four short links: 7 January 2010

London Data, SEO Deathspiral, Subversion Search, Entity Extraction APIs

  1. London Datastore to Launch — the Mayor of London will launch a site full of London data. (via Ed Dumbill)
  2. Google Destroyed the Web — It’s hard to disagree with the basic contention that SEO aimed at Google’s rankings has fucked the web. It’s a vicious circle, too: the more fake content sites are created to game Google, the harder it will be for any new web search startup to filter that effluent and deliver meaningful results in competition to Google. This is a grim feedback loop.
  3. ReposSearch — search Subversion repositories.
  4. Survey of Entity Extraction APIs — he describes the qualititative differences in the APIs and their responses, finding that Evri and OpenAlchemy had the best for his needs.

Robots.Txt and the .Gov TLD

I’m on the board of CommonCrawl.Org, a nonprofit corporation that is attempting to provide a web crawl for use by all. An interesting report just got sent to us about the use of robots.txt files within the .Gov Top Level Domain, a standard known as the Robots Exclusion Standard. In examining about 32,000 subdomains in .gov, it turns at least 1,188 of these have a robots.txt file with a “global disallow,” meaning robots are excluded from indexing this content. Even more curious, on 175 of these sites, while there is a global disallow, there is a specific bypass that allows the Googlebot to index the data.

Four short links: 20 November 2009

Four short links: 20 November 2009

Social Network Search for Morons, Bulking Up Bio Data, Better E-Mail, Better Standards

  1. Spokeo — abysmal indictment of society, first prize in mankind’s race to the bottom. Uncover personal photos, videos, and secrets … GUARANTEED! Spokeo deep searches within 48 major social networks to find truly mouth-watering news about friends and coworkers. PS, anybody who gives their gmail username and password to a site that specializes in dishing dirt can only be described as a fucking idiot. (via Jim Stogdill, who was equally disappointed in our species)
  2. Biologists rally to sequence ‘neglected’ microbes (Nature) — The Genomic Encyclopedia of Bacteria and Archaea is project to sequence genomes from more branches of the evolutionary tree of life. Eisen’s team selected and sequenced more than 100 ‘neglected’ species that lacked close relatives among the 1,000 genomes already in GenBank. The researchers reported earlier this year at the JGI’s Fourth Annual User Meeting that even mapping the first 56 of these microbes’ genomes increased the rate of discovery of new gene and protein families with new biological properties. It also improved the researchers’ ability to predict the role of genes with unknown functions in already sequenced organisms. (via Jonathan Eisen)
  3. Mail Learning: The What and the How (Simon Cozens) — a few things that a really good mail analysis tool needs to do. I hope that my mail client and server does these out of the box in the next five years.
  4. Introducing the Open Web Foundation AgreementThe Open Web Foundation Agreement itself establishes the copyright and patent rights for a specification, ensuring that downstream consumers may freely implement and reuse the licensed specification without seeking further permission. In addition to the agreement itself, we also created an easy-to-read “Deed” that provides a high level overview of the agreement. Applying the open source approach to better standards.

Real Time Search with Wowd: A Conversation with CEO Mark Drummond

During last year’s Summit I had the good fortune to interview Kevin Kelly (see Technology is the Seventh Kingdom of Life). In the interview Kevin made the case that we have only scratched the surface on how to coordinate group activities on the web: there must be hundreds of effective methods to run an auction, crowdsource products etc. We have only scratched the surface so why stop at eBay and Threadless?

Four Short Links: 25 August 2009

Four Short Links: 25 August 2009

Reverse Search, PDF Stripping, Flash Visualization, Failure

  1. Tineye — reverse search engine; you upload an image and they find you similar images so you know where else it’s used. Check out their cool searches.
  2. PDF Pirate — upload a PDF and this web site will give it back to you minus the restrictions on copying/printing/etc.
  3. Flarean ActionScript library for creating visualizations that run in the Adobe Flash Player. BSD-licensed, modelled on Prefuse. When there’s a visualisation library for every platform, will we start to get people who know how to make them?
  4. The Importance of Failure (Marco Tabini) — This is a point that I don’t often hear made when people talk about failure; the moral behind a failure-related story is usually about preventing it, or dealing with the aftermath, but not about the fact that sometimes things go bad despite your best efforts, and all the careful risk management and contingency planning won’t keep you from going down in flames. This is important, because it forces every person to establish a risk threshold that they are willing to accept in every one of their life efforts.