With a modern search engine and smart planning, web sites can provide visitors with a better search experience than Google. For instance, Google may well turn up interesting results if you search for a certain kind of shirt, but a well-designed clothing site can also pull up related trousers, skirts, and accessories. It’s not Google’s job to understand the intricate interrelationships of data on a particular web property, but the site’s own team can constantly tune searches to reflect what the site has to offer and what its visitors uniquely need.
Hence the important of search engines like Solr, based on the Lucene library. Both are open source Apache projects, maintained by Lucid Imagination, a company founded to commercialize the underlying technology. I attended parts of Lucid Imagination’s conference this week, Lucene Revolution, and found Lucene evolving in the ways much of the computer industry is headed.
Wait till they get big
In his opening remarks CEO Paul Doscher showed some stats from the sign-ups and indicated that many of the 350 attendees were new to Lucene and Solr. One third had less than one year of experience. This explains to me why turn-out for the regular tracks was higher than the new “big data” track on advanced processing and performance issues, which I expected to draw more participants. Speakers in the big data track had some fascinating applications to show off, suggesting that this is an example of the future not being equally distributed.
Thus, Mark Davis did a fast-pace presentation on the use of Solr along with Hadoop, <a href="http://mahout.apache.org/"Mahout, and systems hosting GPUs at the information processing firm Kitenga. A RESTful API from LucidWorks Enterprise gives Solr access to Hadoop to run jobs. Glenn Engstrand described how Zoosk, The “Romantic Social Network,” keeps slow operations on the update side of the operation so that searches can be simple and fast. As in many applications, Solr at Zoosk pulls information from MySQL. Other tools they use include the High-speed ObjectWeb Logger (HOWL) to log transactions and RabbitMQ for auto-acknowledge messages. HOWL is also useful for warming Solr’s cache with recent searches, because certain operations flush the cache.
Along these lines, Apache has released a replication tool called Solr Cloud that is supposed to make it much easier to manage sharding (partitioning) and multiple servers in Solr. Lucid Imagination used the show to announce their LucidWorks Big Data platform, now accepting Beta applicants, which will allow organizations to do pretty much what Davis described in his talk without having to configure all the tools on local systems. I suspect that first uses of this cloud service will be restricted to early adopters, but that next year both the “big data” presentations and LucidWorks Big Data will be popular.
The flexibility of a good search
Several presenters pointed out that Google has spoiled users and they expect every commercial site, health provider, or other major organization to provide a local site with Google-like features, including auto-completion and auto-suggestion, fuzzy searches and spelling correction (“Did you mean to search for…?”), and of course highly relevant “give me what I’m thinking of” search results.
Many companies offer search solutions–and O’Reilly actually has a book on another open source project with some very sophisticated back-end features, Introduction to Search with Sphinx–but Lucene with its strong Apache branding is the most popular open source solution, and (again according to Paul Doscher) probably the most popular independent search engine anywhere.
Sudarshan Gaikaiwari presented a talk on auto-completion, concentrating on geospatially informed results. For instance, if you enter “pi” into a search box, you may be presented with pizza joints, piano bars, and other popular searches within a few miles. Gaikaiwari achieved this with careful mining of log files and by creating a hidden prefix to the search term (for instance, “pi” can be altered to “times square new york city pi”). It’s important to create the long prefix because the longer a search string is, the fewer results have to be returned and the quicker you can present suggested search items while the user is still typing. (To feel responsive, a site should present result to a user within 140 milliseconds.)
The geospatial information is retrieved through geohashes, a way of representing the world’s grid as arbitrary strings. Shorter strings represent larger geographical areas, and as you add a character to the end of the string you zoom in on a smaller area. By a mixture of four-character and five-character strings, you can create a reasonable area in which to show local search results.
Some of the other interesting parts of Gaikaiwari’s talk included:
Check each search string you recommend against the main search index to make sure that someone clicking on that search string will come up with at least one actual document. Users quickly come to distrust your recommendations if they try when and come up with an empty set.
Measure “time to first click” to check how good your recommendations are. This metric is valuable because it combines two important criteria: presenting suggestions to the user quickly and success in actually producing a suggestion the user likes. Gaikaiwari also listed several other metrics.
Interestingly, search engines such as Solr and Sphinx functioned as NoSQL replacements for relational databases (though usually used to offload the search function from these databases) long before the term NoSQL was invented. Although people don’t tend to think of the search tools in that light, they do in fact work like NoSQL in that they perform specific functions more efficiently than a relational database can, and they sometimes compete with document stores like CouchDB and MongoDB. But search engines have evolved tremendously to intersect with the worlds of taxonomy and analytics. And now they’re dealing with big enough data sets to require sharding and replication as well.