Google Custom Search

This morning, after a series of phone tags across to India, I was rewarded with a fascinating interview with Ramanathan V. Guha, presently at Google. Guha was able to help me understand the significance of the recent announcement with the Hewlett Foundation of the Open Educational Resources search portal.

[N.B.: Over a very long distance telephone conversation, there may inevitably develop mis-hearings, and any mistakes here are certainly a result of my aging ears combined with the inefficiencies of global telecom infrastructure.]

The OE Search project is a particularly compelling example of Google’s Custom Search Engine (CSE) offering, which is itself derived from the earlier Google Coop initiative, in which Guha was heavily involved. I had followed the Google Coop work in its early days because it offered the capacity of constructing informed portals of rich content curated by interested parties. E.g., health topics are contributed by leading organizations such as the CDC, Stanford Hospital, UCSF, Kaiser, the New England Journal of Medicine, and many others.

CSE builds on this concept by permitting individuals, community domains, or organizations to associate Google search with a compilation of resources; that search specification can itself be made available for embedding in other sites, or it can be private to the originator or social network that owns it. CSE, in effect, enables a portal designer to create the semblance of an autonomous index while retaining the rich features of Google’s larger indexing operations. While most “home-built” portal search offerings are limited to a delimited set of links and resources, Google CSE is able to incorporate analysis of incoming anchor text in rankings of search results (among other things). Google’s optimizations are thus brought into user-defined, more tightly or expertly-scoped domains, yielding a sophisticated range of functionality.

An interesting example can be found at a sample CSE such as the Islamic Clothing Search site. A search for “shawl” provides highly-tuned page listings, the outcome of an knowledgeable and informed curator. (A more academic example is Cornell Law Library’s Legal Research Engine.)

Returning to the educational context, Guha remarked that it would be possible for any given implementer of a heavily-populated site, such as what OER Search will hopefully become through wide participation, to boost locally-held results. For example, a MIT implementation of the OE Search portal could choose to boost sites, providing higher visibility for MIT’s OpenCourseWare offerings.

Google’s CSE setup page provides for a simple but flexible set of restrictions and operatives guiding a CSE construction. Advertising can be turned on or off; selected web resources can be boosted, or not; and so forth. Notably, the metadata defining a CSE need not be a unique submission to Google, or a manually maintained revision, but instead could reside externally and be generated programmatically. Google (Guha) seemed to welcome suggestions for features that might be attractive for custom sites; e.g. there might be aspects of faceted presentation, separating results by MIME type or specified metadata fields.

It is interesting for me to speculate about how academic portals such as my organization’s own nascent Aquifer, which is a collection of metadata resource records in the broad area of American Social History, or portals such as AnthroCite, or social networking systems such as Nature’s Connotea, might fare under the CSE regime.

There are several potential issues with CSE that leap to mind. One of the first, endemic to any centrally provisioned networked offering — particularly from Google — is privacy. Many universities have chosen not to adopt Google Apps for Education due to issues primarily relating to privacy, and indemnification, secondly. These are not trivial concerns, and they will remain in evolution for quite some time. They are potentially relevant for certain types of CSE offerings.

One operational issue is that users of Google’s main index are not driven to matching, high-quality curated CSE search portals in returned results. Alternatively, a better way to describe this problem is to observe that there is no boosting for CSE sites based on an analysis of their fitness for specific user-supplied search parameters or terms. A main index search for “Islamic clothing” yields very respectable results, but not as good as for the previously mentioned CSE; certainly for less tightly scoped, more ambiguous, or more interpretative search terms, a user would benefit from awareness of closely-associated CSEs. Guha acknowledged this issue of “findability,” but stressed that it was a known problem, and currently being addressed by Engineering.

This obliquely raises another issue, which relates to the problem of the higher resource diversity obtained through increasing field size. Obviously, Google would appreciate the development of a rich ecosystem of CSEs as social networks, organizations, academic niches, and domains of practice turn to the development, care, and feeding of their own embeddable portals. Although spam-farms are an obvious problem, Google will still need to carefully and algorithmically analyze the utility of individual CSEs for boosting findability; herein are rich problems in translating domain ontologies (whether formally specified or not – more often not) into normative rankings.

A final note arises from Google’s increasing desire to incorporate content directly into its site — trading for its own account, as Tim O’Reilly nicely observed. Its recent soft announcement of the deal with the Associated Press, folding those results directly into its News site, for example, and the rapidly growing corpus of works in Google Book Search, encourage one to speculate about the incorporation of Google’s internal resources into a CSE. This could be done “by hand,” but many observers have noted the desirability of having an API to call results for domain targets against such content pools. Bridging “in” and “out” resources will be a growing challenge for Google, and one that will also offer new opportunities for customization.