The Traditional Future

Mon

09.17.07

listen

A prominent U.S. sociologist and student of professions, Andrew Abbott of the University of Chicago, has written a thought-provoking thesis on what he terms "library research" -- that is, research as performed with library-held resources by historians, et. al, via the reading and browsing of texts -- compared to social science research, which has a more linear, "Idea->Question->Data->Method->Result" type of methodology.

The pre-print, "The Traditional Future: A Computational Theory of Library Research," is full of insights about library centric research, including intriguing parallels between library research and neural net computing architectures; a comparison that made me think anew, and with more clarity, about how the science of history is conducted. Armed with a distinctive interpretation of library research, Abbott is able to draw some incisive conclusions about the ramifications of large repositories of digitized texts (such as Google Book Search) on the conduct of scholarship.

Library research, Abbott notes, "is not interested in creating a model of reality based on fixed meanings and then querying reality whether this model is right or wrong. ... Rather, it seeks to contribute to an evolving conversation about what human activity means. Its premise is that the real world has no inherent or single meaning, but becomes what we make it."

This has immediate ramifications for the potential utility of search premised on concordance-based indexes for humanistic research. "[I]t is by no means clear that increasing the efficiency of library research will improve its overall quality. For example, it is not clear that increasing the speed of access to library materials by orders of magnitude has improved the quality of library-based research."

There are other, inherently structural characteristics of how automated discovery is provisioned that bear on the optimization of library research. One of these impacts relates to the presence of noise, or randomness, that inevitably arises when there are multiple paths to discovery. With more and more information accessible through a dwindling paucity of search interfaces, the variation in returned results is reduced. Research is not served well when one receives the same answers to the same questions; no learning lies there.

As anyone who has worked in optimization recently knows, stripping the randomness out of a computing system is a bad idea. Harnessing randomness is what optimization is all about today. (Even algorithms designed for convergence make extensive use of randomness, and it is clear that library research in particular thrives on it.) But it is evident that much of the technologization of libraries is destroying huge swaths of randomness. First, the reduction of access to a relatively small number of search engines, with fairly simple-minded indexing systems -- most typically concordance indexing (not keywords, which are assigned by humans) -- has meant a vast decrease in the randomness of retrieval. Everybody who asks the same questions of the same sources gets the same answers. The centralization and simplification of access tools thus has major and dangerous consequences. This comes even through reduction of temporal randomness. In major indexes without cumulations - the Readers Guide, for example - substantial randomness was introduced by the fact that researchers in different periods tended to see different references. With complete cumulations, that variation is gone.

That's an interesting observation - almanacs or compilations often present slices, or ever-varying accumulations of results, and so even identical questions would inevitably return different results depending upon when in the publication sequence they were asked. As more and more information is aggregated into composite sets, this temporal variation is also lost.

Dr. Abbott makes a final point about the transformation of browsing and discovery, and the underlying nature of library based research - often, the investigator doesn't quite know exactly what they are looking for, just as much if not more than merely not knowing the best sources to look in.

This argument makes it clear why "efficient" search is actually dangerous. The more technology allows us to find exactly what we want the more we lose this browsing power. But library research, as any real adept knows, consists in the first instance in knowing, when you run across something suddenly interesting, that you ought to have wanted to look for it in the first place. Library research is almost never a matter of looking for known items. But looking for known items is the true - indeed the only - glory of the technological library. The technological library thus helps us do something faster but it is something we almost never want to do and, furthermore, it strips us in the process of much of the randomness-in-order on which browsing naturally feeds. In this sense, the technologized library is a disaster.

Google Book Search is a wonderful thing. But it not so wonderful that we should assume it will transform education and research. Nor should we assume that in the future we might not be able to generate architectures that make books live more intelligently amongst each other - and more freely - than anything that Google might envision. As libraries who might be participating in digitization: let us challenge the fundamental assumptions we are handed - that must seem so dangerously obvious - and rethink the landscape of our profession, and how we might best support our real work of learning.

tags: books, digitization, google, research, search | comments: 8 | Sphere It
submit:

0 TrackBacks

TrackBack URL for this entry: http://orm3.managed.sonic.net/mt/mt-tb.cgi/2419

Comments: 8

Search◊ Engines WEB [09.17.07 09:31 PM]

>>> This argument makes it clear why "efficient" search is actually dangerous. The more technology allows us to find exactly what we want the more we lose this browsing power

If it search efficiency does get THAT extreme, technology could be programmed to make vertical or related suggestions concerning areas or topics of likely interest based on patterns of recent searches (personalization.

Tim O'Reilly [09.18.07 05:05 AM]

Peter,

I'm not so sure I buy the premise that online search destroys the serendipitous find of unknown documents. In fact, many of the anecdotes about the value of Google Book Search in academic research are precisely about people finding documents that they didn't know existed, and that, once found, opened up new avenues for thought and further research.

This seems to me like one more anachronistic "our old way is better," post. It's also silly. If research in the physical library is better for the kind of deep research the article is talking about, there's nothing to stop it from continuing.

Peter Brantley [09.18.07 07:14 AM]

Tim,

I think the argument is far more nuanced than (paraphrasing) "new is bad, old is good." There is absolutely no question that GBS and kin facilitate a certain conduct of research.

I think Andy's points are distinct, and perhaps I should have added more narrative. I will, with prejudice, attempt to highlight a couple of them with more care.

One point that Andy makes is that highly automated search might not make that much impact, or have a deleterious impact ultimately, on the quality of research conducted. Crassly, perhaps more time is spent on discovery, and less on analysis. I am not sure I support that perspective, but it should be worthy of more consideration than dismissal. I agree that nothing currently obviates the ability to perform traditional work. However, the rush to mine resources such as GBS is certain to cause at least a momentary distraction from other lines of research; what its permanent ramifications are remains to be seen.

Just as we see older forms of content production and distribution relegated to niches or backwaters, so we may also see forms of research deprecated. Simply being able to "do the old" when the new is in place is a naive defense of the new - "new" is far more transformative on the "old" than its mere existence. Since there are so few ways of understanding our world to begin with, that impact must be by definition a loss, even as we gain something novel. It is not a trade of equivalents.

Another point that I would pull out of Andy's post is that there is a mitigation of randomness. This is a far more serious point than whether or not we can provide some sort of pseudo-browsing, which itself is not likely to ever be as serendipitous as physical browsing. The ontology of browsing is a dive into the consideration of different types of experience that I do not want to take here. But the damage to the overall mitigation of randomness is of more profound nature.

Even supposing that some randomness might be inserted into the dominant search engines' algorithm suites, one is still left with a paucity of places to perform inquiry. We can add more levers and knobs to our oracles, but if we slay the majority of our seers, consciously or not, then we get fewer ways of knowing the world.

Note also, search engines will play with the insertion of randomness only so much before they feel their result streams are compromised; SEO is a labeled a science, a necessary one as the NYT pay-wall demolition demonstrates, but it is driven by the search for order and rational derivation of argument and conclusion. That is not the whole way of the world.

Think also beyond the world as you know it. If as I have considered previously, there is a settlement between Google and the publishers (AAP) and the authors (the Authors' Guild), for out of print, in-copyright works, then Andy's points will have even more profound salience. Any type of voluntary collective licensing will raise profound barriers to entry for a large mass of digitally-available books that possess far higher value than the largest possible corpus of public domain works, for its sheer volume alone.

The scarcity in such a scenario of places to ask will be -- must be -- profound unsettling. It is more than simply an implicit monopoly in inquiry; it is a diminuation in the ways in which we can question ourselves.

It is as if we have drained all of the ponds in the world except one. We may still throw different size rocks into the pond, and we shall witness different waves and turbulence, and the reflections and light will be ever unceasingly unique. Yet the contours of that pond shall never be altered, the boulders within it not ever moved; our visions have been confined by a habit enforced by comforts too persuasive to remove ourselves from this shore and perceive ourselves in the lakes and rivers and brooks of our earth.

Jerome McDonough [09.18.07 07:24 AM]

Tim, I think you may be letting the information retrieval/information science research community (e.g., me) off a little too easy with your comment "If research in the physical library is better for the kind of deep research the article is talking about, there's nothing to stop it from continuing." Google and other search engines may support a certain type of serendipitous discovery through their sheer mass of resources, but one of the advantages of digital resources has supposedly been the ability to not only browse but dynamically reorder resources for browsing based on user desires. Search engines support one, narrow approach to that, but I think all of them could do a better job. I'd like to see a lot more work done on the types of user interfaces that scholars of different types might benefit from for discovery work, and while there is the occasional flash of novelty from companies like Groxis, 'one size fits all' really seems to be the model most search engines are stuck in. I don't think serendipitous discovery should be relegated to the physical stacks (even if they are good at it); and I think we have a ways to go before the electronic world really lives up to its promise. Building up the world's largest repository of full-text is onl y the first stage of building a great digital library; there's a lot of work to be done after that in making it useful. If Google *does* manage to scan every book/journal article/manuscript in the world, I don't think the information retrieval research community gets to declare the end of history for IR. :)

Rick Prelinger [09.18.07 08:27 AM]

Serendipity can and will be simulated, and I'm sure we will see efforts to build virtual bookstacks beyond what libraries are doing now. It may take years to discover whether or not simulated semi-randomization and discovery can provide the value of physical browsing and riffling through books chosen subjectively. Similarly, it's premature to conclude very much about the analog/digital library split -- massive text datasets are very young (with tools that are still toylike), and physical libraries are stepping up their pace of evolution.

That said, I think there are limits to query-based browsing. Having to type something in a query box, choosing search terms, or even having to ask a natural-language question; each of these acts forecloses surprise and limits browsing. Queries work on indexes, but they cut straight paths through forests in which you might prefer to ramble.

Our growing body of experience in serendipity-based, query-free (amateur) librarianship has revealed that loose organization and surprise yield unexpectedly useful and often gratifying research results. We need to make sure that library users can continue to "find what they are not looking for."

Thomas Lord [09.18.07 09:25 AM]

Where is Nicholson Baker when you need him?

A physical research library is much more than just the contents of its books and catalogs. It is an archeological site, occupied by a society descended from the societies that built it.

For example, a library is a record -- a "condensation" or a "consequence" -- of the history of collection development and culling decisions made over the years.

Another example: The shelving plan of a good (and well supported) research library is the outcome of a purposeful design -- an architecture for moving bodies through space on which information is mapped. To "browse" such a space is not to cast some random I Ching made out of Dewey decimal numbers rather, to browse such a space is to study a collectively constructed sculpture whose artistic subject is precisely your motion as you browse.

In a research library, when the archeological project uncovers material whose documentation is deficient, often members of the occupying society can remember the history of the material or at least make good guesses about where to look for clues.

Do you see? There is nothing serendipitous or random about the practice of using a research library. It is a more controlled and carefuly constructed experience and environment than Disneyland.

Materials -- books, for example -- are like paint or like clay. They are the raw material of an art form. Libraries are sculptures created using those materials. A mass digitization, a la GBS, is just an efficient way to synthesize knock-offs of original materials, plus some catalogs for picking them out.

Research will converge. It is *a bit* fun to imagine a really open form of GBS with lots of contributors designing new views of a "big bag of everything we can lay our hands on to digitize". But: It will be a *lot* of fun when I can turn those same visualization tools to smaller data sets -- namely, models of the contents of actual research libraries. A new kind of art appreciation -- a new way to *see* the archeological sculptures which are libraries.

-t

Alex Tolley [09.18.07 11:49 AM]

Much of the referenced article seems to be about the serendipitous nature of browsing in a research library and its difference to the directed search for information retrieval.

However, I think that much of this serendipity is available in electronic search - social bookmarking, folksonomies etc are the very meta data that the article describes that a research librarian uses.

Indeed the excellent O'Relly book "Ambient Findability" (no plug intended) covers these issues in a broader context.

It is very clear that the article makes a plea for the highly educated individual to do library research, much like search on DIALOG in the last century was best done by experienced library people. But it is also true that the structure of the library is a relatively static structure developed by the collective intelligence that designed it. This makes the serendipitous browsing not as productive as the author indicates. I would suggest that electronic maps showing the paths previous researchers have taken would be hugely beneficial to researchers, not just showing the main informational highways, but also the unexplored links that have yet to be visited to generate new links and insights.

As always, Hesse's "The Glass Bead Game" (aka Magister Ludi) comes to mind.

orcmid [09.18.07 04:04 PM]

Another problem, perhaps one that will be cured after much pain, is the utter lack of good librarianship along with the poor cataloging effort that is revealed in Google Book Search. This must have participating research librarians apoplectic.

Charles Petzold has posted some critiques and demonstrations that are chilling. This is almost worse than not having the material online and it shows complete lack of appreciation for what is involved in curating and cataloging library materials. Here are the three successive posts from Charles:

http://www.charlespetzold.com/blog/2007/09/070444.html

http://www.charlespetzold.com/blog/2007/09/090206.html

http://www.charlespetzold.com/blog/2007/09/100111.html