Mssv

Mssv random header image

Defending the Library of Google

May 29th, 2008 · 3 Comments

In the current issue of The New York Review of Books, Robert Darnton, Director of the University Library at Harvard, writes about Google’s efforts to digitise the world’s books and create a new universal library. For the most part, the article is really very well-written and enlightening.

However, when comes around to criticising Google Book Search on eight fundamental points, he seriously missteps on at least two of them:

6. As in the case of microfilm, there is no guarantee that Google’s copies will last. Bits become degraded over time. Documents may get lost in cyberspace, owing to the obsolescence of the medium in which they are encoded. Hardware and software become extinct at a distressing rate. Unless the vexatious problem of digital preservation is solved, all texts “born digital” belong to an endangered species. The obsession with developing new media has inhibited efforts to preserve the old. We have lost 80 percent of all silent films and 50 percent of all films made before World War II. Nothing preserves texts better than ink imbedded in paper, especially paper manufactured before the nineteenth century, except texts written on parchment or engraved in stone. The best preservation system ever invented was the old-fashioned, pre-modern book.

(Adrian: Also read point 4, which is related in that it addresses the built-in obsolence electronic media in general, and companies such as Google in particular)

One of the most insightful comments I have ever read on the Internet (I’m not sure where – perhaps it was Slashdot) was that digital information does not last any longer than analog information. All digital information exists on some form of physical medium, whether it’s on a length of tape, a hard drive or a DVD. Any of those media can be damaged, and certainly we know that CDs and DVDs become degraded over mere decades or years. As Darnton points out, ink truly is one of the best preservation systems, lasting potentially for millennia, and handily beating digital media.

No, the true advantage of digital information is that it can be perfectly copied. Digital information, after all, is just a series of 0s and 1s, and even as floppy discs were superseded by CDs, which were then superseded by DVDs, we can be sure that we can perfectly preserve the information on old media through the act of copying. Another useful, although not integral, advantage of digital information is that it can also be copied very rapidly. Analog information, on the other hand, is much slower and more difficult to copy reliably, and if Darnton thinks that Google will make mistakes while digitising books (point 5), he will know that far more mistakes are made when books are copied by any other method.

The best way of preserving information over centuries and millennia is not to print it on pages with ink, or to carve it into stone; the number of books, pamphlets, tablets, scrolls and stele that have been damaged, destroyed or lost over history is uncountable. Instead, the most proven method is to make copies of the information, and lots of them. It is telling that the stories we have from ancient Egypt and Rome are the most popular ones, the ones that were copied hundreds or thousands of times. What happened to all those other robust scrolls, which had fewer copies? They are lost.

Not everything digital is backed up and copied hundreds of times, but believe it or not, there do exist many copies of large portions of the entire Internet. The Internet Archive holds some copies, but it’s Google’s – and Yahoo’s, and Microsoft’s – servers that hold the bulk. After all, to perform a search of the full text of the whole world wide web in a fraction of a second doesn’t just require one copy of the whole web – it requires many, many copies, spread over many, many hard drives around the world. And on a more familiar scale, it is relatively trivial to copy the entire content of Wikipedia to an iPhone. There are valid questions about using formats that will be readable in the future, and the problems of reading old formats, and I would direct you to the Wikipedia article on Digital Preservation to learn more.

No doubt we shouldn’t make the mistake of obsessing over new media only to inhibit efforts to preserve the old; there is no reason why we cannot do both. But Darnton is missing one of the most valuable and unique properties of digital information by ignoring its copying abilities.

7. Google plans to digitize many versions of each book, taking whatever it gets as the copies appear, assembly-line fashion, from the shelves; but will it make all of them available? If so, which one will it put at the top of its search list? Ordinary readers could get lost while searching among thousands of different editions of Shakespeare’s plays, so they will depend on the editions that Google makes most easily accessible. Will Google determine its relevance ranking of books in the same way that it ranks references to everything else, from toothpaste to movie stars? It now has a secret algorithm to rank Web pages according to the frequency of use among the pages linked to them, and presumably it will come up with some such algorithm in order to rank the demand for books. But nothing suggests that it will take account of the standards prescribed by bibliographers, such as the first edition to appear in print or the edition that corresponds most closely to the expressed intention of the author.

Google employs hundreds, perhaps thousands, of engineers but, as far as I know, not a single bibliographer. Its innocence of any visible concern for bibliography is particularly regrettable in that most texts, as I have just argued, were unstable throughout most of the history of printing. No single copy of an eighteenth-century best-seller will do justice to the endless variety of editions. Serious scholars will have to study and compare many editions, in the original versions, not in the digitized reproductions that Google will sort out according to criteria that probably will have nothing to do with bibliographical scholarship.

This is a pretty cheap shot. Darnton asks, “Will Google determine its relevance ranking of books in the same way that it ranks references to everything else, from toothpaste to movie stars?” and assumes the answer is “Yes”. But why? Google are not stupid; if people want to order information in their book archives by bibliographic standards, then they will offer that option. It’s not as if it’s difficult to do. He then goes on to beat this strawman into a dusty corpse by making further assumptions based on Google’s supposed final stance of only scanning one copy of a book.

I can’t speak for Google – maybe they are that stupid – but neither can Darnton. We will just have to wait and see, and if Google really are this shortsighted, then no harm is done – and perhaps another search engine will just do a better job.

In an effort to warn readers about some of Google’s legitimately problematic practices, Darnton has overreached in attempting to fashion an far wider argument, namely that Google Book Search (or any similar venture) is – and always will be – inferior to the ‘research library’. However, this argument is fatally damaged by the assumptions he makes with respect to the present and future capabilities of digital technology.

Darnton admits that the internet has transformed the world of words in only a few years, yet then presumes that the digitisation of the world’s books is an impossible task; and that even if it were to be accomplished, the information would not be available – and could never be made available – in the desired format of scholars. He conflates Google with all potential book digitisation projects, and would have us believe that Google Book Search must work perfectly and last forever in order for its ideals to succeed – as if the incompetence, closure or destruction of any single library system would somehow prove that all libraries are flawed. He is wrong.

Tags: book · future · google · tech · writing

3 responses so far ↓

Leave a Comment