November 2007 Transcript: Paul Duguid

Joy: Welcome to First Monday Podcast I’m Joy Austria

AJ: And I’m AJ Hannah. The Google Books Project recently divided the academic community between the Google faithful and critics wary of the privatization of public goods. University of California Berkeley adjunct professor Paul Duguid briefly assessed the quality of scanning and search results in Google Books Project.

Joy: So I took your article to be an assessment of how well Google is scanning books in their book project. Can you briefly describe how you went about doing that?

Paul: Yeah, perhaps I need to backtrack a bit, but still within the context of First Monday, is that I wrote an earlier piece which you very kindly published called the laws of quality which was trying to assess how well open source method worked in what could be called cultural projects. We knew about them in software; people would try to move from software to other projects, such as Project Gutenberg or Wikipedia, and I wanted to say, well, do they transfer, and if they don't what are the methods by which we can maintain quality in such projects.

And one of the methods that I used to try and see how well Project Gutenberg seemed to be doing was to take this book Tristram Shandy. As I acknowledged, I think, being moderately unkind, it’s a wonderful book, it’s a funny book, but it’s a tricky book, particularly if you are trying to put it into ASCII. I gave a couple of talks on the piece before writing it, and when I was writing the piece, and people were reading it almost everyone said to me, of course Google Books will solve these problems. And I believed that Google Books would solve these problems. And so I was amazed at the sort of trouble that Google books seemed to me to have. That there’s an aspect to Google Books and said what will an ordinary reader get from Google Books?

And so the first thing I found was that a lot of the pages in the first edition that Google threw up on the page when I did a search were just incredibly badly scanned. The first word of the text was missing. The first ten pages had five that were unreadable. So one, there seemed to be a problem with the scanning. Now again, it was just one book but when I asked around, a lot of people said that when I looked at this book or when I’ve looked at this body of work, the scans were terrible. So it seemed worth saying that the scans looked pretty bad.

The second thing which I think was highly problematic and remains so is that no one can understand how Google Book search is ranked. So all I could do again is as a method is just start with the first one that was up there and see what the problems were with that, and then I went on to the second and there were problems with that and so on. So naturally how I came to the book was just to say, let’s continue this research, let’s see what Google Books looks like, and let’s just take the same book that I used before and see what that throws up.

AJ: How can metadata help these types of projects?

Paul: Well, metadata is particularly significant in a book like the edition of Tristram Shandy or in fact the first three editions of Tristram Shandy that when I did the search were thrown up, because all of them were multi–volume books and Google didn’t seem to understand that. So that if you pick the first book of Tristram Shandy that I looked at, I think it was the first one or maybe the second one, it was actually volume two of a four volume edition.

And I went back to Google today just before talking to you and the first two volumes that are up there now are different. One of them though it doesn’t indicate this, is volume three, and the other one which doesn’t indicate at all, is volume four of the four volume set, which contains the last three volumes of Tristram Shandy — I’m looking at the screen now — plus various other works by Stern.

So if you don’t have a very simple of metadata that would distinguish four volumes with exactly same name, one from the other, you have this problem that how do people know what they’re reading. When you open the text and it says chapter one and you start reading, if that happens to be chapter one of book five of Stern’s Tristram Shandy, you’re none the wiser and Google makes you none the wiser and it’s really only metadata and taking metadata seriously that can sort out an issue like that.

Joy: Besides the obvious answer of legibility and readability, why should we care if Google does a good job of scanning books? Are there bigger issues that need to be considered?

Paul: Yeah I think there are. I think there are two issues that again began with the first paper I wrote and that is one is the question of openness and the other is the question of quality. I brought up just some of the issues of quality I mean there were many more I just thought I shouldn't go on with my answer, some much more serious I think.

And the other issue is the question of openness. On the one hand, just simple things, how many books does a library have, what are the titles that it has, can you see them all in any reasonable ... you know, by a particular author. How are they ranked? All of those things are not open that we can’t understand. And of course more worryingly, is that Google is a private company, and it’s interests are not those of the public library. So on the one side there’s the question of openness that I think we need to worry about, that just simply don’t allow us to make assessments of quality and on the other side there’s just the questions of quality.

AJ: By undertaking this project what is Google’s responsibility?

Paul: Well, Google’s responsibility is to itself. We’re not here to tell Google what to do or how to spend its money. I think though that there are some sort of tricky issues. One is, again this is really with a view to their shareholders, they clearly don’t want to be declared a public utility like AT&T and have the federal government sitting over their shoulders, and the more monopolistic they become, the more there is a chance of that. But I think beyond that, the very persona they have created for their company, claiming somehow that “Do no evil” as a mantra makes them better than everybody else, and the sort of things that they have said, and the PR they have put out around Google Books as in some way in being for people in an almost quasi–philanthropic project places some obligation on them to resist the tendency to turn this simply into a part of the Google empire.

Joy: I want to go back to this idea of openness that you spoke of earlier and the fact that Google is a private company. We don’t know how they are structuring their metadata or how they’re conducting their quality controls and that sort of thing. Did you get a chance to listen to Siva [Vaidhyanathan]’s podcast?

Paul: I did, I did indeed. He preempted a great deal of what I might say.

Joy: Excellent. Well basically he suggested a way to address these quality control issues is to remove the task of digitization from the private sector and give that responsibility to public institutions or get universities together to do this sort of work. Do you agree with Siva’s recommendation or are there other remedies that we haven’t considered yet?

Paul: Well I’m not quite sure...whether you characterized him completely correctly because I don’t know if he was saying that we should remove this task from the private sector, only that we shouldn’t simply leave this task to the private sector. I’m only too happy for Google to go on and do what it’s doing; it’s a great service and it has huge benefits, but I think one of the dangers that it has is because they are so impressive, well known and so wealthy they are persuading a lot of other people that we don’t need alternative projects. And the reason they’re crowding out the space, in some way, and the reason that we need alternative projects I think exactly for the reason that Siva brings up is that we actually need to make sure that we have this stuff around if Google goes under or changes its mind or rewrites its contracts.

And people made an awful lot of fun a few years ago when [Jean–Noel] Jeanneney the director of the Bibliothèque Nationale in France said that he was worried about Google taking over so much of the cultural space, but I think we ought to be, that there are public endeavors and public interests which Google has no right to pay attention to its obligations to its owners and its shareholders. But there are public obligations that things like libraries really do need to pay attention to and one of them is making sure that these sort of things remain open to the public at large, and are never shut off or privatized.

And so I’m quite happy for the two to exist side by side, but I do think the danger of Google, in part its success, or perceived success, is likely to limit the libraries who are willing to engage with other digitization projects and indeed the people who are willing to fund other digitization projects because they ain’t cheap.

AJ: Google has not been billing itself as a one stop shop for information, but that’s how a lot of people begin or follow through with their research. With this in mind, what kind of suggestions would you make to Google or to another open access project to fix this perception or mistake?

Paul: I have talked and I have to say off the record, to librarians in various libraries cooperating with Google, and the people working on the project seem not to be enamored with Google, because I think Google’s arrogance in believing that it can do with technology what others have done previously with different methods like creating metadata Google just thinks for a long time that all that stuff just isn’t worthwhile. And it’s idea of metadata is pretty restricted. So I think, and I think this is very unlikely, too a little humility on Google’s part wouldn’t be a bad thing.

I think that if it really did want to make this into a public spirited project rather than just a PR campaign, it would follow some of the open content alliance ideas about how content should indeed be made public and certainly you would think it would be reasonable if they placed fewer restrictions on what the libraries could do with the files that Google gives back to them. So I think there are things like that which I suspect Google is unlikely to want to do.

I ought to say that I have been and remain a fan of Google Books And I have to acknowledge that just taking one book is clearly not a very good way to sample such a huge project. But then again it becomes very difficult to know how to sample reliably such a huge project. So there were lots of limits in what I did, but I thought that if I look, I will confirm what everyone else was telling me, that Google Books will have no trouble with a book like this.

Another thing I need to say, because it’s tricky to understand what people think Google Books is. Some people thought that what I was doing was being unreasonably scholarly. In fact it was almost the reverse, so then people accused me of being condescending. But when I had talked about Project Gutenberg, I had said Project Gutenberg doesn’t claim to be scholarly. If somebody finds a misplaced comma or a wrong quotation, they’re going to say that's not what we’re doing. We’re providing books for the general reader.

Now I went to Google Books with the same idea. Google’s a terribly good technology company, but does it really understand books, was the question that I came down to in the end. And I think and I’ve occasionally talked to people at Google about this. Their assumption is that brute strength of technology can replace all the other quality control and quality assurance processes that we have and I’m just hesitant about whether that's true.