Thursday, November 16, 2006

Software Tools

I'm a systems programmer, and a tool-maker, and I think that every library would benefit from having a software tool maker around. Being a tool maker means that I write small, relatively simple programs that only do one thing. I've never written an editor, but I've written lots of data conversion and simple analysis programs: programs that read one or two files and produce one or two files of output, and I always rely on having a command line close by to run my programs.

When I became a librarian, my need to write tools decreased, but it didn't disappear. Lists are the bane of collection librarians, and we regularly receive spreadsheets full of book bibliographic data, or generate lists of journal titles from JCR, which we then have to use as checklists to find out how much of it we own, and how many of the titles that the accreditation board expects us to own are absent. When the list of titles is brief, this process isn't too painful, but it primarily involves cutting ISBNs or titles out of a spreadsheet and pasting them into the web OPAC, then making a note in a column of the spreadsheet. Unfortunately, the lists of titles are rarely brief. For most categories of journals in JCR, there are less than one hundred titles, which is most of a day's work. I simplified this for myself by writing code for the emacs editor that would automatically query the ISBN in the catalogue for me, eliminating some of the cutting and pasting, and speeding the process up somewhat. Unfortunately, such primitive tools are insufficient when faced with a list of six hundred e-books, and a need to determine the titles that we already own, especially when the ISBN in the list may be for a different format that the one we own.

So I wrote a program. The challenge is figuring out how to get information out of the catalogue: the web OPAC is useless for programs, since they can't easily read nicely formatted HTML tables, and the system doesn't provide a simple web service interface like SRU for querying the database. Fortunately, my catalogue has a (bad) Z39.50 server, and it's possible to find Z39.50 client modules for most scripting languages nowadays, so I just used Z39.50 to talk to my catalogue. Of course, this will only tell me if I own exactly the same edition of a book as the one that the publisher told me about, and I know that's not true, since we commonly buy the paper edition, rather than the hardcover, and we also already own electronic versions of some books. This is where the whole "Web 2.0" thing takes over. OCLC is providing a cross-ISBN server, xISBN, that is a simple web service: it takes an ISBN as input, and it transmits back an XML document that is a list of "related" ISBNs: the paper, cloth, electronic version, and whatever else they thing might be associated with it.

Adding xISBN into the mix means that if we don't own the exact ISBN given by the spreadsheet, then I ship it off to OCLC and check the ISBNs in the return list to see if we have one of the related titles. In a perfect world, I'd record this information in a new column in the spreadsheet, indicating whether we owned the title or a related title, and providing a link to the web OPAC so that the librarian could click from the spreadsheet into the OPAC to check circulation statistics and other useful staff information. But reading and writing Excel files is non-trivial, and storing an URL in a CSV means you end up with an URL displayed in Excel, rather than a friendly link, so I just write out an HTML file that is a list of the titles we own, as links in the OPAC, as desired. After having spent five or six hour programming (aka "having fun"), it took a mere three minutes to process the list of six hundred computer science titles and identify the one hundred thirty titles that we own. But now I've got the tool built, so when this comes up again, or when I need to check journal holdings, it'll take no time at all.

Web 2.0, and by extension Library 2.0, is about providing modular services so that users can build what they want for whatever reason they have. Mostly on the web 2.0, this is for social or "fun" purposes, but the same philosophy also improves work productivity. Peter Murray spoke at the recent Future of the ILS symposium that the University of Windsor sponsored, and he talked about the importance of componentized business processes for users. But building the right components for our business processes also makes our business more flexible and easier to mash-up. This is a big part of what the library vendors are missing: they think they know how we should use our data, when even we don't know how we want to use it. But that, as they say, is a story for another day.