Tuesday, January 20, 2015

Finding ISBNs in the the digits of π

For some reason, a blog post from 2010 about searching for ISBNs in the first fifty million digits of π suddenly became popular on the net again at the end of last week (mid-January 2015). The only problem is that Geoff, the author, only looks for ISBN-13s, which all start with the sequence "978". There aren't many occurrences of "978" in even the first fifty million digits of π, so it's not hard to check them all to see if they are the beginning of a potential ISBN, and then find out if that potential ISBN was ever assigned to a book. But he completely ignores all of the ISBN-10s that might be hidden in π. So, since I already have code to validate ISBN checksums and to look up ISBNs in OCLC WorldCat, I decided to check for ISBN-10s myself.

I don't have easy access to the first fifty million digits of π, but I did manage to find the first million digits online without too much difficulty.

An ISBN-10 is a ten character long string that uniquely identifies a book. An example is "0-13-152414-3". The dashes are optional and exist mostly to make it easier for humans, just like the dashes in a phone number. The first character of an ISBN-10 indicate the language in which the book is published: 0 and 1 are for English, 2 is for French, and so on. The last character of the ISBN is a "check digit", which is supposed to help systems figure out if the ISBN is correct or not. It will catch many common types of errors, like swapping two characters in the ISBN: "0-13-125414-3" is invalid.

Here are the first one hundred digits of π:

To search for "potential (English) ISBN-10s", all one needs to do is search for every 0 or 1 in the first 999,990 digits of π (there is a "1" three digits from the end, but then there aren't enough digits left over to find a full ISBN, so we can stop early) and check to see if the ten digit sequence of characters starting with that 0 or 1 has a valid check digit at the end. The sequence "1415926535", highlighted in red, fails the test, because "5" is not the correct check digit; but the sequence "0781640628" highlighted in green is a potential ISBN.

There are approximately 200,000 zeros and ones in the first million digits of π, but "only" 18,273 of them appear at the beginning of a potential ISBN-10. Checking those 18,273 potentials against the WorldCat bibliographic database results in 1,168 valid ISBNs. The first one is at position 3,102: ISBN 0306803844, for the book The evolution of weapons and warfare by Trevor N. Dupuy. The last one is at position 996,919: ISBN 0415597234 for the book Exploring language assessment and testing : language in action by Anthony Green.

Here's the full dataset.


Geoff said...

Many thanks for this David, I've updated my original post with your further research - now, to start analysing your dataset for trends... :)

David J. Fiander said...

The next, ultra-geeky step, is to parse the ISBNs to identify the publishers involved....

symac said...

Hello David,
thanks for that, fun. Any reason to limit your tool to english ISBN apart from performance questions ?

David J. Fiander said...

Since it takes less than a second to check all the zeros and ones to see if they're valid, it would take less than ten seconds to check all the other languages' potential ISBNs. The problem is that it would take a long time to run those ISBNs against WorldCat to see if they were actually used, and WorldCat's data on non-English books is probably not as good as its data on English content.