I’m working on an app that given the URL for a book, needs to scrape the page looking for an ISBN. I have seen several of these out there, but I didn’t think they would be quite robust enough.
The main difficulty is the format – an ISBN can be 10 or 13 digits, optionally broken into sections separated by hyphens. (One could use spaces, but you have to draw the line somewhere…) The saving grace is that in every site I’ve sampled, (once the html tags are stripped out), the text “isbn” precedes the number itself. This should be the case even for simple tables (the two <td> elements are normally in the same row, thus consecutive). Then there is the hyphen issue. ISBN legitimately supports variable length groups.
So I have two (.NET style – remove the string “?<isbn>” from them to get a standard regex) Regular Expressions – one for ISBN-10 and one for ISBN-13.
Here they are defined in c#
Regex rexIsbnNum10 = new Regex(@"isbn(.{0,3}10)?[^\w]{1,10}(?\d-?\d-?\d-?\d-?\d-?\d-?\d-?\d-?\d-?\d\b)"); Regex rexIsbnNum13 = new Regex(@"isbn(.{0,3}13)?[^\w]{1,10}(? \d-?\d-?\d-?\d-?\d-?\d-?\d-?\d-?\d-?\d-?\d-?\d-?\d\b)");
To use them, you grab the contents of the web page, strip out the html tags, make it lowercase for simplicity (or use case-insensitive regexs), apply regexIsbnNum10.Matches() to the string, and you should have all the ISBN-10 values nicely enumerated. Likewise for ISBN-13.
Using the ISBN-13 as an example, this regex will match:
ISBN 978-0130190772
ISBN-13: 978-0130190772
ISBN13: 978-0130190772
ISBN: 978-0130190772
ISBN-13: 9-78-01301-9077-2
ISBN-13 — 9-7-8-0-1-3-0-1-9-0-7-7-2
It will not match if there are less than or greater than 13 numbers, or if the “ISBN” and the “13” and more than 3 character apart, or if the ISBN-13 is more than 10 characters away from the number itself (or letter / number falls in between)