Navigating the unnavigable on NCBI Bookshelf

Update (as of 10 Dec 2011): it appears the below approach no longer works! I have not yet tried to decode the latest NCBI Bookshelf. Please check back soon, or let me know if you discover a solution.

Note: the typesetting –but not content– of this article will change soon.

The National Center for Biotechnology Information has a bookshelf on which many nice books (e.g., “Neuroscience” (2001) by Purves et al.) may be freely accessed. Sadly, this “free access” usually prohibits browsing, as noted by a statement printed at the top of the table of contents for such books:

By agreement with the publisher, this book is accessible by the search feature, but cannot be browsed.

Luckily, the source of each webpage (one per “part” or section) of every book includes two navigation tags located immediately before the beginning of the text,

<p class="prev main">
<p class="next main">

Inside each tag is a partially complete link to the previous section and the next section, respectively. The part field may be extracted and entered into the browser's address bar –replacing the current “part” value– in order to navigate.

This may be better understood by an example. Consider the book “Molecular Cell Biology” (2000) by Lodish et al. If you search the page source (XHTML) of Table of Contents, you will not find the hidden navigation links. They appear to be on every other page of the book, so let us search for and then go to the Preface. To do this manually, type “preface” in the search bar toward the upper-right corner of the webpage, and press ENTER. There should be only one or two related results; go to the obvious Preface link.

Once you are viewing the Preface page, you should notice the form of the URL (confer the web address in your browser, likely located near the top of the window). It should end with “part=A103”. This part specifies that we wish to view this particular section, i.e. the Preface, of the book. Note that the book name is also seen nearby as “book=mcb”. Now view the source for this page (in Firefox, try Ctrl+u; or “Page Source” under the “View” menu toward the top-left of the browser window).

To quickly find the tags we are interested in, simply search the text for “prev main”. Once found, you will see an href link within each. The part values listed may be used to go to the previous or next sections of the book. In this example, you should see

<p class="prev main"><a href="br.fcgi?book=&amp;part=A102" class="new-related-obj">*</a></p>
<p class="next main"><a href="br.fcgi?book=&amp;part=A119" class="new-related-obj">*</a></p>

Thus, to go to the previous section, you would replace the part value currently in the address to “A102”. Similarly, to step forward through the book (as you might do if you were reading the book in a library), then you would change the part value to “A119”.

I have tested this method in a few of the other books on the NCBI Bookshelf with good success. The next step is to create a browser plugin that can override the stylesheet which causes these navigation links to be hidden, or perhaps a Python script that samples every (section) page in a book and builds a table of contents using these part values.

To facilitate browsing, I wrote a Python script that, given a seed page whence to start, will generate a basic (X)HTML file with section titles and links. The script is called maketoc (see also the my misc page) and is free and open source. Usage is quite simple and includes an optional sleep time before each server request.

Though you may prefer having no artificial delays (i.e., calling maketoc with sleep time of 0 seconds), accessing webpages at the speed of an unbridled script is an obvious sign that a machine is the client, not a human sitting at a Web browser. In the case of the NCBI Bookshelf, the PubMed Central servers will notice how quickly you're accessing book sections and temporarily block your IP address. Before I added pre-request delays, I made this mistake and could not access any books for about 20 minutes. To avoid this, maketoc will pause for a given number of seconds before sending each server request. Since this script need only be run once to build the book section link list (again, the section index is merely a list of links in an (X)HTML file), I suggest using a large pause time, e.g. 5 seconds, and letting it run in the background. You can save section indices for later use as you read and refer to books.

To make usage clear, we consider a brief example. Suppose you wish to build a section index for “Molecular Cell Biology” (2000) by Lodish et al. Following the example earlier in this article, we know that the base URL is

http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=mcb&amp;part=

We shall use the Chapter 2 “Chemical Foundations” section page as our seed. Note that in general, any substantive section page should work as the seed; however, I have found that the “Preface” or “Acknowledgments” pages –typically located at the beginning of books– are in local minima and do not have hidden navigation tags leading to the book content. This can be seen, e.g. with “Biochemistry” (2002) by J.M. Berg et al., which seems to have two divisions: beginning to Acknowledgments, and “Prelude” (part value of A140) to end of book. Thus, you could use any “Biochemistry” book section from the Prelude onward as a seed to maketoc.

Assuming you are in the same directory as maketoc (and have set appropriate execution bits, etc.), at the command-line enter

./maketoc.py 'http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=mcb&amp;part=' A250 mcb.html 5

This should work on any Unix (-like) or Mac platform with a base Python installation. The “base URL” and seed are given as separate arguments to simplify parsing code. The resulting section index is written to mcb.html, and the maketoc script sleeps for 5 seconds before each server request. Running on my GNU/Linux machine, this command completed in 25 minutes, 10 seconds and yielded a nice section list.

I also tried using a 2 second sleep time, and though I could make more progress than in the 0 second (i.e. no artificial delay) case, the server nonetheless caught my activity and temporarily blocked my IP address. A more general and robust approach is to vary the sleep times according to some stochastic process, which is entirely straightforward to implement but unnecessary for an effort as small as maketoc. Adding stochasticity to server request timing helps mask the presence of a machine, giving the illusion of a human navigating webpages and pausing to read paragraphs occasionally.