It is advisable to check "finished" pages before they are exposed to your eager public. The program check_pages checks all internal and external links, checks that all referenced image files exist and sends pages to the W3C validator site.
By default check_pages will run all four checks: pictures(p), internal links(i), external links(e), validation(v), but they can each be selected using options p,i,e,v. For example:
python ../SCRIPTS/check_pages.py -ve
checks external links(e) and validation(v) only.
check_pages checks all the files ending with ".html" in the current directory. So, if you are following the directory structure suggested here, do the following: Generate all the pages for your site. They will now be in the directory COMPLETE. In directory COMPLETE type:
Any problems will be listed.
It is important that you generate all pages because otherwise the link checker may find lots of missing internal links.
The code for talking to the W3C validator was adapted from Christof Hoeke's batch validator. His method was more general than I needed, but the remaining code is essentially his. Thanks.
The W3C validator is, to me, an essential resource, but it is time consuming to use a web browser to upload and check large numbers of pages. To save time, check_pages sends files to the validator and reports any that fail. These alone can then be checked interactively using a browser. Note, don't be alarmed if the validator reports huge numbers of errors - often a single error - say like forgetting to close a paragraph - will trigger a cascade of apparent problems. Close the paragraph and all the errors may disappear.
I find it quite easy to get the names of image files wrong or to forget to copy them into the appropriate directory. The image checker finds all the image tags in the XHTML and tests to see if the named files exist. At present it does not check image files on external sites.
External links are those addressed using the protocol and hostname prefix, eg "http://example.com", and internal links consist only of filenames and anchor names - eg my_file.html#my_anchor. Links to your own site should not include the protocol and hostname prefix: if they do check_pages will check links against the external copies, and this is not what you want.
Laboured example: suppose you discover that in file "x.html" you have misspelled an anchor name - say "my_nchor", instead of "my_anchor", but got a link to it from another file (say "y.html") correct. In "x.html" you correct the anchor spelling to "my_anchor" and re-run check_pages before uploading the files. check_pages will find link "http://example.com/x.html#my_anchor" in "y.html", load "x.html" from site "example.com", search for anchor "my_anchor" and fail to find it. What you wanted was for it to search the local copy.
We are all familiar with the browser code "404" error message when trying to access sites that could not be found. But of course there are many other codes (including "200", which means: OK, succeeded.) check_pages reports all codes other than "200". If necessary you can look them up at W3 HTTP Status Code Definitions. Not all are real problems, for example "302" means "The requested resource resides temporarily under a different URI", and so no change is required. However, "301" means "The requested resource has been assigned a new permanent URI", and so you should change the address. Below, in Figure 1, is a list of results I got for sourgumdrop.org.uk.
302 Found www.wordpress.org sm_1.html 301 Moved Permanently sudoku.com / contacts.html 404 /sudoku/index.jsp www.act365.com /sudoku/ contacts.html 302 Found www.setbb.com /phpbb/ contacts.html 301 Moved Permanently www.webforumz.com / contacts.html
The programs work on Windows. Simply use "\" instead of "/". For example:
In directory GAMES\MINESWEEPER\DESCRIPTIONS I run make_pages: python ..\..\SCRIPTS\make_pages.py This assumes that pages.txt is the page description file and puts completed files in GAMES\COMPLETE