I could spend hours and hours just copy-/pasting stuff... but meh. Enter BeautifulSoup, urllib, and sqlite... all glued together with python.
Now, for the most part, the content I want to scrape from these static html pages aIso contains links to the other pages I want to keep. So, here's what we can do, use urlllib to grab the content from the web server, use BeautifulSoup to extract all the values from href attributes in our "a" elements, and keep a list of these (checking to make sure they point to sites we want to keep).
Then, I use BeautifulSoup to parse the page, keeping the bits that I want to keep. Lucky for me, the pages I'm particularly interested in have handy "InstanceBeginEditable" and "InstanceEndEditable" in comments (thanks Dreamweaver), so I just strip out everything before and after that, and parse what's left over. Of course, this sort of thing is going to be different for everybody, but luckily, parsing XHTML isn't that difficulty thanks to BeautifulSoup's mostly-comprehensive documentation.
So, the part you've been waiting for... Download the Code, try it out, and leave me some feedback!
http://bradmontgomery.net/files/migrate_web.zip

migrate_web.py by Brad Montgomery is licensed under a Creative Commons Attribution 3.0 United States License.
Based on a work at bradmontgomery.net.
comments powered by Disqus