I wrote a python script to convert Word documents to mostly-clean html. Get it at https://github.com/bradmontgomery/word2html.
Ah, Microsoft Word...
That glorious business-class software used all-around the world. It's perfect for those long, legal documents consisting of nothing but headers, paragraphs, and bulleted lists. All of which we an easily convert into simple HTML, right. Right?
File > Save As > Web Page (.htm). Easy as... No wait, was that supposed to be File > Save As > Web Page, Filtered (.htm)?
O.M.G. What is all this
MsoTitle crap. Why are there so many inline styles for simple black & white text. Why are all of my bulleted lists paragraph tags!? Why oh why are we 20-years into having a world wide web, and the world's foremost business document software can't even generate a simple html page.
Never fear there's hope.
The gist of the code looks something like this:
import pypandoc from tidylib import tidy_document output = pypandoc.convert(your_filename, 'html') output, errors = tidy_document(output) with open(output_file, 'w') as f: f.write(output)
Grab the repo, install the requirements, and run the command:
python convert.py MyGloriousDoc.docx
Happy converting your word docs to html. Long live the web!
comments powered by Disqus