Let's convert a Word Doc to HTML

Published on April 29, 2016, 11:04 a.m.

html pandoc python word

tl;dr

I wrote a python script to convert Word documents to mostly-clean html. Get it at https://github.com/bradmontgomery/word2html.

Ah, Microsoft Word...

That glorious business-class software used all-around the world. It's perfect for those long, legal documents consisting of nothing but headers, paragraphs, and bulleted lists. All of which we an easily convert into simple HTML, right. Right?

File > Save As > Web Page (.htm). Easy as... No wait, was that supposed to be File > Save As > Web Page, Filtered (.htm)?

O.M.G. What is all this MsoTitle crap. Why are there so many inline styles for simple black & white text. Why are all of my bulleted lists paragraph tags!? Why oh why are we 20-years into having a world wide web, and the world's foremost business document software can't even generate a simple html page.

Never fear there's hope.

Disclaimer: this is a quick and dirty hack. Check out my word2html script. With the magic of python, pandoc, and pytidylib/html tidy, doing this conversion isn't soooo bad.

The gist of the code looks something like this:

import pypandoc
from tidylib import tidy_document

output = pypandoc.convert(your_filename, 'html')
output, errors = tidy_document(output)
with open(output_file, 'w') as f:
    f.write(output)

Grab the repo, install the requirements, and run the command:

 python convert.py MyGloriousDoc.docx

Happy converting your word docs to html. Long live the web!

comments powered by Disqus