MS Word "save as HTML" and HTML Purifier

Written by Peter Davies on .

Whilst creating an older version of this website, I realised that the mechanisms in place for saving an Microsoft Word document as HTML (filtered webpage) and then importing the source into the Jaws CMS caused quite a few issues. For example, the supposedly "filtered" saved output from MS Word is far from filtered and is likely to destroy any existing formatting that had been created by the CMS.

This got me thinking and through some simple code using a recently updated HTML Purifier, I managed to extract a filtered HTML document that was suitable for generating the articles you see on the right-hand side. Through the following code I have reduced the time it takes to convert a Word document to a fully formed XHTML document.

The PHP code is based around the PHP5 version of HTML Purifier 2.0:

Access to the working demo will be available soon in a new "tools" area of the site. Of course the combination of PHP4 and PHP5 might prove interesting - maybe link to the development server might do.