Word to HTML and HTML to Word Conversions

5 Jun 202422 minutes to read

The Essential DocIO converts the HTML file into Word document and vice versa. You can also convert the Word document (DOC, DOCX, RTF, DOT, DOTX, DOCM, and DOTM) into HTML format.

In Word library (DocIO) we use XmlReader for parsing the content from input HTML. So, the input HTML should meet XML standard (have proper open and close tags), even if you specify XHTMLValidationType parameter as XHTMLValidationType.None.

Assemblies and NuGet packages required

Refer to the following links for assemblies and NuGet packages required based on platforms to convert the HTML file into Word document and vice versa using the .NET Word Library (DocIO).

Convert HTML to Word

The following code example shows how to convert the HTML file into Word document.

NOTE

Refer to the appropriate tabs in the code snippets section: C# [Cross-platform] for ASP.NET Core, Blazor, Xamarin, UWP, .NET MAUI, and WinUI; C# [Windows-specific] for WinForms and WPF; VB.NET [Windows-specific] for VB.NET applications.

FileStream fileStreamPath = new FileStream("Input.html", FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
//Opens an existing document from file system through constructor of WordDocument class
using (WordDocument document = new WordDocument(fileStreamPath, FormatType.Html))
{
    //Saves the Word document to MemoryStream
    MemoryStream stream = new MemoryStream();
    document.Save(stream, FormatType.docx);
    //Closes the Word document
    document.Close();
}
//Loads the HTML document against validation type none
WordDocument document = new WordDocument("Input.html", FormatType.Html, XHTMLValidationType.None);
//Saves the Word document
document.Save("HTMLtoWord.docx", FormatType.Docx);
//Closes the document
document.Close();
' Loads the HTML document against validation type none
Dim document As New WordDocument("Input.html", FormatType.Html, XHTMLValidationType.None)
'Saves the Word document
document.Save("HTMLtoWord.docx", FormatType.Docx)
'Closes the document
document.Close()

You can download a complete working sample from GitHub.

XHTML Validation

Every HTML content is validated against a Document Type Declaration (DTD) which is a set of mark-up declarations that define a document type for a SGML-family mark-up language (GML, SGML, XML, HTML).

XHTML validation types

The following XHTML validation types are supported in Essential DocIO while importing an HTML content.

XHTML validation types Description
XHTMLValidationType.None It does not perform any schema validation but the given HTML content should meet XHTML 1.0 format.
XHTMLValidationType.Transitional It allows several attributes within the tags.
XHTMLValidationType.Strict It does not allows the attributes inside the tag.

Customizing the HTML to Word conversion

The Essential DocIO provides settings while performing HTML to Word conversion as mentioned as follows:

  • Validate the HTML string against XHTML 1.0 Strict and Transitional schema.
  • Insert the HTML string at the specified position of the document body contents.
  • Append HTML string to the specified paragraph.

The following code example shows how to customize the HTML to Word conversion.

//Loads the template document
WordDocument document = new WordDocument("Template.docx");
//Html string to be inserted
string htmlstring = "<p><b>This text is inserted as HTML string.</b></p>";
//Validates the Html string
bool isValidHtml = document.LastSection.Body.IsValidXHTML(htmlstring, XHTMLValidationType.Transitional);
//When the Html string passes validation, it is inserted to the document
if (isValidHtml)
{
    //Appends Html string as first item of the second paragraph in the document
    document.Sections[0].Body.InsertXHTML(htmlstring, 2, 0);
    //Appends the Html string to first paragraph in the document
    document.Sections[0].Body.Paragraphs[0].AppendHTML(htmlstring);
}
//Saves and closes the document
document.Save("Sample.docx");
document.Close();
'Loads the template document
Dim document As New WordDocument("Template.docx")
'Html string to be inserted
Dim htmlstring As String = "<p><b>This text is inserted as HTML string.</b></p>"
'Validates the Html string
Dim isValidHtmlAs Boolean = document.LastSection.Body.IsValidXHTML(htmlstring, XHTMLValidationType.Transitional)
'When the Html string passes validation, it is inserted to document
If isValidHtmlThen
    'Appends Html string as first item of the second paragraph in the document
    document.Sections(0).Body.InsertXHTML(htmlstring, 2, 0)
    'Appends the Html string to first paragraph in the document
    document.Sections(0).Body.Paragraphs(0).AppendHTML(htmlstring)
End If
'Saves and closes the document
document.Save("Sample.docx")
document.Close()

You can download a complete working sample from GitHub.

NOTE

  1. Inserting XHTML string is not supported in Silverlight, Windows Phone, and Xamarin applications.
  2. XHTML validation against XHTML 1.0 Strict and Transitional schema is not supported in Windows Store applications.
  3. XHTMLValidationType.None: Default validation while importing HTML file.
  4. XHTMLValidationType.None: Validates the HTML file against XHTML format and it doesn’t perform any schema validation.

Customize image data

The Essential DocIO provides an ImageNodeVisited event, which is used to customize image data while importing and exporting HTML files. You can implement logic to customize the image data by using this ImageNodeVisited event.

The following code example shows how to load image data based on image source path when importing the HTML files.

//Open the file as Stream
FileStream docStream = new FileStream("Input.html", FileMode.Open, FileAccess.Read);
//Creates a new instance of WordDocument
WordDocument document = new WordDocument();
//Hooks the ImageNodeVisited event to open the image from a specific location
document.HTMLImportSettings.ImageNodeVisited += OpenImage;
//Opens the input HTML document
document.Open(docStream, FormatType.Html);
//Unhooks the ImageNodeVisited event after loading HTML
document.HTMLImportSettings.ImageNodeVisited -= OpenImage;
//Creates an instance of memory stream
//Saves the Word document to MemoryStream
MemoryStream stream = new MemoryStream();
document.Save(stream, FormatType.Docx);
//Closes the WordDocument instance
document.Close();
//Creates a new instance of WordDocument
WordDocument document = new WordDocument();
//Hooks the ImageNodeVisited event to open the image from a specific location
document.HTMLImportSettings.ImageNodeVisited += OpenImage;
//Opens the input HTML document
document.Open("Input.html", FormatType.Html);
//Unhooks the ImageNodeVisited event after loading HTML
document.HTMLImportSettings.ImageNodeVisited -= OpenImage;
//Saves the Word document
document.Save("HtmlToWord.docx", FormatType.Docx);
//Closes the WordDocument instance
document.Close();
'Creates a new instance of WordDocument
Dim document As WordDocument = New WordDocument()
'Hooks the ImageNodeVisited event to open the image from a specific location
AddHandler document.HTMLImportSettings.ImageNodeVisited, AddressOf OpenImage
'Opens the input HTML document
document.Open("Input.html", FormatType.Html)
'Unhooks the ImageNodeVisited event after loading HTML
RemoveHandler document.HTMLImportSettings.ImageNodeVisited, AddressOf OpenImage
'Saves the Word document
document.Save("HtmlToWord.docx", FormatType.Docx)
'Closes the WordDocument instance
document.Close()

The following code example shows how to read the image from the specified path when importing the HTML files.

private void OpenImage(object sender, ImageNodeVisitedEventArgs args)
{
    //Read the image from the specified (args.Uri) path
    args.ImageStream = System.IO.File.OpenRead(args.Uri);
}
private void OpenImage(object sender, ImageNodeVisitedEventArgs args)
{
    //Read the image from the specified (args.Uri) path
    args.ImageStream = System.IO.File.OpenRead(args.Uri);
}
Private Sub OpenImage(ByVal sender As Object, ByVal args As ImageNodeVisitedEventArgs)
    'Read the image from the specified (args.Uri) path
    args.ImageStream = System.IO.File.OpenRead(args.Uri)
End Sub

You can download a complete working sample from GitHub.

NOTE

Calling the above event is mandatory in ASP.NET Core, UWP, and Xamarin platforms to preserve the images in HTML conversions.

Frequently Asked Questions

Convert Word to HTML

The following code example shows how to convert the Word document into HTML.

FileStream fileStreamPath = new FileStream("Template.docx", FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
//Opens an existing document from file system through constructor of WordDocument class
using (WordDocument document = new WordDocument(fileStreamPath, FormatType.Docx))
{
    //Saves the Word document to MemoryStream
    MemoryStream stream = new MemoryStream();
    document.Save(stream, FormatType.Html);
    //Closes the Word document
    document.Close();
}
//Loads the template document
WordDocument document = new WordDocument("Template.docx", FormatType.Docx);
//Saves the document as Html file
document.Save("WordToHtml.html", FormatType.Html);
//Closes the document 
document.Close();
'Loads the template document
Dim document As New WordDocument("Template.docx", FormatType.Docx)
'Saves the document as Html file
document.Save("WordToHtml.html", FormatType.Html)
'Closes the document 
document.Close()

You can download a complete working sample from GitHub.

Customizing the Word to HTML conversion

You can customize the Word to HTML conversion with the following options:

  • Extract the images used in the HTML document at the specified file directory
  • Specify to export the header and footer of the Word document in the HTML
  • Specify to consider Text Input field as a editable fields or text
  • Specify the CSS style sheet type and its name
  • Export the images as Base-64 embedded images
  • Omit XML declaration in the exported HTML file using HtmlExportOmitXmlDeclaration.

NOTE

  1. When exporting header and footer, DocIO exports the first section of header content at the top of the HTML file and the first section of footer content at the end of the HTML file.
  2. HtmlExportImagesFolder and HtmlExportCssStyleSheetFileName APIs are only supported in the .NET Framework.

The following code sample illustrates how to customize Word to HTML conversion.

//Load an existing Word document into DocIO instance.
using (FileStream fileStreamPath = new FileStream("Input.docx", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
   using (WordDocument document = new WordDocument(fileStreamPath, FormatType.Docx))
   {
        //The header and footer in the input are exported.
        document.SaveOptions.HtmlExportHeadersFooters = true;
        //Export the text form fields as editable .
        document.SaveOptions.HtmlExportTextInputFormFieldAsText = false;
        //Set the style sheet type.
        document.SaveOptions.HtmlExportCssStyleSheetType = CssStyleSheetType.Inline;
        //Set value to omit XML declaration in the exported html file.
        //True- to omit xml declaration, otherwise false.
        document.SaveOptions.HtmlExportOmitXmlDeclaration = false;
        //Create a file stream.
        using (FileStream outputFileStream = new FileStream("WordToHTML.html", FileMode.Create, FileAccess.ReadWrite))
        {
            //Save the HTML file to file stream.
            document.Save(outputFileStream, FormatType.Html);
        }
   }
//Loads an existing document
WordDocument document = new WordDocument("Template.docx");
HTMLExport export = new HTMLExport();
//The images in the input document are copied to this folder
document.SaveOptions.HtmlExportImagesFolder = @"D:/Data/";
//The headers and footers in the input are exported
document.SaveOptions.HtmlExportHeadersFooters = true;
//Exports the text form fields as editable
document.SaveOptions.HtmlExportTextInputFormFieldAsText = false;
//Sets the style sheet type
document.SaveOptions.HtmlExportCssStyleSheetType = CssStyleSheetType.External;
//Sets name for style sheet
document.SaveOptions.HtmlExportCssStyleSheetFileName = "UserDefinedFileName.css";
//Export the Word document image as Base-64 embedded image
document.SaveOptions.HTMLExportImageAsBase64 = true;
//Set value to omit XML declaration in the exported html file.
document.SaveOptions.HtmlExportOmitXmlDeclaration = true;
//Saves the document as html file
export.SaveAsXhtml(document, "WordtoHtml.html");
document.Close();
'Loads an existing document
Dim document As New WordDocument("Template.docx")
Dim export As New HTMLExport()
'The images in the input document are copied to this folder
document.SaveOptions.HtmlExportImagesFolder = "D:/Data/"
'The headers and footers in the input are exported
document.SaveOptions.HtmlExportHeadersFooters = True
'Exports the text form fields as editable
document.SaveOptions.HtmlExportTextInputFormFieldAsText = False
'Sets the style sheet type
document.SaveOptions.HtmlExportCssStyleSheetType = CssStyleSheetType.External
'Sets name for style sheet
document.SaveOptions.HtmlExportCssStyleSheetFileName = "UserDefinedFileName.css"
'Export the Word document image as Base-64 embedded image
document.SaveOptions.HTMLExportImageAsBase64 = True
'Set value to omit XML declaration in the exported html file.
document.SaveOptions.HtmlExportOmitXmlDeclaration = True;
'Saves the document as html file
export.SaveAsXhtml(document, "WordtoHtml.html")
document.Close()

You can download a complete working sample from GitHub.

Customize image Path

DocIO provides an ImageNodeVisited event, which is used to customize the image path that is set in the output HTML file and save images externally while converting a Word document to HTML.

The following code example illustrates how to save image files during a Word to HTML conversion.

//Open the file as a Stream.
using (FileStream docStream = new FileStream("Data/Input.docx", FileMode.Open, FileAccess.Read))
{
    //Load the file stream into a Word document.
    using (WordDocument document = new WordDocument(docStream, FormatType.Docx))
    {
        //Hook the event to customize the image. 
        document.SaveOptions.ImageNodeVisited += SaveImage;
        using (FileStream outputStream = new FileStream("WordtoHTML.html", FileMode.Create, FileAccess.ReadWrite, FileShare.ReadWrite))
        {
            //Save the HTML file.
            document.Save(outputStream, FormatType.Html);
        }
    }
}
//Open an existing Word document. 
using (WordDocument document = new WordDocument("Input.docx"))
{
    //Hook the event to customize the image. 
    document.SaveOptions.ImageNodeVisited += SaveImage;
    //Save a Word document as a HTML file.
    document.Save("WordtoHTML.html", FormatType.Html);
}
'Open an existing Word document. 
Using document As WordDocument = New WordDocument("Input.docx")
    'Hook the event to customize the image. 
    document.SaveOptions.ImageNodeVisited += SaveImage
    'Save a Word document as a HTML file.
    document.Save("WordtoHTML.html", FormatType.Html)
End Using

The following code example illustrates the event handler for customizing the image path and saving the image in an external folder.

static void SaveImage(object sender, ImageNodeVisitedEventArgs args)
{
    string imagepath = @"D:\Temp\Image.png";
    //Save the image stream as a file.
    using (FileStream fileStreamOutput = File.Create(imagepath))
        args.ImageStream.CopyTo(fileStreamOutput);
    //Set the URI to be used for the image in the output HTML. 
    args.Uri = imagepath;
}
static void SaveImage(object sender, ImageNodeVisitedEventArgs args)
{
    string imagepath = @"D:\Temp\Image.png";
    //Save the image stream as a file. 
    using (FileStream fileStreamOutput = File.Create(imagepath))
        args.ImageStream.CopyTo(fileStreamOutput);
    //Set the image URI to be used in the output HTML.
    args.Uri = imagepath;
}
Private Shared Sub SaveImage(ByVal sender As Object, ByVal args As ImageNodeVisitedEventArgs)
    Dim imagepath = "D:\Temp\Image.png"
    'Save the image stream as a file. 
    Using fileStreamOutput = File.Create(imagepath)
        args.ImageStream.CopyTo(fileStreamOutput)
    End Using
    'Set the URI to be used for the image in the output HTML. 
    args.Uri = imagepath
End Sub

TIPS

By utilizing the event handler mentioned above, you can also implement logic to store images in the Cloud or other online storage platforms.

You can download a complete working sample from GitHub.

Export HTML with body content alone

While saving a Word document as a HTML file using .NET Word Library, there is an option to save the HTML file with only the content within the <body> tags, excluding other elements through HtmlExportBodyContentAlone API.

The following code example illustrates how to export the HTML file with only the body content.

//Load an existing Word document.
using (FileStream fileStreamPath = new FileStream("Input.docx", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
    using (WordDocument document = new WordDocument(fileStreamPath, FormatType.Docx))
    {
        //Enable the flag, to save HTML with elements inside body tags alone.
        document.SaveOptions.HtmlExportBodyContentAlone = true;
       
        using (FileStream outputFileStream = new FileStream("WordToHTML.html", FileMode.Create, FileAccess.ReadWrite))
        {
            //Save Word document as HTML.
            document.Save(outputFileStream, FormatType.Html);
        }
    }
}
//Loads an existing document
//Load an existing Word document.
WordDocument document = new WordDocument("Input.docx", FormatType.Docx);
//Enable the flag, to save HTML with elements inside body tags alone.
document.SaveOptions.HtmlExportBodyContentAlone = true;
//Saves the document as html file
document.Save(document, "WordtoHtml.html");
document.Close();
'Loads an existing document
Dim document As New WordDocument("Input.docx")
'Enable the flag, to save HTML with elements inside body tags alone.
document.SaveOptions.HtmlExportBodyContentAlone = true
'Saves the document as html file
document.Save(document, "WordtoHtml.html")
document.Close()

You can download a complete working sample from GitHub

Supported and unsupported items

The following document elements and attributes are supported by DocIO in Word to HTML and HTML to Word conversions.

Document Element Attribute Support Status Notes
Bookmark

Id

Yes

-

Border



Color

Yes

-



Distance from text

Yes

-



Line style

Partial

Some line styles are rendered as solid.



Line width

Yes

-

Document Properties



Yes

-

Field



Yes

-

Footnotes and Endnotes



Yes

-

Form Field

Text input, Checkbox and combo box

Yes

-

Header / Footer

Different per section

Partial

Only odd header of the first section is preserved in HTML export.

Hyperlink

External URL

Yes

-



Local

Yes

-

Image

Inline

Yes

-



Scale

Yes

-

List

Custom bullets

Yes

-



Multi-level

Yes

-



Numbered

Yes

-



Restart numbering

Yes

-



Standard bullets

Yes

-

Comment



No



Symbols



Yes



Paragraph

Alignment

Yes





Borders

Yes

See Borders, for more details.



Keep lines and paragraphs together

Yes

-



Paragraph Indents

Yes

-



Line spacing

Yes

-



Page break before

Yes

-



Shading

Yes

See Shading, for more details.



Spacing before and after

Yes

-

Shading



Background color

Partial

Solid background colors are supported.



Foreground color

Partial

Solid foreground color is used when background color is auto.

Styles



Paragraph styles

Yes

-



Character styles

Yes

-



List styles

Yes

-

Table



Alignment

Yes

-



Cell margins

Yes

-



Column widths

Yes

-



Indent from left

Yes

-



Preferred width

Yes

-



Spacing between cells

Yes

-



Borders

Partial

See Borders, for more details.



Shading

Partial

See Shading, for more details.

Nested Table



Yes



Table Cell



Borders

Partial

See Borders, for more details.



Cell margins

Yes

-



Horizontal merge

Yes

-



Shading

Partial

See Shading, for more details.



Vertical alignment

Yes

-



Vertical merge

Yes

-

Table Row

Height

Yes

-



Padding

Yes

-

Text



All caps

Yes

-



Bold

Yes

-



Character spacing

Yes

-



Color

Yes

-



Emboss

Partial

Rendered as bold.



Engrave

Partial

Rendered as bold.



Font

Yes

-



Hidden

Yes

-



Highlighting

Yes

-



Imprint

Partial

Rendered as bold.



Italic

Yes

-



Line breaks

Yes

-



Outline

Partial

Rendered as bold.



Page breaks

Yes

-



Shading

Partial

See Shading, for more details.



Small caps

Yes

-



Special symbols

Yes

-



Strike out

Yes

-



Subscript / Superscript

Yes

-



Underline

Partial

Underline types and colors are ignored.

See Also