Word to HTML and HTML to Word Conversions
5 Jun 202422 minutes to read
The Essential DocIO converts the HTML file into Word document and vice versa. You can also convert the Word document (DOC, DOCX, RTF, DOT, DOTX, DOCM, and DOTM) into HTML format.
In Word library (DocIO) we use XmlReader for parsing the content from input HTML. So, the input HTML should meet XML standard (have proper open and close tags), even if you specify XHTMLValidationType parameter as XHTMLValidationType.None.
Assemblies and NuGet packages required
Refer to the following links for assemblies and NuGet packages required based on platforms to convert the HTML file into Word document and vice versa using the .NET Word Library (DocIO).
Convert HTML to Word
The following code example shows how to convert the HTML file into Word document.
NOTE
Refer to the appropriate tabs in the code snippets section: C# [Cross-platform] for ASP.NET Core, Blazor, Xamarin, UWP, .NET MAUI, and WinUI; C# [Windows-specific] for WinForms and WPF; VB.NET [Windows-specific] for VB.NET applications.
FileStream fileStreamPath = new FileStream("Input.html", FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
//Opens an existing document from file system through constructor of WordDocument class
using (WordDocument document = new WordDocument(fileStreamPath, FormatType.Html))
{
//Saves the Word document to MemoryStream
MemoryStream stream = new MemoryStream();
document.Save(stream, FormatType.docx);
//Closes the Word document
document.Close();
}
//Loads the HTML document against validation type none
WordDocument document = new WordDocument("Input.html", FormatType.Html, XHTMLValidationType.None);
//Saves the Word document
document.Save("HTMLtoWord.docx", FormatType.Docx);
//Closes the document
document.Close();
' Loads the HTML document against validation type none
Dim document As New WordDocument("Input.html", FormatType.Html, XHTMLValidationType.None)
'Saves the Word document
document.Save("HTMLtoWord.docx", FormatType.Docx)
'Closes the document
document.Close()
You can download a complete working sample from GitHub.
XHTML Validation
Every HTML content is validated against a Document Type Declaration (DTD) which is a set of mark-up declarations that define a document type for a SGML-family mark-up language (GML, SGML, XML, HTML).
XHTML validation types
The following XHTML validation types are supported in Essential DocIO while importing an HTML content.
XHTML validation types | Description |
XHTMLValidationType.None | It does not perform any schema validation but the given HTML content should meet XHTML 1.0 format. |
XHTMLValidationType.Transitional | It allows several attributes within the tags. |
XHTMLValidationType.Strict | It does not allows the attributes inside the tag. |
Customizing the HTML to Word conversion
The Essential DocIO provides settings while performing HTML to Word conversion as mentioned as follows:
- Validate the HTML string against XHTML 1.0 Strict and Transitional schema.
- Insert the HTML string at the specified position of the document body contents.
- Append HTML string to the specified paragraph.
The following code example shows how to customize the HTML to Word conversion.
//Loads the template document
WordDocument document = new WordDocument("Template.docx");
//Html string to be inserted
string htmlstring = "<p><b>This text is inserted as HTML string.</b></p>";
//Validates the Html string
bool isValidHtml = document.LastSection.Body.IsValidXHTML(htmlstring, XHTMLValidationType.Transitional);
//When the Html string passes validation, it is inserted to the document
if (isValidHtml)
{
//Appends Html string as first item of the second paragraph in the document
document.Sections[0].Body.InsertXHTML(htmlstring, 2, 0);
//Appends the Html string to first paragraph in the document
document.Sections[0].Body.Paragraphs[0].AppendHTML(htmlstring);
}
//Saves and closes the document
document.Save("Sample.docx");
document.Close();
'Loads the template document
Dim document As New WordDocument("Template.docx")
'Html string to be inserted
Dim htmlstring As String = "<p><b>This text is inserted as HTML string.</b></p>"
'Validates the Html string
Dim isValidHtmlAs Boolean = document.LastSection.Body.IsValidXHTML(htmlstring, XHTMLValidationType.Transitional)
'When the Html string passes validation, it is inserted to document
If isValidHtmlThen
'Appends Html string as first item of the second paragraph in the document
document.Sections(0).Body.InsertXHTML(htmlstring, 2, 0)
'Appends the Html string to first paragraph in the document
document.Sections(0).Body.Paragraphs(0).AppendHTML(htmlstring)
End If
'Saves and closes the document
document.Save("Sample.docx")
document.Close()
You can download a complete working sample from GitHub.
NOTE
- Inserting XHTML string is not supported in Silverlight, Windows Phone, and Xamarin applications.
- XHTML validation against XHTML 1.0 Strict and Transitional schema is not supported in Windows Store applications.
- XHTMLValidationType.None: Default validation while importing HTML file.
- XHTMLValidationType.None: Validates the HTML file against XHTML format and it doesn’t perform any schema validation.
Customize image data
The Essential DocIO provides an ImageNodeVisited event, which is used to customize image data while importing and exporting HTML files. You can implement logic to customize the image data by using this ImageNodeVisited event.
The following code example shows how to load image data based on image source path when importing the HTML files.
//Open the file as Stream
FileStream docStream = new FileStream("Input.html", FileMode.Open, FileAccess.Read);
//Creates a new instance of WordDocument
WordDocument document = new WordDocument();
//Hooks the ImageNodeVisited event to open the image from a specific location
document.HTMLImportSettings.ImageNodeVisited += OpenImage;
//Opens the input HTML document
document.Open(docStream, FormatType.Html);
//Unhooks the ImageNodeVisited event after loading HTML
document.HTMLImportSettings.ImageNodeVisited -= OpenImage;
//Creates an instance of memory stream
//Saves the Word document to MemoryStream
MemoryStream stream = new MemoryStream();
document.Save(stream, FormatType.Docx);
//Closes the WordDocument instance
document.Close();
//Creates a new instance of WordDocument
WordDocument document = new WordDocument();
//Hooks the ImageNodeVisited event to open the image from a specific location
document.HTMLImportSettings.ImageNodeVisited += OpenImage;
//Opens the input HTML document
document.Open("Input.html", FormatType.Html);
//Unhooks the ImageNodeVisited event after loading HTML
document.HTMLImportSettings.ImageNodeVisited -= OpenImage;
//Saves the Word document
document.Save("HtmlToWord.docx", FormatType.Docx);
//Closes the WordDocument instance
document.Close();
'Creates a new instance of WordDocument
Dim document As WordDocument = New WordDocument()
'Hooks the ImageNodeVisited event to open the image from a specific location
AddHandler document.HTMLImportSettings.ImageNodeVisited, AddressOf OpenImage
'Opens the input HTML document
document.Open("Input.html", FormatType.Html)
'Unhooks the ImageNodeVisited event after loading HTML
RemoveHandler document.HTMLImportSettings.ImageNodeVisited, AddressOf OpenImage
'Saves the Word document
document.Save("HtmlToWord.docx", FormatType.Docx)
'Closes the WordDocument instance
document.Close()
The following code example shows how to read the image from the specified path when importing the HTML files.
private void OpenImage(object sender, ImageNodeVisitedEventArgs args)
{
//Read the image from the specified (args.Uri) path
args.ImageStream = System.IO.File.OpenRead(args.Uri);
}
private void OpenImage(object sender, ImageNodeVisitedEventArgs args)
{
//Read the image from the specified (args.Uri) path
args.ImageStream = System.IO.File.OpenRead(args.Uri);
}
Private Sub OpenImage(ByVal sender As Object, ByVal args As ImageNodeVisitedEventArgs)
'Read the image from the specified (args.Uri) path
args.ImageStream = System.IO.File.OpenRead(args.Uri)
End Sub
You can download a complete working sample from GitHub.
NOTE
Calling the above event is mandatory in ASP.NET Core, UWP, and Xamarin platforms to preserve the images in HTML conversions.
Frequently Asked Questions
Convert Word to HTML
The following code example shows how to convert the Word document into HTML.
FileStream fileStreamPath = new FileStream("Template.docx", FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
//Opens an existing document from file system through constructor of WordDocument class
using (WordDocument document = new WordDocument(fileStreamPath, FormatType.Docx))
{
//Saves the Word document to MemoryStream
MemoryStream stream = new MemoryStream();
document.Save(stream, FormatType.Html);
//Closes the Word document
document.Close();
}
//Loads the template document
WordDocument document = new WordDocument("Template.docx", FormatType.Docx);
//Saves the document as Html file
document.Save("WordToHtml.html", FormatType.Html);
//Closes the document
document.Close();
'Loads the template document
Dim document As New WordDocument("Template.docx", FormatType.Docx)
'Saves the document as Html file
document.Save("WordToHtml.html", FormatType.Html)
'Closes the document
document.Close()
You can download a complete working sample from GitHub.
Customizing the Word to HTML conversion
You can customize the Word to HTML conversion with the following options:
- Extract the images used in the HTML document at the specified file directory
- Specify to export the header and footer of the Word document in the HTML
- Specify to consider Text Input field as a editable fields or text
- Specify the CSS style sheet type and its name
- Export the images as Base-64 embedded images
- Omit XML declaration in the exported HTML file using HtmlExportOmitXmlDeclaration.
NOTE
- When exporting header and footer, DocIO exports the first section of header content at the top of the HTML file and the first section of footer content at the end of the HTML file.
- HtmlExportImagesFolder and HtmlExportCssStyleSheetFileName APIs are only supported in the .NET Framework.
The following code sample illustrates how to customize Word to HTML conversion.
//Load an existing Word document into DocIO instance.
using (FileStream fileStreamPath = new FileStream("Input.docx", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
using (WordDocument document = new WordDocument(fileStreamPath, FormatType.Docx))
{
//The header and footer in the input are exported.
document.SaveOptions.HtmlExportHeadersFooters = true;
//Export the text form fields as editable .
document.SaveOptions.HtmlExportTextInputFormFieldAsText = false;
//Set the style sheet type.
document.SaveOptions.HtmlExportCssStyleSheetType = CssStyleSheetType.Inline;
//Set value to omit XML declaration in the exported html file.
//True- to omit xml declaration, otherwise false.
document.SaveOptions.HtmlExportOmitXmlDeclaration = false;
//Create a file stream.
using (FileStream outputFileStream = new FileStream("WordToHTML.html", FileMode.Create, FileAccess.ReadWrite))
{
//Save the HTML file to file stream.
document.Save(outputFileStream, FormatType.Html);
}
}
//Loads an existing document
WordDocument document = new WordDocument("Template.docx");
HTMLExport export = new HTMLExport();
//The images in the input document are copied to this folder
document.SaveOptions.HtmlExportImagesFolder = @"D:/Data/";
//The headers and footers in the input are exported
document.SaveOptions.HtmlExportHeadersFooters = true;
//Exports the text form fields as editable
document.SaveOptions.HtmlExportTextInputFormFieldAsText = false;
//Sets the style sheet type
document.SaveOptions.HtmlExportCssStyleSheetType = CssStyleSheetType.External;
//Sets name for style sheet
document.SaveOptions.HtmlExportCssStyleSheetFileName = "UserDefinedFileName.css";
//Export the Word document image as Base-64 embedded image
document.SaveOptions.HTMLExportImageAsBase64 = true;
//Set value to omit XML declaration in the exported html file.
document.SaveOptions.HtmlExportOmitXmlDeclaration = true;
//Saves the document as html file
export.SaveAsXhtml(document, "WordtoHtml.html");
document.Close();
'Loads an existing document
Dim document As New WordDocument("Template.docx")
Dim export As New HTMLExport()
'The images in the input document are copied to this folder
document.SaveOptions.HtmlExportImagesFolder = "D:/Data/"
'The headers and footers in the input are exported
document.SaveOptions.HtmlExportHeadersFooters = True
'Exports the text form fields as editable
document.SaveOptions.HtmlExportTextInputFormFieldAsText = False
'Sets the style sheet type
document.SaveOptions.HtmlExportCssStyleSheetType = CssStyleSheetType.External
'Sets name for style sheet
document.SaveOptions.HtmlExportCssStyleSheetFileName = "UserDefinedFileName.css"
'Export the Word document image as Base-64 embedded image
document.SaveOptions.HTMLExportImageAsBase64 = True
'Set value to omit XML declaration in the exported html file.
document.SaveOptions.HtmlExportOmitXmlDeclaration = True;
'Saves the document as html file
export.SaveAsXhtml(document, "WordtoHtml.html")
document.Close()
You can download a complete working sample from GitHub.
Customize image Path
DocIO provides an ImageNodeVisited event, which is used to customize the image path that is set in the output HTML file and save images externally while converting a Word document to HTML.
The following code example illustrates how to save image files during a Word to HTML conversion.
//Open the file as a Stream.
using (FileStream docStream = new FileStream("Data/Input.docx", FileMode.Open, FileAccess.Read))
{
//Load the file stream into a Word document.
using (WordDocument document = new WordDocument(docStream, FormatType.Docx))
{
//Hook the event to customize the image.
document.SaveOptions.ImageNodeVisited += SaveImage;
using (FileStream outputStream = new FileStream("WordtoHTML.html", FileMode.Create, FileAccess.ReadWrite, FileShare.ReadWrite))
{
//Save the HTML file.
document.Save(outputStream, FormatType.Html);
}
}
}
//Open an existing Word document.
using (WordDocument document = new WordDocument("Input.docx"))
{
//Hook the event to customize the image.
document.SaveOptions.ImageNodeVisited += SaveImage;
//Save a Word document as a HTML file.
document.Save("WordtoHTML.html", FormatType.Html);
}
'Open an existing Word document.
Using document As WordDocument = New WordDocument("Input.docx")
'Hook the event to customize the image.
document.SaveOptions.ImageNodeVisited += SaveImage
'Save a Word document as a HTML file.
document.Save("WordtoHTML.html", FormatType.Html)
End Using
The following code example illustrates the event handler for customizing the image path and saving the image in an external folder.
static void SaveImage(object sender, ImageNodeVisitedEventArgs args)
{
string imagepath = @"D:\Temp\Image.png";
//Save the image stream as a file.
using (FileStream fileStreamOutput = File.Create(imagepath))
args.ImageStream.CopyTo(fileStreamOutput);
//Set the URI to be used for the image in the output HTML.
args.Uri = imagepath;
}
static void SaveImage(object sender, ImageNodeVisitedEventArgs args)
{
string imagepath = @"D:\Temp\Image.png";
//Save the image stream as a file.
using (FileStream fileStreamOutput = File.Create(imagepath))
args.ImageStream.CopyTo(fileStreamOutput);
//Set the image URI to be used in the output HTML.
args.Uri = imagepath;
}
Private Shared Sub SaveImage(ByVal sender As Object, ByVal args As ImageNodeVisitedEventArgs)
Dim imagepath = "D:\Temp\Image.png"
'Save the image stream as a file.
Using fileStreamOutput = File.Create(imagepath)
args.ImageStream.CopyTo(fileStreamOutput)
End Using
'Set the URI to be used for the image in the output HTML.
args.Uri = imagepath
End Sub
TIPS
By utilizing the event handler mentioned above, you can also implement logic to store images in the Cloud or other online storage platforms.
You can download a complete working sample from GitHub.
Export HTML with body content alone
While saving a Word document as a HTML file using .NET Word Library, there is an option to save the HTML file with only the content within the <body> tags, excluding other elements through HtmlExportBodyContentAlone API.
The following code example illustrates how to export the HTML file with only the body content.
//Load an existing Word document.
using (FileStream fileStreamPath = new FileStream("Input.docx", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
using (WordDocument document = new WordDocument(fileStreamPath, FormatType.Docx))
{
//Enable the flag, to save HTML with elements inside body tags alone.
document.SaveOptions.HtmlExportBodyContentAlone = true;
using (FileStream outputFileStream = new FileStream("WordToHTML.html", FileMode.Create, FileAccess.ReadWrite))
{
//Save Word document as HTML.
document.Save(outputFileStream, FormatType.Html);
}
}
}
//Loads an existing document
//Load an existing Word document.
WordDocument document = new WordDocument("Input.docx", FormatType.Docx);
//Enable the flag, to save HTML with elements inside body tags alone.
document.SaveOptions.HtmlExportBodyContentAlone = true;
//Saves the document as html file
document.Save(document, "WordtoHtml.html");
document.Close();
'Loads an existing document
Dim document As New WordDocument("Input.docx")
'Enable the flag, to save HTML with elements inside body tags alone.
document.SaveOptions.HtmlExportBodyContentAlone = true
'Saves the document as html file
document.Save(document, "WordtoHtml.html")
document.Close()
You can download a complete working sample from GitHub
Supported and unsupported items
The following document elements and attributes are supported by DocIO in Word to HTML and HTML to Word conversions.
Document Element | Attribute | Support Status | Notes |
---|---|---|---|
Bookmark |
Id |
Yes |
- |
Border |
Color |
Yes |
- |
|
Distance from text |
Yes |
- |
|
Line style |
Partial |
Some line styles are rendered as solid. |
|
Line width |
Yes |
- |
Document Properties |
|
Yes |
- |
Field |
|
Yes |
- |
Footnotes and Endnotes |
|
Yes |
- |
Form Field |
Text input, Checkbox and combo box |
Yes |
- |
Header / Footer |
Different per section |
Partial |
Only odd header of the first section is preserved in HTML export. |
Hyperlink |
External URL |
Yes |
- |
|
Local |
Yes |
- |
Image |
Inline |
Yes |
- |
|
Scale |
Yes |
- |
List |
Custom bullets |
Yes |
- |
|
Multi-level |
Yes |
- |
|
Numbered |
Yes |
- |
|
Restart numbering |
Yes |
- |
|
Standard bullets |
Yes |
- |
Comment |
|
No |
|
Symbols |
|
Yes |
|
Paragraph |
Alignment |
Yes |
|
|
Borders |
Yes |
See Borders, for more details. |
|
Keep lines and paragraphs together |
Yes |
- |
|
Paragraph Indents |
Yes |
- |
|
Line spacing |
Yes |
- |
|
Page break before |
Yes |
- |
|
Shading |
Yes |
See Shading, for more details. |
|
Spacing before and after |
Yes |
- |
Shading |
Background color |
Partial |
Solid background colors are supported. |
|
Foreground color |
Partial |
Solid foreground color is used when background color is auto. |
Styles |
Paragraph styles |
Yes |
- |
|
Character styles |
Yes |
- |
|
List styles |
Yes |
- |
Table |
Alignment |
Yes |
- |
|
Cell margins |
Yes |
- |
|
Column widths |
Yes |
- |
|
Indent from left |
Yes |
- |
|
Preferred width |
Yes |
- |
|
Spacing between cells |
Yes |
- |
|
Borders |
Partial |
See Borders, for more details. |
|
Shading |
Partial |
See Shading, for more details. |
Nested Table |
|
Yes |
|
Table Cell |
Borders |
Partial |
See Borders, for more details. |
|
Cell margins |
Yes |
- |
|
Horizontal merge |
Yes |
- |
|
Shading |
Partial |
See Shading, for more details. |
|
Vertical alignment |
Yes |
- |
|
Vertical merge |
Yes |
- |
Table Row |
Height |
Yes |
- |
|
Padding |
Yes |
- |
Text |
All caps |
Yes |
- |
|
Bold |
Yes |
- |
|
Character spacing |
Yes |
- |
|
Color |
Yes |
- |
|
Emboss |
Partial |
Rendered as bold. |
|
Engrave |
Partial |
Rendered as bold. |
|
Font |
Yes |
- |
|
Hidden |
Yes |
- |
|
Highlighting |
Yes |
- |
|
Imprint |
Partial |
Rendered as bold. |
|
Italic |
Yes |
- |
|
Line breaks |
Yes |
- |
|
Outline |
Partial |
Rendered as bold. |
|
Page breaks |
Yes |
- |
|
Shading |
Partial |
See Shading, for more details. |
|
Small caps |
Yes |
- |
|
Special symbols |
Yes |
- |
|
Strike out |
Yes |
- |
|
Subscript / Superscript |
Yes |
- |
|
Underline |
Partial |
Underline types and colors are ignored. |