Friday, September 24, 2010

Merging Word Documents using Office Open XML (OOXML)

A common scenario when working with OOXML is the need to merge or append several word 2007 or 2010 documents. Using C#, in this article I illustrate how this is possible with relative ease.

 

Add a Reference

To start with you will need to add a reference to the WindowsBase.dll, which you may need to browse to at "C:\Program Files\Reference Assemblies\Microsoft\Framework\v3.0\WindowsBase.dll"

Next, you will need to add a using directive to your class for the System.IO.Packaging namespace, as follows:

using System.IO.Packaging;

 

Document XML        

Now we have a using directive for the Packaging namespace in our class, we can directly access the types within it. The Packaging namespace is actually provided for access to compressed files, and since word document docx files are compressed files, we can use the types to look at the OOXML files within the document file. To start with, we use the Package class and the Open method to open the file, like so:

Package package = Package.Open(fileName, FileMode.Open, FileAccess.ReadWrite)

Since we are only interested in the actual body of the document, we want to now get access to the Document XML file within the package. If you were to open the word document yourself with a file compression utility, you would be able to see that the xml files that make up the document are organised into several files. The Document XML file we are interested in is called document.xml and is located within the word folder. To access a file within the document we use the Package method GetPart which will return a PackagePart type object as follows:

PackagePart docPart = package.GetPart(new Uri("/word/document.xml", UriKind.Relative));

Now we have access to the Document XML part, we want to then load the XML into an XmlDocument, and here is how we do that:

XmlDocument document = new XmlDocument();

document.Load(docPart.GetStream());

Because we have now loaded the Document XML file into an XmlDocument type object we can use all the methods available in the XmlDocument class to navigate through the XML. Because we are only interested in the actual body of the document, we want to select the body node of the document for later use, like so:

XmlNamespaceManager nsm = new XmlNamespaceManager(document.NameTable);

nsm.AddNamespace("w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main");

XmlNode body = document.SelectSingleNode("/w:document/w:body", nsm);

 

Appending/Merging Other Documents

Because we now have a reference to the body node of the document, we can simply add more children to the end of that node to append or merge other documents into it. What we basically do is repeat the process above for the second or subsequent documents, and then loop through the children nodes of the other documents appending them to the end of the first documents body node. Here is the snipped illustrating the looping through and appending of the children nodes:

XmlNode last = body.LastChild;

foreach (XmlNode childNode in lastPage.SelectSingleNode("/w:document/w:body", nsm).ChildNodes)

{

XmlNode docChildNode = document.ImportNode(childNode, true);

body.InsertAfter(docChildNode, last);

last = docChildNode;

}

It is worth noting the use of the XmlDocument ImportNode method. The reason we use that method, is because we are appending nodes to one document which are from another document, and therefore the namespaces differ.

 

Finally
The final thing we need to do is save the updated Document Xml back into its PackagePart. We simply call the Save method on the XmlDocument class as follows:

document.Save(docPart.GetStream(FileMode.Create, FileAccess.Write));

 

The Complete Code

The following is a complete example of merging/appending two word documents:

using System;

using System.IO;

using System.IO.Packaging;

using System.Xml;

 

class OpenOfficeXmlExample

{

 

    public static void MergeFiles()

    {

        const string DOC_URL = "/word/document.xml";

        const string OUTPUT_FILE = "C:\\TEMP\\MergedFile_{0:ddMMyy}-{0:HHmmss}.docx";

        const string FIRST_PAGE = "C:\\TEMP\\DocOne.docx";

        const string LAST_PAGE = "C:\\TEMP\\DocTwo.docx";

 

        string fileName = string.Format(OUTPUT_FILE, DateTime.Now);

        File.Copy(FIRST_PAGE, fileName);

 

        using (Package package = Package.Open(fileName, FileMode.Open, FileAccess.ReadWrite))

        {

            PackagePart docPart = package.GetPart(new Uri(DOC_URL, UriKind.Relative));

 

            XmlDocument document = new XmlDocument();

            document.Load(docPart.GetStream());

 

            XmlNamespaceManager nsm = new XmlNamespaceManager(document.NameTable);

            nsm.AddNamespace("w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main");

 

            XmlNode body = document.SelectSingleNode("/w:document/w:body", nsm);

 

            using (Package lastPackage = Package.Open(LAST_PAGE, FileMode.Open, FileAccess.Read))

            {

                XmlDocument lastPage = new XmlDocument();

                lastPage.Load(lastPackage.GetPart(new Uri(DOC_URL, UriKind.Relative)).GetStream());

 

                XmlNode last = body.LastChild;

                foreach (XmlNode childNode in lastPage.SelectSingleNode("/w:document/w:body", nsm).ChildNodes)

                {

                    XmlNode docChildNode = document.ImportNode(childNode, true);

                    body.InsertAfter(docChildNode, last);

                    last = docChildNode;

                }

            }

 

            document.Save(docPart.GetStream(FileMode.Create, FileAccess.Write));

        }

    }

 

}

In this example, firstly what we do is we take a copy of the DocOne file into a file called MergedFile which is also time stamped. We then open up the new files, and append the contents of the body of DocTwo into the body of the MergedFile.

Office Open XML (OOXML)

What is Office Open XML?
Office Open XML or OOXML is an Ecma standard which can be used to represent word processing documents, spreadsheets and presentations. It was originally developed by Microsoft, and was released in Office 2007 onwards. A Microsoft office open xml document is signified by the x on the end of the extension, for example docx for a word document (traditionally doc) or xlsx for an excel spreadsheet (traditionally xls). 

These Microsoft office documents are actually a packaged compressed file containing all the OOXML files to represent the document. With your file compression utility you can open to view, or extract the files from the files. Or even more simply, you can rename the file to have a zip extension and then just double click to view its contents

What can I use Office Open XML for?
Actually, OOXML is very useful. It can be used to create new, edit existing, or for extracting data from documents. You could setup a word document template and use it in a mail merge type scenario pragmatically, or you could extract data out of an excel spreadsheet to be then used for other purposes.

But can't I already do all that in other ways?
There are of course many ways to solve a problem, but Office Open XML provides a simple approach to achieve all of these aims through a single technology. Automation is of course one other way, but you can suffer DLL hell when different clients have different versions of office, and as well as the fact that Microsoft does not advise to use Office automation in server applications. You can of course use Jet/Ace OLEDB to access excel spreadsheets, but this of course doesn't work with other document types.

By using Office Open XML you have a standard flexible and safe way in which to access and manipulate all aspects of Office documents. There is much more support for dealing with XML, and there are many ways in which you can use it.

Just Another C# Blog - The Beginnings

With thousands of C# blogs already out there, I am not going to pretend that mine offers anything new or exciting that hasn't already been blogged on in the past. My aim instead is to write about the particular technologies I use and how I use them, and over time I hope to build up a comprehensive catalogue of articles relating to these technologies.