Friday, September 24, 2010

Merging Word Documents using Office Open XML (OOXML)

A common scenario when working with OOXML is the need to merge or append several word 2007 or 2010 documents. Using C#, in this article I illustrate how this is possible with relative ease.

 

Add a Reference

To start with you will need to add a reference to the WindowsBase.dll, which you may need to browse to at "C:\Program Files\Reference Assemblies\Microsoft\Framework\v3.0\WindowsBase.dll"

Next, you will need to add a using directive to your class for the System.IO.Packaging namespace, as follows:

using System.IO.Packaging;

 

Document XML        

Now we have a using directive for the Packaging namespace in our class, we can directly access the types within it. The Packaging namespace is actually provided for access to compressed files, and since word document docx files are compressed files, we can use the types to look at the OOXML files within the document file. To start with, we use the Package class and the Open method to open the file, like so:

Package package = Package.Open(fileName, FileMode.Open, FileAccess.ReadWrite)

Since we are only interested in the actual body of the document, we want to now get access to the Document XML file within the package. If you were to open the word document yourself with a file compression utility, you would be able to see that the xml files that make up the document are organised into several files. The Document XML file we are interested in is called document.xml and is located within the word folder. To access a file within the document we use the Package method GetPart which will return a PackagePart type object as follows:

PackagePart docPart = package.GetPart(new Uri("/word/document.xml", UriKind.Relative));

Now we have access to the Document XML part, we want to then load the XML into an XmlDocument, and here is how we do that:

XmlDocument document = new XmlDocument();

document.Load(docPart.GetStream());

Because we have now loaded the Document XML file into an XmlDocument type object we can use all the methods available in the XmlDocument class to navigate through the XML. Because we are only interested in the actual body of the document, we want to select the body node of the document for later use, like so:

XmlNamespaceManager nsm = new XmlNamespaceManager(document.NameTable);

nsm.AddNamespace("w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main");

XmlNode body = document.SelectSingleNode("/w:document/w:body", nsm);

 

Appending/Merging Other Documents

Because we now have a reference to the body node of the document, we can simply add more children to the end of that node to append or merge other documents into it. What we basically do is repeat the process above for the second or subsequent documents, and then loop through the children nodes of the other documents appending them to the end of the first documents body node. Here is the snipped illustrating the looping through and appending of the children nodes:

XmlNode last = body.LastChild;

foreach (XmlNode childNode in lastPage.SelectSingleNode("/w:document/w:body", nsm).ChildNodes)

{

XmlNode docChildNode = document.ImportNode(childNode, true);

body.InsertAfter(docChildNode, last);

last = docChildNode;

}

It is worth noting the use of the XmlDocument ImportNode method. The reason we use that method, is because we are appending nodes to one document which are from another document, and therefore the namespaces differ.

 

Finally
The final thing we need to do is save the updated Document Xml back into its PackagePart. We simply call the Save method on the XmlDocument class as follows:

document.Save(docPart.GetStream(FileMode.Create, FileAccess.Write));

 

The Complete Code

The following is a complete example of merging/appending two word documents:

using System;

using System.IO;

using System.IO.Packaging;

using System.Xml;

 

class OpenOfficeXmlExample

{

 

    public static void MergeFiles()

    {

        const string DOC_URL = "/word/document.xml";

        const string OUTPUT_FILE = "C:\\TEMP\\MergedFile_{0:ddMMyy}-{0:HHmmss}.docx";

        const string FIRST_PAGE = "C:\\TEMP\\DocOne.docx";

        const string LAST_PAGE = "C:\\TEMP\\DocTwo.docx";

 

        string fileName = string.Format(OUTPUT_FILE, DateTime.Now);

        File.Copy(FIRST_PAGE, fileName);

 

        using (Package package = Package.Open(fileName, FileMode.Open, FileAccess.ReadWrite))

        {

            PackagePart docPart = package.GetPart(new Uri(DOC_URL, UriKind.Relative));

 

            XmlDocument document = new XmlDocument();

            document.Load(docPart.GetStream());

 

            XmlNamespaceManager nsm = new XmlNamespaceManager(document.NameTable);

            nsm.AddNamespace("w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main");

 

            XmlNode body = document.SelectSingleNode("/w:document/w:body", nsm);

 

            using (Package lastPackage = Package.Open(LAST_PAGE, FileMode.Open, FileAccess.Read))

            {

                XmlDocument lastPage = new XmlDocument();

                lastPage.Load(lastPackage.GetPart(new Uri(DOC_URL, UriKind.Relative)).GetStream());

 

                XmlNode last = body.LastChild;

                foreach (XmlNode childNode in lastPage.SelectSingleNode("/w:document/w:body", nsm).ChildNodes)

                {

                    XmlNode docChildNode = document.ImportNode(childNode, true);

                    body.InsertAfter(docChildNode, last);

                    last = docChildNode;

                }

            }

 

            document.Save(docPart.GetStream(FileMode.Create, FileAccess.Write));

        }

    }

 

}

In this example, firstly what we do is we take a copy of the DocOne file into a file called MergedFile which is also time stamped. We then open up the new files, and append the contents of the body of DocTwo into the body of the MergedFile.

2 comments:

  1. cool , its working
    but i have a problem when i added two docx file , page break not affected.
    can you please tell how would i do this?

    ReplyDelete