Mike Robbins

Sitecore Developer Blog

Follow me on GitHub

Sitecore Data Importer / Word docx Importer

I recently had a requirement from a client to allow them to import content into Sitecore items from word docx files. After looking on the Sitecore marketplace and git hub I realised nothing existed so decided on writing a module. I did quickly realise why nobody has written one before docx files are basically zip files containing a number of files with the main document content written in XML.

The document

To be useful I wanted the importer to be able to split the docx file into fields within a Sitecore item rather than drop the entire contents into one rich text editor field. This became an issue, how could I tell which section of the document related to which field.

The solution I came up was using the “Title” button in Microsoft Word to mark each section of the document. The title would match up exactly with the field name within Sitecore. Using this structure means that I could programmatically read all content between titles and know which field in Sitecore that content related too.

Sample Document

> Sample Document

Import Module

The actual Sitecore data importer module has a simple interface allowing the upload of a file to import (Sitecore data importer supports docx files, csv’s and a zip file containing docx files) . The module also takes the path where you want the items created and the template of the items you want to create.

The main work of the importer parsing the docx file into fields is handled by the code below. Im using OpenXml to help read the docx format, and then iterating through each paragraph checking to see if it’s a title tag, storing the title in a variable and then grabbing the text paragraph(s). Each field name / value is stored in a dictionary which gets written out to the Sitecore API later on.  You can also upload a zip file containing multiple docx files which then gets uncompressed before passing each document off to this function.

Sitecore  Data Importer

Sitecore Data Importer

public Dictionary<string, string> ExtractFields(string path)
{
  var fields = new Dictionary<string, string>();
  using (var myDocument = WordprocessingDocument.Open(path, true))
  {
    var body = myDocument.MainDocumentPart.Document.Body;
    var dictionaryKey = "";
    foreach (var paragraph in body.Where(paragraph => !string.IsNullOrEmpty(paragraph.InnerText)))
    {
      if (paragraph.InnerXml.Contains("w:val=\"Title\""))
      {
        dictionaryKey = paragraph.InnerText;
      }
      else
      {
        if (fields.ContainsKey(dictionaryKey))
        {
          fields[dictionaryKey] = fields[dictionaryKey].EndsWith("</p>") ? fields[dictionaryKey] + "<p>" + paragraph.InnerText + "</p>" : "<p>" + fields[dictionaryKey] + "</p><p>" + paragraph.InnerText + "</p>";
        }
        else
        {
          fields.Add(dictionaryKey, paragraph.InnerText);
        }
      }
    }
  }
  return fields;
}

Source Code and Sitecore Marketplace

Any issues or bugs with the module please let me know. The module does need testing with more docx structures, this is the initial release.

https://github.com/komainu85/SitecoreDataImporter

http://marketplace.sitecore.net/en/Modules/Sitecore_Data_Importer.aspx