WordML to FlowDocument – how to convert docx files to WPF FlowDocument

22 באפריל 2008

2 תגובות

[This blog was migrated. You will not be able to comment here.
The new URL of this post is http://khason.net/blog/wordml-to-flowdocument-%e2%80%93-how-to-convert-docx-files-to-wpf-flowdocument/]


Recently we spoke about converting XPS files and FixedDocuments to FlowDocuments. It works, but there are lot of problems, due to fact, that FixedDocument (or XPS/PDF document) has less information, then the original file. Those files are, actually, printout of the original document. Also we know how to use Windows Vista Preview Handler to view original MS Office files inside WPF application. So why not to work with the originals? Why not to convert Microsoft Office document into FlowDocument and then view it as XAML files inside native FlowDocumentReader? Can we do this? Sure we can. Let’s see how…

image

First of all, we should understand what is WordML (docx) document and what are differences between old Word format (doc) and new (docx).

WordML (ExcelML, etc) is new open format by Microsoft. It’s very similar to XPS – package with bunch of XML files inside it. We can work with those files directly from WPF code, by using System.IO.Packaging namespace as well as we can download and use technology preview of Open XML Format SDK with new handy classes, used to read and write Open XML document.

Let’s start coding. first of all we should read the file. We can do it either by using Package or WordprocessingDocument class

using (WordprocessingDocument wdoc = WordprocessingDocument.Open(path, false))
            {

Now, let’s read the main part (Word/document.xml) file and load it into XDocument. Yes, we’ll use XLinq to do all work. Why? ‘Cos it’s Passover now, I’m tired of Matzo and want spaghetti :) Also it’s because of Eric White, who looked for new job inside Microsoft to run away of such code, but he’s the only men, who really understand what’s happened inside those evil lines, so he stayed in his position.

using (StreamReader sr = new StreamReader(wdoc.MainDocumentPart.GetStream()))
                {
                    xdoc = XDocument.Load(sr);

Next step is to read all paragraphs inside the main document. See? Paragraphs… We have Paragraphs also in FlowDocument. All we have to do is to convert

var paragraphs = from par in xdoc
                                         .Root
                                         .Element(w_body)
                                         .Descendants(w_p)
                                     let par_style = par
                                         .Elements(w_pPr)
                                         .Elements(w_pStyle)
                                         .FirstOrDefault()
                                     let par_inline = par
                                         .Elements(w_pPr)
                                         .FirstOrDefault()
                                     let par_list = par
                                         .Elements(w_pPr)
                                         .Elements(w_numPr)
                                         .FirstOrDefault()                                    
                                     select new
                                     {
                                         pElement = par,
                                         pStyle = par_style != null ? par_style.Attribute(w_val).Value : (from d_style in xstyle
                                                                                                               .Root
                                                                                                               .Elements(w_style)
                                                                                                          where
                                                                                                              d_style.Attribute(w_type).Value == "paragraph" &&
                                                                                                              d_style.Attribute(w_default).Value == "1"
                                                                                                          select d_style).First().Attribute(w_styleId).Value,
                                         pAttrs = par_inline,
                                         pRuns = par.Elements().Where(e => e.Name == w_r || e.Name == w_ins || e.Name == w_link || e.Name == w_numId || e.Name == w_numPr || e.Name == w_ilvl),
                                         pList = par_list
                                     };

Remember spaghetti? Here are macaroni XLinq code. Next for each WordML paragraph we’ll FlowDocument paragraph and read WordML runs (run? we have Run class in FlowDocument)

foreach (var par in paragraphs)
                    {
                        Paragraph p = new Paragraph();

var runs = from run in par.pRuns
                                   let run_style = run
                                       .Elements(w_rPr)
                                       .FirstOrDefault()
                                   let run_istyle = run
                                       .Elements(w_rPr)
                                       .Elements(w_rStyle)
                                       .FirstOrDefault()
                                   let run_graph = run
                                       .Elements(w_drawing)
                                   select new
                                   {
                                       pRun = run,
                                       pRunType = run.Name.LocalName,
                                       pStyle = run_istyle != null ? run_istyle.Attribute(w_val).Value : string.Empty,
                                       pAttrs = run_style,
                                       pText = run.Descendants(w_t),
                                       pBB = run.Elements(w_br) != null,
                                       pExRelID = run.Name == w_link ? run.Attribute(rels_id).Value : string.Empty,
                                       pGraphics = run_graph
                                   };

                        foreach (var run in runs)
                        {

But what to do with Styles? Simple, let’s read it from the original document. In order to do it, we have to get StyleDefinitionsPart of our document and convert OpenXML styles into WPF styles

XDocument xstyle, xdoc;
                using (StreamReader sr = new StreamReader(wdoc.MainDocumentPart.StyleDefinitionsPart.GetStream()))
                {
                    xstyle = XDocument.Load(sr);
                    var styles = from style in xstyle
                                     .Root
                                     .Descendants(w_style)
                                 let pPr = style
                                     .Elements(w_pPr)
                                     .FirstOrDefault()
                                 let rPr = style
                                     .Elements(w_rPr)
                                     .FirstOrDefault()
                                 select new
                                 {
                                     pStyleName = style.Attribute(w_styleId).Value,
                                     pName = style.Element(w_name).Attribute(w_val).Value,
                                     pPStyle = pPr,
                                     pRStyle = rPr
                                 };

                    foreach (var style in styles)
                    {
                        Style pStyle = style.pPStyle.ToWPFStyle();
                        pStyle.BasedOn = style.pRStyle.ToWPFStyle();

                        doc.Resources.Add(style.pStyleName, pStyle);
                    }
                }

And what’s happens inside ToWPFStyle attached method? It’s just enumerates styles, by extracting well known tags and create appropriate setter for those properties.

internal static Style ToWPFStyle(this XElement elem)
        {
            Style style = new Style();
            if (elem != null)
            {
                var setters = elem.Descendants().Select(elm =>
                    {
                        Setter setter = null;
                        if (elm.Name == w + "left" || elm.Name == w + "right" || elm.Name == w + "top" || elm.Name == w + "bottom")
                        {
                            ThicknessConverter tk = new ThicknessConverter();
                            Thickness thinkness = (Thickness)tk.ConvertFrom(elm.Attribute(w+"sz").Value);

                            BrushConverter bc = new BrushConverter();
                            Brush color = (Brush)bc.ConvertFrom(string.Format("#{0}",elm.Attribute(w+"color").Value));

                            setter = new Setter(Block.BorderThicknessProperty,thinkness);
                            //style.Setters.Add(new Setter(Block.BorderBrushProperty,color));
                        }                       
                        else if (elm.Name == w + "rFonts")
                        {
                            FontFamilyConverter ffc = new FontFamilyConverter();
                            setter = new Setter(TextElement.FontFamilyProperty,ffc.ConvertFrom(elm.Attribute(w+"ascii").Value));
                        }
                        else if (elm.Name == w + "b")
                        {
                            setter = new Setter(TextElement.FontWeightProperty, FontWeights.Bold);
                        }
                        else if (elm.Name == w + "color")
                        {
                            BrushConverter bc = new BrushConverter();
                            setter = new Setter(TextElement.ForegroundProperty, bc.ConvertFrom(string.Format("#{0}",elm.Attribute(w_val).Value)));
                        }                       

Now, when we have Paragraphs, Runs and Styles we can do the similar transformations for Images, Hyperlinks, Graphics, Tables, Lists. For almost all elements, used in WordML we have sibling WPF class. Let’s create attached method for FlowDocument and we done

FlowDocument fd = new FlowDocument();
            fd.LoadFromWordML("../../testdoc.docx");
            reader.Document = fd;

Pretty easy isn’t it? So what are you waiting for? Convert the rest in order to be able to display Word (and other Office 2007 and later) document inside FlowDocumentReader or any other FlowDocument viewer inside your WPF document. It’s also very easy to build Office addin, that makes you able to save document as XAML FlowDocument and read them inside WPF application later.

This is the best way to use Microsoft Office document in .NET framework 3.0 and 3.5 application. Download sample source code for this article to see whole class (this is not final product – you have a lot of work to do in order to make it 100% complaint to WordML specification.

Have a nice day and be good people.

הוסף תגובה
facebook linkedin twitter email

כתיבת תגובה

האימייל לא יוצג באתר. (*) שדות חובה מסומנים

2 תגובות

  1. Jose Walker21 בספטמבר 2008 ב 17:56

    Hi Tamir,

    Great job! very useful for what I'm working at the moment…

    I would like to ask you whether you noticed that this piece of code doesn't work very well with tables? also, it puts all the text in a "newspaper-like" manner, which is not accurate of how the document was created.

    Would you give me some hints of how to treat Tables and correct formatting of the document, hopefully a mirror of the word document….?? appreciated!! :-)

    Thanks,

    Jose Walker.

    להגיב
  2. Roo4 בנובמבר 2008 ב 5:45

    Any news on dealing with tables?

    להגיב