A Casual Overview of DOCX

Around a billion individuals and workplaces rely on Microsoft Office, which makes its DOCX format the most widely used standard for sharing documents. Its main competitor, the ODT format, lacks this reach, supported only by specific software like Open/LibreOffice and some open-source options. Although commonly used, PDFs are not a direct competitor. Their lack of editing capabilities and comprehensive document structure limits them to minor additions like watermarks or signatures. This dominance of the DOCX format stems from the absence of a viable alternative.

Despite its intricate structure, manual parsing of the DOCX format is possible for tasks like indexing, converting to TXT, or implementing small modifications. This article aims to provide a developer-friendly explanation of DOCX internals, simplifying the extensive 5,000-page ECMA specifications.

Creating a basic one-word document in MSWord and observing the underlying XML changes upon editing is key to understanding the format. This approach helps decipher formatting issues and understand the XML structure’s impact on the document’s appearance.

Drawing from my year-long experience developing a collaborative DOCX editor, CollabOffice, I aim to share insights with the developer community. This article bridges the gap between the complex ECMA specification and oversimplified online tutorials, providing a comprehensive understanding of the DOCX file structure. Accompanying files are available in the “toptal-docx” project on my github account.

Inside a Simple DOCX

A DOCX file is essentially a ZIP archive containing XML files. Creating a simple document with the word “Test” in MSWord and unzipping it reveals this structure:

Despite the document’s simplicity, MSWord generates default themes, properties, font tables, and more, all in XML format.

All the files inside a DOCX are XML files, even those with the ".rels" extension.

For clarity, let’s focus on document.xml, containing the main text elements. Ensure that when deleting any files, you remove all references to them from other XML files. Here is a code-diff example on how dependencies for app.xml and core.xml were removed. Any unresolved or missing references will render the file corrupt in MSWord.

Here’s the streamlined structure of our minimal DOCX document (and here’s the project on github):

Let’s break down each file:

_rels/.rels

This file directs MS Word to the document’s content, in this instance, word/document.xml:

1
2
3
4
5
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
   <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument"
                 Target="word/document.xml"/>
</Relationships>

_rels/document.xml.rels

Defining references to resources like embedded images, this file is empty in our example due to the absence of such elements:

1
2
3
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
</Relationships>

[Content_Types].xml

[Content_Types].xml provides information about the document’s media types, which, in this case, is solely text:

1
2
3
4
5
6
7
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
   <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
   <Default Extension="xml" ContentType="application/xml"/>
   <Override PartName="/word/document.xml"
             ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
</Types>

document.xml

This XML file holds the document’s core text content. While some namespace declarations are omitted here for brevity (refer to the GitHub project for the complete version), note that removing seemingly unused namespace references from the full file is not recommended, as MSWord relies on them.

Simplified example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
<w:document>
   <w:body>
       <w:p w:rsidR="005F670F" w:rsidRDefault="005F79F5">
           <w:r><w:t>Test</w:t></w:r>
       </w:p>
       <w:sectPr w:rsidR="005F670F">
           <w:pgSz w:w="12240" w:h="15840"/>
           <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720"
                    w:gutter="0"/>
           <w:cols w:space="720"/>
           <w:docGrid w:linePitch="360"/>
       </w:sectPr>
   </w:body>
</w:document>

The primary node, <w:document>, represents the entire document. <w:body> houses paragraphs, with page dimensions defined by nested <w:sectPr> tags.

You can disregard the <w:rsidR> attribute; it pertains to MS Word’s internal workings.

Expanding on this, let’s examine a document with three paragraphs. The XML highlighting corresponds to the colors in the Microsoft Word screenshot, illustrating the correlation:

<w:p w:rsidR=“0081206C” w:rsidRDefault=“00E10CAE”> <w:r> <w:t xml:space=“preserve”>This is our example first paragraph. It’s default is left aligned, and now I’d like to introduce</w:t> </w:r> <w:r> <w:rPr> <w:rFonts w:ascii=“Arial” w:hAnsi=“Arial” w:cs=“Arial”/> <w:color w:val=“000000”/> </w:rPr> <w:t>some bold</w:t> </w:r> <w:r> <w:rPr> <w:rFonts w:ascii=“Arial” w:hAnsi=“Arial” w:cs=“Arial”/> <w:b/> <w:color w:val=“000000”/> </w:rPr> <w:t xml:space=“preserve”> text</w:t> </w:r> <w:r> <w:rPr> <w:rFonts w:ascii=“Arial” w:hAnsi=“Arial” w:cs=“Arial”/> <w:color w:val=“000000”/> </w:rPr> <w:t xml:space=“preserve”>, </w:t> </w:r> <w:proofErr w:type=“gramStart”/> <w:r> <w:t xml:space=“preserve”>and also change the</w:t> </w:r> <w:r w:rsidRPr=“00E10CAE”> <w:rPr><w:rFonts w:ascii=“Impact” w:hAnsi=“Impact”/> </w:rPr> <w:t>font style</w:t> </w:r> <w:r> <w:rPr> <w:rFonts w:ascii=“Impact” w:hAnsi=“Impact”/> </w:rPr> <w:t xml:space=“preserve”> </w:t> </w:r> <w:r> <w:t>to ‘Impact’.</w:t></w:r> </w:p> <w:p w:rsidR=“00E10CAE” w:rsidRDefault=“00E10CAE”> <w:r> <w:t>This is new paragraph.</w:t> </w:r></w:p> <w:p w:rsidR=“00E10CAE” w:rsidRPr=“00E10CAE” w:rsidRDefault=“00E10CAE”> <w:r> <w:t>This is one more paragraph, a bit longer.</w:t> </w:r> </w:p>

Dissecting Paragraph Structure

A basic document comprises paragraphs, each consisting of runs (text sharing the same font, color, etc.), which are further broken down into individual characters within <w:t> tags. Multiple characters can reside within a single <w:t> tag, and a single run can contain multiple <w:t> tags.

Again, disregard <w:rsidR>.

Properties of Text

Basic text properties encompass font, size, color, style, etc., with approximately 40 tags dictating text appearance. As seen in our three-paragraph example, each run defines properties like <w:color>, <w:rFonts>, and boldness (<w:b>) within <w:rPr>.

Importantly, property distinctions exist between normal and complex script characters (e.g., Arabic). Corresponding tags are used for each.

Most normal script tags have complex script counterparts distinguished by an appended “C.” For example, <w:i> (italic) becomes <w:iCs>, and <w:b> (bold) becomes <w:bCs>.

Styles - A Deep Dive

MS Word’s style toolbar, offering options like Normal, No Spacing, Heading 1, etc., stores these styles in /word/styles.xml (absent in our simplified example; create a new DOCX to observe this file).

Defining text with a particular style results in a reference within the paragraph properties tag, <w:pPr>. Here’s an example using the “Heading 1” style:

1
2
3
4
5
6
7
8
<w:p>
   <w:pPr>
       <w:pStyle w:val="Heading1"/>
   </w:pPr>
   <w:r>
       <w:t>My heading 1</w:t>
   </w:r>
</w:p>

And the corresponding style definition from styles.xml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<w:style w:type="paragraph" w:styleId="Heading1">
   <w:name w:val="heading 1"/>
   <w:basedOn w:val="Normal"/>
   <w:next w:val="Normal"/>
   <w:link w:val="Heading1Char"/>
   <w:uiPriority w:val="9"/>
   <w:qFormat/>
   <w:rsid w:val="002F7F18"/>
   <w:pPr>
       <w:keepNext/>
       <w:keepLines/>
       <w:spacing w:before="480" w:after="0"/>
       <w:outlineLvl w:val="0"/>
   </w:pPr>
   <w:rPr>
       <w:rFonts w:asciiTheme="majorHAnsi" w:eastAsiaTheme="majorEastAsia" w:hAnsiTheme="majorHAnsi"
                 w:cstheme="majorBidi"/>
       <w:b/>
       <w:bCs/>
       <w:color w:val="365F91" w:themeColor="accent1" w:themeShade="BF"/>
       <w:sz w:val="28"/>
       <w:szCs w:val="28"/>
   </w:rPr>
</w:style>

The <w:style/w:rPr/w:b> XPath specifies bold text, while <w:style/w:rPr/w:color> dictates font color. <w:basedOn> instructs MSWord to inherit missing properties from the “Normal” style.

The Cascade of Properties

Text properties follow an inheritance model. A run possesses individual properties (w:p/w:r/w:rPr/*) while inheriting from the paragraph level (w:r/w:pPr/*). Both can reference style properties within /word/styles.xml.

1
2
3
4
5
6
7
<w:r>
 <w:rPr>
   <w:rStyle w:val="DefaultParagraphFont"/>
   <w:sz w:val="16"/>
 </w:rPr>
 <w:tab/>
</w:r>

Default properties apply to both paragraphs and runs: w:styles/w:docDefaults/w:rPrDefault/* and w:styles/w:docDefaults/w:pPrDefault/*. Determining the final rendered properties of a character involves this order:

Apply default run/paragraph properties.
Append run/paragraph style properties.
Append local run/paragraph properties.
Override paragraph properties with final run properties.

“Appending” B to A involves iterating through B’s properties, overriding corresponding properties in A while preserving non-intersecting properties.

Another potential source for default properties is the <w:style> tag with w:type="paragraph" and w:default="1". Note that characters within a run never have default styles, so <w:style w:type="character" w:default="1"> has no impact.

1554402290400-dbb29eef3ba6035df7ad726dfc99b2af.png)

Characters in a run can inherit from its paragraph and both can inherit from styles.xml.

The Toggle Effect

Some properties, like <w:b> (bold) or <w:i> (italic), function as “toggle” attributes, behaving like an XOR operation.

If both a parent style and a child run are bold, the resulting text will be regular (non-bold).

Thorough testing and reverse-engineering are crucial for accurate handling of these toggle attributes. For a deeper understanding of their behavior, refer to paragraph 17.7.3 of the ECMA-376 Open XML specification.

Toggle properties are the most complex for a layouter to handle correctly.

Fonts and their Definitions

Fonts adhere to the same principles as other text attributes, with default values defined within a separate theme file referenced in word/_rels/document.xml.rels, as follows:

1
<Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme" Target="theme/theme1.xml"/>

Following this reference, the default font name resides in word/theme/themes1.xml, specifically within the <a:theme> tag under a:themeElements/a:fontScheme/a:majorFont or a:minorFont.

Unless otherwise specified by w:docDefaults/w:rPrDefault, the default font size is 10. If this tag is absent, the default size becomes 11.

Text Alignment - A Closer Look

The <w:jc> tag controls text alignment, offering four w:val modes: "left", "center", "right", and "both".

"left" (the default) aligns text to the left margin of the paragraph rectangle, typically the page width. (This sentence demonstrates left alignment.)

"center" positions text centrally within the page width. (This sentence exemplifies center alignment.)

"right" aligns text to the right margin. (As you can see, this sentence is right-aligned.)

"both" increases inter-word spacing to stretch lines across the full paragraph width, except for the final line, which defaults to left alignment. (This sentence showcases the “both” alignment mode.)

Incorporating Images

DOCX supports two image types: inline and floating.

Inline images, appearing within a paragraph alongside text, utilize the <w:drawing> tag instead of <w:t> (text). Extract the image ID using this XPath:

w:drawing/wp:inline/a:graphic/a:graphicData/pic:pic/pic:blipFill/a:blip/@r:embed

This ID helps locate the corresponding filename within word/_rels/document.xml.rels, which should point to a GIF or JPEG file within the “word/media” subfolder. (Refer to the word/_rels/document.xml.rels file in the GitHub project for a practical example.)

Floating images are positioned relative to paragraphs, with text flowing around them. (The GitHub project’s sample document provides a visual representation of this.)

Floating images employ <wp:anchor> instead of <w:drawing>. Exercise caution when deleting text within <w:p> to avoid unintentional image removal, especially if anchors are involved.

MS Word's image options refer to image alignment as "text wrapping mode".

Tables and Their Structure

Similar to HTML, XML tags for tables resemble their HTML counterparts: <w:tbl> acts like

, <w:tr> corresponds to, and so on.

<w:tbl>, representing the table itself, contains table properties within <w:tblPr>. Each column’s properties are defined by <w:gridCol> tags within <w:tblGrid>. Rows are represented by consecutive <w:tr> tags, each containing the same number of columns as specified in <w:tblGrid>:

<w:tbl>
 <w:tblPr>
   <w:tblW w:w="5000" w:type="pct" />
 </w:tblPr>
 <w:tblGrid><w:gridCol/><w:gridCol/></w:tblGrid>
 <w:tr>
   <w:tc><w:p><w:r><w:t>left</w:t></w:r></w:p></w:tc>
   <w:tc><w:p><w:r><w:t>right</w:t></w:r></w:p></w:tc>
 </w:tr>
</w:tbl>

COMMON DOCX XML UNIT CONVERSIONS
	20th of a point	Points dxa/20	Inches pt/72	Centimeters in*2,54	Font half size pt/144	EMU in*914400
Example	11906	595.3	8,27…	21.00086…	4,135	7562088
Tags using this	pgSz/pgMar/w:spacing				w:sz	wp:extent, a:ext