How to Create a Word Document in RPG

Published Mar 10, 2011

In a recent thread in the System iNetwork Forums, someone asked how to produce Microsoft Word documents from IBM i. He wanted to create a document in Word, but then insert data from a DB2 for i database into the document with an RPG program.

I advised that he should investigate the DOCX format that became Word's standard format starting with Office 2007. I knew that it was based on XML, and so you should be able to create it from any programming language, but I didn't elaborate, because I hadn't done it myself. Now I've decided it's time to experiment with it. In this article, I'll tell you what I discovered, and demonstrate how to insert data into a Word document from RPG.

Microsoft Word and the DOCX Format

With the release of Microsoft Office 2007, Microsoft changed their file formats dramatically. Word, Excel and Powerpoint all have new native formats to store their data in. They still support the old formats, of course, but their "standard" format is the XML ones.

Microsoft Word is no exception. Their older format uses the extension .DOC to denote that it's a Word document, and the new format uses the extension .DOCX, which denotes that it's an XML variant of the Word document. Since I know that XML is plain text, and I know I can read and write XML from RPG, I should be able to work with the XML format, right?

Turns out, it's a little more complex than that. A .DOCX file isn't a single XML file, it's actually a .ZIP file that contains a whole directory structure. Within that structure are many XML files.

In order to experiment with this, I opened up Microsoft Word and created the following Word document:

Notice that I've put placeholders where I wanted to insert data from my RPG program. For the date, I put ====DATE====, I figure that my RPG program can search the XML document for that string, and replace it with the actual date. Likewise, I have placeholders like ====RECIP==== and ====TITLE==== for their corresponding fields from my RPG program. I chose the = character because it has no special meaning in XML, it works across all character sets, and it's unlikely that four consecutive = characters would appear in a normal business letter.

I saved this document to my PC as ACME.docx. I made certain to use the Office 2007 "docx" format, since I know that's a Zipped XML document. I used FTP in binary mode to upload this document to an IFS directory on my system that runs IBM i.

Next, I used the InfoZip Utility in PASE to unzip the Word document. (Actually, my first attempt used 7-Zip, which worked well for unzipping, but when I zipped up my result, it didn't work. Apparently, 7-Zip creates .ZIP files that Microsoft Word doesn't understand. InfoZip doesn't have as many features as 7-Zip, but seems to be compatible with Word.) From QShell, I typed the following command:

unzip ACME.docx

And the command's output looked like this:

Archive:  ACME.docx
  inflating: [Content_Types].xml
   creating: _rels/
  inflating: _rels/.rels
   creating: docProps/
  inflating: docProps/core.xml
  inflating: docProps/app.xml
  inflating: docProps/custom.xml
   creating: word/
   creating: word/_rels/
  inflating: word/_rels/document.xml.rels
  inflating: word/_rels/settings.xml.rels
  inflating: word/document.xml
  inflating: word/footnotes.xml
  inflating: word/endnotes.xml
  inflating: word/header1.xml
   creating: word/theme/
  inflating: word/theme/theme1.xml
  inflating: word/settings.xml
  inflating: word/webSettings.xml
  inflating: word/styles.xml
  inflating: word/numbering.xml
  inflating: word/fontTable.xml

At this point, you may be wondering how that worked! After all, it was a .DOCX file, not a .ZIP file! How was I able to unzip it? In truth, it was a zip file. That's what today's Word documents are, they are .ZIP files that contain a particular directory structure. The files inside that structure are XML documents that contain the layout of the Word document.

I was amazed at how much data is stored inside a Word document. The files that contain the phrase "rels", are relationship documents that describe how the files relate to one another. Most of the others, including styles.xml, fontTable.xml, settings.xml and theme1.xml are XML documents that describe how the document looks. What are the fonts? How is everything laid out? For now, I'm content to let Word figure all of that out.

The only file I'm interested in is the document.xml file that's found in the word subdirectory. It contains the actual document, including my ==== placeholders. If I load it up into my RPG program, I should be able to find those placeholders, insert my own text, save it back to disk, and re-zip it.

The Document.xml File

The document.xml file is, of course, an XML file. You can open it and look at it's contents, and you'll see that it contains the text you typed into Word. I opened mine with the Firefox web browser, since Firefox will format XML nicely on the screen, making it very easy to read. Here's an excerpt from the document.xml file:

<w:document>
  <w:body>
    .
    .
    <w:p w:rsidR="00BD0BBB" w:rsidRPr="00BD0BBB" w:rsidRDefault="001E5AC9" w:rsi
dP="00BD0BBB">
      <w:pPr>
        <w:pStyle w:val="Date"/>
      </w:pPr>
      <w:r>
        <w:t>====DATE====</w:t>
      </w:r>
    </w:p>
    .
    .

As you can see, it's an XML file, but what are all of the elements in it? What does <w:Pr> do, for example? What is a w:rsidR="00BD0BBB"? I certainly don't know. Fortunately, I don't have to worry too much about them, I just need to replace ====DATE==== with data from my RPG program, and then I can save the rest of it back to disk unchanged.

So I did that. I wrote an RPG program that follows these steps:

  1. Calls InfoZip to unzip the .DOCX file into a temporary direcory.
  2. Reads the document.xml file into a character string in my RPG program.
  3. Uses the %SCAN and %REPLACE BIFs to replace my placeholders with data from my program.
  4. Saves the document.xml file back to the IFS.
  5. Calls InfoZip to .ZIP the XML files again, creating a new .DOCX file.

One Tricky Problem

It partially worked. All of the fields were replaced except my ====STATE==== and ====POSTAL=== fields. For some reason, they did not get replaced! It took awhile, but I eventually found the problem. In my document.xml, I was expecting to find this:

<w:t>====CITY====, ====STATE====  ====POSTAL====</w:t>

However, I didn't find that. Instead, I found this:

<w:t>====CITY====, ====STATE===</w:t></w:r><w:proofErr w:type="gramStart"/><w:r><w:t>=  =</w:t></w:r> <w:proofErr w:type="gramEnd"/><w:r><w:t>===POSTAL====</w:t>

It appears tha Word decided that my placeholders were bad grammar, so it inserted "proofErr" tags to show me where my grammar error started and ended. Because it happened to be in the middle of my ====STATE==== and ====POSTAL==== placeholders, my RPG program couldn't find the strings, and failed to replace them properly.

Once I finally realized this, I went into Word and disabled it's spelling and grammar checking, and tried again. This time, it worked!

The RPG Code

How does the RPG code work? It works by calling the QCMDEXC API to invoke QShell, and uses QShell to unzip the DOCX file.

         // ------------------------------------------------------
         // Extract the DOCX Template to a temporary
         // directory, and mark the document.xml file w/CCSID 1208
         // ------------------------------------------------------

         cmd = 'QSH CMD(''export PATH=$PATH:/usr/local/bin +
                       && mkdir "' + TMPDIR + '" +
                       && cd "' + TMPDIR + '" +
                       && unzip "' + Template + '" +
                       && setccsid 1208 word/document.xml'')';
         QCMDEXC(cmd: %len(cmd));

I found I had to set the CCSID of document.xml to 1208 (UTF-8) in order for the IFS APIs to perform proper translation of the data when my program reads it in. In the preceding code, I used QShell's setccsid utility to do this. The CHGATR CL command is another good way to change the CCSID, but since I was already in QShell, I opted for QShell's command.

Now that my .DOCX file has been unzipped, I used the IFS APIs to load it into a variable in my RPG program.

     D buf             s          65535a
     D vbuf            s          65535a   varying
         .
         .
         // ------------------------------------------------------
         //   Load the document.xml file from the template
         //   into an RPG variable.
         // ------------------------------------------------------
         IfsPath = TMPDIR + '/word/document.xml';
         fd = open( IfsPath
                  : O_RDONLY + O_CCSID + O_TEXTDATA
                  : 0
                  : 0 );
         if (fd = -1);
            // handle error here
         endif;

         len = read(fd: %addr(buf): %size(buf));
         callp close(fd);

         vbuf = %subst(buf:1:len);

Since I decided I wanted my code to remain V5R4 compatible, I used an alphanumeric field that's only 65535 bytes long. It was more than large enough for my simple Word document. However, it's easy to imagine a situation where I might want to handler larger documents. In IBM i 6.1, you can change the size of buf and vbuf to much larger sizes, up to 16 MB. I'll leave that change as an exercise for the reader.

Since I'm working with V5R4 code, I can't use RPG's new Scan and Replace (%SCANRPL) BIF, either, so I wrote myself a subprocedure to perform scanning and replacing.

     P scanrpl         B
     D                 PI
     D   vbuf                     65535a   varying
     D   oldval                     100a   varying const
     D   newval                     100a   varying const
     D pos             s             10i 0
      /free
         pos = %scan( oldval: vbuf );
         dow pos > 0;
            vbuf = %replace( newval: vbuf: pos: %len(oldval));
            pos = %scan( oldval: vbuf: pos+%len(newval) );
         enddo;
      /end-free
     P                 E

With this procedure, I can easily scan for my placeholders and replace them with data from my RPG program.

     D WordRepl_fields_t...
     D                 ds                  qualified
     D   date                        10a   varying
     D   recip                       30a   varying
     D   recipnm                     30a   varying
     D   title                       30a   varying
     D   company                     30a   varying
     D   address                     30a   varying
     D   city                        20a   varying
     D   state                        2a
     D   postal                      10a   varying

     D my              ds                  likeds(WordRepl_fields_t)
         .
         .
         scanrpl( vbuf : '====DATE===='    : my.date    );
         scanrpl( vbuf : '====RECIP===='   : my.recip   );
         scanrpl( vbuf : '====RECIPNM====' : my.recipnm );
         scanrpl( vbuf : '====TITLE===='   : my.title   );
         scanrpl( vbuf : '====COMPANY====' : my.company );
         scanrpl( vbuf : '====ADDRESS====' : my.address );
         scanrpl( vbuf : '====CITY===='    : my.city    );
         scanrpl( vbuf : '====STATE===='   : my.state   );
         scanrpl( vbuf : '====POSTAL===='  : my.postal  );

Prior to the preceding code, I set values for recipient, title, company, et al, in the my data structure. I just use simple variable assignment to hard-code these values in my RPG program. However, in a real-world program, you'd probably want to get this information either from the user or from a database file.

The result is that the placeholders were replaced with data from variables in my program. Now my vbuf variable contains the final XML document, with the data already filled in. I need to write it out to the IFS using the IFS APIs:

         fd = open( IfsPath
                  : O_TRUNC + O_WRONLY + O_TEXTDATA + O_CCSID
                  : 0
                  : 0 );
         if (fd = -1);
            // handle error here
         endif;

         callp write(fd: %addr(vbuf)+VARPREF: %len(vbuf));
         callp close(fd);

And the final step is to create a new .DOCX file by zipping up my temporary directory. I used QShell and InfoZip to do this.

         cmd = 'QSH CMD(''export PATH=$PATH:/usr/local/bin +
                       && cd "' + TMPDIR + '" +
                       && zip -r "' + NewDoc + '" *'')';
         QCMDEXC(cmd: %len(cmd));

The Result

When I open my new .DOCX file with Microsoft Word, it looks like this:

Code Download

Click here to download the RPG code described in this article.

Click here to download a copy of InfoZip that has been compiled for PASE