Published Mar 10, 2011
In a recent thread in the System iNetwork Forums, someone asked how to produce Microsoft Word documents from IBM i. He wanted to create a document in Word, but then insert data from a DB2 for i database into the document with an RPG program.
I advised that he should investigate the DOCX format that became Word's standard format starting with Office 2007. I knew that it was based on XML, and so you should be able to create it from any programming language, but I didn't elaborate, because I hadn't done it myself. Now I've decided it's time to experiment with it. In this article, I'll tell you what I discovered, and demonstrate how to insert data into a Word document from RPG.
With the release of Microsoft Office 2007, Microsoft changed their file formats dramatically. Word, Excel and Powerpoint all have new native formats to store their data in. They still support the old formats, of course, but their "standard" format is the XML ones.
Microsoft Word is no exception. Their older format uses the extension .DOC to denote that it's a Word document, and the new format uses the extension .DOCX, which denotes that it's an XML variant of the Word document. Since I know that XML is plain text, and I know I can read and write XML from RPG, I should be able to work with the XML format, right?
Turns out, it's a little more complex than that. A .DOCX file isn't a single XML file, it's actually a .ZIP file that contains a whole directory structure. Within that structure are many XML files.
In order to experiment with this, I opened up Microsoft Word and created the following Word document:
Notice that I've put placeholders where I wanted to insert data from my RPG program. For the date, I put ====DATE====, I figure that my RPG program can search the XML document for that string, and replace it with the actual date. Likewise, I have placeholders like ====RECIP==== and ====TITLE==== for their corresponding fields from my RPG program. I chose the = character because it has no special meaning in XML, it works across all character sets, and it's unlikely that four consecutive = characters would appear in a normal business letter.
I saved this document to my PC as ACME.docx. I made certain to use the Office 2007 "docx" format, since I know that's a Zipped XML document. I used FTP in binary mode to upload this document to an IFS directory on my system that runs IBM i.
Next, I used the InfoZip Utility in PASE to unzip the Word document. (Actually, my first attempt used 7-Zip, which worked well for unzipping, but when I zipped up my result, it didn't work. Apparently, 7-Zip creates .ZIP files that Microsoft Word doesn't understand. InfoZip doesn't have as many features as 7-Zip, but seems to be compatible with Word.) From QShell, I typed the following command:
unzip ACME.docx
And the command's output looked like this:
Archive: ACME.docx inflating: [Content_Types].xml creating: _rels/ inflating: _rels/.rels creating: docProps/ inflating: docProps/core.xml inflating: docProps/app.xml inflating: docProps/custom.xml creating: word/ creating: word/_rels/ inflating: word/_rels/document.xml.rels inflating: word/_rels/settings.xml.rels inflating: word/document.xml inflating: word/footnotes.xml inflating: word/endnotes.xml inflating: word/header1.xml creating: word/theme/ inflating: word/theme/theme1.xml inflating: word/settings.xml inflating: word/webSettings.xml inflating: word/styles.xml inflating: word/numbering.xml inflating: word/fontTable.xml
At this point, you may be wondering how that worked! After all, it was a .DOCX file, not a .ZIP file! How was I able to unzip it? In truth, it was a zip file. That's what today's Word documents are, they are .ZIP files that contain a particular directory structure. The files inside that structure are XML documents that contain the layout of the Word document.
I was amazed at how much data is stored inside a Word document. The files that contain the phrase "rels", are relationship documents that describe how the files relate to one another. Most of the others, including styles.xml, fontTable.xml, settings.xml and theme1.xml are XML documents that describe how the document looks. What are the fonts? How is everything laid out? For now, I'm content to let Word figure all of that out.
The only file I'm interested in is the document.xml file that's found in the word subdirectory. It contains the actual document, including my ==== placeholders. If I load it up into my RPG program, I should be able to find those placeholders, insert my own text, save it back to disk, and re-zip it.
The document.xml file is, of course, an XML file. You can open it and look at it's contents, and you'll see that it contains the text you typed into Word. I opened mine with the Firefox web browser, since Firefox will format XML nicely on the screen, making it very easy to read. Here's an excerpt from the document.xml file:
<w:document> <w:body> . . <w:p w:rsidR="00BD0BBB" w:rsidRPr="00BD0BBB" w:rsidRDefault="001E5AC9" w:rsi dP="00BD0BBB"> <w:pPr> <w:pStyle w:val="Date"/> </w:pPr> <w:r> <w:t>====DATE====</w:t> </w:r> </w:p> . .
As you can see, it's an XML file, but what are all of the elements in it? What does <w:Pr> do, for example? What is a w:rsidR="00BD0BBB"? I certainly don't know. Fortunately, I don't have to worry too much about them, I just need to replace ====DATE==== with data from my RPG program, and then I can save the rest of it back to disk unchanged.
So I did that. I wrote an RPG program that follows these steps:
It partially worked. All of the fields were replaced except my ====STATE==== and ====POSTAL=== fields. For some reason, they did not get replaced! It took awhile, but I eventually found the problem. In my document.xml, I was expecting to find this:
<w:t>====CITY====, ====STATE==== ====POSTAL====</w:t>
However, I didn't find that. Instead, I found this:
<w:t>====CITY====, ====STATE===</w:t></w:r><w:proofErr w:type="gramStart"/><w:r><w:t>= =</w:t></w:r> <w:proofErr w:type="gramEnd"/><w:r><w:t>===POSTAL====</w:t>
It appears tha Word decided that my placeholders were bad grammar, so it inserted "proofErr" tags to show me where my grammar error started and ended. Because it happened to be in the middle of my ====STATE==== and ====POSTAL==== placeholders, my RPG program couldn't find the strings, and failed to replace them properly.
Once I finally realized this, I went into Word and disabled it's spelling and grammar checking, and tried again. This time, it worked!
How does the RPG code work? It works by calling the QCMDEXC API to invoke QShell, and uses QShell to unzip the DOCX file.
// ------------------------------------------------------ // Extract the DOCX Template to a temporary // directory, and mark the document.xml file w/CCSID 1208 // ------------------------------------------------------ cmd = 'QSH CMD(''export PATH=$PATH:/usr/local/bin + && mkdir "' + TMPDIR + '" + && cd "' + TMPDIR + '" + && unzip "' + Template + '" + && setccsid 1208 word/document.xml'')'; QCMDEXC(cmd: %len(cmd));
I found I had to set the CCSID of document.xml to 1208 (UTF-8) in order for the IFS APIs to perform proper translation of the data when my program reads it in. In the preceding code, I used QShell's setccsid utility to do this. The CHGATR CL command is another good way to change the CCSID, but since I was already in QShell, I opted for QShell's command.
Now that my .DOCX file has been unzipped, I used the IFS APIs to load it into a variable in my RPG program.
D buf s 65535a D vbuf s 65535a varying . . // ------------------------------------------------------ // Load the document.xml file from the template // into an RPG variable. // ------------------------------------------------------ IfsPath = TMPDIR + '/word/document.xml'; fd = open( IfsPath : O_RDONLY + O_CCSID + O_TEXTDATA : 0 : 0 ); if (fd = -1); // handle error here endif; len = read(fd: %addr(buf): %size(buf)); callp close(fd); vbuf = %subst(buf:1:len);
Since I decided I wanted my code to remain V5R4 compatible, I used an alphanumeric field that's only 65535 bytes long. It was more than large enough for my simple Word document. However, it's easy to imagine a situation where I might want to handler larger documents. In IBM i 6.1, you can change the size of buf and vbuf to much larger sizes, up to 16 MB. I'll leave that change as an exercise for the reader.
Since I'm working with V5R4 code, I can't use RPG's new Scan and Replace (%SCANRPL) BIF, either, so I wrote myself a subprocedure to perform scanning and replacing.
P scanrpl B D PI D vbuf 65535a varying D oldval 100a varying const D newval 100a varying const D pos s 10i 0 /free pos = %scan( oldval: vbuf ); dow pos > 0; vbuf = %replace( newval: vbuf: pos: %len(oldval)); pos = %scan( oldval: vbuf: pos+%len(newval) ); enddo; /end-free P E
With this procedure, I can easily scan for my placeholders and replace them with data from my RPG program.
D WordRepl_fields_t... D ds qualified D date 10a varying D recip 30a varying D recipnm 30a varying D title 30a varying D company 30a varying D address 30a varying D city 20a varying D state 2a D postal 10a varying D my ds likeds(WordRepl_fields_t) . . scanrpl( vbuf : '====DATE====' : my.date ); scanrpl( vbuf : '====RECIP====' : my.recip ); scanrpl( vbuf : '====RECIPNM====' : my.recipnm ); scanrpl( vbuf : '====TITLE====' : my.title ); scanrpl( vbuf : '====COMPANY====' : my.company ); scanrpl( vbuf : '====ADDRESS====' : my.address ); scanrpl( vbuf : '====CITY====' : my.city ); scanrpl( vbuf : '====STATE====' : my.state ); scanrpl( vbuf : '====POSTAL====' : my.postal );
Prior to the preceding code, I set values for recipient, title, company, et al, in the my data structure. I just use simple variable assignment to hard-code these values in my RPG program. However, in a real-world program, you'd probably want to get this information either from the user or from a database file.
The result is that the placeholders were replaced with data from variables in my program. Now my vbuf variable contains the final XML document, with the data already filled in. I need to write it out to the IFS using the IFS APIs:
fd = open( IfsPath : O_TRUNC + O_WRONLY + O_TEXTDATA + O_CCSID : 0 : 0 ); if (fd = -1); // handle error here endif; callp write(fd: %addr(vbuf)+VARPREF: %len(vbuf)); callp close(fd);
And the final step is to create a new .DOCX file by zipping up my temporary directory. I used QShell and InfoZip to do this.
cmd = 'QSH CMD(''export PATH=$PATH:/usr/local/bin + && cd "' + TMPDIR + '" + && zip -r "' + NewDoc + '" *'')'; QCMDEXC(cmd: %len(cmd));
When I open my new .DOCX file with Microsoft Word, it looks like this:
Click here to download the RPG code described in this article.
Click here to download a copy of InfoZip that has been compiled for PASE