My uncle from Hong Kong recently contacted me about some problems that he was having with his research. My uncle is a professor of Theology in Hong Kong; I’m not exactly sure what he does, or where he teaches, but apparently he needed some help with MSWord doc files. So I told him to e-mail me what he already has and what he wants me to do, and we’ll go from there.
So for the past few years, my uncle has been working on a Chinese <-> English <-> German <-> Hebrew <-> (Klingon?) lexicon for translating words and phrases in the Bible. His grand scheme is to parse the doc files and enter them into a database, and from there web-ify it so that researchers and laymen alike can study the Bible in these languages. He had a former student of his work on the parsing and database stuff, and he said he’d send me the code for it. Great, I love code…ish :S
My heart sank when I first saw the code.
Well, actually I lied, first I had to find the code. All I got was a file called “parser.doc” that appeared to be an unusually large Word document that was completely empty. Well, the problem turned out to be my Mac (again) because iWork’s Pages hates MS Office Macros.
Yes, that’s right, Macros.
The “parser” was a 1,700 line VB.net, flat class file; embedded into the MSWord document as a gigantic Macro. I tried to run it and frankly, I had no idea what the hell it does. And I wasn’t about to go through 1,700 lines of uncommented VB code while school is going on.
A few weeks later, my uncle calls me up and tells me that he needs some help again. This time he wants to remove all tables and occurrences of 〔*〕 in his word document. And since I actually had a bit of spare time, I said sure, why not? The first thing was to figure out how to edit Word files without going through the trouble of using VB macros. My reason for this was twofold: 1) using Compuware’s TestPartner while doing automated testing at CAST has given me a healthy loathing of VB, and 2) from what I saw in the “parser” macro, you cannot do batch execution on multiple files; you have to select your file each and every time. Well that’s awfully inconvenient. So what can we do?
Well, my research (a.k.a. Googling) showed that you can use Python + the win32com module to do everything that I wanted. The only problem was that there isn’t a whole lot of documentation behind it, but I was still able to get it done relatively quickly.
The code’s actually not that long, I’ve posted the full program below.
The hardest part about this was 1) finding documentation on its usage and 2) figuring out a way to send this to my uncle so he could actually run it.
I tried fooling around with py2exe, but that was just a bag of hurt, so I opted with just telling my uncle how to install python and gave him detailed instructions to get everything set up.
Also, I found that it’s impossibly hard to manually edit text while going through the COM module. I tried for literally minutes and I just gave up once I found out you can use wildcard replacement on Find and Replace.
Oh, and one last thought: Powershell + Python + Vim = win! But that’s about it though, don’t try to do anything else in it :S
import win32com.client
import os
if win32com.client.gencache.is_readonly == True:
win32com.client.gencache.is_readonly = False
win32com.client.gencache.Rebuild()
from win32com.client.gencache import EnsureDispatch
from win32com.client import constants
word = win32com.client.Dispatch("Word.Application")
word.Visible = False
work_dir = "modified"
leftBracket = u"\u3014"
rightBracket = u"\u3015"
def processDoc(name):
#tell word to open the document
word.Documents.Open(os.getcwd() + "\\" + name)
#open it internally (i guess...)
doc = word.Documents(1)
#delete ALL tables
tables = doc.Tables
for table in tables:
table.Delete()
find = word.Selection.Find
find.ClearFormatting()
find.Replacement.ClearFormatting()
find.Text = leftBracket + "*" + rightBracket
find.Replacement.Text = ""
find.Forward = True
find.Wrap = constants.wdFindContinue
find.MatchWildcards = True
find.Execute(Replace= constants.wdReplaceAll)
#re-save in the modified folder
doc.SaveAs(os.getcwd() + "\\" + work_dir + "\\" + name)
#close the stream
doc.Close()
def findLocalDocs():
#look at what's local
for root, dirs, files in os.walk("."):
for name in files:
if name[0] != "~" and name[-3:] == "doc":
processDoc(name)
def main():
try:
os.mkdir(work_dir)
except WindowsError:
pass
findLocalDocs()
word.Quit()
if __name__ == "__main__":
main()
One Trackback
[...] Project Gibson! All was not lost. I fired up Vim and made a few changes to my existing Python script, and voila! I had an instant docx <-> doc batch [...]