Popular Posts
-
Blogs I Follow:
- Victoria Mui - Blog
Close preview
Loading... - Frank Macchia - Allow Me To Be Frank
Close preview
Loading... - Greg Wilson - The Third Bit
Close preview
Loading... - Mike Yoo - Living as Minhoon
Close preview
Loading... - Brian Shim - Now & Then
Close preview
Loading... - Stephen Khuu := Steve Khuu
Close preview
Loading... - Sensorial'Org
Close preview
Loading... - Seriously? @Pi/Pi
Close preview
Loading... - Misa - Trails...
Close preview
Loading...
- Victoria Mui - Blog
-
RSS Links
-
Meta
Grksm To Unicode
First, some background and other related blog posts:
Editing MS Word documents through Python
An explanation of the problem: Grksm2Unicode
Grksm 2 Unicode
Setup
How It’s Supposed To Work
Technical Details – Dealing with Word
Dealing with Word: Setup
Dealing with Word: Launch!
Dealing with Word: Read it!
Technical Details – Text Conversion
Text Conversion: Generating the Mappings
Text Conversion: Converting the Text
Testing It
Usage
Download
Setup
I used Visual Studio 2008 as my IDE for this project. You’ll also need to have Microsoft Word installed, or else this will not work.
How It’s Supposed to Work
Technical Details – Dealing with Word
Dealing with Word: Setup First, you’ll need to add the MS Word and MS Office object libraries to your project so that the compiler knows what you’re talking about. The Microsoft support documents (found here) were actually very helpful for this part, so I’ll let them do the explaining. Just follow step 3 and you should be good to go. Now that the libraries are imported, you’ll need to add a “using” (they’re kind of like Java’s “import” statements) statement at the top of your code:
This creates an alias called “Word” for the full path “Microsoft.Office.Interop.Word”. So now you’ll only have to type “Word.method()” instead of the fully qualified address to the method (which would be Microsoft.Office.Interop.Word)
Dealing with Word: Launch!
Let’s connect to Word and manipulate some documents.
object missing = System.Type.Missing; object filename = @"C:PathToMyWord.doc"; Word._Application oWord = new Word.Application(); oWord.Visible = true; oWord.Options.CheckGrammarAsYouType = false; oWord.Options.CheckSpellingAsYouType = false; Word._Document oDoc = oWord.Documents.Open(ref filename, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing);The basic model for dealing with Word applications can be seen from this nifty model here. At the topmost level we have the Word application itself. Inside each Application we have some Documents (more than one doc may be open at a time within the same instance of a Word application), and inside each Document we have a lot of junk that describes the contents of each document; like Paragraphs, Range and Selection. The code above will open up a new instance of Word and load up the file specified by the object filename (which is really just a String). A few interesting points here:
object missingis a reference to the special type “Missing”. missing is passed into method arguments when you really need the method itself but don’t know what to put in. I’m pretty sure that’s not how its really used but that’s what I’ve seen in the examples floating around on the internets and it works.object filenameuses the very nifty “@” operator. In C#, appending an @ character before strings will cause the compiler to interpret the following string very literally. Among other things, it will ignore the escape character ”, so now I don’t have to typeC:whydoihavetodothis?. One of the other cool things about this feature is that it’ll also automagically insert ‘n’ if you type out Strings that go across multiple linesWord.Application oWord = new Word.Application, this line actually tells the OS to run MS WordoWord.Visible = trueis purely for show. This makes the actual application window visible. Its great for debugging purposes and for sanity checks. Once you’re ready to deploy, I’d suggest that you set it to false for the sake of speed. If you insist on leaving it on, then I suggest you add the lineoWord.ScreenUpdate = false. This won’t make the application screen update in real time. Seriously, once its working, just turn it off and print messages out to Console. It’ll shave off a lot of timeDealing with Word: Read it!
Now, in order to read/change the text WITHOUT destroying the structure of the document, we’ll have to go through the Document’s Paragraph objects.
If you do a bit of research on the interwebz, you’ll find that most sites tell you to do something like this:
object missing = System.Type.Missing; foreach (Word.Paragraph p in oDoc.Paragraphs) { p.Range.Font.Name = "Arial"; p.Range.Text = convertSnippet(p.Range.Text); }In theory, this should convert all of the text within each Paragraph’s body to the Arial font, and the text itself should change to whatever was returned by the convertSnippet method call.
Well it doesn’t work.
This only works if you do ANYTHING other than set the value of
p.Range.Text. If you only set the Font, then this works perfectly. If you want to set the text, well Word freaks out and mysteriously stays at the same paragraph for ever and ever.So what’s going on? Why aren’t we able to advance through Paragraphs once we start modifying the text? Well, after a very long and tedious debugging session, I found out that the pointer to the Paragraph automatically advances to the next Paragraph when you set the Text. Once you assign
p.Range.Textto a new value, the existing Paragraphpwill now point to the next Paragraph in the document. That’s why the foreach loop was acting screwy. In fact, you have to do this to make it work:object missing = System.Type.Missing; int maxPara = oDoc.Paragraphs.Count; int j = 0; Word.Paragraph p = oDoc.Paragraphs.First; while (j < maxPara) { if (p.Range.Font.Name.Equals("Grksm") || p.Range.Font.Name.Equals("")) { p.Range.Font.Name = "Arial"; p.Range.Text = convertSnippet(p.Range.Text); } else { p = p.Next(ref missing); } j++; }Technical Details – Text Conversion
Text Conversion: Generating the Mappings
For each of the possible characters in the Grksm font range, I had to figure out its equivalent Unicode address. First, lets fire up Windows’ Character Map program.
Looking at the Character Map, we can see each character’s assigned character code. Now, all we have to do to is look up the same character’s Unicode value and create a dictionary from which we can perform our text conversion. The Unicode charts for Greek and Coptic, Greek Extended and Combining Diacritical Marks were immensely useful for this.
The easiest characters to convert are the upper and lower characters Alpha through Zeta. The accented characters are a bit trickier to convert because of the way Grksm represents various accents.
Using Grksm, the character ἄ is represented as α + ῎
However, in Unicode, the same character ἄ can be represented as either a single code ἄ or as a combination of codes: α + ̓ + ́
The problem here is twofold. First, we need to handle the Grksm combining “accent” characters that actually map into two separate characters in the Unicode world. Second, we need to settle on a Unicode representation of the characters, either we fully combine our characters or we don’t, we cannot have a mixed bag of each.
Lets deal with the second problem first. Well, as it turns out, this problem can be solved with a simple API call. Strings in C# (and in Java too, as of Java 6) have a Normalize() function which will automagically compose/decompose Unicode strings into either their long (decomposed) or short (composed) forms. That way, all string stored and received will have the same underlying representation.
With regards to the first problem, I simply chose to ignore it. If a character maps into two Unicode characters, I simply noted what those the two characters were and decided to handle that case later in the actual conversion. For now, all we need is the x -> y mapping of each Grksm character.
In the end, the text file which contained the mapping looked like this:
Character codes are always : delimited. The first term is always the Grksm source code, and following that there is an abitrary amount of associated Unicode character codes
Text Conversion: Converting the Text
The character map is read in as a plain text file. The mapping is stored in a Dictionary where the key is the Grksm character code (in hex), and the value is a String array that contains all of the Unicode character codes which make up associated Grksm character (also in hex).
Finally, the text in each Paragraph of the Word document converted to Unicode.
The trickiest part is handling the character replacement. In the case where 1 Grksm -> 1 Unicode, this is trivial. But what do we do for cases where 1 Grksm -> 2 + Unicode? It would be inefficient to simply expand the size of the Char[] every time.
In order to address this problem, I added the translated characters to an ArrayList instead of replacing the characters directly in the Char[].
One final kink: the Unicode rendering was not very happy with characters that were constructed with two or more combining characters. The two combining characters would combine into a single character (usually a character with two accents or modifiers), but that resulting character would not combine with the original A-Z characters.
In order to address this, I made a call to
Normalize()using the argumentNormalizationForm.FormCon a String constructed with the current character and the previous character in the ArrayList. This will ensure that the combining characters would be applied to the “root” character instead of another combining character. Its a wee bit complicated/confusing, but you can take a look at the source code to figure out what I mean.Testing It
As with all things, we must test our code to make sure it works! I was sick of C# by this point, so I wrote a quick Python script that would write out the characters A-Z, a-z and apply all the Grksm combining marks into a plain text document. I then opened that document in Word, applied the Grksm font to it and saved it again as a doc file.
When you run the converter on it, you’ll notice that not all of the characters are rendered exactly as they appear when Grksm is used. I think that’ OK, because for those characters, that particular accent and character combination would not be used in the actual language. I could be wrong, and I would like to hear from you if I am.
Usage
I haven’t been able to test it out the configurations yet, but I know for sure that it works on Windows Vista + Office 2007 (that’s my current configuration).
The program itself looks for files that follow the pattern *.doc*. If your version of Word opens docx, then by all means, this program will support docx as well.
The tool itself is to be used from the command line. The app takes up to 3 arguments:
Argument 1: Path to the directory containing the Word files that you want to convert.
Argument 2 (Optional): Path to the directory where you want to place your converted files. If this argument is not given then all files will be placed in a new folder called “modified” in your current working directory.
Argument 3 (Optional): The path to the mapping file. In theory, this program could be used to convert characters of any font to any other font. You just need to provide the proper mapping for it.
Download
Binary Executable
Source Files
All of the download able-code and binaries are released under the BSD License