Converting Grksm to Unicode

An update on the Greek lexicon project that I am working on (hereafter arbitrarily re-named to Project Gibson)….

The Greek lexicon that I am analyzing consists of a series of Word documents, where each document represents a specific chapter  in the lexicon (for the most part, one document holds all the words starting with Alpha, one for Beta, etc. etc.).

The text itself is written in a combination of Greek and Chinese.  The Chinese portions of the text use a font that follows the Unicode standard, the Greek portions do not.  This is going to be a problem. Before I can even begin to parse the text, I will need to make sure that the entire document follows the same encoding scheme and is non-font specific.

But before I go on, I think I’ll need to go into a brief explanation on what the problem exactly is. If you already understand this then you can jump down to the end.

Unicode Fonts vs Symbol Fonts

During the early days of computing, ASCII was the only widely supported character encoding system. If you wanted to type in a language other than English, your only option would have been to use special fonts called symbol fonts (think Wingdings).  These special fonts would completely change the way typical English characters were drawn so that they would appear to resemble a foreign character in your text editor.

In the Word documents that I’m looking at, the Greek was written using a proprietary symbol font called Grksm.  This font is so ancient that a Google search for it shows very little information about the font itself.  All is know is that the font itself was copyrighted by a Donald Reiher in 1996, and that it is used by at least one New Testament scholar.

At first glance, Grksm appears to allow the user to actually type in Greek.  For example, typing “a” or “b” on the keyboard will show their Greek equivalents in Word: α and β.  However, upon closer inspection, you’ll find that the underlying character encoding is the same; the text editor still sees the character as 0×41 (the encoding for the English character “a”).  So although you see the Greek letter alpha, its really just the letter “a” to the machine.

As you can see, using these special fonts can be quite a nuisance.  First, if you want type in a mix of more than one language, you’ll have to constantly switch between fonts.  Second, if you want to pass on your text file to someone else, you’ll have to give the recipient a copy of the font that you used as well.

This is where the Unicode standard comes in.  With Unicode, all characters from both English and non-English writing systems are given their own unique character code.  That way, the user will be able to use the same font and type in various languages at the same time.  Furthermore, because characters from various languages are given their own range in the Unicode encoding scheme, its a lot easier to distinguish between different writing systems at the machine level.

Now, since half of the documents are written using the Grksm scheme, I’m going have to figure out a way to convert the Grksm Greek characters to their Unicode equivalent.  And that…is going to be a problem.

The Solution

As some of you know, I’ve been playing around with controlling MS Word through Python for a while now.  Well, I eventually gave up on that.  There’s a lot of reasons behind it, but eventually I decided that Python might not offer the low level of control that I need to do this work.  That or its just an excuse for my weak Python-fu.

I poked around with Java for a while, but the (free) libraries that let you talk with Windows COM modules simply sucked.  There was a lot of wrapping of Objects that I simply did not want to deal with.

In the end, I settled on working with C# using Visual Studio 2008.  C# is very similar to Java, and it definitely has its own share of quirks.

On top of its its syntactic similarity with Java, C# also offers a direct connection to the COM modules required to communicate with Word.  After a week or so of stumbling and poking around, I was finally able to whip up a program that would access Word documents, convert all Grksm symbols into their Unicode equivalent, and save the new Word file in a new folder.

I’ve already written an essay and a half here, so I’ll save the technical details for my next post.

One final thought: I was a bit suprised to find out that this tool might turn out to be incredibly useful for other biblical scholars.  The Grksm font is widely used for writing academic papers, so having this Grksm to Unicode converter might be very useful to other people.  I have no idea who might use it, but I think it’d be prudent to share the code with those in need, so I’ll post it up somewhere and see who bites.

This entry was posted in Tech, Win32 and tagged , , , , , , , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>