pne: A picture of a plush toy, halfway between a duck and a platypus, with a green body and a yellow bill and feet. (Default)
[personal profile] pne

I thought I had this great idea for counting letter frequencies in Klingon.

You see, I thought that in order to count Klingon letters, I could ignore the multi-letter graphemes (is that the right word?) such as "tlh" and "gh" and simply count letter by (ASCII) letter and compensate afterwards.

My theory was that there are only three multi-letter graphemes in the traditional Latin-based orthography: ch, gh, tlh. Also, every "c" occurs only as part of "ch" and every "g" only as part of "gh". Hence, if you subtract one "h" for every "c" and "g" you've seen, any "h"s left will have to come from "tlh". Then you can subtract that many "t"s and "l"s so that the proper count for those letters can be found.

The weakness of this plan manifested itself when I had another look at the alphabet and saw "ng". That means that I cannot tell from the number of "g"s in the input data how many "gh"s and how many "ng"s there were. Also, I can't compensate by counting "n"s because that letter can also occur alone. Bummer.

So it looks as if the only correct way is to take into account the orthography and include a switch "if the letter is c, g, n, or t, look ahead to see whether more letters follow that contribute to the current grapheme". (Or, alternatively, use a regular expression that explicitly lists the multi-letter graphemes first, something like (tlh|ch|gh|ng|['abDeHijlmnopqQrStuvwy]).)

Date: Monday, 15 September 2003 12:32 (UTC)
From: [identity profile] nik-w.livejournal.com
My dear Mr. Newton, you have far too much free time on your hands, but I applaud your use of it just the same!

Date: Monday, 15 September 2003 14:23 (UTC)
From: [identity profile] n-true.livejournal.com
Can't you kind of replace those tlh, gh, ng, ch and probably even rgh? Hm, maybe that doesn't help.

But what if a word is *Qunghom, the program might think that the ng belongs together, doesn't it?

segmenting

Date: Monday, 15 September 2003 21:35 (UTC)
ext_78: A picture of a plush animal. It looks a bit like a cross between a duck and a platypus. (Default)
From: [identity profile] pne.livejournal.com
Well, if you replace those combinations, you're already analysing the stream of characters at a higher level than letter-by-letter, so you're not gaining much.

As for such words—good point. My first thought was that they are ambiguous in the same sense as Esperanto flughaveno when using the h-convention (it's flug-haveno rather than fluĝaveno). However, Klingon probably is unambiguous if you look far enough ahead because g as the start of a grapheme can only be followed by h and can never stand alone. Therefore, if a g is followed by anything other than h, the two letters must belong to separate graphemes (and then the g must be preceded by an n, since that is the only other place it may occur).

So monghom must be segmented m·o·n·gh·o·m "capital (city) group"; for the n and g to belong together, the next letter must be different, say, H rather than h: mongHom is unambiguously m·o·ng·H·o·m "neck bone".

So any program that wants to segment Klingon words into sounds must look at least two letters ahead, I suppose—at least, if the current letter is n. Thanks for pointing that out to me.

Re: segmenting

Date: Monday, 15 September 2003 23:52 (UTC)
ext_78: A picture of a plush animal. It looks a bit like a cross between a duck and a platypus. (Default)
From: [identity profile] pne.livejournal.com
(And mongghom must be m·o·ng·gh·o·m "neck group".)

Profile

pne: A picture of a plush toy, halfway between a duck and a platypus, with a green body and a yellow bill and feet. (Default)
Philip Newton

June 2015

S M T W T F S
 12 3456
78910111213
14151617181920
2122232425 2627
282930    

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Saturday, 23 May 2026 21:09
Powered by Dreamwidth Studios