The Voynich Ninja

Full Version: Bigram = phoneme theory (language agnostic)
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
I am motivated to post this idea in the spirit of Emma May Smith's approach of language and content agnostic analysis, but with the assumptions of linguistic text written in the plain.

I briefly brought up the outline of the idea in the devil's advocate for glossolalia post, but I realize it may have gotten lost in the 100,000 or so other words of text surrounding it. So I would like to put it forward separately and more clearly here:

At first glance one would think it absurd to propose that each Voynich character *bigram* could represent a single letter or phoneme in any language: surely with 15-25 characters, there must be many hundreds of such bigrams, and no language could have that many phonemes. But in fact upon closer inspection, so long as the bigrams are paired off naturally, and certain obvious non-bigrams are excluded (initial [q-], many final [-y]'s, initial [d-] in some cases, etc.), then it becomes apparent that there are not anywhere near hundreds of such bigrams that occur with any frequency beyond the rare or accidental appearance; rather there are only a couple dozen or so of them.

As a simple example, in the vord [otchy], clearly one must not consider the pairs [tc] and [hy] as bigrams! Obviously the bigrams are [ot] and [ch], and [-y] is a single character at the end. In this case it is obvious because we all know [ch] is one unit, not two separate letters or phonemes. Likewise with the notorious [sh], however many different forms of it may occur in the ms text: in any case, we know the [h] cannot be separated from the [s]!

In this spirit, I propose that the following inventory of bigrams constitutes a substantially large majority of the ms text:

[ch], [dy], [ai], [ok], [in], [ol], [ee], [sh], [da], [ey], [ot], [eo], [ar], [al], [or], [od], [yk], [sa], [yt], [os], [do], [so], [ky], [ty], [oy]

Naturally the apparent ligatures [cth] and [ckh] must be accounted for here as well.

As I noted above, certain obvious and frequent non-bigram single characters must be accounted for separately:
many [y]'s, many [d]'s, [q], many [s]'s, an occasional initial [k] and [t], and the odd extraneous [o], [i], or [e].

But I stress that these latter occasional or extraneous characters are very much the infrequent exception in the ms text, not the common rule. Likewise, it still remains to deal with [p] and [f], not to mention [m], [g], and a few others! But they will hardly affect the reading of the vast majority of the ms text.

=====

Further, we can make even more sense of this bigram inventory as a phoneme inventory if we regard certain pairs of bigrams as *the initial and final forms* of the same phoneme. The variance in form of letters in initial and/or medial vs. final position is the absolute rule in the Arabic script, exists for a number of letters in Syriac and Hebrew, is known to many in the case of the Greek letter sigma, and existed until modern times in the English letter "s" (the funny-looking "f" without the bar occurring in initial/medial position).

So, for example, perhaps [ok] is an initial form and [ky] a final form of the same letter/phoneme. Likewise [ot] and [ty]. I note on Emma May Smith's blog the suggestion that [a] and [y] may be equivalent: perhaps then [da] and [dy] are the initial and final forms of a very frequent letter/phoneme? More speculatively, but perhaps usefully, might [ch] and [ey] be the initial and final forms of the same letter/phoneme? Further suggestions for the same phenomenon include the pair [sa] and [ar], and the pair [so] and [or].

With such an inventory, we have now perhaps accounted at least somewhat for the thorny issue of initial vs. final glyphs and sequences, and we still have a decent and reasonably sized inventory of distinct letters/phonemes by this method, not too large and not too small.

=====

I recall that somewhere on René Zandbergen's voynich.nu website, there is the observation that the 3rd character in each vord is much less predictable, and thus contains much more information, than either the 1st or 2nd character. If the text is indeed composed of bigrams, and the initial bigram/letter/phoneme in the language happens to be rather predicable (cf. the Hebrew article prefix h-), then it would indeed make sense that the variation and information and reduced predictability would not occur until the 3rd character.

The bigram theory does introduce the problem of extremely short vords. This would be less of an issue in a Semitic abjad, in which vowels are not written. And we may also consider the idea that each vord may not be a complete word, but only a part of a word, however we may define that. 

It is just one theory, in any case. I hope some folks here may find it worth considering and discussing, if not accepting.

-Geoffrey Caveney
Hello Geoffrey,

I think that looking into bigrams more closely is an excellent idea.

You may find some inspiration here:

You are not allowed to view links. Register or Login to view.

I think it shows quite visually what you are trying to bring across.
It also shows that one has to be careful w.r.t. the transcription alphabet to use.
In partiular, Eva is not the best choice for doing statistics.
I don't know WHERE I am going to find the time to clean it up and post it, but I have a paper on this.
(double post)
Geoffrey, I just glanced through my paper.

It needs to be double-checked for typos, and I have several charts that should probably be moved from the appendix to the main text (easier to read that way).

I will try to do it this weekend. It's directly relevant to your post.
Hello Geoffrey,

It should be noted that frequencies of very common bigrams (not just EVA-ed between "languages" A and B) vary a lot from page to page. For example f. 15v has the highest frequency of EVA-or, f. 58rv have the highest frequency of EVA-al. On the other hand some very common bigrams are missing or almost missing on some pages. For example there are no EVA-dy on f. 5v, 6r, 19v, 25v, 35v, no EVA-or on f. 26r.

If these very common bigrams stand for some cleartext or phoneme by themselves, to account for the high variability in frequency there may be homophones: other bigrams that play the same role. It should be possible (in principle) to identify them by finding an optimum partition of the common bigrams set (or any set of common patterns) that keeps the frequency of each group of bigrams (or patterns) as stable as possible over pages of large, relatively homogeneous portions of the VMs (one or several quires). I haven't tried it yet, maybe someone else has...
Lately I have been trying to figure out why the bigram to reverse bigram ratio is so unbalanced e. g. ~7000 EVA-dy and only ~200 EVA-yd.

Few common bigrams are relatively well balanced relative to their reverse: yt-ty, yk-ky, os-so.
(06-03-2019, 12:32 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Lately I have been trying to figure out why the bigram to reverse bigram ratio is so unbalanced e. g. ~7000 EVA-dy and only ~200 EVA-yd.

Few common bigrams are relatively well balanced relative to their reverse: yt-ty, yk-ky, os-so.

Do you mean from the perspective of natural language?

From the perspective of natural language, syllabic languages work like this. Look at Japanese (or Indic), for example. The "alphabet" is learned as syllables.

We learn A B C D E, they learn syllables like ma, mi, mu, me, mo, ba, bi, bu, be, bo, etc. Words are constructed out of those building blocks. You wouldn't reverse them. The general idea is Yo-ko-ha-ma (for Yokohama). Note how the vowel follows the consonant when it's broken into its component syllables. Ta-ka-ha-shi (Takahashi).


Older languages are more "pure" than modern languages in this respect. Loanwords, occupation by other nations with a different language, all these factors tend to alter the basic rules. Korean has a regular and logical structure that was slightly changed by Chinese occupation, but it is changing much more now, with the introduction of English computer terms. Instead of translating them into Korean roots, they simply change them to Korean pronunciation, thus watering down the logical structure of their base language (it all becomes memorization).
(06-03-2019, 01:33 PM)-JKP- Wrote: You are not allowed to view links. Register or Login to view.Do you mean from the perspective of natural language?

No, I have a nagging feeling that we are missing something very obvious and very artificial in the way Voynichese is constructed.

The rare (sometimes extremely rare) reverse bigrams are interesting. In which context they occur, how they relate to the positional nature of glyphs in words/tokens, all very interesting. I have no answer yet but several ideas that need to be further investigated.
(06-03-2019, 01:47 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(06-03-2019, 01:33 PM)-JKP- Wrote: You are not allowed to view links. Register or Login to view.Do you mean from the perspective of natural language?

No, I have a nagging feeling that we are missing something very obvious and very artificial in the way Voynichese is constructed.

The rare (sometimes extremely rare) reverse bigrams are interesting. In which context they occur, how they relate to the positional nature of glyphs in words/tokens, all very interesting. I have no answer yet but several ideas that need to be further investigated.

Well, that's why I said from the perspective of natural language.

I think it's unlikely that it's natural language, but if it were, then a syllabic language is one kind of language that is constructed that way (the syllables are directional, they go one way but not the other).


Yes, the bigrams are directional. I've been kinda trying to say that for a long time. It's part of what I mean when I say the glyphs are positional. I've blogged about the fact that certain glyphs are followed by __, but not preceded by __.


It's also why I've frequently mentioned numbers (incuding Roman numerals) as examples of one kind of thing that is positional. You don't know if it's 35 or 53 unless you have "rules" for which one comes first. Same with Roman numerals... VII is different from IIV. A symbolic or synthetic language could also be positional.


There are two ways Voynichese is positional... the groupings of glyph combinations, and where they can occur in a token.
Pages: 1 2 3