The Voynich Ninja

Full Version: A possible way to break down Voynich text
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6
This is something that I've known for a while now, and I have kept it secret, but I think it's time to reveal it to the community. I also posted it to Stephen Bax's website and the VMS mailing list:

I have located about 26 patterns that the text breaks down into. Almost all of the Voynich text (170,000+ characters) may be formed from these 26 units. If there is no other way to explain why the manuscript does this, then this may be a clue to the solution of the manuscript.

Here are the patterns:

[Image: attachment.php?aid=525]

And here is an example using the patterns from the May Zodiac:

[Image: attachment.php?aid=526]

And here is an example from a random page, folio 9v:

[Image: attachment.php?aid=527]

Don't take my word for it - Print off the "Units of the Voynich script" and try to break down the text into the 26 (or 27) patterns that I've described. I find that they work 79 out of 80 times (maybe the Scribe made a spelling error on the exception!)

I believe that each unit may substitute for a Latin script letter. For example, qo = a, ar = b, y = (u)s, etc. We all know that these patterns keep re-appearing in the Voynich text, and they seem to have meanings independent of their words, or their location in words.

For example, we have daiin, but also qodaiin, chodaiin, qochodaiin, qoar, and qoaiin. It looks like these words are just composed of smaller building blocks: qo, d, aiin, ar, ch, etc.


So why am I posting this now?

1) I would like your feedback. Can these patterns be explained another way or other methods? Or could they really be the "alphabet" of the Manuscript?

2) If this is the key, then I'll need the community's help. Since I still don't know what language to look for, I would rule out phonetic patterns in languages that I'm not familiar with. If the manuscript is Italian or Hebrew, for example, I might never recognize it. I have tried some Latin and German substitution and got some grammatical Latin phrases, but they may be false positives.

Also, the Units chart may not be 100% correct. There may need to be changes (I'm not 100% certain about the last patterns).
But if this is the key, let the games begin!

Thomas Coon
Savannah, GA
9/3/16
I see what you are seeing but would not that make the voynich words even smaller?
Well, I actually think that the spaces in text are completely fake. I think that he wrote large words (ex. "vocabulary") into two words: ("vocab-ulary").

Rene Z. posted such an example on his site:
"In addition, some words that appear standing alone can also appear connected together. For example beside chol daiin also choldaiin occurs."

Actually, looking at the different forms of daiin in text proves my point:
daiin = d-aiin
qodaiin = qo-d-aiin
chodaiin = ch-od-aiin
qochodaiin = qo-ch-od-aiin
choldaiin = ch-ol-d-aiin
And there are a hundred other examples.
Hello Thomas,

I just saw your message on the mailing list. This is a good analysis. While I applaud your effort, I have a few responses.
  • Your approach is interesting. Breaking the text into such bigram units is rather simple and elegant. However, what was the basis for approaching the text this way?
  • Have you done any automated tests to determine how much of the corpus fits this pattern? You claim that "Almost all of the Voynich text" fits. How much is "almost all"?
  • "If there is no other way to explain why the manuscript does this" - There have been many attempts to break down the text's predictable text into neat systems, and they do explain your patterns, and fit the text very well. Unfortunately, most are lost to the depths of the 90s internet so I don't have the links on hand and most of them don't work anymore anyway. Pelling mentions a few of them, and features a similar unit-based system You are not allowed to view links. Register or Login to view.. Pelling had another good post about how to evaluate how effective the word formation theories are, but I can't find it Angry
  • How did you derive these particular bigrams? Why 26? Did you start from that number?
  • A lot of the text may fit because of some vagueness of the units. It might sound good that only 26 units make up so much text, but units 6, 9, 10, 15, 16, 24 and 26 have two possibilities each, so there are really 33 units. Then when considering your claim that the units can be reversed, there are over 60! Two of the units are only one letter long, which is kind of "cheating" when it comes to getting a good fit.
  • I think you may be overstating the significance of this system. Even if this pattern holds, how can you be certain that this is the pattern (or "the key") behind the Voynich Manuscript text, rather than a corollary of some other principles? For example, you have the units od, ot, op and ok. What if the "real" (i.e. deliberate) underlying system does not rely on bigram units at all, and one of its rules is that "o" attaches to tall-looking glyphs, and your common od/ot/op/ok bigrams are simply a side-effect of this? As I mentioned, there are many competing theories about word formation systems. For each proposal, there is the golden question: Does this system merely describe the features of the text or does it really explain them? Your theory is also subject to this, and you need to answer it before thinking that it could be "the key". It is not good enough to say that the theory has only a few simple rules yet all of the manuscript's text fits. Each of the theories can claim the exact same thing. Why would your theory have a better claim than the others?
  • If the only idea is that the text is made from your units, we would expect to frequently find words like amameeqo or chchchch, but we don't. There is more to the text's patterns. It is well-known that some glyphs appear near the start or end of words (e.g. q or m), and this applies to your units too. In your theory, is there an explanation for why such phenomena occur? That would help to accept that your pattern might be the text's actual system as originally intended. 
  • "I believe that each unit may substitute for a Latin script letter". Why? In your post, there's an unexplained leap from suggesting a word formation system to suggesting a substitution cipher for a natural language.
Don't mistake all that for outright rejection. Your ideas are promising, they just need more fleshing out and justification before I can really accept them. I hope you find this criticism constructive.

Brian Cham

P.S. @davidjackson, thanks for letting us type in EVA on the forum!! That was on my wishlist for a while.
Quote:Hello Thomas,
I just saw your message on the mailing list. This is a good analysis. While I applaud your effort, I have a few responses.
  • 1. Your approach is interesting. Breaking the text into such bigram units is rather simple and elegant. However, what was the basis for approaching the text this way?
  • 2. Have you done any automated tests to determine how much of the corpus fits this pattern? You claim that "Almost all of the Voynich text" fits. How much is "almost all"?
  • 3. "If there is no other way to explain why the manuscript does this" - There have been many attempts to break down the text's predictable text into neat systems, and they do explain your patterns, and fit the text very well. Unfortunately, most are lost to the depths of the 90s internet so I don't have the links on hand and most of them don't work anymore anyway. Pelling mentions a few of them, and features a similar unit-based system You are not allowed to view links. Register or Login to view.. Pelling had another good post about how to evaluate how effective the word formation theories are, but I can't find it Angry
  • 4. How did you derive these particular bigrams? Why 26? Did you start from that number?
  • 5. A lot of the text may fit because of some vagueness of the units. It might sound good that only 26 units make up so much text, but units 6, 9, 10, 15, 16, 24 and 26 have two possibilities each, so there are really 33 units. Then when considering your claim that the units can be reversed, there are over 60! Two of the units are only one letter long, which is kind of "cheating" when it comes to getting a good fit.
  • 6. I think you may be overstating the significance of this system. Even if this pattern holds, how can you be certain that this is the pattern (or "the key") behind the Voynich Manuscript text, rather than a corollary of some other principles? For example, you have the units od, ot, op and ok. What if the "real" (i.e. deliberate) underlying system does not rely on bigram units at all, and one of its rules is that "o" attaches to tall-looking glyphs, and your common od/ot/op/ok bigrams are simply a side-effect of this? As I mentioned, there are many competing theories about word formation systems. For each proposal, there is the golden question: Does this system merely describe the features of the text or does it really explain them? Your theory is also subject to this, and you need to answer it before thinking that it could be "the key". It is not good enough to say that the theory has only a few simple rules yet all of the manuscript's text fits. Each of the theories can claim the exact same thing. Why would your theory have a better claim than the others?
  • 7. If the only idea is that the text is made from your units, we would expect to frequently find words like amameeqo or chchchch, but we don't. There is more to the text's patterns. It is well-known that some glyphs appear near the start or end of words (e.g. q or m), and this applies to your units too. In your theory, is there an explanation for why such phenomena occur? That would help to accept that your pattern might be the text's actual system as originally intended. 
  • 8. "I believe that each unit may substitute for a Latin script letter". Why? In your post, there's an unexplained leap from suggesting a word formation system to suggesting a substitution cipher for a natural language.
Don't mistake all that for rejection. Your ideas are promising, they just need more fleshing out. I hope you find this criticism constructive.

Brian Cham

P.S. David, thanks for letting us type in EVA on the forum!! That was on my wishlist for a while.


Dear Brian,
Thank you for your questions - These are good points which I am glad you brought up. I numbered the bullets in your post:

1. I stumbled upon this idea when looking for  grammatical cases actually. I tried to find units which repeated at the ends of words - such as
or, ar, y etc. to see if the language betrayed certain grammar features. I actually made a whole list of the qo- words in this attempt:

[attachment=528]

Trying to search for more cases, I broke down the two innermost circles of text in the May Zodiac rings. Since both pages referred to the same month, I figured that would be a way to find the word for "May" or "Taurus." What I found was that many words started with oke- or ote-. Making a list of these words, I realized that the same groups of letters that appeared at the ends of these words also appeared at the beginning - which would indicate that they are independent functioning units, and my mind immediately jumped to the idea of verbose cipher. Here is the picture from my notebook - I boxed in the set of characters that first made me come to this realization:

[Image: attachment.php?aid=529]

From there it was a small cognitive jump to realize that all the repeating groups - or, ar, y, aiin, qo, ch etc. - may be independent units that stand for 1 Latin alphabet character.

2. Have you done any automated tests to determine how much of the corpus fits this pattern? You claim that "Almost all of the Voynich text" fits. How much is "almost all"?
I have not done automated tests, but only because I don't know of any program that can do those tests and I don't have the programming knowledge to make one. So far I've only tried this breakdown by hand on about 20 pages, and in those pages, it works the grand majority of the time (98%) - the times it does not may be spelling errors.

3. I was not aware of Nick Pelling's system but I will definitely take a look now - thanks!

4. I actually did not start from 26. I only realized it was that many when I began to count the patterns that I hypothesized. I arrived at 26 by doing many, MANY trials, trying to figure out which characters appear before and after each individual voynich letters. Here are my notes on EVA y,d, and l:

[Image: attachment.php?aid=530]

As you can see, <l> is preceded by <o> so often that it is not random chance. It seemed to me that <ol> must be a combination, and there are 25 similar combinations that repeatedly appear.


5.
I admit that the possibility of reversing some combinations will bring up the number of possible units, but regardless whether <ar> or <ra>, for example, both should still only correlate to one plaintext Latin letter. So it's not as if I'm devising a really loose system which will spell anything I want to read into the text - I've seen the pitfalls of that in other "decryptions"! Smile

Regarding #6 and #15, I am firmly convinced that <ch> = <ee>. I can post a lot of places in the text which support this point, including many <chh> ligatures. The same is true of <Sh> = <s h>. I'll post examples tomorrow, but I agree that at first sight this would seem like a weakness of my theory.

Regarding #16 and #26, the <m> character apears in the exact same places that we see <y>: most importantly after <d>. Anyone who has spent time with the text knows that <dy> is everywhere, but we also have <dm>. I believe they might be the same character. In the text, <m> looks like <y> with an extra loop at the top - I actually never realized this before trying to figure out what <m>'s function was.

Regarding #9 and #10, those are just ways to incorporate the gallows, which only appear inconsistently in the text (always in the first line of a paragraph) which means they are likely a variant of another character. No natural language puts certain sounds only at the beginning of a string of speech, so they must be another letter in the text just written differently.

6. "Does this system merely describe the features of the text or does it really explain them? Your theory is also subject to this, and you need to answer it before thinking that it could be "the key". It is not good enough to say that the theory has only a few simple rules yet all of the manuscript's text fits. Each of the theories can claim the exact same thing. Why would your theory have a better claim than the others?"

That is a good question. I think statistics would be the answer. If l is not supposed to function with <o>or <a>, why does it almost always appear with <o> or <a>? As in my graph paper at the top: 33/40 times with <o> and 7/40 with <a> - This suggests there is definitely an <ol> pattern in the text, and likely an <al> pattern also. The same is true for other letters for which I examined their place in relation to other letters.

To answer the question about the corollary, I am not sure. Perhaps the Voynich scribe did create some units by putting <o> near tall characters; I can't say he didn't. Whether he did or he didn't wouldn't affect the outcome, as long as the combinations still correlate to one Latin letter each.

7. I fully believe that the spaces in between words in the manuscript are fake. I don't think they mean anything: the choice to put a space before qo- words, in my mind, is completely arbitrary and just meant to mask the text.

8. "I believe that each unit may substitute for a Latin script letter". Why? In your post, there's an unexplained leap from suggesting a word formation system to suggesting a substitution cipher for a natural language.

The fact that these units repeat over and over, and the majority seem completely unrestrained to certain word positions (except for the ones like <qo> you mentioned - but for that see point 7). This means they are not case endings like Latin, but individual units with their own function. And that pointed me in the direction of individual letters with their own sound values.
(04-09-2016, 01:48 AM)ThomasCoon Wrote: You are not allowed to view links. Register or Login to view.Well, I actually think that the spaces in text are completely fake. I think that he wrote large words (ex. "vocabulary") into two words: ("vocab-ulary").

...

It works the other way, as well. Word-tokens that appear to be unique can frequently be broken into two (or occasionally three) commonly used units which means that any attempt to ascribe special meanings to "unique" vords might be a misassumption for some or all of them.





(04-09-2016, 12:16 AM)ThomasCoon Wrote: You are not allowed to view links. Register or Login to view.This is something that I've known for a while now, and I have kept it secret, but I think it's time to reveal it to the community.
...

Welcome to the club.   Smile

I agree in principle to this kind of breakdown (as I have also noted on Don's posts and briefly alluded to on a couple of blogs). I don't agree with 100% of your breakdowns but I don't think the deconstruction has to be perfect for it to be productive so I'm not going to niggle about small details.


And no, I'm not revealing mine to the community yet. There's this little problem of how to wrestle 2200+ pages of notes into something digestible (and of giving away too much before I've had a chance to write it up)!
I also think that this is a very promising approach, in general.
I see it as one of the best ways to explain the low entropy of the text.

Inevitably, this has been considered and tried by several people in the past.
Jim Reeds tried to create such a list using a piece of software to set it up, but as far as I know this did not produce a result that he was satisfied with.

Something rather similar was done by Robert Firth:

You are not allowed to view links. Register or Login to view.

though his suggestion of what it means is quite different.

That last point I think is important in order to advance, namely to clearly separate the analytical part (how to separate the text into basic units) and the speculative part: what it means (cipher or language?, which language?).

The most important questions to answer at the start are, in my opinion:
- How much of the text (e.g. in percent) does this really explain? It is certainly not 100%, and it is also correct that one may assume that the text has some errors. If it can't (yet) be checked by a piece of software, it should at least be checked against some relevant pages, one herbal-A, one herbal-B, one bio or one recipes page.
- How can we know that this is the only method? Really, it is not likely to be, but how to find the 'most likely' method.

A general comment as well: I don't think that it is necessary at this stage to assume that Ch and ee (and similar examples) mean the same thing.

This is not meant to be overly critical. I do believe, as I said at the start, that this is quite promising.
I think this remark by Brian is an important one:

Quote:If the only idea is that the text is made from your units, we would expect to frequently find words like amameeqo or chchchch, but we don't.

There is more to the structure of Voynichese words, and I'm not sure if it can be explained just by word breaks. 

One interesting test would be to somehow try to encrypt a known text into Voynichese using your system, and see to what extent you manage to approach its appearance. 
Hi Thomas,

In my opinion, the key point in approaching the Voynichese text is that we don't know the real Voynich alphabet. E.g., we don't know whether iin stands for three characters or one character, whether l stands for one character or two characters (i plus a tail modifier), etc.

Instead of simply wondering at low character entropy of the Voynich text, a more productive approach would possibly be to try to (re)construct the Voynich alphabet in the way where the character entropy would be normal.

Next, about the orthography decomposition. As I believe I already noted in my review of Brian's paper, if we speak of some rules or mapping, then the rules are there to be adhered to. So if one proposes a "rule" that is followed in only 50% of the cases, that is simply not a rule. Respectively, Brian is right about the testing of the pattern validity. Figures are needed to assess that. I'd say, 85% conformity is a must to ever begin any discussion and 95% would be good conformity to speak of a "rule" (providing 5% for scribal errors).

ce and ee are definitely not the same thing, since that crossbar is met not only with e, but, e.g. with y. Wladimir provided examples of that.

"Gallows coverage" (see the thread about gallows for details) and text entered in multiple passes are things to be considered. I strongly suspect that some kind of shuffling is there in place.

***

Brian, nice to see you on the forum again.
(04-09-2016, 05:09 AM)-JKP- Wrote: You are not allowed to view links. Register or Login to view.
(04-09-2016, 01:48 AM)ThomasCoon Wrote: You are not allowed to view links. Register or Login to view.Well, I actually think that the spaces in text are completely fake. I think that he wrote large words (ex. "vocabulary") into two words: ("vocab-ulary").

...

It works the other way, as well. Word-tokens that appear to be unique can frequently be broken into two (or occasionally three) commonly used units which means that any attempt to ascribe special meanings to "unique" vords might be a misassumption for some or all of them.

I'm sorry, I didn't explain clearly what I was thinking: agree that it's a misassumption to ascribe any meaning to any unique vord. In my opinion, each line should be written without spaces and then broken down into the units I've described. I think it's possible he wrote a plaintext word (like "vocabulary") as "vocab-ulary" but also "voca bu lary" or "vo cab ula ry" etc.




Quote:I think this remark by Brian is an important one:

Quote: Wrote:If the only idea is that the text is made from your units, we would expect to frequently find words like amameeqo or chchchch, but we don't.

There is more to the structure of Voynichese words, and I'm not sure if it can be explained just by word breaks. 

One interesting test would be to somehow try to encrypt a known text into Voynichese using your system, and see to what extent you manage to approach its appearance.

Brian and Koen:

I believe that many ch (ee) and other 2-letter combinations are hidden: for example, one "c" is at the end of a word while the next "c" is at the beginning of the following word. As I said above I don't believe the spaces mean anything. Here is a paragraph that supports my belief (f1v):

[Image: attachment.php?aid=533]

In blue are all the < o+l > units. In the red boxes you might think my theory breaks down because there is only <l> - but the <o> that they need is on the preceding word both times.

(04-09-2016, 08:44 AM)Koen Gh. Wrote: You are not allowed to view links. Register or Login to view.One interesting test would be to somehow try to encrypt a known text into Voynichese using your system, and see to what extent you manage to approach its appearance. 


I will do that today with a Latin text and a German one.
Pages: 1 2 3 4 5 6