The Voynich Ninja - Inside the Voynich Network: graph analysis

Pages: 1 2 3 4 5 6 7 8 9 10 11

(05-11-2025, 05:22 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The author is anonymous. Here you can find the entry in Spanish Wikipedia. You are not allowed to view links. Register or Login to view.

Ah, thanks! So it is almost certainly a single author.

(05-11-2025, 04:59 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.The most interesting result is the proximity of the Spanish "head" texts to the Portuguese "tail ones, and vice-versa. Those are essentially distinct texts, in rather different languages and quite different spellings, technically by distinct authors (Machado in the latter, and Tapía translating Machado in the former). What they had in common was the higher-level nature and style of the work (grammatical variety, clause length, predominant verbal tenses, etc.), the general topic (which determined proper names and common concepts and actions) and whatever part of the author's style could survive the translation.

I think it is interesting because it says that the graph is getting the structure of the text independently of the language. Maybe a good test would be to pass a text translated into not so near languages like Portuguese and Spanish, and see how the graph behaves. Because then, if the dots are close, we will have a tool to study the Voynich text independently from the alphabet or the "language".

(05-11-2025, 11:37 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Maybe a good test would be to pass a text translated into not so near languages like Portuguese and Spanish, and see how the graph behaves. Because then, if the dots are close, we will have a tool to study the Voynich text independently from the alphabet or the "language".

Good idea! Here are some versions of the Pentateuch (first five books of the Old Testament) in various languages. From my You are not allowed to view links. Register or Login to view.:

You are not allowed to view links. Register or Login to view. Chinese (Mandarin) in GB (Guo Biao) 2312 encoding, 2 bytes per character.
You are not allowed to view links. Register or Login to view. Hebrew, with vowel marks, in an ad-hoc "phoneticish" encoding.
You are not allowed to view links. Register or Login to view. Latin (Vulgate)
You are not allowed to view links. Register or Login to view. Russian, KOI8-R encoding.
You are not allowed to view links. Register or Login to view. Vietnamese, VIQR encoding with a few changes. (You already have this one.)

In each of these folders there is a file "main.src" which is the source text -- heavily edited, recoded, and reformatted by me. The markup is rather complicated, and varies somewhat from file to file; but ignoring all lines starting with '#' or '@' should give a clean text, apart from recoding and a few special punctuation (such as "=" for end-of-paragraph).

There is also a file "main.wds" that has the "main.src" digested into a more uniform format, with one word or punctuation per line. The first letter identifies the entry ("a" for text word, "p" for punctuation, "s" for symbol, etc.), with the object itself after a space. This file may be easier to convert to your format than the "main.src" one.

At the top of each of these files there are a few "@chars" lines that specify the characters that can appear as parts of words (@chars alpha), of punctuation (@chars punct), and of numbers or other symbols (@chars symbol)

The files are all in the ISO-latin-1 encoding, so you must pipe them through "recode latin-1..utf8" if your software expects Unicode. And be sure to use "wget" or the "Save link as..." button of your browser, rather than opening the file in the browser and copy-pasting the contents.

Let me know if you need help. (For instance, I think I can convert the Chinese file to phonetic pinyin. I have the recipe saved somewhere. But I don't know whether the conversion will be 100% correct...)

All the best, --stolfi

To go further into the Voynich graph analysis, I created a directed graph where every Voynich "word" is a node and each link connects words that appear next to each other inside the same paragraph. I avoided to connect words from one paragraph to the next one.Then I looked at how this network behaves.

Several things were found. There's not so much new, but a new apporach that gives the same rsults.

A small group of words (like ol, chedy, shedy, qokain, aiin) form the backbone of the text. They connect to many others and often appear in short loops.
The network isn’t random. When I shuffle the connections, most of the structure disappears.
The text seems to remember what came before. Each word depends not only on the previous one but also on the last two or three. In other words, it has short-range grammar or a sort of rhythm rather than fixed phrases.
There are no clear long repeating cycles. Instead, the same combinations keep reappearing in slightly different orders. Some words appear again after a short cycle, as shown in this plot:
[attachment=12053]It shows that a few words, like chedy, qokeedy, shedy, daiin, tend to come back to themselves after two or three steps, while most others quickly lose that connection.

With this analysis, we can say the Voynich text behaves like a system that keeps recycling a few basic pieces under certain rules, not like plain gibberish.

According to the results, the next word is not random. it depends on the previous two or three words. This means the text follows a kind of rhythm or local grammar rather than fixed repeated phrases.
I tested it in three ways:

Conditional entropy drops sharply when more context is known.
Perplexity (remember it is the exponential of mean word entropy) decreases from 1-gram to 3-gram models.
Return probability in the token network is much higher than in random graphs.

Together these results suggest that Voynich writing behaves like a controlled system with short-range structure, not random typing. Here is the table summarizing the findings:

[attachment=12055]

An additional plot to show the strongest repeating word transitions in the Voynich text, where a few tokens like ol, shedy, and qokedy form the main loop of tightly connected words:

[attachment=12054]

(06-11-2025, 09:57 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.a few tokens like ol, shedy, and qokedy form the main loop of tightly connected words

I am guessing that an edge directed from A to B counts the times that A appears before B in some parag. Is that correct?

(06-11-2025, 09:57 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.the next word is not random. it depends on the previous two or three words. ... Conditional entropy drops sharply when more context is known

So far this is a property of almost any text written in any natural language. And even of encrypted text, if each original word type is mapped to one encrypted word type. Or to a small number of types.

Because of this property, a Markov model of order 2 (that chooses the next word at random, with frequencies based on the last two generated words) can produce pseudo-English (or pseudo-Mongolian) that is pretty much indistinguishable from the real thing -- to anyone who does not know the language.

Therefore, the results of any word-level analysis of the VMS should be compared to those obtained from a sample of pseudo-Voynichese generated by a Markov of order 2 or 3. Comparing to simple random stream of words (as produced by a Markov of order zero) is not very useful.

A millennium ago, Jacques Guy wrote such a generator, which he called "monkey". Alas I can find neither the program not its output. Maybe some other old-timer kept a copy?

All the best, --stolfi

(06-11-2025, 10:19 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I am guessing that an edge directed from A to B counts the times that A appears before B in some parag. Is that correct?

Yes, exactly. In the graph, a directed edge from A to B means that word A appears immediately before word B somewhere in the same paragraph.

(06-11-2025, 10:36 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.So far this is a property of almost any text written in any natural language. And even of encrypted text, if each original word type is mapped to one encrypted word type. Or to a small number of types.

Well, that would be good news for the ones who expect that the Voynich have some sense.

Here are some additional files which you may find useful. In You are not allowed to view links. Register or Login to view.:

Legitimate texts:

Code:
lines  words    bytes file        

------- ------- --------- ------------

  9058  99049    531241 engl/chr/true.txt English, Culpeper's Herbal, herbal section.

    845    7684    45404 voyn/hea/true.txt Voynichese, Herbal-A, parags only.      

    359    3417    20400 voyn/heb/true.txt Voynichese, Herbal-B, parags only.      

    661    6804    41140 voyn/bio/true.txt Voynichese, Bio, parags only.          

  1111  11555    72750 voyn/str/true.txt Voynichese, Starred Parags, parags only.

Fake texts generated (mostly) from the above by a Markov of order 2:

Code:
lines  words    bytes file        

------- ------- --------- ------------

  9092  99049    532343 engl/chr/fake.txt

    842    7684    45592 voyn/hea/fake.txt

    346    3417    20394 voyn/heb/fake.txt

    646    6804    41173 voyn/bio/fake.txt

  1125  11555    72839 voyn/str/fake.txt

Same, but one word per line:

Code:
lines  words    bytes file        

------- ------- --------- ------------

  99049  99049    531241 engl/chr/true.wdp

  7684    7684    45404 voyn/hea/true.wdp

  3417    3417    20400 voyn/heb/true.wdp

  6804    6804    41140 voyn/bio/true.wdp

  11555  11555    72745 voyn/str/true.wdp

Code:
lines  words    bytes file        

------- ------- --------- ------------

  99049  99049    532343 engl/chr/fake.wdp

  7684    7684    45592 voyn/hea/fake.wdp

  3417    3417    20394 voyn/heb/fake.wdp

  6804    6804    41172 voyn/bio/fake.wdp

  11555  11555    72829 voyn/str/fake.wdp

The ".wdp" files have one word per line. The ".txt" files are the same
but filled as parags to 72 columns. In all files, the end of a parag is
denoted by a word "=" in a line by itself. The file encoding is
ISO-latin-1.

The Voynichese files are derived from a recent copy of Rene's
transcription, which uses lowercase EVA. Line breaks and plant
intrusions in the original text are not recorded. Parag breaks
were inferred from the "locators" ("P+", "P=", etc.)
Unfortunately the sections, especially Herbal-B, are
rather short.

The English files are derived from Culpeper's Herbal. Only the plant
descriptions from the Herbal section proper were taken, omitting the
"Place", "Time", and "Vertues" subsections, and the marginal notes.
Punctiation, numbers, and symbols (including "&") are omitted. The words
were all mapped to lowercase. The word characters are thus [a-z] plus
apostrophe "'", "°" for abbreviation period, and "~" for hyphen.

Hope it helps, --stolfi

(06-11-2025, 10:36 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.A millennium ago, Jacques Guy wrote such a generator, which he called "monkey". Alas I can find neither the program not its output. Maybe some other old-timer kept a copy?

I had a copy for quite a while, but no longer. I have also played with my own:
You are not allowed to view links. Register or Login to view.
The 'fun' results are in Annex A.

Pages: 1 2 3 4 5 6 7 8 9 10 11