05-11-2025, 09:25 PM
05-11-2025, 09:25 PM
05-11-2025, 11:37 PM
(05-11-2025, 04:59 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.The most interesting result is the proximity of the Spanish "head" texts to the Portuguese "tail ones, and vice-versa. Those are essentially distinct texts, in rather different languages and quite different spellings, technically by distinct authors (Machado in the latter, and Tapía translating Machado in the former). What they had in common was the higher-level nature and style of the work (grammatical variety, clause length, predominant verbal tenses, etc.), the general topic (which determined proper names and common concepts and actions) and whatever part of the author's style could survive the translation.
I think it is interesting because it says that the graph is getting the structure of the text independently of the language. Maybe a good test would be to pass a text translated into not so near languages like Portuguese and Spanish, and see how the graph behaves. Because then, if the dots are close, we will have a tool to study the Voynich text independently from the alphabet or the "language".
06-11-2025, 01:02 AM
(05-11-2025, 11:37 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Maybe a good test would be to pass a text translated into not so near languages like Portuguese and Spanish, and see how the graph behaves. Because then, if the dots are close, we will have a tool to study the Voynich text independently from the alphabet or the "language".
Good idea! Here are some versions of the Pentateuch (first five books of the Old Testament) in various languages. From my You are not allowed to view links. Register or Login to view.:
- You are not allowed to view links. Register or Login to view. Chinese (Mandarin) in GB (Guo Biao) 2312 encoding, 2 bytes per character.
- You are not allowed to view links. Register or Login to view. Hebrew, with vowel marks, in an ad-hoc "phoneticish" encoding.
- You are not allowed to view links. Register or Login to view. Latin (Vulgate)
- You are not allowed to view links. Register or Login to view. Russian, KOI8-R encoding.
- You are not allowed to view links. Register or Login to view. Vietnamese, VIQR encoding with a few changes. (You already have this one.)
There is also a file "main.wds" that has the "main.src" digested into a more uniform format, with one word or punctuation per line. The first letter identifies the entry ("a" for text word, "p" for punctuation, "s" for symbol, etc.), with the object itself after a space. This file may be easier to convert to your format than the "main.src" one.
At the top of each of these files there are a few "@chars" lines that specify the characters that can appear as parts of words (@chars alpha), of punctuation (@chars punct), and of numbers or other symbols (@chars symbol)
The files are all in the ISO-latin-1 encoding, so you must pipe them through "recode latin-1..utf8" if your software expects Unicode. And be sure to use "wget" or the "Save link as..." button of your browser, rather than opening the file in the browser and copy-pasting the contents.
Let me know if you need help. (For instance, I think I can convert the Chinese file to phonetic pinyin. I have the recipe saved somewhere. But I don't know whether the conversion will be 100% correct...)
All the best, --stolfi
06-11-2025, 09:57 AM
To go further into the Voynich graph analysis, I created a directed graph where every Voynich "word" is a node and each link connects words that appear next to each other inside the same paragraph. I avoided to connect words from one paragraph to the next one.Then I looked at how this network behaves.
Several things were found. There's not so much new, but a new apporach that gives the same rsults.
With this analysis, we can say the Voynich text behaves like a system that keeps recycling a few basic pieces under certain rules, not like plain gibberish.
According to the results, the next word is not random. it depends on the previous two or three words. This means the text follows a kind of rhythm or local grammar rather than fixed repeated phrases.
I tested it in three ways:
[attachment=12055]
An additional plot to show the strongest repeating word transitions in the Voynich text, where a few tokens like ol, shedy, and qokedy form the main loop of tightly connected words:
[attachment=12054]
Several things were found. There's not so much new, but a new apporach that gives the same rsults.
- A small group of words (like ol, chedy, shedy, qokain, aiin) form the backbone of the text. They connect to many others and often appear in short loops.
- The network isn’t random. When I shuffle the connections, most of the structure disappears.
- The text seems to remember what came before. Each word depends not only on the previous one but also on the last two or three. In other words, it has short-range grammar or a sort of rhythm rather than fixed phrases.
- There are no clear long repeating cycles. Instead, the same combinations keep reappearing in slightly different orders. Some words appear again after a short cycle, as shown in this plot:
[attachment=12053]It shows that a few words, like chedy, qokeedy, shedy, daiin, tend to come back to themselves after two or three steps, while most others quickly lose that connection.
With this analysis, we can say the Voynich text behaves like a system that keeps recycling a few basic pieces under certain rules, not like plain gibberish.
According to the results, the next word is not random. it depends on the previous two or three words. This means the text follows a kind of rhythm or local grammar rather than fixed repeated phrases.
I tested it in three ways:
- Conditional entropy drops sharply when more context is known.
- Perplexity (remember it is the exponential of mean word entropy) decreases from 1-gram to 3-gram models.
- Return probability in the token network is much higher than in random graphs.
[attachment=12055]
An additional plot to show the strongest repeating word transitions in the Voynich text, where a few tokens like ol, shedy, and qokedy form the main loop of tightly connected words:
[attachment=12054]
06-11-2025, 10:19 AM
(06-11-2025, 09:57 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.a few tokens like ol, shedy, and qokedy form the main loop of tightly connected words
I am guessing that an edge directed from A to B counts the times that A appears before B in some parag. Is that correct?
06-11-2025, 10:36 AM
(06-11-2025, 09:57 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.the next word is not random. it depends on the previous two or three words. ... Conditional entropy drops sharply when more context is known
So far this is a property of almost any text written in any natural language. And even of encrypted text, if each original word type is mapped to one encrypted word type. Or to a small number of types.
Because of this property, a Markov model of order 2 (that chooses the next word at random, with frequencies based on the last two generated words) can produce pseudo-English (or pseudo-Mongolian) that is pretty much indistinguishable from the real thing -- to anyone who does not know the language.
Therefore, the results of any word-level analysis of the VMS should be compared to those obtained from a sample of pseudo-Voynichese generated by a Markov of order 2 or 3. Comparing to simple random stream of words (as produced by a Markov of order zero) is not very useful.
A millennium ago, Jacques Guy wrote such a generator, which he called "monkey". Alas I can find neither the program not its output. Maybe some other old-timer kept a copy?
All the best, --stolfi
06-11-2025, 11:52 AM
(06-11-2025, 10:19 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I am guessing that an edge directed from A to B counts the times that A appears before B in some parag. Is that correct?
Yes, exactly. In the graph, a directed edge from A to B means that word A appears immediately before word B somewhere in the same paragraph.
06-11-2025, 04:04 PM
(06-11-2025, 10:36 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.So far this is a property of almost any text written in any natural language. And even of encrypted text, if each original word type is mapped to one encrypted word type. Or to a small number of types.
Well, that would be good news for the ones who expect that the Voynich have some sense.
06-11-2025, 08:54 PM
Here are some additional files which you may find useful. In You are not allowed to view links. Register or Login to view.:
Legitimate texts:
Fake texts generated (mostly) from the above by a Markov of order 2:
Same, but one word per line:
The ".wdp" files have one word per line. The ".txt" files are the same
but filled as parags to 72 columns. In all files, the end of a parag is
denoted by a word "=" in a line by itself. The file encoding is
ISO-latin-1.
The Voynichese files are derived from a recent copy of Rene's
transcription, which uses lowercase EVA. Line breaks and plant
intrusions in the original text are not recorded. Parag breaks
were inferred from the "locators" ("P+", "P=", etc.)
Unfortunately the sections, especially Herbal-B, are
rather short.
The English files are derived from Culpeper's Herbal. Only the plant
descriptions from the Herbal section proper were taken, omitting the
"Place", "Time", and "Vertues" subsections, and the marginal notes.
Punctiation, numbers, and symbols (including "&") are omitted. The words
were all mapped to lowercase. The word characters are thus [a-z] plus
apostrophe "'", "°" for abbreviation period, and "~" for hyphen.
Hope it helps, --stolfi
Legitimate texts:
Code:
lines words bytes file
------- ------- --------- ------------
9058 99049 531241 engl/chr/true.txt English, Culpeper's Herbal, herbal section.
845 7684 45404 voyn/hea/true.txt Voynichese, Herbal-A, parags only.
359 3417 20400 voyn/heb/true.txt Voynichese, Herbal-B, parags only.
661 6804 41140 voyn/bio/true.txt Voynichese, Bio, parags only.
1111 11555 72750 voyn/str/true.txt Voynichese, Starred Parags, parags only.Fake texts generated (mostly) from the above by a Markov of order 2:
Code:
lines words bytes file
------- ------- --------- ------------
9092 99049 532343 engl/chr/fake.txt
842 7684 45592 voyn/hea/fake.txt
346 3417 20394 voyn/heb/fake.txt
646 6804 41173 voyn/bio/fake.txt
1125 11555 72839 voyn/str/fake.txtSame, but one word per line:
Code:
lines words bytes file
------- ------- --------- ------------
99049 99049 531241 engl/chr/true.wdp
7684 7684 45404 voyn/hea/true.wdp
3417 3417 20400 voyn/heb/true.wdp
6804 6804 41140 voyn/bio/true.wdp
11555 11555 72745 voyn/str/true.wdpCode:
lines words bytes file
------- ------- --------- ------------
99049 99049 532343 engl/chr/fake.wdp
7684 7684 45592 voyn/hea/fake.wdp
3417 3417 20394 voyn/heb/fake.wdp
6804 6804 41172 voyn/bio/fake.wdp
11555 11555 72829 voyn/str/fake.wdpThe ".wdp" files have one word per line. The ".txt" files are the same
but filled as parags to 72 columns. In all files, the end of a parag is
denoted by a word "=" in a line by itself. The file encoding is
ISO-latin-1.
The Voynichese files are derived from a recent copy of Rene's
transcription, which uses lowercase EVA. Line breaks and plant
intrusions in the original text are not recorded. Parag breaks
were inferred from the "locators" ("P+", "P=", etc.)
Unfortunately the sections, especially Herbal-B, are
rather short.
The English files are derived from Culpeper's Herbal. Only the plant
descriptions from the Herbal section proper were taken, omitting the
"Place", "Time", and "Vertues" subsections, and the marginal notes.
Punctiation, numbers, and symbols (including "&") are omitted. The words
were all mapped to lowercase. The word characters are thus [a-z] plus
apostrophe "'", "°" for abbreviation period, and "~" for hyphen.
Hope it helps, --stolfi
07-11-2025, 12:46 AM
(06-11-2025, 10:36 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.A millennium ago, Jacques Guy wrote such a generator, which he called "monkey". Alas I can find neither the program not its output. Maybe some other old-timer kept a copy?
I had a copy for quite a while, but no longer. I have also played with my own:
You are not allowed to view links. Register or Login to view.
The 'fun' results are in Annex A.