Hello Voynich researchers,
I would like to share a structural hypothesis regarding the generation of Voynichese text. Inspired by linguistic modeling and probabilistic generation, I propose that the author may have used a mental or mechanical system akin to “dice” categorized by grammatical roles—such as conjunctions, nouns, and verbs—to construct each word or phrase.
? Core Idea
Each grammatical category (e.g., conjunction, noun, verb) corresponds to a pool of syllables or word fragments. The author may have randomly selected one item from each pool to form a structured word. This method would produce words with consistent internal structure while allowing for variation and creativity.
? Experimental Simulation
Using a simple model, I generated words by randomly combining syllables from three pools:
Conjunctions: qo, re, da, ai, ke, zu
Nouns: ched, dain, ol, shar, tor, lem, vok
Verbs: chy, dy, kar, mor, tin, sek, ral
These words resemble Voynichese in terms of length, internal structure, and syllabic repetition.
? Related Research
This hypothesis aligns with and extends ideas from:
Torsten Timm’s grid-based word generation model
Luis Acedo’s Hidden Markov Model analysis
Claire Bowern’s linguistic structure studies
? Implications
If valid, this model suggests that Voynichese may be a structured pseudo-language generated through a creative but systematic process. It bridges randomness and grammar, and may reflect a cognitive or artistic experiment rather than a natural language.
Note: I am Japanese and not a native English speaker. Due to language limitations, my replies may be slow or imperfect, but I will do my best to respond thoughtfully. I appreciate your understanding and welcome any feedback, questions, or suggestions.
Thank you very much.
while working on the idea of my You are not allowed to view links. Register or Login to view. I tried a small exercise. Imagine that the Voynich is a positional substitution cipher, where each position in a word is encoded independently.
What I did was to take the Voynich tokens in EVA transliteration, look at them position by position, and record the distribution of characters. But I deliberately ignored which character it was. In other words: at position 1 we might have 40% of one character, 30% of another, 20% of a third, and so on. Just the shape of the distribution, not the labels.
Then I repeated the same procedure for several different languages. My working assumption is that if the corpora are large enough, the positional distributions should be similar within texts of the same language. Here is the result I got from the texts I currently have available:
Corpus
Distance
Tokens
Alchemical herbal (Latin)
0.327
6,536
De Docta Ignorantia (Latin)
0.374
37,121
Tirant lo Blanc (Catalan)
0.395
419,309
La Reine Margot (French)
0.396
112,803
Ambrosius Medionalensis (Latin)
0.402
117,734
El Lazarillo de Tormes (Spanish)
0.403
20,060
Simplicius Simplicissimus (German)
0.415
189,804
Romeo and Juliet (English)
0.451
24,822
The English Physician (Culpepper) (English)
0.460
135,362
So what does this mean? In this experiment the texts that came out closest to the Voynich were in Latin (especially the “Alchemical herbal” and “De Docta Ignorantia”), followed by Catalan, French, and Spanish. German and English were clearly further away.
Of course this does not prove the language of the Voynich, but it is interesting that the nearest matches are all Romance or Latin texts, and the Germanic ones sit lower down the ranking. It suggests that, at least under this positional-distribution approach, the Voynich behaves more like Romance/Latin than like Germanic languages.
Note: I used the "Alchemical herbal" transliteration from Marco Ponzi and the german Simplicius Simplicissimus version of Jorge Stolfi.
Someone sometime posted an image with a list of the 10 Arabic digits from a manuscript, which had two variants for "5", both with a tilde or macron above. Rene conjectured that the tildes could be two "v"s standing for "vel .. vel" (= "either ... or").
But on the bottom right corner of page You are not allowed to view links. Register or Login to view. there is a "mysterious symbol" that looks very much like one of those two "5"s, complete with the tilde/macron.
So perhaps those macrons are just meant to avoid confusion with the letter "y", and the "mysterious symbol" on You are not allowed to view links. Register or Login to view. is an original quire or section number, applied before the bifolios were scrambled, flipped, and re-quired.
All the best, --jorge.
[KG: I could not find the post with those digits. If this post is out-of-topic for this sub-forum, feel free to move it.]
I have published my research on script directionality for review on arxiv. I would like to thank the many people who pointed me to the Gini index paper by Winstead. I have shown how and why it doesnt apply to Voynich, but it’s quite useful nonetheless to show how far Voynich stands from a indo european language.
Here is the link to the preprint article:
You are not allowed to view links. Register or Login to view.
As I was running some promising yet unsuccessful attempts to decode the Voynich MS text, I ran into the question: how difficult will it be to detect being 'close to' the solution? I decided to try something I had been thinking of on several occasions:
Take a known plain text, and substitute vowels with other vowels and consonants with other consonants. Will the result be recognisable? To try this out, I took a part of Dante's Inferno, and changed the vowels and consonants from the Italian frequency distribution to the Latin frequency distribution. For these distributions, I used the ones obtained empirically on this page: You are not allowed to view links. Register or Login to view.
for the Dante and Mattioli source texts. Following is the conversion table:
Code:
E A I O U N R L T S C D M P V G H F B Q Z X J K
i e a u o t s r n m c l d p q b v f g h x y z k
Note that the Italian text had two fewer consonants than the Latin text, so I added J and K to make them equal.
Surprisingly, even though Italian and Latin are closely related, the conversion completely mixes up both vowels and consonants.
The two questions I had were:
- would the text be somehow recognisable?
- would the text show any indication of being meaningful?
The first is a definite no. The second is a bit more subjective, but I would also argue that it is: 'rather not'.
As the text is a fully grammatical known text with only simple substitution applied, and it largely follows a reasonable single character frequency distribution, this means that 'just looking' isn't sufficient to decide whether one is close to a solution or not.
A proper simple substitution solver (tool) should be used to test the result.
Here follows a moderatly short part of the resulting text. If anyone wants a longer sample to play with, let me know.
Quote:tir dixxu lir ceddat la tumnse qane
da sansuqea pis ote mirqe umcose
cvi re lasanne qae ise mdessane
eva hoetnu e las hoer ise i cume lose
imne mirqe mirqebbae i empse i fusni
cvi tir pitmais satuqe re peose
netn i edese cvi pucu i pao dusni
de pis nsennes lir git cv a qa nsuqea
lasu li r ernsi cumi cv a q vu mcusni
au tut mu git salas cud a q atnsea
netn ise pait la muttu e hoir potnu
cvi re qiseci qae eggetlutea
de pua cv a foa er pai l ot curri baotnu
re luqi nisdateqe hoirre qerri
cvi d eqie la peose ar cus cudpotnu
boeslea at ernu i qala ri moi mperri
qimnani bae li sebba lir paetine
cvi dite lsannu ernsoa pis ubti cerri
errus fo re peose ot pucu hoine
cvi tir rebu lir cus d ise losene
re tunni cv a pemmea cut netne paine
i cudi hoia cvi cut rite effettene
When looking at the Voynich text it helps to separate different kinds of lines instead of treating the whole corpus as one block. A line at the start of a paragraph does not behave like a line in the middle, and a label line does not look like a body line. By checking the very first and last characters of each line, and comparing them with what is normal for the overall corpus, some clear patterns appear. Paragraph beginnings have their own favorite starters, body lines are more balanced, and lines outside of paragraphs follow yet another rule. What follows is a summary of these differences, line type by line type.
Labels
These short label lines behave differently from running text.
Starts: They very often begin with o (about 50%). Compared with the overall corpus, that’s a strong tilt toward o. Against their own local word mix it’s only mildly unusual, but versus the full corpus it stands out.
Ends: They tend to end in y, with smaller bumps for m and d.
Sumary: Labels have an o- opening habit and a fairly y-heavy ending, but otherwise don’t diverge wildly from their own local vocabulary.
Paragraph-initial lines
This is where the strongest cue lives.
Starts: Paragraph-initial lines strongly favor p, t, k, and f. Those openings are over-represented both against their own local word stock and against the whole corpus. By contrast, starts like o, q, c, s are under-represented here.
Ends: Endings don’t separate these lines much, with the clear exception that m is noticeably more common at the very end.
Distance: The divergence at the first character is high (JS_init ≈ 0.38 vs local, 0.52 vs global) (The very first character of these lines is much less like the rest).
Summary: If you see a line starting p, t, k, f odds are it’s the first line of a paragraph. This entry code is the sharpest positional signal in the manuscript.
Body lines (within paragraphs)
These are ordinary running lines within the paragraph (not the first line, nor the last line).
Starts: A mixed, stable recipe: d, s, y, q, o dominate, with c notably lower than in the general word stock.
Ends: y is slightly lower than global norms, while m at line end is clearly higher than we’d expect from the local body vocabulary.
Summary: This is the baseline flow of the text: predictable starts and the familiar y, n, l, r, m mix at the end, with m a recurring tail.
Last lines of paragraphs
The group of last paragraph lines, broadly similar to the body lines.
Starts: Again y, s, d, o, q; c is much lower than its local/global baselines.
Ends: y and n are higher; l and r are lower. Log-odds confirm m, y, g are over-represented at the very end, while l, r, o are under.
Distance: Modest at the start (JS_init ~ 0.14), low at the end (~0.048).
Summary: Another "normal text" profile; very close to the first body set (the intra paragraphs lines), with slightly stronger y at both ends.
Non-paragraph lines (outside any paragraph)
Standalone or detached lines show a different entry behavior.
Starts: A very strong bias too- (about 66% of line starts). Against the whole corpus this is striking, though against their own local word mix it’s milder.
Ends: y is high (≈ 47%), and m is also above baseline.
Distance: Start vs global is notable (JS_init ≈ 0.21), while start vs. local and ends are relatively low.
Summary: These lines have an o-anchor at the very start; the second character contributes far less than the first.
The body of the text (lines within and ending paragraphs), taken together, makes up about 72% of all words. Its behaviour is fairly steady: lines usually begin with d, s, y, q, or o, and they usually end with y, n, m, l, or r. Some small rules repeat across the text, such as d being followed by a, c, s, or o at the start of words, and l being followed by a or o at the very end of a line. These habits stay in place regardless of which type of body line you look at.
Looking at the manuscript as a whole, the sharpest divide appears right at the first character of each line. Lines that begin a paragraph usually open with a small set of markers such as p, t, k, or f. Lines that stand outside any paragraph, by contrast, almost always start with o. Once the line is underway and you move inside the word, the contrasts between line types are still there, but they are far less pronounced than the jolt that comes at the very first step.
There are significant and obvious statistical differences between the "head" lines of parags and other "body" lines, in terms of word, character, and digraph distributions.
The most obvious is the preponderance of "puffs" (the p and f gallows) on head lines as opposed to "tikes" (the t and k gallows). There are reasons to assume that this difference, in particular, is the result of some "puffing transformation" applied by the Scribe to the head lines, in order to mark them as such -- which apparently was a not uncommon scribal habit at the time. The occurrence of puffs in scattered words and phrases within parag bodies is presumably the result of the Scribe applying a similar transformation to them-- possibly to indicate emphasis, respect, proper nouns, etc. These transformations would be loosely analogous (in spirit, not in detail or use) to our habits of capitalizing most words in titles and capitalizing proper names.
These "puffing" transformations obviously imply the replacement of some tikes by puffs, but apparently not in a trivial 1-for-1 way. It has been conjectured that each puff may stand for some combination of a tike and some adjacent glyph, like e or Ch. The actual replacement rules may be non-deterministic, or depend on local context (like out rule of not capitalizing articles and prepositions), and may not be invertible (just like when we map both "rocky mountains" and "Rocky Mountains" to "Rocky Mountains")
However, it seems that significant statistical differences between head and body lines persist even after one accounts for such a puffing transformation.
It has been conjectured that these residual differences are due to the fact that the paragraphs in many (if not all) sections generally tend to follow a fixed formula. For instance, a herbal paragraph is expected to start with the name(s) of the plant, then include a list of conditions that can be treated with it, where the plant grows, instructions of how to prepare it, dosages, etc., generally in a certain order. As a result, the word distribution should change significantly depending on the position within the parag. And since the frequency of a glyph or glyph group is dominated by its occurrence in the most common words, these frequencies too should be dependent on position.
I just tested this conjecture using the transcription and translation of the You are not allowed to view links. Register or Login to view. published by Marco Ponzi (@MarcoP). It has 95 herbs, all but one with a single paragraph of text, and ~6700 Latin words total (which is ~1/5 of the VMS). AFAIK that file does not mark the original line breaks, so it was not possible to identify the head lines. As a substitute, I extracted 12 words from each parag, either from the beginning (including the plant's name), around the middle, and at the end of the parag. Here are the results, with three columns (count, frequency, item) for each subset
Latin text:
== words == 116 0.10329 herba | 106 0.09439 et | 115 0.10240 et 71 0.06322 ad | 38 0.03384 de | 69 0.06144 in 54 0.04809 accipe | 30 0.02671 item | 41 0.03651 nascitur 45 0.04007 sanandum | 26 0.02315 herbe | 21 0.01870 subito 39 0.03473 de | 25 0.02226 accipe | 20 0.01781 de 37 0.03295 et | 25 0.02226 si | 17 0.01514 per 33 0.02939 si | 23 0.02048 istius | 16 0.01425 herba 22 0.01959 ista | 22 0.01959 ad | 15 0.01336 est 22 0.01959 istius | 22 0.01959 in | 15 0.01336 montibus 20 0.01781 quis | 22 0.01959 ista | 14 0.01247 dierum
== chars == 856 0.14142 a | 753 0.13639 e | 709 0.12101 i 676 0.11168 e | 580 0.10505 i | 620 0.10582 e 559 0.09235 i | 513 0.09292 t | 544 0.09285 t 436 0.07203 s | 501 0.09074 a | 492 0.08397 s 413 0.06823 r | 397 0.07191 u | 467 0.07971 a 362 0.05981 u | 379 0.06865 s | 452 0.07715 u 341 0.05634 m | 346 0.06267 r | 445 0.07595 r 334 0.05518 t | 313 0.05669 m | 350 0.05974 n 318 0.05254 n | 250 0.04528 n | 292 0.04984 o 261 0.04312 c | 238 0.04311 o | 259 0.04421 m
== char pairs == 264 0.03679 a. | 212 0.03191 m. | 199 0.02850 t. 228 0.03177 m. | 199 0.02995 t. | 185 0.02650 s. 210 0.02926 er | 190 0.02860 e. | 156 0.02234 et 185 0.02578 .a | 165 0.02483 er | 151 0.02163 .e 158 0.02202 s. | 155 0.02333 .e | 147 0.02105 .s 157 0.02188 .h | 149 0.02243 et | 145 0.02077 r. 156 0.02174 e. | 134 0.02017 .s | 142 0.02034 tu 139 0.01937 rb | 128 0.01927 .i | 139 0.01991 er 139 0.01937 um | 128 0.01927 a. | 129 0.01848 ur 138 0.01923 an | 123 0.01851 s. | 125 0.01790 is
English:
== words == 67 0.05852 for | 74 0.06463 and | 77 0.06725 and 66 0.05764 the | 66 0.05764 the | 71 0.06201 the 62 0.05415 take | 43 0.03755 it | 63 0.05502 it 45 0.03930 healing | 42 0.03668 this | 61 0.05328 in 44 0.03843 of | 41 0.03581 herb | 43 0.03755 will 44 0.03843 this | 33 0.02882 of | 41 0.03581 be 38 0.03319 and | 26 0.02271 will | 39 0.03406 grows 37 0.03231 herb | 21 0.01834 also | 28 0.02445 healed 33 0.02882 a | 21 0.01834 be | 23 0.02009 they 29 0.02533 or | 20 0.01747 in | 22 0.01921 is == chars == 644 0.12485 e | 611 0.13309 e | 608 0.12528 e 513 0.09946 a | 418 0.09105 t | 437 0.09005 i 445 0.08627 o | 370 0.08059 a | 415 0.08551 t 419 0.08123 t | 369 0.08037 i | 388 0.07995 n 401 0.07774 i | 348 0.07580 o | 361 0.07439 a 371 0.07193 n | 324 0.07057 n | 332 0.06841 o 341 0.06611 r | 303 0.06600 h | 284 0.05852 r 317 0.06146 s | 242 0.05271 r | 272 0.05605 s 311 0.06029 h | 233 0.05075 l | 266 0.05481 d 248 0.04808 l | 231 0.05032 s | 252 0.05193 h
== char pairs == 232 0.03681 .t | 224 0.03905 e. | 183 0.03051 e. 230 0.03649 e. | 200 0.03487 .t | 183 0.03051 s. 193 0.03062 he | 185 0.03225 th | 174 0.02901 .i 173 0.02745 s. | 183 0.03190 he | 174 0.02901 .t 157 0.02491 th | 161 0.02807 .a | 173 0.02884 d. 131 0.02078 r. | 149 0.02598 d. | 161 0.02684 he 125 0.01983 .a | 128 0.02232 s. | 160 0.02668 n. 121 0.01920 .h | 116 0.02022 .i | 150 0.02501 th 118 0.01872 or | 107 0.01865 an | 149 0.02484 in 113 0.01793 in | 99 0.01726 t. | 108 0.01801 .a
So it seems that, indeed, in a formulaic text like a herbal all three statistics can vary a lot between the start, middle, and end of parags.
I’m new here and excited to share a fresh perspective on the Voynich Manuscript that I’ve been developing. I’m posting in the hope of peer review, constructive critique, and discussion.
My main premise is that the Voynich is not a ciphered natural language or random nonsense. Instead, it functions as a symbolic operator system — a structured framework of actions and processes, encoded in both glyphs and imagery.
Rather than treating the text as phonetic writing, I approach it as a grammar of symbolic functions (operators). This means the glyphs are not “letters” in the traditional sense, but instructions that align with alchemical, cosmological, and Hermetic traditions. For example:
Certain glyphs consistently map to processes like dissolve, bind, seal, or circulate.
Images (plants, roots, leaves, zodiac wheels, bathing figures) provide visual overrides that reinforce or correct the operator sequence.
The manuscript follows a sevenfold cycle that mirrors the stages of the Opus Magnum (calcination, dissolution, separation, conjunction, fermentation, distillation, coagulation).
I’ve built translation rules that allow for reproducible readings, and I’ve worked examples (like folio f1r) that yield structured alchemical instructions — consistent across passes.
This approach doesn’t “solve” the manuscript as a language, but rather offers a system that reveals it as a ritual–procedural text: part laboratory manual, part spiritual allegory.
I’d love to hear your thoughts. Specifically:
Does this model resonate with parallels you’ve seen in other alchemical or emblematic manuscripts?
What weaknesses do you see in interpreting glyphs as operators instead of phonemes?
Are there particular folios you’d recommend testing this framework against?
Thanks in advance for your feedback — I look forward to the discussion!
— Rob
You are not allowed to view links. Register or Login to view. transliteration
Page 1
1. The Work begins. Fire speaks the name again and again to awaken the vessel. The quality is impressed, and the measure is taken—twofold, threefold—to ensure none stray. Sulphur and Mercury are joined. Their image mirrored in the glass.
2. Shape the vessel beneath the sign of Saturn, and govern it by Time. Open the channels and let the waters flow. Dissolve the body, wash it. At each stage of dissolution, seal the work that none may escape.
3. Count again. What has risen? What remains? Where twin natures divide—yoke them. Where they wander—bind. Where they thin—multiply. Where they thicken—fix.
4. Turn the wheel through its triple states: dissolution, conjunction, coagulation. Each turn firmer than the last. When the liquor runs clear, reflect it back upon the body. When the tincture takes—mark it. When the weight is right—set it.
5. Let the vessel breathe, then close it. Let the heat rise, then settle. Let the fixed become volatile, and the volatile become fixed. Bind opposites in a single form. Reflect the pattern across the Zodiac.
6. Beneath Aries, awaken fire. Under Cancer, cool the waters. Mercury flows from east to west; Sulphur from above to below. Join them at the point of balance, and raise the vessel upon the Earth’s stillness.
7. Silver answers to the Moon, and Iron to Mars. Bind each to its planet, and temper them by weight and breath. Filter what rises, distil what clings. Let the twins speak once more, and call the measure whole.
yes, it's AI but I worked on this a fair bit. I saw the thread at the top that spoke about how AI dilutes everything, well i think the opposite. Have a look.