There are significant and obvious statistical differences between the "head" lines of parags and other "body" lines, in terms of word, character, and digraph distributions.
The most obvious is the preponderance of "puffs" (the p and f gallows) on head lines as opposed to "tikes" (the t and k gallows). There are reasons to assume that this difference, in particular, is the result of some "puffing transformation" applied by the Scribe to the head lines, in order to mark them as such -- which apparently was a not uncommon scribal habit at the time. The occurrence of puffs in scattered words and phrases within parag bodies is presumably the result of the Scribe applying a similar transformation to them-- possibly to indicate emphasis, respect, proper nouns, etc. These transformations would be loosely analogous (in spirit, not in detail or use) to our habits of capitalizing most words in titles and capitalizing proper names.
These "puffing" transformations obviously imply the replacement of some tikes by puffs, but apparently not in a trivial 1-for-1 way. It has been conjectured that each puff may stand for some combination of a tike and some adjacent glyph, like e or Ch. The actual replacement rules may be non-deterministic, or depend on local context (like out rule of not capitalizing articles and prepositions), and may not be invertible (just like when we map both "rocky mountains" and "Rocky Mountains" to "Rocky Mountains")
However, it seems that significant statistical differences between head and body lines persist even after one accounts for such a puffing transformation.
It has been conjectured that these residual differences are due to the fact that the paragraphs in many (if not all) sections generally tend to follow a fixed formula. For instance, a herbal paragraph is expected to start with the name(s) of the plant, then include a list of conditions that can be treated with it, where the plant grows, instructions of how to prepare it, dosages, etc., generally in a certain order. As a result, the word distribution should change significantly depending on the position within the parag. And since the frequency of a glyph or glyph group is dominated by its occurrence in the most common words, these frequencies too should be dependent on position.
I just tested this conjecture using the transcription and translation of the You are not allowed to view links. Register or Login to view. published by Marco Ponzi (@MarcoP). It has 95 herbs, all but one with a single paragraph of text, and ~6700 Latin words total (which is ~1/5 of the VMS). AFAIK that file does not mark the original line breaks, so it was not possible to identify the head lines. As a substitute, I extracted 12 words from each parag, either from the beginning (including the plant's name), around the middle, and at the end of the parag. Here are the results, with three columns (count, frequency, item) for each subset
Latin text:
== words == 116 0.10329 herba | 106 0.09439 et | 115 0.10240 et 71 0.06322 ad | 38 0.03384 de | 69 0.06144 in 54 0.04809 accipe | 30 0.02671 item | 41 0.03651 nascitur 45 0.04007 sanandum | 26 0.02315 herbe | 21 0.01870 subito 39 0.03473 de | 25 0.02226 accipe | 20 0.01781 de 37 0.03295 et | 25 0.02226 si | 17 0.01514 per 33 0.02939 si | 23 0.02048 istius | 16 0.01425 herba 22 0.01959 ista | 22 0.01959 ad | 15 0.01336 est 22 0.01959 istius | 22 0.01959 in | 15 0.01336 montibus 20 0.01781 quis | 22 0.01959 ista | 14 0.01247 dierum
== chars == 856 0.14142 a | 753 0.13639 e | 709 0.12101 i 676 0.11168 e | 580 0.10505 i | 620 0.10582 e 559 0.09235 i | 513 0.09292 t | 544 0.09285 t 436 0.07203 s | 501 0.09074 a | 492 0.08397 s 413 0.06823 r | 397 0.07191 u | 467 0.07971 a 362 0.05981 u | 379 0.06865 s | 452 0.07715 u 341 0.05634 m | 346 0.06267 r | 445 0.07595 r 334 0.05518 t | 313 0.05669 m | 350 0.05974 n 318 0.05254 n | 250 0.04528 n | 292 0.04984 o 261 0.04312 c | 238 0.04311 o | 259 0.04421 m
== char pairs == 264 0.03679 a. | 212 0.03191 m. | 199 0.02850 t. 228 0.03177 m. | 199 0.02995 t. | 185 0.02650 s. 210 0.02926 er | 190 0.02860 e. | 156 0.02234 et 185 0.02578 .a | 165 0.02483 er | 151 0.02163 .e 158 0.02202 s. | 155 0.02333 .e | 147 0.02105 .s 157 0.02188 .h | 149 0.02243 et | 145 0.02077 r. 156 0.02174 e. | 134 0.02017 .s | 142 0.02034 tu 139 0.01937 rb | 128 0.01927 .i | 139 0.01991 er 139 0.01937 um | 128 0.01927 a. | 129 0.01848 ur 138 0.01923 an | 123 0.01851 s. | 125 0.01790 is
English:
== words == 67 0.05852 for | 74 0.06463 and | 77 0.06725 and 66 0.05764 the | 66 0.05764 the | 71 0.06201 the 62 0.05415 take | 43 0.03755 it | 63 0.05502 it 45 0.03930 healing | 42 0.03668 this | 61 0.05328 in 44 0.03843 of | 41 0.03581 herb | 43 0.03755 will 44 0.03843 this | 33 0.02882 of | 41 0.03581 be 38 0.03319 and | 26 0.02271 will | 39 0.03406 grows 37 0.03231 herb | 21 0.01834 also | 28 0.02445 healed 33 0.02882 a | 21 0.01834 be | 23 0.02009 they 29 0.02533 or | 20 0.01747 in | 22 0.01921 is == chars == 644 0.12485 e | 611 0.13309 e | 608 0.12528 e 513 0.09946 a | 418 0.09105 t | 437 0.09005 i 445 0.08627 o | 370 0.08059 a | 415 0.08551 t 419 0.08123 t | 369 0.08037 i | 388 0.07995 n 401 0.07774 i | 348 0.07580 o | 361 0.07439 a 371 0.07193 n | 324 0.07057 n | 332 0.06841 o 341 0.06611 r | 303 0.06600 h | 284 0.05852 r 317 0.06146 s | 242 0.05271 r | 272 0.05605 s 311 0.06029 h | 233 0.05075 l | 266 0.05481 d 248 0.04808 l | 231 0.05032 s | 252 0.05193 h
== char pairs == 232 0.03681 .t | 224 0.03905 e. | 183 0.03051 e. 230 0.03649 e. | 200 0.03487 .t | 183 0.03051 s. 193 0.03062 he | 185 0.03225 th | 174 0.02901 .i 173 0.02745 s. | 183 0.03190 he | 174 0.02901 .t 157 0.02491 th | 161 0.02807 .a | 173 0.02884 d. 131 0.02078 r. | 149 0.02598 d. | 161 0.02684 he 125 0.01983 .a | 128 0.02232 s. | 160 0.02668 n. 121 0.01920 .h | 116 0.02022 .i | 150 0.02501 th 118 0.01872 or | 107 0.01865 an | 149 0.02484 in 113 0.01793 in | 99 0.01726 t. | 108 0.01801 .a
So it seems that, indeed, in a formulaic text like a herbal all three statistics can vary a lot between the start, middle, and end of parags.
I’m new here and excited to share a fresh perspective on the Voynich Manuscript that I’ve been developing. I’m posting in the hope of peer review, constructive critique, and discussion.
My main premise is that the Voynich is not a ciphered natural language or random nonsense. Instead, it functions as a symbolic operator system — a structured framework of actions and processes, encoded in both glyphs and imagery.
Rather than treating the text as phonetic writing, I approach it as a grammar of symbolic functions (operators). This means the glyphs are not “letters” in the traditional sense, but instructions that align with alchemical, cosmological, and Hermetic traditions. For example:
Certain glyphs consistently map to processes like dissolve, bind, seal, or circulate.
Images (plants, roots, leaves, zodiac wheels, bathing figures) provide visual overrides that reinforce or correct the operator sequence.
The manuscript follows a sevenfold cycle that mirrors the stages of the Opus Magnum (calcination, dissolution, separation, conjunction, fermentation, distillation, coagulation).
I’ve built translation rules that allow for reproducible readings, and I’ve worked examples (like folio f1r) that yield structured alchemical instructions — consistent across passes.
This approach doesn’t “solve” the manuscript as a language, but rather offers a system that reveals it as a ritual–procedural text: part laboratory manual, part spiritual allegory.
I’d love to hear your thoughts. Specifically:
Does this model resonate with parallels you’ve seen in other alchemical or emblematic manuscripts?
What weaknesses do you see in interpreting glyphs as operators instead of phonemes?
Are there particular folios you’d recommend testing this framework against?
Thanks in advance for your feedback — I look forward to the discussion!
— Rob
You are not allowed to view links. Register or Login to view. transliteration
Page 1
1. The Work begins. Fire speaks the name again and again to awaken the vessel. The quality is impressed, and the measure is taken—twofold, threefold—to ensure none stray. Sulphur and Mercury are joined. Their image mirrored in the glass.
2. Shape the vessel beneath the sign of Saturn, and govern it by Time. Open the channels and let the waters flow. Dissolve the body, wash it. At each stage of dissolution, seal the work that none may escape.
3. Count again. What has risen? What remains? Where twin natures divide—yoke them. Where they wander—bind. Where they thin—multiply. Where they thicken—fix.
4. Turn the wheel through its triple states: dissolution, conjunction, coagulation. Each turn firmer than the last. When the liquor runs clear, reflect it back upon the body. When the tincture takes—mark it. When the weight is right—set it.
5. Let the vessel breathe, then close it. Let the heat rise, then settle. Let the fixed become volatile, and the volatile become fixed. Bind opposites in a single form. Reflect the pattern across the Zodiac.
6. Beneath Aries, awaken fire. Under Cancer, cool the waters. Mercury flows from east to west; Sulphur from above to below. Join them at the point of balance, and raise the vessel upon the Earth’s stillness.
7. Silver answers to the Moon, and Iron to Mars. Bind each to its planet, and temper them by weight and breath. Filter what rises, distil what clings. Let the twins speak once more, and call the measure whole.
yes, it's AI but I worked on this a fair bit. I saw the thread at the top that spoke about how AI dilutes everything, well i think the opposite. Have a look.
I propose here a single-mode positional substitution cipher that
sharply reduces entropy at n=2
decays only mildly for n>2, matching a key distributional signature of Voynich text (see You are not allowed to view links. Register or Login to view.)
preserves original language token lengths
works programatically on any kind of language
the method is quill-and-paper friendly, so could have been done in the XV century (or even before)
... but can't still decipher the text, not yet...
It consists of just a position-by-position substitution table, plus a tiny ‘residuals’ note to resolve cases where one cipher glyph covers more than one letter (for example, ‘ch’ at position 1, see table below).
I call it Positional Mimic Cipher (PM-Cipher). It is a single-mode, position-by-position substitution using the Voynich EVA alphabet, with small per-position priority lists (‘residuals’) tweaked to imitate Voynich-style statistics. In practice, on a natural language sample the cipher yields typical figures around r≈0.90 for graphemes, r≈0.50 for in-word bigrams, H₂≈2.80, and length/Zipf shapes close to VMS B* while keeping the median token length at 4-5.
My design goals were, as said, to have a strong n=2 drop in entropy and mild additional drop for n>2. As Voynich token lengths look like natural language, I thought it would be interesting to keep the natural language token length; and it does. It should mimic overall grapheme/bigram shape (correlations (How similarly do two frequency patterns rise and fall?); Jensen-Shannon divergence (How far apart are two probability distributions?) and Zipf curve (Similar slope and curvature). Oh, and thelast one was a must! Fully doable with period materials: just paper and something to write on it.
How to encrypt (quill & paper):
Take the plaintext as Latin letters (any language).
For the k-th letter in a word, look up its Voynich substitution in the table’s column p[k] (see table below)
Write the EVA grapheme you see there (sometimes multi-grapheme, e.g., ch).
If a table cell produces a multi-grapheme EVA that can come from different Latin letters, check a tiny “residuals” note for that position to pick the intended original (this only matters when decoding; for encoding you simply follow the table).
How to decrypt:
Read each EVA grapheme by position.
Use the same table, but in reverse (per position).
If an EVA entry maps to several Latin letters, consult the residuals note for that position to resolve it deterministically.
Why it matches Voynich-like stats? Positionality lets us emulate within-word structure (beginnings/ends differ) without ballooning word length. A light “residual” re-ordering step nudges unigram/bigram mass (e.g., boosting ey, ai, ii, ke and damping de, el, ry, da) while preserving token length and Zipf shape. This gives us a net effect: large entropy drop at n=2 but shallow slope for n>2, close to VMS behavior.
Materials & practicality:
One printed table (below).
A small residuals slip (one or two pages): per position, a short priority list where EVA⇄Latin is many-to-one (e.g., for ch at position 1). (That “residuals booklet” could easily have existed and been lost; it’s small and personal.) Interesting note: the residual could be hidden even in the MS text itself (a small point, a small symbol)... but still searching for it.
Core positional table example (from De Docta Ignorantia - Nicolaus von Kues):
Examples
latin: Auferre, trucidare, rapere falsis nominibus res publica, atque ubi solitudinem faciunt, pacem appellant
PM-Ciphered: aeryssi qotedisemy larysi qoallid shoderanyr lal shedlile akkee aka choledylidyl qoacheyrs shachyl arryndenn
Residuals: 0101220 311000030 102121 100101 002010011 111 1100000 03110 100 20003101113 1000112 10013 012100024
Some plots: [size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif][/font][/size] Note: most of these plots depend on the original text in natural language. For example, the distribution of token lengths (as the PM-Cipher leaves the token lengths almost as it is originally), the Zipf curves... But in the following plots you can see how entropy behaves as the MS does (and perplexity, where we can see the bump even better):
Limitations, advantages & next steps
Residuals booklet is required for strict decoding where EVA is multi-grapheme; historically plausible but must be posited.
Possible to hide the residuals in small marks on the glyphs (or even have different glyphs) -> In study
Cross-text validation across languages: the code adapts the cipher table and residuals to whatever language as input. The goal would be to have as many used residuals as possible, playing with different languages
As said: it does not decipher the MS, but it is an easy-to-do cipher that fulfils the requirements (entropy, dimension of tokens, etc)
I am aware that the first lines of the paragraphs are different from the rest of the text. We could make a cipher table only for those lines and make appear more gallows.
Happy to share it with you and discuss tests or alternative target profiles. If you think it is interesting enough, I would write and present a paper about it.
Rather than trying to decode the VMS outright, I think we can certainly answer other questions about the MS with the help of AI. For example, we should be able to decide with high probability whether the MS has real meaning or not. And I propose a system for doing that.
My own take: what best fits the evidence is that someone commissioned the VMS to be exactly what it is: a mysterious, impressive book that nobody can read because it contains no meaning. This is highly plausible and fits with the physical evidence, linguistic evidence and cultural evidence of the time. I believe it's a pseudo-text, made to look like a language, without actual meaning, for the purpose of impressing people, and I think AI can help us decide one way or another.
The question to explore is what kind of system is this text most like?
The idea is to generate large sets of synthetic manuscripts under different assumptions and see which "universe" the Voynich statistically belongs to. For example:
Structured pseudo-languages (rules for prefixes/suffixes, but no meaning)
Shorthand/abbreviation systems treated as full glyph sets
Then we can measure each synthetic universe against a Voynich "fingerprint panel" (word shapes, entropy, Zipf’s law, affix patterns, section differences, etc.). Rather than asking what does it say? this approach asks "what system is it most like?" If structured pseudo-language consistently fits better than ciphered Latin or conlang universes, that's powerful evidence.
This wouldn’t solve the translation, but it would be an important step in understanding the MS and it would be one box checked off.
Does this kind of “synthetic benchmarking” sound worth trying? Has anyone attempted something like this at scale?
Anyway, here's where AI did a lot of the work in building an outline for how the experiment might go with only off-the-shelf tools. The goal is to see which universe (ciphered language, real language, conlang, structured pseudo-language, shorthand/abbreviation, etc.) best reproduces the Voynich’s full statistical “fingerprint.”
No, I don't have expertise in this kind of research. I'm only seeing where AI can point to help us check off some boxes and let those with the expertise run with it.
1) Define the universes (generate many fakes) Make 200–2,000 synthetic manuscripts, each matched in length and page/line structure to the VM. Each fake follows one hypothesis with tunable knobs:
Rewrap real Latin/Italian texts into Voynich-like lines/paragraphs and enforce a few layout quirks (e.g., frequent “q-” line-initial tokens) to probe the effect of mise-en-page alone.
prefix ∈ {qo, q, o, ch, …}, stem ∈ Σ{a,i,e,o,y}, suffix ∈ {dy, n, in, ain, aiin, …}
position-dependent variants (line-initial gets more “q”)
tunable affix productivity, stem entropy, and run-lengths
Also include a human-plausible generator: Markov/HMM with simple constraints to simulate “fast fluent scribbling.”
E. Shorthand/Abbreviation Universe
Start with Latin prose; compress using a learned set of ~30–60 breviographs/abbreviations (e.g., -us, -rum, -que), then hide the mapping (treat abbreviographs as glyphs). Vary aggressiveness.
2) Build the Voynich “fingerprint panel” Compute the same metrics for the true VM and for every synthetic manuscript:
Token/Type structure
Zipf slope & curvature; Heaps’ law α, K
Word-length distribution; KS/AD distance vs VM
Vocabulary growth by page/quire
Local dependencies
Char bigram/trigram distributions; JS divergence
Conditional entropy H(Xₙ|Xₙ₋₁) and H(wordₙ|wordₙ₋₁)
Mutual information vs distance (Hilberg-style curve)
Morphology & segmentation
Morfessor/BPE: number of subunits, affix productivity, stem/affix ratio
Family proliferation: counts of {dain, daiin, daiiin}-type ladders
n-gram perplexity (n=3..7) trained on one half, tested on the other
Tiny Transformer perplexity (character-level) trained on each universe, tested on VM (cross-perplexity)
Clustering/embedding
UMAP/t-SNE on character n-gram vectors; silhouette vs VM cluster
Rank-order correlation (Kendall τ) of frequency lists
3) Scoring: which universe fits best? Use multiple, complementary criteria:
Distance aggregation: Normalize each metric to z-scores, then compute a weighted composite distance of each synthetic to the VM. Rank universes by median distance.
Model selection: Approximate Bayesian Computation (ABC): treat generator knobs as priors, accept parameter settings whose synthetic stats fall within ε of the VM. Compare posterior mass across universes.
Held-out validation: Fit knobs on half the VM; test distances on the other half (and per section).
4) Robustness checks
Ablations: remove line-position rules or suffix ladders—does fit collapse?
Overfitting guard: ensure no generator is trained directly on VM tokens (only statistics), and verify generalization across sections.
Adversarial baseline: try to force ciphered Latin to match VM—if it still lags pseudo-language on multiple metrics, that’s strong evidence.
Lightweight Transformers (optional):
transformers with a tiny char-LM
6) Workflow & timeline (lean team) Week 1–2: Data wrangling (VM EVA, Latin/Italian corpora), page/line schema, metric code scaffolding Week 3–6: Implement generators A–E; unit tests; produce first 500 synthetics Week 7–8: Compute full fingerprint panel; initial ranking Week 9–10: ABC fitting per universe; robustness/ablations Week 11–12: Write-up, plots, release code & datasets (repro pack)
7) Readouts you can trust (what “success” looks like)
A league table: per-universe composite distance to VM (with error bars)
Posterior plots: which parameter regions (e.g., high suffix productivity, low stem entropy) best match VM
Confusion matrix from a classifier trained to tell universes apart using the fingerprint; if VM gets classified as “structured pseudo-language” with high confidence, that’s decisive.
8) “Citizen-science” version (solo, laptop-friendly)
Implement Universe D (pseudo-language) and Universe A(1) (mono-substitution over Latin).
Compute a mini fingerprint: Zipf slope, word-length KS, bigram JS, compression, Morfessor affix productivity.
Generate 100 synthetics for each universe; plot distance distributions vs VM.
If pseudo beats ciphered Latin on 4/5 metrics, you’ve got a publishable note.
9) Pitfalls & how to avoid them
Layout leakage: VM line/page structure matters—always replicate it in synthetics.
Cherry-picking metrics: pre-register the metric set; report all.
Over-tuning: do ABC on one half; evaluate on the other.
Section bias: score by section and overall; the winner should be consistent.
One of the things that intrigues me most about the Voynich manuscript is the poor quality of its drawings. Whenever I’ve seen manuscripts from the 15th century or even later, the clumsiness of the Voynich illustrator is striking. The drawings are made almost in a single stroke, with no attempt at shading or adding the slightest grace to the figures. From the plants to the nymphs or the zodiac signs, they look like something a child of six or seven could have drawn.
I understand that by that time art had already reached quite a high level of quality; the Renaissance was just emerging in Italy in the 15th century. So whoever produced the illustrations must have been an amateur, and quite a poor one at that.
What is even more surprising is the contrast between the complexity of the text (whether it is an actual cipher or an invented script) and the low quality of the images. One senses the ambition to depict grand ideas, such as the elaborate foldout diagrams or the roses, yet the final result feels clumsy and impoverished when compared with the artistic standards of the time. I understand that the Renaissance was not accessible to everyone, but even the humblest artistic traditions of the period offered a more faithful representation of reality than what we see in the Voynich.
It is also striking that researchers have identified different scribes at work in the text, while the illustrations — at least the thematic ones — seem to share the same hand and style. It is difficult to imagine two people independently drawing the nymphs, for example, in exactly the same (and equally unconvincing) manner with respect to human anatomy. This further reinforces the impression that the text and the images may have followed very different logics of production.
That said (and I hope not to offend, I’m a Voynich enthusiast myself!), I wonder whether any other manuscripts from the period show a similar poverty of representation, with drawings that are unrealistic and poorly proportioned.
Hi everyone ,
This is a general research or strategy question. I love it when someone posts something visual up here that is a new match to a feature of voynich. I’ve done a lot of searches on the ninja forum to see what’s been done. I’m assuming a lot of the researchers have been combing over a lot of manuscripts quite a bit, especially when looking for things like marginalia or letters or zodiac signs etc. I’m assuming that anything posted here has been pretty much looked over carefully. So as I venture into the websites of archives, I find it hard to figure out what manuscripts have been completely picked over and noticed for novelty and which ones haven’t been looked at yet. More importantly, I can’t figure out how to discover when a manuscript is freshly updated or freshly posted on one of the many different sites that are hosting it or whether it’s been there for a long time. I’m not necessarily asking you to tell me every single manuscript that’s been combed over on this forum. I’m asking “Is there a way to quickly search all the various online sources and quickly identify the range of dates they were updated or uploaded so you can just start with the Newest ones first. “ and “Are there particular websites that anyone here favors as being really great with new update updates. “ “also are there collections online that we just know haven’t been looked at yet because they just dropped?”
And also- this being entirely subjective, do all of you major researchers descend upon new updates within a few months or are there enough manuscripts being added to the web if not at a steady rate then by key large uploads at certain points?”.
Anyone have an awesome system for doing this? Personal preferences?
Also- I’m most interested in herbal texts specifically - but this applies to all types.
Ps I really hope this is not extremely obvious I have -tried- to look at many manuscripts already and with the variation of viewers still struggle to see a quick way of finding when it was uploaded.
PSs: I have watched all Voynich Talk video, and the conferences and days. I searched and attempted to learn from the Voynich as well. So I have a bit of general background knowledge of what texts have been looked at to compare.
?? Thank you for any advice or tips given
Anya
Psss- I love em dashes. Love them. I am expressive in real life and those em dashes are there to show that. So please don’t flag
This as Ai. ?
I’m Joaquim Quadrada—Quim for short (that-s the reason of my nickname quimqu). I’m 51, from Barcelona, a native Catalan speaker. I formerly was a mechanical engineer and moved into data science two years ago, finishing a master that lasted 3 years. Even if during the master we worked the linguistic part of data science, I’m not a linguist. I open this thread because I think graph methods can serve the Voynich community as a practical, transparent way to poke at structure and test ideas, and Graphs are quite a new area to investigate.
By “graph” we understand a set of points and lines. The points (nodes) can be words or bits of metadata such as “section”, “Currier hand”, “writing hand”, or “position in line”. The lines (edges) connect things that co-occur or belong together. We can also add information to the edges, direction, and a lot of things that creates the relationship between the nodes. Once you cast the transliteration as one or more graphs (and yes, we can join graphs), you can ask graph-native questions: which links are unexpectedly strong once chance is controlled, which words act as bridges between otherwise separate clusters, which small patterns (A→B→C chains or tight triangles) recur at line starts, how closely word communities align with metadata nodes (sections, hands, line-position), and whether any directed paths repeat often enough to count as reusable templates. None of this decides whether the text is language or cipher, but it can highlight stable regularities, quantify them, and rank hypotheses for experts to examine.
I’d like to open a brainstorming thread to push ideas that are worth trying on top of these graphs.
As a concrete example, I started with the first lines of paragraphs (what I call L1) and compared them to all other lines. Building co-occurrence graphs, the L1 network consistently comes out denser and more tightly clustered. When I switch to a small sliding window (a “micro-syntax” view), the L1 graph splits into more distinct communities, which is what you’d expect if opening material uses more fixed combinations. I also looked for actual line-start bigrams that repeat. A couple of pairs do appear at L1 and not elsewhere, but the evidence is thin; they behave more like soft habits than hard formulas. To see broader context, I built a bipartite graph that connects words to their metadata (position, section, hand). Projecting this graph shows a clear cohort of words that lean toward L1, and it also shows which sections and Currier hands share that opening behavior. All of this is descriptive and testable; nothing here presumes a linguistic reading or a cipher.
This, for example, is the graf for the first lines of the paragraphs:
To illustrate what I mean by opening units at L1, here’s a table with the two bigrams that pass the defined thresholds: they have positive ΔPMI versus the rest of the text (ΔPMI > 0 means the bigram is more tightly associated in L1 than in the other lines) and they always occur at the start of a line. I’ve added short KWIC (Key Word in Context) snippets for context.
What I wish now are linguistics eyes and instincts. If you can suggest a better unit than whole EVA “words” (for example, splitting gallows and benches, or collapsing known allographs), I will rebuild the graphs and quantify how the patterns change. If you have candidates for discourse-like items that might prefer line starts, I can measure their positional bias, role in the network, and the contexts they pull in. If there are section or hand contrasts you care about, I can compare their “profiles” in the bipartite projection and report which differences are solid under shuffles and which are noise.
I’ll keep my end practical: small, readable tables; KWIC lines for anything we flag; and ready-to-open graph files. If this sounds useful, I’ll post the current outputs and code, and then iterate with your guidance on segmentation, normalization, and targets worth testing.
My only goal is to make a common playground where your expertise drives what we measure.
Egerton MS 845 (F.21 v)
Probably just a coincidence, but I thought the top left triangle design and top right cross design (scroll) might be note worthy given the general similarities between the two images in general. The overall layout has quite some similarities also. I can't find scans for this MS anymore, though it was linked on here years ago and noted as "first half of 15c" in 2021.
On a side note - The image also shows up as being part of Harley 2407 which was 15c and contained many later notes by readers, the most famous of which were Dee and Ashmole.
The other is from Micheal Scot - Liber introductorius, which is a very interesting work, and person (if you like rabbit holes)
This is a 14c copy (around 1320) - You are not allowed to view links. Register or Login to view.
The images are regarding eclipses, though I found the general way it was drawn to have some similarities with the VM images, even if the meaning is seemingly different
You will be able to find many great images in this MS, if you browse backwards from this page you will find all sorts of great images for planets and zodiac signs, some more relatable to VM than others.
Bonus merlons and weird sun face for Koen
I don't really have much to claim on these, just thought I'd share and not let it rot in the list of things I'll probably forget about.