Brainstorming Session: Mapping Voynich with Graphs

Brainstorming Session: Mapping Voynich with Graphs - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Brainstorming Session: Mapping Voynich with Graphs (/thread-4916.html)

Brainstorming Session: Mapping Voynich with Graphs - quimqu - 07-09-2025

Hi everyone.

I’m Joaquim Quadrada—Quim for short (that-s the reason of my nickname quimqu). I’m 51, from Barcelona, a native Catalan speaker. I formerly was a mechanical engineer and moved into data science two years ago, finishing a master that lasted 3 years. Even if during the master we worked the linguistic part of data science, I’m not a linguist. I open this thread because I think graph methods can serve the Voynich community as a practical, transparent way to poke at structure and test ideas, and Graphs are quite a new area to investigate.

By “graph” we understand a set of points and lines. The points (nodes) can be words or bits of metadata such as “section”, “Currier hand”, “writing hand”, or “position in line”. The lines (edges) connect things that co-occur or belong together. We can also add information to the edges, direction, and a lot of things that creates the relationship between the nodes. Once you cast the transliteration as one or more graphs (and yes, we can join graphs), you can ask graph-native questions: which links are unexpectedly strong once chance is controlled, which words act as bridges between otherwise separate clusters, which small patterns (A→B→C chains or tight triangles) recur at line starts, how closely word communities align with metadata nodes (sections, hands, line-position), and whether any directed paths repeat often enough to count as reusable templates. None of this decides whether the text is language or cipher, but it can highlight stable regularities, quantify them, and rank hypotheses for experts to examine.

I’d like to open a brainstorming thread to push ideas that are worth trying on top of these graphs.

As a concrete example, I started with the first lines of paragraphs (what I call L1) and compared them to all other lines. Building co-occurrence graphs, the L1 network consistently comes out denser and more tightly clustered. When I switch to a small sliding window (a “micro-syntax” view), the L1 graph splits into more distinct communities, which is what you’d expect if opening material uses more fixed combinations. I also looked for actual line-start bigrams that repeat. A couple of pairs do appear at L1 and not elsewhere, but the evidence is thin; they behave more like soft habits than hard formulas. To see broader context, I built a bipartite graph that connects words to their metadata (position, section, hand). Projecting this graph shows a clear cohort of words that lean toward L1, and it also shows which sections and Currier hands share that opening behavior. All of this is descriptive and testable; nothing here presumes a linguistic reading or a cipher.

This, for example, is the graf for the first lines of the paragraphs:
[Image: mZs74BP.png]

To illustrate what I mean by opening units at L1, here’s a table with the two bigrams that pass the defined thresholds: they have positive ΔPMI versus the rest of the text (ΔPMI > 0 means the bigram is more tightly associated in L1 than in the other lines) and they always occur at the start of a line. I’ve added short KWIC (Key Word in Context) snippets for context.

Bigram	Count	ΔPMI	Line-start	KWIC (examples)
polor sheedy	2	6.234	100%	f112v: [polor sheedy] … sheedar … \| f115v: [polor sheedy] … qokechy …
tshor shey	2	4.799	100%	f15r: [tshor shey] … chtols … daiin \| f53v: [tshor shey] … oltshey … qopchy

What I wish now are linguistics eyes and instincts. If you can suggest a better unit than whole EVA “words” (for example, splitting gallows and benches, or collapsing known allographs), I will rebuild the graphs and quantify how the patterns change. If you have candidates for discourse-like items that might prefer line starts, I can measure their positional bias, role in the network, and the contexts they pull in. If there are section or hand contrasts you care about, I can compare their “profiles” in the bipartite projection and report which differences are solid under shuffles and which are noise.

I’ll keep my end practical: small, readable tables; KWIC lines for anything we flag; and ready-to-open graph files. If this sounds useful, I’ll post the current outputs and code, and then iterate with your guidance on segmentation, normalization, and targets worth testing.

My only goal is to make a common playground where your expertise drives what we measure.

RE: Brainstorming Session: Mapping Voynich with Graphs - synapsomorphy - 07-09-2025

I have been doing a somewhat similar type of analysis but with word embeddings generated with word2vec.

Left is each frequent word mapped down to 2d, closer means used around the same other words.

Right is a few texts compared on how the top pair similarities of their most frequent words decline. On the far left is each text's most similar two frequent words and on the right the least similar two frequent words. I guess the pair similarities would be similar to the strength of your edge connections:

[Image: VMS_word_embeddings_umap.png?raw=true]

[Image: VMS_word_embeddings_umap.png?raw=true]

[Image: embedding_similarity_seriesplot.png?raw=true]

You can see more results You are not allowed to view links. Register or Login to view. including the first plot for each of these texts, and all of my code You are not allowed to view links. Register or Login to view.

However, I have a problem with this type of analysis. With so many hyperparameters to set it up and then so much resulting data I feel like you can find almost any patterns you want to. I tried to compare between texts instead of among a text to mitigate this but it doesn't help a ton - the Naibbe encoded texts seem to always stay above the rest, but I can make the VMS line in the above plot flip above or below the natural texts by picking different hyperparams.

I think there's a chance that useful insights could come from specific datapoints in complex analyses like this. But we also already have many of these peculiarities that we have no real path forward to combining into some higher-level rule. This is why I'm more interested in looking at texts instead of instances of char/word/ngram occurence patterns.

I also think it's very important to report when you do some analysis and don't find strong correlations, alongside successful analyses.

RE: Brainstorming Session: Mapping Voynich with Graphs - Jorge_Stolfi - 07-09-2025

(07-09-2025, 04:50 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.As a concrete example, I started with the first lines of paragraphs (what I call L1) and compared them to all other lines. Building co-occurrence graphs, the L1 network consistently comes out denser and more tightly clustered.

Hi! This general line of inquiry is important and I hope it gives some useful insights.

Using graphs to model relationships between VMS words is not a new idea. (I seem to recall a French Ph. D. thesis that built a graph where the nodes were words and edges connected words that were similar in some metric, perhaps edit distance.) But there are always many original ways to do it.

It is not surprising that parag head lines (L1) come out as more similar to each other than to other lines. Through all sections, those lines are written with different "spelling" conventions than other lines, apparently just because they are head lines. This seems to have been a not uncommon way to highlight parag breaks at the time. Compare with our tradition of Capitalizing Most (But Not All) Words in Titles of Papers and Sections. A martian who abducted one or our books might get quite baffled by the apparent different "encryption" of those lines. The rule for that "embellishing" transformation is not yet understood, and may be partly random, partly ambiguous.

Once I mapped each page to a point in N-space according to the frequency of certain N common words in it. You are not allowed to view links. Register or Login to view. shows selected 2D projections of those points. Colors identify sections and lines connect pages that are consecutive in the book. But more useful would be to connect pages that are most similar, in that or some other metric; since this graph might reveal the original order of the pages.

All the best, --jorge

RE: Brainstorming Session: Mapping Voynich with Graphs - RobGea - 08-09-2025

Hi quimqu, how about Labels, they are a peculiar group.

You have probably seen this already but if not, Torsten Timm has a lot of word graphs on his Github.
--> You are not allowed to view links. Register or Login to view.

RE: Brainstorming Session: Mapping Voynich with Graphs - quimqu - 08-09-2025

(08-09-2025, 01:31 AM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Hi quimqu, how about Labels, they are a peculiar group.

You have probably seen this already but if not, Torsten Timm has a lot of word graphs on his Github.
--> You are not allowed to view links. Register or Login to view.

Thank you Rob, but it is not exactly the same. Torsten Timm used graphs to see how slight changes in words are linked. What I want to use is a graph that links the whole Voynich text. I will try to post examples of the work that we can achieve later.

RE: Brainstorming Session: Mapping Voynich with Graphs - quimqu - 08-09-2025

I post here a fast example of what can be done with the Graph.

I have compared words not by how they look, but by how they behave in the text. If two words tend to appear in the same contexts, the graph shows them as similar. The method produces a list of similar pairs and also groups of words (clusters). This allows me to detect potential orthographic variants, scribal inconsistencies, or grammatical categories.

Many pairs and clusters appear, but here I only explain the first three of each as examples. Here are the first three pairs:

Pair	Similarity	Shared Left Contexts	Shared Right Contexts
am ~ oly	0.912	or, aiin	shey
cthey ~ ychey	0.870	daiin	chey
ckhol ~ qokor	0.870	daiin	daiin

- The first line is interesting: am and oly look completely different, but the graph shows that both often appear after or or aiin and are followed by shey.
- The second pair, cthey ~ ychey, shows the classic frame daiin … chey.
- The third pair, ckhol ~ qokor, shows two forms that both sit inside daiin … daiin.

The same idea extends to clusters. The graph groups together whole families of words that share context. Here the first three found:

- Cluster 1 is very large, with more than a hundred words. It is shaped by the common formulas with daiin, shedy, ol, chey. This cluster includes the first three pairs above, which are representative of the whole family.
- Cluster 3 is small, only six words. It mixes very short forms like l and d with longer forms like laiin or lkaiin. The shared contexts are o, chey, y, chedy. It is puzzling that single character words are clustered with long ones... maybe abbreviations?
- Cluster 4 is also small, again with six words. It is centred on words like chol, sho, okeol. Pairs such as dor ~ ockhey or kchol ~ ykeey belong here. This cluster seems to be a family of variants with similar endings and usage.

These are only the first examples. Many more pairs and clusters can be studied in the same way, and the graph makes it possible to explore their behaviour systematically.

I encourage you to post ideas to explore the graph. For example: if you suspect that one word is a misspelling by the scribe, I can check the "correct" word and the "wrong" word to see if they behave differently or not. If they do behave differently, then it probably wasn’t just a scribal error.