![]() |
|
The oddities of the bigram "ed" pt. 2: Same as it ever was. - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: The oddities of the bigram "ed" pt. 2: Same as it ever was. (/thread-5373.html) Pages:
1
2
|
The oddities of the bigram "ed" pt. 2: Same as it ever was. - Dunsel - 17-02-2026 If you haven't read my first post on this, here's the link: You are not allowed to view links. Register or Login to view. It'll explain a lot about what I'm going to present. I don't believe God plays dice. In the last post I had just split the Voynich into my 0ed (Currier A) and ed+ (Currier B) pages. This post is going to be less chart pretty and more statistics. Prepare to be bored mindless with numbers. The Voynich Total Pages
Tokens per page (mean / median)
Unique tokens per page (mean / median)
Global hapax ratio (mean / median)
Reuse ratio (mean / median) - How often words repeat.
Variance of token length (mean / median)
Proportion of long tokens (length ≥ 7)
Unique bigrams per page (mean / median)
Bigram repetition rate
TLDR; ED+ pages Are much longer (Baneo and Recipe influenced) Introduce more total vocabulary Reuse prior vocabulary more than 0ED pages Contain longer tokens Use more bigram types Repeat bigrams more heavily. Have roughly the same hapax creation rate. Sometimes things have to break before you can fix them. So, there are a lot of statistics that can show that ED+ pages are significantly different from ED0 pages, and much of this has been discovered and mulled over since Currier spotted that difference. But, what I'm about to show you will, I think, make you reconsider that difference. I was working on splat repair. It occurred to me that some 2,000 splats exist in Takahashi. If the Voynich contained information, that was a lot of lost information. So, I started working on suite of repair tools, some bits stolen from OCR, some from spell checking. In that suite I did something different. I did 'leak free' testing. I would train with tokens on herbal and would then I would test on recipe. Or, I'd alternate folios. Train on one, test on the other. This allowed comparisons between two sections of the Voynich without one set of data contaminating the other. So when I spotted ed0 and ed+, the repair thing popped right up. When I trained my model on 0ed pages and tested on ed+ pages, here's what I found. OOV tokens: (OOV = out of vocabulary)
Bigram-illegal tokens:
Repairability of OOV tokens:
This, is the big one. Of those 10,389 tokens that were found on the ed+ pages that didn't exist on ed0 pages, 82% could be made an ed0 token with one simple character deletion or substitution. These two sets of pages are not that different. Next, I reversed that test. I trained on ed+ pages and tested on ed0 pages. OOV tokens (Vocab-OOV): 1,714 - That's a huge difference. Over 10,000 tokens were seen on ed+ pages that didn't exist on ed0, but only 1,714 tokens were on ed0 that were not on ed+ Bigram-illegal tokens: 17 - Only 17 tokens in ed0 had bigrams that were not on ed+. When you compare those numbers based on total tokens in the test, 0ED → ED+
ED+ → 0ED
And, 82.21% of ed+ tokens could be repaired to make ed0 tokens with a single substitution or deletion. More notes: High-frequency backbone In 0ED
In ED+:
Exclusive bigrams
TLDR2; 0ED vocabulary is largely contained within ED+. ED+ expands well beyond 0ED. Every high-frequency backbone token in 0ED exists in ED+. 195 of 207 0ED bigrams survive in ED+ That is 94% retention ED+ expands the bigram alphabet Adds 60 new bigrams Bigram space grows from 207 → 255 And here's the first chart. This is the vocabulary growth between 0ED and ED+. This is a very smooth growth rate. This suggests that there was no big shift between 0ED and ED+. Despite all of those differences above, it's still the same base "engine" chugging along with no dramatic change. Sometimes, broken things deserve to be repaired. In ED+ there are 4,260 unique tokens that do not exist in 0ED pages (different from the OOV above). If we take those tokens and we do a simple 1 edit distance repair to a token in 0ED: 2,870 are 1 edit distance away. 1,123 are edit distance 2 away. 3: 218 are edit distance 3 >3: 49 I gotta bold this to make sure it's seen Around 94% of ED+-only vocabulary is within edit distance ≤2 of 0ED. Let me put that another way. Out of 4,260 uniquie tokens in ED+ pages 3993 can be made a 0ED token by changing 2 characters. I had to keep repairing things... I set up a chain, where I would take all of those that were edit distance 1 from a 0ED token and I compared all of those that were edit distance >= 2 By repeating this chain of checking and rechecking edit distance I came to an abrupt stop at get 6
1,093 tokens were still unreachable. So, I relaxed the rules a bit. I started allowing 2 edit distance. And then allowed edit distance 3. After 2 rounds of editing like this, I was left with 15 tokens that could not be chained back to 0ED. And every single one of those tokens was length >8. I then considered those to be possibly transcription errors and were conjugations. I compared them to shorter tokens and was able to split them into 2 words. All of those were then 1 or 2 edit distance from a 0ED token or a previously repaired token. Ok, so every single token on ED+ could be chained into an edit distance of 3 or less into a 0ED token. I checked Zandberg/Landini. 99.07% of the ED+-only vocabulary in was absorbed within ≤3 edits. I had 45 tokens left over. 45 / 45 (100%) have a split where both halves were within ≤3 edits of an another checked token. Ok,... that can't possibly be right. It means I can edit any word a few times... ok 6 at most, maybe 7, and I can make every single word from one half a book match the other half. Well, kinda Latin (Caesar) plus-only unique tokens: 2,597 Dist=1: 860 Dist=2: 940 Dist=3: 506 3: 291 Within ≤2: 69.31% Within ≤3: 88.79% English (Dracula) plus-only unique tokens: 1,184 Dist=1: 374 Dist=2: 413 Dist=3: 226 3: 171 Within ≤2: 66.47% Within ≤3: 85.56% So yes, they can be repaired. BUT! Voynich Within ≤2: ~93.7% Within ≤3: ~98.8% Voynich words have a greater similarity than Latin or English. Now to old hands at Voynich, this is no huge surprise. But, it does show that after editing what appears to be two very different sections of the book, that similarity is just a few edit distances away. Conclusion. I'm likely going to get beat up over this but, here goes:
So, I hope I've given enough evidence to show how these two regimes are different, but the same underlying system. I'll be interested in hearing your thoughts. The big question now is: Why does a lexical "engine" make a drastic switch like "ed" if the vocabulary isn’t actually changing much? I think I can answer that. But that's for another post. Disclaimer: I have tried to review all of these numbers and I believe them to be reasonably accurate. I may have missed some but hopefully nothing drastic. RE: The oddities of the bigram "ed" pt. 2: Same as it ever was. - Koen G - 17-02-2026 Great post, Dunsel. I like these kinds of analyses, since they attempt to get closer to the heart of what Voynichese is. I'm not a great statistician, so just some general thoughts. 1. Some of the differences you discuss would disappear when somehow normalized for text length. Unique tokens, hapax, reuse... So I wonder if your whole first section could be summarized as "Q20 has many words per page"? [*]2. I am very interested in ways to kind of smoothen out and "repair" things. If it's always "aiin" but then suddenly "oiin", we may wonder whether the thing that looks like "o" was actually supposed to be one. I also think splitting hapax that are compounds is a valid strategy. [*]Edit distance 1 is also a good way to account for scribal and transcription errors. However, edit distance 2 and 3 feels like a lot in a poor system like Voynichese. Especially when doing it blindly without noting the underlying patterns. RE: The oddities of the bigram "ed" pt. 2: Same as it ever was. - Jorge_Stolfi - 17-02-2026 (17-02-2026, 05:22 AM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.Why does a lexical "engine" make a drastic switch like "ed" if the vocabulary isn’t actually changing much? You may have seen You are not allowed to view links. Register or Login to view.. I think it is evidence that "language A" and "language B" are the same language with two different "spellings" -- but not completely different. I don't know whether one could obtain a similar plot if the languages were truly different. Even for closely related languages, like Spanish and Portuguese. I wonder if such a reduction to a common core would work for Norwegian in the Nynorsk and Bokmål spellings. Or for Portuguese before and after the 1943 spelling reform. All the best, --stolfi RE: The oddities of the bigram "ed" pt. 2: Same as it ever was. - oshfdk - 17-02-2026 In a work of fiction this can easily happen if a new character is introduced with a very peculiar foreign sounding name. For example, the below chart shows the number of times "ka" appears in Winnie the Pooh. The first match near the beginning is the chapter index, then there is nothing for half of the book until Roo's mom shows up. But this can happen in non-fiction too, say a history book that doesn't talk about Xerxes until the relevant part of the history is discussed. RE: The oddities of the bigram "ed" pt. 2: Same as it ever was. - nablator - 17-02-2026 (17-02-2026, 01:46 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.But this can happen in non-fiction too, say a history book that doesn't talk about Xerxes until the relevant part of the history is discussed. And 10% to 30% of all "e" are replaced with "xe". RE: The oddities of the bigram "ed" pt. 2: Same as it ever was. - oshfdk - 17-02-2026 (17-02-2026, 02:59 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.And then 10% to 30% of all "e" are replaced with "xe". I'm not sure what you mean exactly, but I also don't think the vocabulary is the right explanation of this shift in the Voynich MS anyway. I'm in the cipher camp, I can explain away any evolution of the text with "new key", "new cipher table", "adjustments in encoding", etc. Life is simple until someone asks how the cipher works specifically. RE: The oddities of the bigram "ed" pt. 2: Same as it ever was. - nablator - 17-02-2026 (17-02-2026, 03:09 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.I'm not sure what you mean exactly I mean it's not just one new word with an unusual bigram (like xe in Xerxes) and it's not all words (like a change of spelling): all chey or chdy have not been replaced by chedy in Currier language B. If some of them have been replaced, many remain: it's not a systematic replacement of a short pattern by another short pattern including ed. RE: The oddities of the bigram "ed" pt. 2: Same as it ever was. - Aga Tentakulus - 17-02-2026 I'm a bit confused. You mention ‘c8’. Is that a word or a symbol for some kind of statistics? If it's a word, why not just say what it is? RE: The oddities of the bigram "ed" pt. 2: Same as it ever was. - Jorge_Stolfi - 18-02-2026 (17-02-2026, 03:33 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I mean it's not just one new word with an unusual bigram (like xe in Xerxes) and it's not all words (like a change of spelling): all chey or chdy have not been replaced by chedy in Currier language B. If some of them have been replaced, many remain: it's not a systematic replacement of a short pattern by another short pattern including ed. But it could be a change of spelling. Imagine that someone is writing in German and realizes that writing both sounds of "ch" the same way is a bad idea. Then he changes the spelling of all the words that previously were written with "ch" to use "ch" or "kh" depending on the sound. And, since he is at it, he starts writing "sh" instead of "sch", and "c" instead of "ck". That would change the spelling of a large number of words, but many words would not be affected. And the frequency of "ch" would drop, but not to zero. And some words with "c" and/or with "h" would not be affected... All the best, --stolfi RE: The oddities of the bigram "ed" pt. 2: Same as it ever was. - Dunsel - 18-02-2026 (17-02-2026, 10:46 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Great post, Dunsel. I like these kinds of analyses, since they attempt to get closer to the heart of what Voynichese is. Thank you! (17-02-2026, 10:46 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.1. Some of the differences you discuss would disappear when somehow normalized for text length. Unique tokens, hapax, reuse... So I wonder if your whole first section could be summarized as "Q20 has many words per page"? Nope. I can completely shuffle any of the pages in the Voynich and get those results. And, to be honest, if I really applied that chaining to a natural language, I'd probably get the same results eventually. The stand out point is just how close these pages are to each other. Not my next post but part 4, I'm really going to dig into repair. (17-02-2026, 10:46 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.[*]2. I am very interested in ways to kind of smoothen out and "repair" things. If it's always "aiin" but then suddenly "oiin", we may wonder whether the thing that looks like "o" was actually supposed to be one. I also think splitting hapax that are compounds is a valid strategy.[*] I'm thinking, those were scribal choices. I did choose to split them because it's well known that transcription errors occur. Most of the time, I consider it just statistical noise. In that case, I dug into the noise just to see if I could fix them. [*] (17-02-2026, 10:46 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.[*]Edit distance 1 is also a good way to account for scribal and transcription errors. However, edit distance 2 and 3 feels like a lot in a poor system like Voynichese. Especially when doing it blindly without noting the underlying patterns. You are correct. If you have a word length of 3 and an edit distance of 3 then you're changing the entire word. And in this case, I didn't worry about that because in another post, I'm going to show how that edit distance >3 becomes insignificant. Right now, I'm just demonstrating the surface features. |