If you haven't read my first post on this, here's the link:
You are not allowed to view links.
Register or
Login to view.
It'll explain a lot about what I'm going to present.
I don't believe God plays dice.
In the last post I had just split the Voynich into my 0ed (Currier A) and ed+ (Currier B) pages. This post is going to be less chart pretty and more statistics. Prepare to be bored mindless with numbers.
The Voynich
Total Pages- 0ED pages: 104
- ED+ pages: 121
Tokens per page (mean / median)- 0ED: 84.6 / 80
- ED+: 211.2 / 145
Unique tokens per page (mean / median)- 0ED: 66.8 / 65
- ED+: 140.4 / 110
Global hapax ratio (mean / median)- 0ED: 0.146 / 0.142
- ED+: 0.144 / 0.131
Reuse ratio (mean / median) - How often words repeat.- 0ED: 0.725 / 0.755
- ED+: 0.784 / 0.800
Variance of token length (mean / median)- 0ED: 2.69 / 2.64
- ED+: 2.77 / 2.68
Proportion of long tokens (length ≥ 7)
Unique bigrams per page (mean / median)- 0ED: 60.5 / 61
- ED+: 80.9 / 79
Bigram repetition rate- (1 − unique_bigrams / total_bigrams)
- 0ED: 0.791
- ED+: 0.861
TLDR;
ED+ pages
Are much longer (Baneo and Recipe influenced)
Introduce more total vocabulary
Reuse prior vocabulary more than 0ED pages
Contain longer tokens
Use more bigram types
Repeat bigrams more heavily.
Have roughly the same hapax creation rate.
Sometimes things have to break before you can fix them.
So, there are a lot of statistics that can show that ED+ pages are significantly different from ED0 pages, and much of this has been discovered and mulled over since Currier spotted that difference. But, what I'm about to show you will, I think, make you reconsider that difference.
I was working on splat repair. It occurred to me that some 2,000 splats exist in Takahashi. If the Voynich contained information, that was a lot of lost information. So, I started working on suite of repair tools, some bits stolen from OCR, some from spell checking. In that suite I did something different. I did 'leak free' testing. I would train with tokens on herbal and would then I would test on recipe. Or, I'd alternate folios. Train on one, test on the other. This allowed comparisons between two sections of the Voynich without one set of data contaminating the other. So when I spotted ed0 and ed+, the repair thing popped right up.
When I trained my model on 0ed pages and tested on ed+ pages, here's what I found.
OOV tokens:
(OOV = out of vocabulary)
- 10,389 - That's the number of tokens seen in ed+ pages that were not seen in ed0 pages. A fairly large number
Bigram-illegal tokens:
- 170 - That's how many tokens on ed+ pages that had bigrams that were not seen in the 0ed pages.
Repairability of OOV tokens:- SUB or DEL (token-level): 82.15%
- SUB or DEL (type-level): 66.49%
- SUB/DEL/INS (token-level): 82.69%
- SUB/DEL/INS (type-level): 67.33%
This, is the big one. Of those 10,389 tokens that were found on the ed+ pages that didn't exist on ed0 pages, 82% could be made an ed0 token with one simple character deletion or substitution. These two sets of pages are not that different.
Next, I reversed that test. I trained on ed+ pages and tested on ed0 pages.
OOV tokens (Vocab-OOV):
1,714 - That's a huge difference. Over 10,000 tokens were seen on ed+ pages that didn't exist on ed0, but only 1,714 tokens were on ed0 that were not on ed+
Bigram-illegal tokens:
17 - Only 17 tokens in ed0 had bigrams that were not on ed+.
When you compare those numbers based on total tokens in the test,
0ED → ED+- OOV tokens: 10,389
- Test tokens: 25,554
- OOV ratio: 40.66%
ED+ → 0ED- OOV tokens: 1,714
- Test tokens: 8,797
- OOV ratio: 19.48%
And, 82.21% of ed+ tokens could be repaired to make ed0 tokens with a single substitution or deletion.
More notes:
High-frequency backbone
In 0ED
- Top 10 most frequent tokens → 100% shared with ED+
- Top 20 → 100% shared
- Top 50 → 100% shared
In ED+:
- Top 10 → 70% shared
- Top 20 → 75% shared
- Top 50 → 78% shared
Exclusive bigrams- 0ED-only bigrams: 12
- ED+-only bigrams: 60
TLDR2;
0ED vocabulary is largely contained within ED+.
ED+ expands well beyond 0ED.
Every high-frequency backbone token in 0ED exists in ED+.
195 of 207 0ED bigrams survive in ED+
That is 94% retention
ED+ expands the bigram alphabet
Adds 60 new bigrams
Bigram space grows from 207 → 255
And here's the first chart. This is the vocabulary growth between 0ED and ED+. This is a very smooth growth rate. This suggests that there was no big shift between 0ED and ED+. Despite all of those differences above, it's still the same base "engine" chugging along with no dramatic change.
Sometimes, broken things deserve to be repaired.
In ED+ there are 4,260 unique tokens that do not exist in 0ED pages (different from the OOV above). If we take those tokens and we do a simple 1 edit distance repair to a token in 0ED:
2,870 are 1 edit distance away.
1,123 are edit distance 2 away.
3: 218 are edit distance 3
>3: 49
I gotta bold this to make sure it's seen
Around 94% of ED+-only vocabulary is within edit distance ≤2 of 0ED.
Let me put that another way.
Out of 4,260 uniquie tokens in ED+ pages
3993 can be made a 0ED token by changing 2 characters.
I had to keep repairing things...
I set up a chain, where I would take all of those that were edit distance 1 from a 0ED token and I compared all of those that were edit distance >= 2
By repeating this chain of checking and rechecking edit distance I came to an abrupt stop at get 6
- Gen 1: 2,337
- Gen 2: 630
- Gen 3: 160
- Gen 4: 30
- Gen 5: 9
- Gen 6: 1
1,093 tokens were still unreachable. So, I relaxed the rules a bit. I started allowing 2 edit distance.
And then allowed edit distance 3.
After 2 rounds of editing like this, I was left with 15 tokens that could not be chained back to 0ED. And every single one of those tokens was length >8. I then considered those to be possibly transcription errors and were conjugations. I compared them to shorter tokens and was able to split them into 2 words. All of those were then 1 or 2 edit distance from a 0ED token or a previously repaired token.
Ok, so every single token on ED+ could be chained into an edit distance of 3 or less into a 0ED token.
I checked Zandberg/Landini.
99.07% of the ED+-only vocabulary in was absorbed within ≤3 edits.
I had 45 tokens left over. 45 / 45 (100%) have a split where both halves were within ≤3 edits of an another checked token.
Ok,... that can't possibly be right. It means I can edit any word a few times... ok 6 at most, maybe 7, and I can make every single word from one half a book match the other half.
Well, kinda
Latin (Caesar)
plus-only unique tokens: 2,597
Dist=1: 860
Dist=2: 940
Dist=3: 506
3: 291
Within ≤2: 69.31%
Within ≤3: 88.79%
English (Dracula)
plus-only unique tokens: 1,184
Dist=1: 374
Dist=2: 413
Dist=3: 226
3: 171
Within ≤2: 66.47%
Within ≤3: 85.56%
So yes, they can be repaired. BUT!
Voynich
Within ≤2: ~93.7%
Within ≤3: ~98.8%
Voynich words have a greater similarity than Latin or English. Now to old hands at Voynich, this is no huge surprise. But, it does show that after editing what appears to be two very different sections of the book, that similarity is just a few edit distances away.
Conclusion.
I'm likely going to get beat up over this but, here goes:
- Currier Language A and B are not distinct languages and he noted that.
- 0ED pages were likely created prior to the ED+ pages. I said likely! I don't have solid proof but the difference in vocabulary and bigrams suggests it.
- 0ED and ED+ are not behaving like normal text. Well, the whole Voynich doesn't behave like normal text so no surprise there.
- 0ED and ED+ look like two regimes, but not two vocabularies:. ED+ is almost entirely built from 0ED by tiny edits. The same "engine", different settings.
So, I hope I've given enough evidence to show how these two regimes are different, but the same underlying system. I'll be interested in hearing your thoughts.
The big question now is:
Why does a lexical "engine" make a drastic switch like "ed" if the vocabulary isn’t actually changing much?
I think I can answer that. But that's for another post.
Disclaimer: I have tried to review all of these numbers and I believe them to be reasonably accurate. I may have missed some but hopefully nothing drastic.