The Voynich Ninja

Full Version: The oddities of the bigram "ed" pt. 4 : The Chunking of Scribes
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
I decided to start another thread on this. While it ties directly into my work on the copy/mutate ledger (You are not allowed to view links. Register or Login to view.), this deserves to be tied into my previous work on the bigram <ed> and to have it's own post because of the.... "Huh?" factor. This may already be known but I'm just discovering it so pardon if stepping on someone else's work.

Here's my previous posts on that subject for reference.

You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.

In those posts I try to describe the differences between Currier A, which is Scribe 1 and Currier B which is Scribe 2+.  They are dominated by the difference between the bigrams <ho> and <ed> with I defined as ED0 and ED+ regimes.  My work on the generator focuses on Scribe 1 because of this apparent "regime shift" in order to try to gain stability in generator production.

Today, I decided to explore that a little further to see what effects I would need to anticipate should I decide to model Scribe 2+ and here's the interesting results I got.

First, I had codex create a basic machine learning script that would scan through the pages created by specific scribes (excluding any pages/sheets classified as being from 2 different scribes) and I had it look for syllabic chunks, things that resembled syllables.  After creating 5 data files, one for each scribe, I had codex create a second script that would combine and compare those results to see how different scribes used different chunks.  Here's the result:

Both charts are sorted by Scribe 1 counts for comparison.

Total count top chunks used by each scribe:

[attachment=15807]

Normalized count of top chunks used by each scribe:

[attachment=15808]

Here's my interpretation of what I'm seeing:

To me, this looks cumulative and directional.  The later scribes are not replacing the earlier ecology, they are amplifying parts of it and reducing usage of other parts.

And it's monotonic. Usage is consistently moving in one direction without reversing by scribe.

[attachment=15810]

[attachment=15811]

Taking this method of ML chunking one step further, I had codex examine these syllables by quire and sheet and create a PCA chart (You are not allowed to view links. Register or Login to view.).  In this chart, a dot represents a sheet. The number above the dot represents the quire number.

[attachment=15813]

Possible explanations for this because this does not look like random drift:
  • Source-pool inheritance: Different scribes may have preferentially copied from different quire/sheet pools, which matches my proposed sheet-source pool idea for page creation.  Furthermore, if pages were created using source sheets, the chunk ecology may suggest that later scribes preferentially reused sheets already created by earlier scribes and may have preferred the most recent scribe's work. (sorry, had to list this one first because of my working theory, not that it has any other specific weight.)
  • Regime reweighting: The scribes may have used slightly different weighting systems for selecting or constructing words while still operating within the same broader production framework.
  • Progressive stabilization: Over time, productive word families may have become increasingly dominant, with prefixes and endings becoming more specialized and certain chunks becoming structurally “sticky.”
  • Scribal training lineage: Later scribes may have learned from already-shifted exeamples. Scribe 5 inherits an ecology already shaped by Scribe 4, Scribe 4 inherits one shaped by Scribe 3, and so on.
  • Section/topic dependence: I have not specifically tested this yet, but different scribes are known to work in specific sections, so some chunk ecology differences could reflect section-specific production behavior rather than purely scribal differences.

And there are certainly other plausible explanations as well. What I'm seeing though is that the chunk ecology appears structured and directional rather than random.

EDIT:  To test if this n-gram chunking might be section specific I reproduced the PCA chart with Herbal only pages.  It's still showing a clean separation between scribes.

[attachment=15816]
I don't think I understand what these graphs show. What is "Chunk Count" in the first graph, is it the total count per corpus for particular hand? Why do all five scribes have comparable overall chunk count?
(28-05-2026, 06:22 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.I don't think I understand what these graphs show. What is "Chunk Count" in the first graph, is it the total count per corpus for particular hand? Why do all five scribes have comparable overall chunk count?

Basically, the ML script looked at the Voynich for each scribe.  It then tried to identify syllables, not just bigrams and trigrams. So you have a mixture of n-grams like <ch>, <co>, <che>, <aiin>.  Then it counted and weighted the occurrences of each of those n-grams (chunks) in each scribe's pages.  So, in the first 2 charts you have n-gram in the x axis and count in the y axis.

The second set of charts takes specific n-grams that increase or decrease by scribe usage and plots those by count.  What that is showing is that specific n-gram usage is increasing or decreasing linearly by scribe.  Not random ups and downs, a smooth progression of n-gram usage.

What I expected to see was some randomness as each of these scribes are generally section specific.  It's not. Their n-gram usage is linear.  What I'm seeing instead is that these n-grams increase or decrease based on the scribe and it's not like scribe 2 preferred <ch> more than scribe 5.  It's sequential.  Scribe 1 used <ch>.  Scribe 2 used it less.  Scribe 3 even less... etc.  

I know you are no fan of the copy/mutate idea but imagine this:  Scribe 1 writes a page.  Scribe 2 gets that page and does a copy/mutate and in the process, uses some n-grams a little less, some a little more. Then Scribe 3 gets Scribe 2's page and they do a copy and mutate.  Because some n-grams are used less and some more, Scribe 3 amplifies this slightly.  Now those same n-grams are less or more on the Scribe 3 page than Scribe 2 or 1.  Now Scribe 4 gets that page and does a copy/mutate.... etc.

I'm not saying that's definitely the explanation but I think it's one possible explanation.
Maybe I'm mistaken, but I think there are a lot of hand 1 pages in the manuscript, quite a number of hand 2 pages and only a few pages each for the rest of the scribes. The first graph appears to look as if the count of chunks is higher for scribe 5 than scribe 1. How is this possible?

Edit: ok, you said "counted and weighted", now I see, however do these absolute values make any sense after counting and weighting?
(28-05-2026, 06:56 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Maybe I'm mistaken, but I think there are a lot of hand 1 pages in the manuscript, quite a number of hand 2 pages and only a few pages each for the rest of the scribes. The first graph appears to look as if the count of chunks is higher for scribe 5 than scribe 1. How is this possible?

Edit: ok, you said "counted and weighted", now I see, however do these absolute values make any sense after counting and weighting?

Weighted as in machine learning scoring:

Code:
Voynich Scribe Chunk Comparison v1
============================================

Scribe 1: 150 chunks
Top 20:
  ch        count=1099  pos=middle start=0.3813 mid=0.6115 end=0.0064
  iin        count=256    pos=end start=0.0 mid=0.0391 end=0.9609
  ol        count=482    pos=end start=0.1577 mid=0.3797 end=0.4606
  od        count=407    pos=middle start=0.0713 mid=0.7838 end=0.1425
  sh        count=381    pos=middle start=0.4751 mid=0.5013 end=0.021
  ee        count=414    pos=middle start=0.0314 mid=0.9565 end=0.0121
  ok        count=331    pos=middle start=0.3656 mid=0.6163 end=0.0151
  or        count=329    pos=end start=0.0821 mid=0.3161 end=0.5988
  cth        count=139    pos=start start=0.5036 mid=0.4676 end=0.0216
  ckh        count=115    pos=middle start=0.3217 mid=0.6609 end=0.0174
  ot        count=259    pos=middle start=0.4672 mid=0.5097 end=0.0193
  che        count=286    pos=middle start=0.3671 mid=0.6294 end=0.0035
  ar        count=208    pos=end start=0.0481 mid=0.3125 end=0.6346
  aiin      count=194    pos=end start=0.0 mid=0.0464 end=0.9485
  al        count=195    pos=end start=0.0462 mid=0.4308 end=0.5179
  ai        count=363    pos=middle start=0.0386 mid=0.9587 end=0.0028
  cho        count=435    pos=middle start=0.4161 mid=0.5195 end=0.0621
  dy        count=265    pos=end start=0.0491 mid=0.0151 end=0.9321
  eo        count=433    pos=middle start=0.0023 mid=0.8868 end=0.1109
  ey        count=234    pos=end start=0.0 mid=0.0385 end=0.9615

Scribe 2: 150 chunks
Top 20:
  che        count=266    pos=middle start=0.3759 mid=0.6241 end=0.0
  she        count=184    pos=middle start=0.4674 mid=0.5272 end=0.0
  dy        count=404    pos=end start=0.0322 mid=0.0223 end=0.9431
  ol        count=348    pos=middle start=0.3161 mid=0.3477 end=0.3333
  ch        count=551    pos=middle start=0.3793 mid=0.6189 end=0.0
  iin        count=112    pos=end start=0.0 mid=0.0357 end=0.9643
  qo        count=214    pos=start start=0.9813 mid=0.0047 end=0.0093
  ee        count=280    pos=middle start=0.0357 mid=0.95 end=0.0143
  al        count=182    pos=end start=0.1044 mid=0.4396 end=0.4505
  ar        count=152    pos=end start=0.0724 mid=0.2368 end=0.6842
  ckh        count=66    pos=middle start=0.1061 mid=0.8788 end=0.0152
  sh        count=292    pos=middle start=0.476 mid=0.5171 end=0.0034
  or        count=121    pos=end start=0.1736 mid=0.1653 end=0.6529
  ke        count=176    pos=middle start=0.1364 mid=0.8636 end=0.0
  cth        count=52    pos=middle start=0.25 mid=0.75 end=0.0
  ai        count=186    pos=middle start=0.0484 mid=0.9462 end=0.0054
  ey        count=187    pos=end start=0.0 mid=0.0267 end=0.9733
  ok        count=182    pos=middle start=0.4451 mid=0.544 end=0.0055
  edy        count=205    pos=end start=0.0 mid=0.0195 end=0.9805
  aiin      count=85    pos=end start=0.0 mid=0.0235 end=0.9647

Scribe 3: 150 chunks
Top 20:
  che        count=514    pos=middle start=0.3268 mid=0.6693 end=0.0019
  ch        count=1074  pos=middle start=0.3613 mid=0.635 end=0.0028
  aii        count=260    pos=middle start=0.0769 mid=0.9231 end=0.0
  she        count=216    pos=middle start=0.4444 mid=0.537 end=0.0139
  ee        count=617    pos=middle start=0.0146 mid=0.9789 end=0.0065
  dy        count=432    pos=end start=0.0162 mid=0.0093 end=0.9722
  ol        count=391    pos=end start=0.3095 mid=0.3223 end=0.3657
  al        count=392    pos=end start=0.1403 mid=0.4184 end=0.4388
  ar        count=339    pos=end start=0.0737 mid=0.236 end=0.6873
  ok        count=331    pos=middle start=0.4169 mid=0.574 end=0.0091
  ot        count=267    pos=start start=0.5506 mid=0.4419 end=0.0037
  sh        count=412    pos=start start=0.5 mid=0.4951 end=0.0049
  ai        count=460    pos=middle start=0.0913 mid=0.9087 end=0.0
  or        count=197    pos=end start=0.1269 mid=0.198 end=0.6701
  aiin      count=189    pos=end start=0.0 mid=0.0106 end=0.9841
  ckh        count=75    pos=middle start=0.1467 mid=0.8533 end=0.0
  od        count=253    pos=middle start=0.0711 mid=0.8103 end=0.1186
  ey        count=267    pos=end start=0.0 mid=0.0225 end=0.9775
  ed        count=422    pos=middle start=0.0 mid=0.8957 end=0.1043
  eo        count=408    pos=middle start=0.0 mid=0.8529 end=0.1471

Scribe 4: 150 chunks
Top 20:
  ch        count=505    pos=middle start=0.4792 mid=0.5149 end=0.004
  eo        count=291    pos=middle start=0.0069 mid=0.8969 end=0.0962
  al        count=247    pos=middle start=0.1174 mid=0.4696 end=0.4089
  ot        count=230    pos=start start=0.787 mid=0.2 end=0.0087
  ok        count=229    pos=start start=0.6638 mid=0.3319 end=0.0044
  ee        count=305    pos=middle start=0.0328 mid=0.9639 end=0.0033
  aiin      count=69    pos=end start=0.0145 mid=0.0145 end=0.9565
  dy        count=157    pos=end start=0.0064 mid=0.0255 end=0.9618
  ol        count=198    pos=middle start=0.1465 mid=0.4899 end=0.3586
  sh        count=149    pos=start start=0.6309 mid=0.3624 end=0.0
  che        count=193    pos=start start=0.5285 mid=0.4663 end=0.0
  ar        count=146    pos=end start=0.089 mid=0.2877 end=0.6164
  ey        count=165    pos=end start=0.0 mid=0.0727 end=0.9212
  ai        count=156    pos=middle start=0.0833 mid=0.9167 end=0.0
  cho        count=150    pos=start start=0.5667 mid=0.38 end=0.0467
  or        count=88    pos=end start=0.2045 mid=0.1932 end=0.5909
  yk        count=69    pos=start start=0.7971 mid=0.1884 end=0.0145
  cth        count=29    pos=middle start=0.2759 mid=0.7241 end=0.0
  oke        count=106    pos=start start=0.6509 mid=0.3491 end=0.0
  od        count=147    pos=middle start=0.0816 mid=0.8639 end=0.0544

Scribe 5: 118 chunks
Top 20:
  che        count=84    pos=middle start=0.4048 mid=0.5952 end=0.0
  dy        count=119    pos=end start=0.0084 mid=0.0168 end=0.9664
  she        count=47    pos=start start=0.6383 mid=0.3404 end=0.0
  ch        count=161    pos=middle start=0.4099 mid=0.5839 end=0.0062
  ke        count=68    pos=middle start=0.1176 mid=0.8824 end=0.0
  ol        count=57    pos=end start=0.1579 mid=0.2807 end=0.5439
  qo        count=56    pos=start start=1.0 mid=0.0 end=0.0
  aiin      count=20    pos=end start=0.0 mid=0.0 end=0.95
  al        count=49    pos=end start=0.102 mid=0.3878 end=0.4898
  sh        count=76    pos=start start=0.6579 mid=0.3158 end=0.0132
  eo        count=73    pos=middle start=0.0 mid=0.8493 end=0.1507
  ar        count=34    pos=end start=0.0 mid=0.2647 end=0.7059
  ot        count=45    pos=start start=0.5778 mid=0.4222 end=0.0
  ok        count=66    pos=middle start=0.4091 mid=0.5455 end=0.0303
  ee        count=66    pos=middle start=0.0 mid=1.0 end=0.0
  ai        count=44    pos=middle start=0.0682 mid=0.9318 end=0.0
  qok        count=30    pos=start start=0.9667 mid=0.0 end=0.0
  cth        count=12    pos=middle start=0.3333 mid=0.6667 end=0.0
  ey        count=41    pos=end start=0.0 mid=0.0244 end=0.9756
  od        count=55    pos=middle start=0.0545 mid=0.7818 end=0.1455

Total unique chunks across inventories: 266

And yes, there are different page counts for different scribes.  ML weights them based on position and occurrence on a 0-1 scale.  Essentially a percentage.  Yes, Scribe 5 has a lot less content but it's still enough to get a basic score for the chunks.

It was capped at the top 150 'chunks' so Scribe 5 with 118 didn't quite have that many, but still enough to get measuements.
It's perhaps worth looking at scribe 2 under one more aspect.
He is responsible for two different parts of the MS: the 'herbal-B' part and the 'biological part (quire 13)'.
These two parts have quite distinct properties, which should also show up on the 'chunk' level. 
(They are dark blue and magenta in the plots on this page: You are not allowed to view links. Register or Login to view. )
On behalf of the scribes, I do not think they would much enjoy being chunked
(29-05-2026, 12:21 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.It's perhaps worth looking at scribe 2 under one more aspect.
He is responsible for two different parts of the MS: the 'herbal-B' part and the 'biological part (quire 13)'.
These two parts have quite distinct properties, which should also show up on the 'chunk' level. 
(They are dark blue and magenta in the plots on this page: You are not allowed to view links. Register or Login to view. )

Excellent suggestion.  Here's the results.

This compares when a chunk starts a word.  Middle and terminal chunk charts show similar variation.

[attachment=15821]

And this chart basically asks which chunks are more important in which section.

[attachment=15822]

And here's the PCA chart for Scribe 2 both sections.

[attachment=15823]

There's one other section that has 2 distinct scribes, Pharma which has Scribe 1 and 3.   I figured, since the data was there and I was spitting out charts, why not.  It's not as dense sheet-count wise but I think there's enough there to shed some light.

Chunk starting a word chart

[attachment=15824]

Chunk importance chart

[attachment=15825]

And a PCA for Pharma.  It's a bit scarce on data but I think we can still say the 2 scribes have... different chunk usage.

[attachment=15826]

And, so that we're comparing apples to apples, I ran the 2 line charts for Herbal as well.  Keep in mind, Scribe 5 only has 1 sheet in Herbal so it's data there is pretty limited.

Chunk starting a word chart

[attachment=15829]

And chunk importance chart

[attachment=15828]

And just for my own curiosity, here's my ED0 and ED+ which is essentially Scribe 1 vs all the other scribes combined.  What's really interesting is that because <ed> never (or almost never, I can't remember) appears as a prefix, my syllable ML learning script never picked it out as a syllable.  So what we're seeing here is essentially the weighing difference between n-grams without the huge explosion of <ed>.  So, this makes me wonder if all this <ed> isn't just a weighing change of other n-grams like... che -> dy.  All of the scribes used che. And they all used dy.  Scribe 1 never combined those 2 specific chunks (except maybe in 1 or 2 hapax tokens).

[attachment=15830]

So my quick eyeball interpretation: The strongest signal appears to be chunk reweighting. The chunk inventory remains pretty much the same across scribes and sections, and many chunks retain the same positional preferences. What changes most consistently is how heavily particular chunks are favored/used by different scribes even within the same section.

And that brings me back to ED0 and ED+.  What I'm seeing here, it's not just a regime shift caused by Scribes or Currier hands, it's a combination of Scribes and section.  The section difference makes sense.  The scribe difference within the same section, on the surface, does not.

Thanks for the suggestion Rene.
(29-05-2026, 03:53 AM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.Pharma which has Scribe 1 and 3. 

I only have scribe 1 for pharma. Could you check?
(29-05-2026, 05:06 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.I only have scribe 1 for pharma. Could you check?

Ok, I think I found the confusion in my data. What I attempted to do was merge the data from your site regarding quires/sheets/Currier A & B with her scribal hands.  The traditional quire method from you and Currier has 20 quires.  Davis reduces it down to 18.  So, I copied her scribe assignments onto the quire data in your quire sequential order but not her quire order.  And that's where a mis-labeling happened.  Two sheets are missing from that section which would account for the missing quire numbers q16 and q18 in the traditional numbering.  She's assigning those missing sheets to Q16.

So technically, using the 20 quire method, and her scribal hands, the chart is correct.  She has scribe 3 creating f94 and f95.  Using her quire numbering system, it's mis-labled and should be q15 and q16.  And to make things even more confusing, she listed those quires as Botanical and not Pharma.

I really have not worked with quires and scribes past the Herbal section so, I didn't notice this till now and thankfully doesn't affect any of my other work.  So, going forward, I'm going to stick with her quire numbering just so I have consistency but I'll make a note of it for future reference.

Thanks, you had me worried I really screwed something up.

Quote:Lisa Fagin Davis, How Many Glyphs and How Many Scribes? Digital Paleography and the Voynich Manuscript (2020)

[attachment=15834]
Pages: 1 2