Wonderful, I'm looking forward to subsequent papers. Some remarks:
1) This one is not aimed at the paper but rather at general practice:
"The Recipes section is distinguished by pages with paragraphs of text separated by assortments of labelled herbs, leaves, or roots."
--> this is one of the two reasons why I find the traditional section names problematic: the name "recipes section" is usually reserved for Q20 (stars), but sometimes people understandably apply it to the preceding section instead (traditionally known as "pharma section").
The other reason is that especially to newcomers, the traditional section names may appear as an actual assessment of their contents rather than just names.
2) The colored overview of section - hand - language on p.6 is very handy.
3) Why did you use character set size (h0) instead of h1? If I understand correctly, h1 is the more telling value since it also takes frequency into account. And isn't the exact h0 of a handwritten manuscript almost impossible to determine? Capitals, ligatures, abbreviations, positional variants... should all be counted, since we cannot eliminate these in Voynichese either.
4) Some time ago I mentioned my blog post on "improving" h2 by merging frequent n-grams. This post contained a mistake in the numbers, but I corrected and expanded upon it recently: You are not allowed to view links.
Register or
Login to view.
Basically, I take this, as you phrase it in your paper, to the extreme: "If certain glyphs occur primarily in a particular sequence, this may be evidence that the sequence of glyphs represents a single character." It is interesting to see how different VM sections react differently.
I cannot stress enough that my entropy posts are meant as experiments and not as proposed solutions or transcription systems. And I do agree with your statement that "one cannot simply assume that the low character entropy is due to our over-splitting of characters; even when they are grouped together, Voynichese is still unusual compared to other language samples."
Still, I think it must be noted that Voynichese's h2 can be lifted higher than is the case for any of the standard transliterations, without losing information and, importantly, while keeping h1 in check.
5) I am very happy that you explain why abjad solutions are a bad idea
Having read the entire paper, I can only say that I certainly agree with the conclusions, and it should be mandatory reading for anyone interested in Voynichese statistics.