I think I found a way to make the result more universal by removing all parameters. This was inspired by Mauro's Nbit grammars, Rene's comment about many possible combinations and BPE encoding used to tokenize texts for machine learning tasks.
The idea is to perform a deterministic transformation to identify the most frequent tokens (independent of the writing system or any delimeters) of each text and then compare the pair statistics for top 15 most frequent tokens. So, there is no central characters and no arbitrary splitting algorithm, there is nothing to fine tune.
This works as follows: initially the text is treated as a sequence of unicode characters (one character is one token).
At each step the algorithm identifies the most frequent pair of characters, assigns a new token value to this pair, replaces this pair everywhere in the text with the new token value and then computes the You are not allowed to view links.
Register or
Login to view. of the new text (this part was suggested by Gemini, I asked if something similar to Mauro's Nbit metric can be used here). If this new MDL is smaller than the previous MDL, the new token value is accepted and this whole step is repeated, if not then the update is reverted and the algorithm proceeds to the next most frequent pair and checks if replacing it with a new token would reduce MDL, and if no such pair is found at all the algorithm stops the encoding identifying the single best possible pair encoding of this text according to this algorithm.
In other words, the algorithm tries to compress the text by replacing pairs of tokens with new tokens until it reaches the point where there is no way to further reduce the total size of the representation (token sequence + token vocabulary), coming to a single most compact BPE representation of the text.
This lets us find some comparable text representation regardless of the language and the writing system. One strong piece of evidence that it works correctly is when running this algorithm on pinyin and the Chinese versions of bencao, the resulting most frequent tokens largely overlap - the Chinese sequences up to three characters long and the corresponding pinyin sequences up to three characters long made it to top 15.
After completing the unique tokenization we get the top 15 tokens and compute the expected and actual counts of their combinations, that is, for tokens 'da' and 'in' we would count the actual number of 'dain' and the expected number given the counts of 'da' and 'in' tokens in the text. Then we produce the same actual/expected charts as before.
The results shown are for English, Latin, Arabic, Chinese characters, Chinese pinyin and Voynichese and three special charts at the bottom, explained below. All texts are of different sizes between 20 and 800 kilobytes in UTF8 text form, the algorithm is not very sensitive to the size of the text. Voynichese is a clear outlier with many more token pairs that appear close to the number of times they would appear is the selection of tokens was made independently of one another. Among texts, Arabic looks the most similar to, but still very far from Voynichese.
For clarity the spaces are represented as mid height dots in the tables (to make them visible) and I also replaced Voynichese spaces (.,) with the same symbol.
We can try other languages and texts, but I suspect that none of them would reach the level of uniformity of Voynichese.
Given that a possible explanation for various peculiarities of the text is a lot of errors made in preparing it, I also computed the chart for "mangled English", where each character with 15% probability was replaced with a similar looking or similar sounding letter, producing a result like this: "soil for the supdort of the planf. The raot, therefare, fulfils a". The resulting chart looks like somewhere between English and Voynichese, however I'm not sure we can treat this result as valid, since what happened here is a nearly perfect machine randomization using a modern statistically stable algorithm, something unlikely to happen when creating the actual MS. I don't think actual mistakes made by humans would be this random and there still would be many more patterns. In any case, the level of mangling to reach the same state as the chart of Voynichese is to mangle 30% of characters, which makes the English text completely unreadable. If we are talking about this number of mistakes, the text is essentially lost.
To me all this is a very strong indication that Voynichese is not a straightforward representation of any natural language, be it phonetic or logographic, faithful or with some errors (but still readable).
I also tried a nomenclator - assigning each different word in an English text to a unique decimal or roman numeral. While roman numerals look more like ordinary language, decimal numerals absolutely overshoot the mark and look even more uniform than Voynichese, something I think Rene hinted at previously in this thread.
(One note about reading the charts, because this looked confusing for me at first: the chart for Voynichese lists d + aiin actual count as 0. This is not a mistake, because d+aiin in the best compressed BPE is encoded as a separate token daiin and not as a combination of d + aiin.)