The Voynich Ninja

Full Version: About word length distribution
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4
(05-05-2025, 08:56 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.I tried writing Voynichese to scale (just with a pencil). I really struggled initially, because most strokes are 1-2 mm long. I think the majority of people studying the manuscript imagine the writing about 2-3 times larger than it actually is.

Thanks for pointing this out, that the writing is smaller than what I imagined.

Duplicating the text with a pencil ( or with a ball point pen ) is never going to be as simple as writing in ink. With those you need to press into the page to make the stroke marks. With ink ( modern fountain pen or quill ) you just need to glide the instrument without much pressure and the ink will flow by itself. So writing smaller is easier with ink, and less tiring on the hand.
(05-05-2025, 09:05 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.The details of the binomial distribution of course depend on the transliteration system. We don't know what makes a Voynich character (EVA:a, iin, aiin?).

This feels like an important point to me. Shouldn't the starting point always be that we don't know the word length distribution?
(05-05-2025, 08:06 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.This seems very unlikely. If you examine the writing in the manuscript you will notice that there is a fluency to it. There doesn't appear to be much stop-start in the text. The words in each line broadly keep to the same baseline. The writer isn't putting the pen down after each word to roll dice ( nor for that matter to consult any book cypher or perform any mathematical computation to determine what the next word should be ). In my opinion the writing of each page was done in one sitting in one uninterrupted rush of effort.

The script in principle allows this fluent type of writing. Now this is crucially important, because in that respect it differs from essentially all old ciphers based on invented alphabets, be they mono-alphabetic or poly-alphabetic.

However, in practice it is rarely written in a fluent matter. The baseline jumps up and down irregularly, and the direction of the baseline of individual words is not always very straight either. I am sure that the quality varies throughout the MS.

To see what I mean, just look closely at the words in the first two lines here:
You are not allowed to view links. Register or Login to view.

Many other examples (good and bad) can be found.
More specifically on the topic of word length distribution, when speaking of this, it is important to distinguish between the word type distribution and the word token distribution.

The binomial case is for word type lengths, meaning that the approach of throwing dice when writing out the text does not work.

For more information see here: You are not allowed to view links. Register or Login to view.
specifically under section 4.6, which is mainly an introduction to a copy of Stolfi's analysis.
(If the link to Stolfi's article fails, refresh the page first because I just had to fix the link).
While you're at it.

You can also discard reading and writing with a hole template. Markers should be visible. Difficult to produce without a chequered background (squares or lines). Word spacing and different line heights make it even more difficult.
(05-05-2025, 11:44 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.More specifically on the topic of word length distribution, when speaking of this, it is important to distinguish between the word type distribution and the word token distribution.



The binomial case is for word type lengths, meaning that the approach of throwing dice when writing out the text does not work.



For more information see here: You are not allowed to view links. Register or Login to view.

Correct, the distinct words of a length follow binomial distribution exactly, the token length distribution follows it approximately (less for labels).
(06-05-2025, 06:07 AM)tikonen Wrote: You are not allowed to view links. Register or Login to view.
(05-05-2025, 11:44 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.More specifically on the topic of word length distribution, when speaking of this, it is important to distinguish between the word type distribution and the word token distribution.




The binomial case is for word type lengths, meaning that the approach of throwing dice when writing out the text does not work.




For more information see here: You are not allowed to view links. Register or Login to view.



Correct, the distinct words of a length follow binomial distribution exactly, the token length distribution follows it approximately (less for labels).

The distinct words follow a binomial distribution exactly if you make the specific choices about what constitutes a single glyph that Stolfi did, i.e. ligatures and ligatured gallows are single characters but the word-final ai*[x] combinations are multiple characters -- for instance, from his data file:

437 shedy {Sh}{e}{d}{y} 4
403 chol {Ch}{o}{l} 3
[...]
219 qokain {q}{o}{k}{a}{i}{n} 6
[...]
115 cthy {CTh}{y} 2

Whether that is saying something about the correct decomposition of the glyphs is an interesting question...
(04-05-2025, 09:51 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Those are some cool dice!

My question is: why would they go through this effort?

I don't really know and I'm not willing to speculate at this point, I've still very limited understanding of all the aspects of VM.

I'm approaching the VM with an assumption that it's generated with a tooling that gives its telltale characteristic like the binomial distribution. In my previous job I worked in the gaming industry with pseudorandom systems (like loot drops) and pseudo language generation for MMORPG npc (non player character and location names) etc. Idea was not to produce something random but something that *looked* plausibly real, at least for western audience.

When I apply these lessons to VM alphabet I get very similar output and distribution of tokens.

For example assume a slot machine that has 3 wheels with 6 position each. Each position has a unique token or syllable, maybe some token is a blank. Roll each wheel and use each in order to construct a word. First wheel is always used for the start, 2nd for middle and 3rd for the ending. Here is a toy example

wheel1 = ['a', 'b', 'aa', '', 'qo', 'abc']
wheel2 = ['a', 'b', 'aa', 'ab', 'aaa', 'abc']
wheel3 = ['a', 'b', 'aa', 'ab', 'in', 'abc']

Words constructed from these wheels look like:
abcabcb qoabab aabin abcaaab abcaaaa aba aaain qoaaab aaab qoaaaaa qoabaa aaaaa baaain abcaaaabc aaaaaabc abcabab aaabc qoba aabcin abcbab

The distribution matches the VM (for number of unique tokens of a length), as long as one generates enough, at least several thousand words.
Ah, I understand.

I don't know nearly as much about statistics as most of the other guys here, but isn't Zipf's law specifically something that shows up even when no apparent "Zippifying" method lies at the basis of the data?
(06-05-2025, 09:01 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Ah, I understand.

I don't know nearly as much about statistics as most of the other guys here, but isn't Zipf's law specifically something that shows up even when no apparent "Zippifying" method lies at the basis of the data?

Koen,

I'd phrase it somewhat differently as there are processes that will generate pseudo-text whose vocabulary will show a Zipf's Law like frequency distribution, but yes. And per tikonen's comment You are not allowed to view links. Register or Login to view. (and as Rugg pointed out regarding his grill method) there are ways of generating pseudo-text whose vocabulary has a binomial length distribution.

As for your earlier question in the thread "why would they go through this effort?" -- without drilling down into details (because if I had gotten my preliminary experiments to a publishable point I would have written them up more formally), in the context of verbose cipher theories:

* an obvious issue with a verbose cipher (and an obvious tell) is that the enciphered words are longer than the corresponding plaintext words;

* while a mechanism like rule-based insertion of spaces (i.e., always before EVA 'q' or after EVA "ain"/"aiin"/etc sequences) helps, it doesn't help enough;

* so let's cut the Gordian knot, and assume some probabilistic process like the sum of multiple dice or coin flips is used to break the text up;

* in that context, the potted history of frequency analysis you'll find in many textbooks is wrong -- frequency analysis of dice throws goes back at least as far as Dante (so plausible for early 15th century);

* combining a) a verbose cipher using the obvious verbose Voynich glyph combinations as some of the elements mapped to (say) Latin based on relative frequency rankings with b) a sum-of-(dice/coin flips) based method for breaking the text up shows promise for potentially replicating all of (1) a binomial-like vocabulary word length distribution, (2) a Voynich-like type/token ratio and hapax fraction, and (3) a vocabulary showing Zipfian frequency behavior with a log-log slope fit similar to the Voynich text.

In addition to some modifications to what I was doing in my preliminary experiments to try to improve the results, whether the text generates the type of distribution of distance between similar words seen in the Voynich text is an open question.
Pages: 1 2 3 4