The Voynich Ninja

Full Version: An Essay on Entropy: what is it, and why is it so important?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5
I think the entropy of text rich in roman numerals would be extremely interesting. Surely there must be more books and letters dealing with accounting and financial transactions.

Another idea - have you tried to use coding languages like C, Java, Python for comparison?
Computer code has many voynichese-like properties like "words" and glyph combinations specific to line and paragraph position, repetitions and many similar expressions with slight differences.
(01-05-2022, 09:43 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.They would probably need to be transcribed first. RenegadeHealer, do you have a link to where you read about this?

Hi, Koen:

I'm going to jump in here with a link (and section of the blog post) for you.  Blog post You are not allowed to view links. Register or Login to view., specifically § 2, the paragraph that starts "In terms of precision . . ." (note the coincidence of the section abbreviation Smile ) and the accompanying figure.

The example provided by Patrick may or may not be sufficiently long to test.  Since he owns the manuscript maybe he would have an opinion about whether there are sufficient entries to be transcribed and analyzed.  Do you have an estimation of how long of a text is needed to get decent entropy stats?

Michelle
(04-05-2022, 04:29 PM)MichelleL11 Wrote: You are not allowed to view links. Register or Login to view.Do you have an estimation of how long of a text is needed to get decent entropy stats?

This depends.
For H1, which is generally not too interesting, already 1000 characters give an indication.
For H2, which is the quantity most looked at for the Voynich MS, at least 10,000 would be needed.
It really depends on the rarer bigrams, for which, in a short text, there are just 0 or 1, but the real expectation is somewhere in between.

H3 is generally ignored, but this is where languages can be most easily separated. Even different authors in the same language. But it will require multiple 100,000's of characters to get a good estimate.
(04-05-2022, 04:29 PM)MichelleL11 Wrote: You are not allowed to view links. Register or Login to view.I'm going to jump in here with a link (and section of the blog post) for you.  Blog post You are not allowed to view links. Register or Login to view., specifically § 2, the paragraph that starts "In terms of precision . . ." (note the coincidence of the section abbreviation Smile ) and the accompanying figure.

The example provided by Patrick may or may not be sufficiently long to test.  Since he owns the manuscript maybe he would have an opinion about whether there are sufficient entries to be transcribed and analyzed.  

I'm away from home at the moment, but as I recall, that particular manuscript is around a dozen pages long (although it was partially burned in a fire at some point and so is missing most of several lines at the top of each page).  I don't think there's anything especially unusual about it, though, and I chose it as an example mainly because it seemed very ordinary.

That said, the point I was trying to make with it is a little different from the one now under discussion.  I was reflecting on my sense of the "precision" of the writing in the VM -- how carefully and distinctly the individual characters seem to have been written.  I'll admit that this is all very impressionistic on my part, and it's limited by the kinds of documents I'm used to looking at.  But the VM writing seems more precise than typical handwritten linguistic text of the period (where someone who didn't understand the language would have a very difficult time preparing a transcription) and less precise than typical ciphertexts (which -- from what little experience I have with them -- seem mostly to be written as sequences of fairly discrete, disconnected, individually formed glyphs).  It's in this sense that I thought the VM writing feels as though it has a similar rhythm to the contemporaneous writing of Roman numerals, somewhere between those two other extremes.

Of course that's not to say this couldn't reflect other similarities with Roman numerals besides, such as matters of entropy.

I'd be happy to transcribe the sums of money given in Roman numerals from that document (or others I have) if anyone wants to analyze them, but it might be just as easy to generate sums randomly and convert them into the same notation.  Basically:

__ L __ s __ d

Where the second slot will never exceed 19 (because 20s=1L) and the third slot will never exceed 11 (because 12d=1s); where final [i] is written [j]; and where each unit is optional (a given sum might consist solely of L, s, or d, or any combination thereof).
Noticed when doing some testing, thought this thread would be as good a place as any to put it.

'Green Eggs and Ham' by Dr.Seuss checked with nablators java code.

h0 = 4.459431618637297   ::   h1 = 3.8063371795355936   ::    h2 = 2.283586994311863    ::   h2-h1 = 1.5227501852237304

Well within VMS range according to results in mbpaper by D.Stallings ( though the file size is a bit short -- 3238 chars ).
Similar to the comment above about "Green Eggs and Ham", I wondered about the small words, and the apparent artificiality of the character system and thought it would be interesting to compare with a modern equivalent, Toki Pona. I ran an analysis of an intermediate level text written in Toki Pona and got the following:

This used Toki Pona's alphabetic writing system rather than the glyph system. (Both easier to analize, and more likely to be relevant for comparison to Voynichese.) The result compares quite well with Voynichese:

Character Entropy (h1): 3.483557
Character Combo Entropy: 5.566203
Delta (h2): 2.082646

An interesting point to be made about the graphics that I have seen showing the probability of each letter combo, is that the vowels in alphabetic systems really stand out, because there tend to be fewer of them, and they pair well against the more common consonants. This is especially true in my analysis of my sample Toki Pona text, but similar patterns in the diagrams created from Voynich text only show lines popping about as much as vowels in latin/english/german/etc.

Looking at some of the graphics on this topic created around Voynich text, it looks like EVA o, e, y, a, s, and g are decent candidates. I'm inclined to eliminate e, because it matches a lot of consonant patterns. a and o are really the strongest candidates.

It's worth noting that some consonants that pair well with other consonants will sometimes stand out too, inconsistently across languages. These can generally be distinguished because they tend to pair with both vowels AND consonants. (Hence, my assertion that I would be inclined to eliminate e due it matching a lot of consonant patterns.)

Anyhow, I would say that the analysis suggests candidate vowels, and that e is likely a consonant that likes to pair with other consonants as the first letter. (Analogous to the pattern I see for n in English, but I wouldn't go so far as to propose that EVA e is English n.)

I would also say that as a low-information-density writing system, the analysis is further consistent with the language itself being artificial, or perhaps, an artificially constrained application of a contemporary (to early 1400s) language.
(02-12-2024, 06:33 PM)seanmcox Wrote: You are not allowed to view links. Register or Login to view.The result compares quite well with Voynichese:

A reduced alphabet (14 letters), reduced vocabulary (almost like Green eggs and ham) and strict CV alternation certainly help, but a short sample also helps... Where did you find a long enough sample of Toki Pona text for h2?


Edit:
Just found this Toki Pona Bible project: You are not allowed to view links. Register or Login to view.

It's not pure Toki Pona because there are many names, but they have been modified to look like proper Toki Pona.

I fixed a few things (not really needed because there are few mistakes): One Betlehem changed to Petelen, one Juda to Juta, one Jerusalen to Jelusalen. Removed all punctuation, Arabic numerals and a few English words (choose, God, sent for, angel), changed #YHWH to Jawe, then converted everything to lowercase.

Final concatenated file length: 130 KB (few books have been translated, and often partially)

h0 = 3.907 = log2(15) : 14 letters and space
h1 = 3.501
h2 = 2.345

Undecided

(04-05-2022, 05:44 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.For H2, which is the quantity most looked at for the Voynich MS, at least 10,000 would be needed.

First 10,000:
h1 = 3.482
h2 = 2.294

First 20,000:
h1 = 3.481
h2 = 2.309

First 30,000:
h1 = 3.486
h2 = 2.334

First 40,000:
h1 = 3.489
h2 = 2.337

First 50,000:
h1 = 3.492
h2 = 2.342
It looks to me like the biggest difference is in the treatment of space characters (ie. word boundaries). I was interested in creating diagrams similar to here:
You are not allowed to view links. Register or Login to view.

My identification of possible vowels is based on the figures there, as I don't have a lengthy enough text yet, normalized in Voynichese to do my own analysis with (I've done shorter analyses, but I can only reproduce the most obvious patterns, as you say, some substantial text is needed),  though, I'm working on it; Going through some EVA text others have created and trying to figure out how to reasonably handle the discrepancies and uncertainties. I'm hoping to test the plausibility of some hypotheses about potential abbreviations.

When I update my handling of spaces to match what you are describing, I get the same results you are describing. (I strip everything but spaces and alphabetic characters, and I normalize text to lower case.)

If this was the procedure followed in the post at the top of this thread, then this is more similar to the non-EVA Voynich results. (I tend to think that that's more plausible as the intended character system.)

For Toki Pona I am using a concatenation of intermediate level Toki Pona texts available here:
You are not allowed to view links. Register or Login to view.

Considering how far these results are from the norm, and how much more similar they are to Voynichese, I was thinking it might be interesting to try a pidgin to see how that changes things. After all, while the broad statistics of Toki Pona look promising, the resulting diagram I get has a very different feel to it than what I get looking at English or Latin. I haven't really sat down to think through why the letter-combination frequencies, when visualized, might look so boxy and compartmentalized in Toki Pona, but Voynichese doesn't. It looks much more like a natural language from this perspective. (If I've been unclear, the diagrams I am talking about are the grids showing leading letters, and following letters, along with a color on the grid indicating how frequently they appear together.)
Pages: 1 2 3 4 5