The Voynich Ninja

Full Version: An Essay on Entropy: what is it, and why is it so important?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5
(26-04-2022, 05:40 PM)davidjackson Wrote: You are not allowed to view links. Register or Login to view.I would suggest, that in your tea example, the Voynich is drinking coffee  Big Grin

I'm more of a coffee guy anyways. Wink

Searcher: this is a good question. If the same words are repeated over and over, this could cause certain glyph combinations (which are common in those words) to become prevalent. One thing I can say is that over large windows of text, for example the whole of Herbal A or Q20, Voynichese's type-token ratio (TTR) is normal. This means that the variation in its vocabulary is well within expected bounds.

We can actually test the impact of duplicate words: take a VM text, remove all duplicate words so that one token per type remains, and compare entropy values before and after. (This would give it better odds than normal, a TTR of 1 is unusual).

I did this quickly with unmodified EVA, that's all I have time for right now. The results are stronger than I expected - though do keep in mind that this is an unusual situation where each token is a new type. And it is still below the "magical limit" of h2=3.

[attachment=6461]
I have analyzed the similarity relations between words of the VMS. For doing so it is possible to represent each word as a node in a graph. Starting with the most frequent token one can recursively search for other words differing by just a single glyph, and connect these new nodes with an edge.

On Github it is possible to find the resulting network graphs for individual You are not allowed to view links. Register or Login to view., You are not allowed to view links. Register or Login to view., You are not allowed to view links. Register or Login to view..

The main network is connecting 6796 out of 8026 word types (=84.67 %). The longest path within this network has a length of 21 steps, substantiating its surprisingly high connectivity. Another feature is that high-frequency tokens also tend to have high numbers of similar words, whereas isolated words (i.e. unconnected nodes in the graph) usually appear just once in the entire VMS (see You are not allowed to view links. Register or Login to view.).

Consequently high-frequency tokens share their glyph order with many other frequently used tokens. For instance the most frequently used word in the VMS is "daiin". Therefore the glyph next to "d" is in 31.4% of the cases "a" (and in 52.8% "y"). However in Currier A "d" is followed in 48.3% of the cases by "a" and only in 25.4% by "y" whereas in Currier B "d" is followed in 64.4 % of the cases by "y" and only in 23.9% by "a". This happens since "chedy" is the most frequently used word in Currier B and an exception in Currier A. Consequently words similar to "chedy" especially words ending in "edy" occur far more frequently in Currier B than in Currier A.
(26-04-2022, 02:48 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.As Rene also noticed, however, the "rewriting n-grams" method has a significant drawback: it makes words really short. As you can see in the "Voynich- Vvovyvnvivcvhv" example above, verbose encoding has the effect of lengthening words, and Voynichese words aren't excessively long to begin with.

I've been beating this drum for the better part of two decades, but for a variety of reasons (probably including lack of formal publication) I'm not sure I've gotten much traction -- the existence of pairs of glyphs that occur straddling spaces with high frequency but occur within "words" with very low frequency and/or very limited contexts is pretty compelling evidence that a huge chunk of spaces were inserted mechanically according to some set of rules. (More precisely, it is evidence consistent with that theory, although I would argue that any theory about the text needs to address the general issue [and specific behavior I discuss below regarding Currier '9']). While "with very low frequency" may seem like weasel words, the issues of both scribal and transcriber error need to be kept in mind.

As an example (useful because I shouldn't have to translate Currier '9' and '4' into other transcription alphabets, plus it covers a huge fraction of spaces in both Herbal A and Bio B), look at the one piece of seriousness in my Apr. 1 posting (You are not allowed to view links. Register or Login to view.):

"Looking at the Herbal A 'language' pages in the D’Imperio transcription, 67% of the time a ‘9’ is followed by a space (1097 occurrences); 11% of the time it is followed by the end of a line (203 occurrences). It is *never* directly followed by ‘4’ without an intervening space -- the only glyphs that follow it within a "word" more than a single-digit number of times are ‘F’ (103), ‘P’ (92), ‘8’ (52), ‘S’ (51), and ‘B’ (16) (‘Z’ just misses the cutoff at 8 occurrences)."

*22%* of spaces in BioB are straddled by '9' and '4' -- 1266 occurrences -- with just 10 occurrences of "94" inside a word. While "9 4" specifically is less common in Herbal A (just 4% of spaces), '9' is still the character before a space 32% of the time and the line/paragraph final character 34% of the time....

I cringe every time I read a paper that takes for granted that spaces are word separators without explicitly foregrounding that assumption. I cringe even more when I read a paper that throws Herbal A and Bio B folios into the same statistical meat grinder without recognizing/acknowledging the magnitude of the difference in their statistics.

And, yeah, I know, I should be more up to date in the transcription I use for casual analysis, especially since I have scripts to convert both EVA and V101 into Currier, although in this specific case there are bigger issues with transcription problems. Comparing v101 and EVA (put into the common transcription alphabet of your choice) on f1r-f57r & f99r-f102v1 -- purely because that happened to be a chunk of folios I was looking at for another reason, and recognizing there may be differences between sections and/or scribes:

Looking at the 9753 places where either transcriber saw a full space:
* 94% of the time, both saw a full space
* 4.5% of the time, one saw a half space
* 1.5% of the time, one saw no space

Looking at the 1178 places where either transcriber saw a half space:
* 17% of the time, both saw a half space
* 38% of the time, one saw a full space
* 45% of the time, one saw no space

In addition, v101 is more likely to have a full space rather than a half space, or a full or half space rather than no space, than the other way around. So...consistency of transcriber judgements regarding where spaces are and aren't isn't necessarily great. To paraphrase an old saying about clocks, a Voynich researcher who has one transcription knows what it says; a Voynich researcher who has multiple transcriptions is never sure.

By the way, I am *very* disappointed my Blackadder theory didn't get a higher profile :-(. Where were *my* 15 minutes of fame? Where were the initially credulous articles in Wired UK and Ars Technica, followed by walk backs where they quote Lisa Fagin Davis or Kevin Knight saying something along the lines of, "What do I think? Um...Let me put it this way -- have you ever driven through farm country during a heat wave on a muggy, sunny day..."?
(26-04-2022, 10:09 PM)kckluge Wrote: You are not allowed to view links. Register or Login to view.have you ever driven through farm country during a heat wave on a muggy, sunny day..."?

Being in Thailand during the hot season, either side of the 100 (ugh) deg.F....
I know what you are saying.

Please post more, Karl!
(26-04-2022, 03:59 PM)Ruby Novacna Wrote: You are not allowed to view links. Register or Login to view.I was naive enough to believe that statistics in general and entropy in particular were means, among others, to achieve a purpose, which would be common to all Voynich fans: understanding the text.

It's not an issue of ends/purpose, it's an issue of methodology. Quoting a comment I made recently on Voynich Revisionist (and making clear here, as I did there, that I am not singling out the theory in question to pick on it, I am merely using it as an example):

"Darius claims that the underlying plaintext is in a 12th cent. CE dialect of Persian (You are not allowed to view links. Register or Login to view.) which has been converted from the normal script used for Persian to Voynich glyphs in accordance with the table at You are not allowed to view links. Register or Login to view. and then obfuscated with nulls that can be stripped out using the rules at You are not allowed to view links. Register or Login to view..

"So…if you take some set of 12th cent. Persian texts and map them to Voynich glyphs using his theory, how do the entropy stats of those texts compare to the entropy stats of the Voynich text after you strip out the nulls? How similar are the glyph frequency distributions of the de-nulled Voynich sample and the converted-to-Voynich-glyphs Persian text samples? For that matter, where do the h1 and h2 values of the de-nulled Voynich text using Darius’ transcription scheme fall relative to known natural languages period? Those are questions that have nothing whatsoever to do with my (or Rene’s or Glen’s or anyone else’s) interpretation of the glyphs or any existing transcription of the mss, They are purely questions about what happens when you apply an empirical test _to his theory using his assumptions_."

Someone might object, "Why bother to do that? Why not just proceed with the work of extracting Persian plaintext from the manuscript?" One answer is that if the theory can't pass that test, then the history of proposed decipherments/translations of MS 408 suggests very strongly that the probability is close to zero that the project of extracting substantial amounts of continuous, coherent Persian plaintext will succeed. Another answer is that if you do that, analyzing the discrepancies might productively help identify problems with the theory and suggest potential fixes to those problems.

We're talking about entropy here because the thread is specifically talking about entropy, but there are a variety of statistical properties and structural characteristics of the MS 408 text that ought to arise naturally from any given theory for it to be credible -- and if they don't, the odds that the deciphered/translated text produced by that theory will turn out to be plausible from a linguistic perspective are fairly slender.
(27-04-2022, 12:44 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Please post more, Karl!

From a priorities perspective, I shouldn't...but I probably will...but I really shouldn't...

A couple quick (and thread-relevant) things:

* With regard to the whole are "words" _words_ issue -- I've seen lots of papers talk about Voynich "words" and Zipf's (word frequency vs word frequency rank) Law. There is a different Zip's Law relating word length vs word frequency rank -- more frequent words are also shorter (a ref for anyone wanting to pull on that thread is You are not allowed to view links. Register or Login to view.) -- has anyone every fit that vs. the Voynich vocabulary? Based on the 32 most-common BioB "words" from D'Imperio's transcription:

33    OFCC89        33    ZCC89        34      SCC9        35      2AM
  35      SX9        40    4OFC9        40  4OPCC89        41      OFAN
  43      8AN        44        89        44    OFC89        46    4OPC89
  47    4OFAR        50    OPC89        53      8AE        56      8AR
  56    ESC89        57        OR        58      4OF9        75      SC9
  77      8AM        78      ZC9        85    4OFAM        85    4OFCC9
  87      4OE        106    4OFAE        150    4OFAN        152  4OFCC89
161    4OFC89        178        OE        194      SC89        213      ZC89


I'm not sure how well that will fit. Recognizing that word length is a function of transcription alphabet, if it hasn't been done that would be an interesting thing to check.

* With regard to verbose ciphers, the way Koen put things in his thread starter ("A verbose cipher is basically a cipher that obfuscates by adding unnecessary stuff. In a very simple example, I could verbosely obfuscate the word "Voynich" by adding a v after evert letter: "Vvovyvnvivcvhv". If Voynichese is the result of a verbose cipher, I could try to reverse this by rewriting common glyph clusters (bigrams, trigrams) as single glyphs. For example, I could replace "dy" by "&" and run the entropy test again to see what changed.") kind of half brings out a distinction between two different types of verbose cipher theories:

In the first type, like with his "Vvovyvnvivcvhv" example, nulls are being inserted in a structured way. In this type, it might be the case that some Currier P's have O's in front of them not because OP represents a different plaintext letter than P, but because there is some rule that puts O's in as nulls before some P's. Without "showing my work", if you take Herbal A and:

replace "4O" with 'o'
replace "CC" with 'c'
then get rid of A's before [ERNM], O's before [ERFP], and C's before [89]

you also wind up with a transformed text whose h1 and h2 are getting more natural language-like. Having looked for common strings shared between pages I don't know that it's a very promising possibility, but it is one.

The second type is where some plaintext letters map to a single glyphs, while some map to 2-(or 3- or more glyph combos -- so plaintext 'C' might map to Currier "OP", while plaintext 'K' might map to Currier "P".  A straddling checkerboard is a (from the point of view of the C-14 date somewhat anachronistic) cipher of this type (You are not allowed to view links. Register or Login to view. shows examples). The problem this raises is that if you have "CLOCK" as a plaintext word, then the "CK" at the end would result in "OPP" in the Voynich text given the hypothetical mappings -- and except for Currier 'C', repeated glyphs vanishingly rare -- out of 35483 digrams in (D'Imperio's) BioB, there are 15 that are doubled glyphs other than "CC". So either there just miraculously happens to be a natural language whose letter contact stats allow a mapping that avoids that problem, or there is some mechanism in the cipher that hides such doubled glyphs -- perhaps the difference between Currier "OF" and "OP" isn't that they map to different plaintext letters, perhaps "OF" is what you write where "OPP" would otherwise occur. Or maybe the verbose cipher theory is just fundamentally wrong...
(27-04-2022, 12:44 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view.if you take some set of 12th cent. Persian texts and map them to Voynich glyphs using his theory
Why you should map Persian texts in Voynich glyphs? Can't you calculate their entropy directly?

(27-04-2022, 12:44 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view.it's an issue of methodology
Speaking of methodology, the translation you mention, which I have just looked at, has not matured to be subject to testing: I have not even found matches between the transcription, in EVA, for example, and the proposed translation, unless the page exists and I have not seen it.
Karl uses the 'good old' Currier alphabet.
(26-04-2022, 03:29 PM)LisaFaginDavis Wrote: You are not allowed to view links. Register or Login to view.I have also noticed that there are errors in the standard EVA transcription that we all depend on, likely because that was originally based on lower-resolution images. I am beginning a very long-term project of proof-reading the transcription to make corrections and to resolve many of the uncertain readings. Might take as long as a year, but I think it will be well worth the effort. I have made ten corrections on f. 1r alone, so I suspect there will be well over a thousand changes to be made, which may very well impact computational and linguistic analyses.

Let me add my kudos to Koen's. If there had been more room, I'm sure the Sermon on the Mount would have included "Blessed are the transcribers, for they shall be shown the solution."

You probably have more and better tools for comparing different transcriptions, but if you don't I have Unix scripts/code for doing it. If (and it's a big if) the places you're seeing errors turn out to be a subset of the places where v101 and LZ disagree, that could save you an enormous amount of time.

For the running text part (maybe not labels or things like the Zodiac folios, depending on how much work Rene has put into massaging the v101 transcription), it's a fairly quick process to generate the comparison. As an example of the output, here are the first few lines of f1r:

Top: fachys/y,kal/ar/otaiin/shol/shory/  ~ses/y/kor/sholdy -
Bot: fachys/y kal/ar/ataiin/shol/shory/cthres/y/kor/sholdy*-
Mrg: fachys/y.kal/ar/+taiin/shol/shory/++++es/y/kor/sholdy*-

Top: sory/ckhar/or,y/kaer/chtaiin/shar/ois/cthar/cthar/dan-
Bot: sory/ckhar/or,y/kair/chtaiin/shar/ase/cthar/cthar,dan-
Mrg: sory/ckhar/or,y/ka+r/chtaiin/shar/+++/cthar/cthar_dan-

Top: syaiir/sheky/or/ykaiin/shod/ctho,ary/cthes/dar,aiin/sy-
Bot: syaiir/sheky/or/ykaiin/shod/ctho ary/cthes/dar aiin/sy-
Mrg: syaiir/sheky/or/ykaiin/shod/ctho.ary/cthes/dar.aiin/sy-

[...]

In this case, "Top" is v101 converted to EVA via Currier; "Bot" is LZ converted to Currier (Basic EVA only, first of multiple readings chosen) and back to EVA. "Mrg" is the min edit distance combination of the two -- '*' is something written off as an untranscribeable "weirdo" in the underlying transcription, '~' is something with no Currier equivalent in the underlying transcription, '+' is a glyph disagreement/insertion/deletion, and there are characters corresponding to different combinations of how the two transcriptions see spaces in a given location. I have script tools to pull out the merged transcription and add folio/line number references back in.

If this would be useful, I would be happy to create the merged version of the running text lines and sent it to you.
(26-04-2022, 04:59 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.A verbose cipher can also be created with a procedure that does not necessarily create longer words

Would binary code be an example of something similar? All "words" will be eight characters long. If that is too long, you could also split them, and mark the ending. For example 10101010 would be 1010 10102. As long as there is a marker, you could add spaces at will.
Pages: 1 2 3 4 5