The Voynich Ninja

Full Version: Word Entropy
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9
(20-09-2019, 11:59 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view....Timm's generated text will again be close to Q13. The higher TTR and entropy of the whole VMS are likely due to the fact that it is made of different sections, with apparently different subjects and "languages" (in Currier's sense), while Timm's text is uniform (it was created by a single execution of his software).

Ah, a hypothesis Smile 
I already had some files cut down to 6500 words so I used this as a base. I included a Spanish text, an English text and Pliny, making sure that all of them looked normal without too much weird stuff going on. Then I added the first 6500 words of Herbal A, Q13, Q20 and Timm's text.

[attachment=3337]

[continued below]
(I accidentally pressed post instead of preview)

The top line if shuffled, bottom is original. Quire 13 is indeed close to Timm, but the unshuffled version has a much lower % of max h2. This may be logical, because of the many patterns in Q13.

English and Spanish behave entirely as expected.

Herbal A has a very small difference between shuffled and original, which is not so good.
For Q20 the difference is larger, but it is close to Pliny with a much lower TTR. This is also not good.


The difference between shuffled and original:

[attachment=3338]
Thank you, Koen.
The differences between the various sections are so large that it is difficult to make a comparison with other languages. Considering the measures in your last graph, the overall distance between Timm's results and Q13 (or also HerbalA) seems smaller than the difference between Q13 and Q20.
From Bowern and Lindemann papers, I understand that word stats in the VMS appear to be closer to written languages than character stats. I recently had another look at what Koen did with word entropy.
Using his corpus, I successfully produced a graph similar to You are not allowed to view links. Register or Login to view. (word-entropy vs conditional word entropy). A minor difference from what he did is that I considered the first 10k words for each file (Koen used the first 5k). The green diamonds are Voynich samples.

[attachment=5067]

I then observed that plotting word-entropy vs bigram (biword?) entropy isolates the Voynich samples even more. For a given biword-entropy value, Voynich samples have lower word entropy than language files from Koen's corpus. For instance, compare the Takahashi transcription (04.TT) with  the Latin Aelredus file:
04.TT_ivtf_v0a H1: 9.73244 H2: 13.05730
Lat_ Aelredus H1: 10.22625 H2: 13.04832


The values for H2 are nearly identical, but H1 is much lower for the VMS (plot "zoomed" on the relevant interval).

[attachment=5069]

How can one transform the Latin file so that biword entropy is basically unaffected while word entropy drops to values closer to the VMS? A possible way to achieve this a many-to-one mapping of some word types, so that the lexicon is reduced (hence word entropy dicreases) but most biwords are still distinct (e.g. because only one of the two words in a biword is typically affected).
I ran several experiments and something that appears to work decently in this case is this regex transformation:

sed -e 's/\([a-z]\)[a-z][a-z][a-z][a-z][a-z][a-z]*\([a-z][a-z]\)/\1q\2/'

The transformation only affects words that are at least 8 characters long. The first character and the last two characters of long words are preserved, while the "core" of the word is replaced by an arbitrary fixed character ('q' in this case).

These are the first 20 words in the original file:
caput primum libri hujus scribendi occasio cum adhuc puer essem in scholis et sociorum meorum me gratia plurimum delectaret et

And the transformed version:
caput primum libri hujus sqdi occasio cum adhuc puer essem in scholis et squm meorum me gratia pqum dqet et

For instance, in the text words like 'delectaret', 'dulcesceret', 'disputaret', 'displicet' are all mapped to 'dqet'

As the plot shows, values for the transformed Aelredus file are quite close to the curve joining the different VMS samples. These are the measures for word-entropy and biword entropy on the transformed file:
04.TT_ivtf_v0a H1: 9.73244 H2: 13.05730
Lat_ Aelredus H1: 10.22625 H2: 13.04832
aelr.q.transf. H1: 9.56466 H2: 13.00754



This is not a good model for Voynichese, since I don't think it can result in a particularly high character entropy. But in my opinion this transformation could support the idea that Voynichese includes 'homophones' (or homographs?) i.e. conceptually distinct words that are written identically.
An interesting experiment, Marco. So basically long words are abbreviated in a way that creates some homographs? (Homograph is the general category that includes homonym so I tink it's better here).

I wonder if you were to do something like this to an English text, if we would still be able to understand its meaning? It probably depends on the way words are abbreviated.

Let's take the sentence: I am typing on my keyboard. Then only "keyboard" would be abbreviated.
* I am typing on my kbrd 
* I am typing on my kqrd
* I am typing on my k~rd

Depending on the way it is done, and sufficient context, it might actually work without considerable loss of information.

Something like this might also explain the issues we have with word length variation?
I'm loving these adventures in abbreviation, guys.

I'm reminded of a viral email I received ~15y ago, which blew my mind:
Quote:I cnduo't bvleiee taht I culod aulaclty uesdtannrd waht I was rdnaieg. Unisg the icndeblire pweor of the hmuan mnid, aocdcrnig to rseecrah at Cmabrigde Uinervtisy, it dseno't mttaer in waht oderr the lterets in a wrod are, the olny irpoamtnt tihng is taht the frsit and lsat ltteer be in the rhgit pclae. The rset can be a taotl mses and you can sitll raed it whoutit a pboerlm. Tihs is bucseae the huamn mnid deos not raed ervey ltteer by istlef, but the wrod as a wlohe. Aaznmig, huh? Yaeh and I awlyas tghhuot slelinpg was ipmorantt! See if yuor fdreins can raed tihs too. (You are not allowed to view links. Register or Login to view.)

It would appear that keeping the first and last letters of a word in their proper places goes a long way to preserving the readability of a jumbled word. Marco and Koen, the algorithms both of you just described for shortening long words (plus or minus jumbling them) adhere to this same rule. I'm skeptical that a similar abbreviating algorithm was used to create the VMs text, because the rigidity of character placement is particularly strong at the ends of vords. About two thirds of tokens end with y. Plus, reviewing You are not allowed to view links. Register or Login to view. of Hidden Markov Modeling and its application to the VMs text, I'm reminded of Reddy and Knight's conclusion that the final character of a vord appears to be generated by a different process than the one that generates the rest of the vord. This seems more compatible with an abbreviation algorithm that levels word endings, rather than preserving them.

As critics of abbreviated Latin VMs theories (most memorably Stephen Carlson) have pointed out, an abbreviation method that effaces word endings is problematic for a plaintext language that carries a lot of its information in word endings, like Latin. I haven't tried chopping off or leveling the endings of long Latin words to see if readability is preserved, and my knowledge of Latin isn't good enough for me to do this experiment well. But I think it's worth considering that if used on a plaintext in a highly inflected language, this method of lossy compression could be hard to reliably decompress / reëxpand.
Thanks, Marco & Koen for sharing these results.  It is definitely interesting what manipulations get Latin to behave statistically more like Voynichese (and visa versa, for Koen’s entropy experiments).

The last thing l want to do is to come across as expecting work to be done, but l was wondering what impact your manipulation had on other basic measures of the text.  Does it still follow Zipf’s Law, for example, or does such wholesale substitution of the middle of words cause a loss of that quality?

Does this kick “q” up to a very high level in the character stats or is the impact something that would be otherwise hidden?

I know you are very interested in the reduplication stats and l would guess that this manipulation didn’t cause this - maybe do we need some sort of re-ordering of the words to get the repetition?

I hope you don’t mind these questions and thanks again for sharing these results!
Hi Koen,
the method I discussed for the Latin Aelredus file does not work for English. In general, English is rather far from Voynichese, according to these two measures. The closest sample in your corpus appears to be Eng_Polandrellovepoems, but that file only has 949 words that are long enough to be transformed by my regex, while the Latin file has 2863 (in both cases, I only considered the first 10k words). The English transformed file (marked 'q' in the plot) might be largely understandable, since less then 10% of the words are altered.

[attachment=5075]

In order to transform the English file into something comparable with Voynichese, I used two different approaches. 

Method 1

A more aggressive version of what I did for Latin. I used this regex:

sed -e 's/\([a-z]\)[a-z][a-z][a-z][a-z][a-z]*\([a-z]\)/\1z\2/'

Which "compresses" all words 6 or more character long, only preserving the first and last letter. This results in the  point marked 'z', which is shifted to the left, but also has a low biword entropy. I then applied a second step where I prefixed character 'x' to 20% of the words: this increases entropy and the result 'z1' is comparable with the Voynichese samples.

Original:
wher is this prynce that conquered his right within ingland master of all his foon and after fraunce be

q:
wher is this prynce that cqed his right within ingland master of all his foon and after fraunce be

z:
wher is this pze that czd his right wzn izd mzr of all his foon and after fze be

z1:
wher is this xpze that czd his right xwzn izd mzr of all xhis foon and after fze xbe


Method 2

This time I did not alter words, so that H1 is unchanged. I modified word-order, by taking groups of 4 consecutive words and rearranging them randomly. This obviously increases biword entropy.

sorted:
wher is prynce this right his that conquered of within master ingland foon his all and after very be fraunce



Both methods result in unreadable text. Personally, I find something like Method 1 much more likely. All medieval manuscripts contain inconsistencies that likely result in higher entropy with respect to a modern edition: this could result in something similar to the second step. The first step, resulting in the production of homographs, is more difficult to explain. Bowern and Lindemann pointed out that "systematic conflation of phonemic distinctions, such as conflating all vowels to a single character" results in lower character entropy, similar to Voynichese; this should also result in an increased number of homographs (e.g. by collapsing t,th,d and all vowels, "time", "theme", "tome", "demo", "dime", "dome" could all be written "twmw"). I guess that transformations like this could preserve readability, but this clearly depends on how many different sounds are conflated. Anyway, I agree that readability is important and that the relevance of these results is quite limited.

The method discussed by RenegadeHealer is impressive in that it shows a complex transformation that can be solved effortlessly by the reader. Unluckily it results in higher entropy both at character and word entropy, so it does not seem to be similar to what happens with the VMS.


(29-12-2020, 12:55 PM)MichelleL11 Wrote: You are not allowed to view links. Register or Login to view.Does it still follow Zipf’s Law, for example, or does such wholesale substitution of the middle of words cause a loss of that quality?

Hi Michelle, since in the case of Latin I only altered long words, I expect that Zipf's Law is not significantly affected. But there are other Voynichese features that cannot be explained by the method I applied here (e.g. character entropy or line effects).

(29-12-2020, 12:55 PM)MichelleL11 Wrote: You are not allowed to view links. Register or Login to view.Does this kick “q” up to a very high level in the character stats or is the impact something that would be otherwise hidden?

I expect that the the frequency of 'q' is considerably bumped. This is one of many reasons that make clear that this is not how Voynichese was written. These experiments are just an attempt to understand more of what word entropy values mean.

(29-12-2020, 12:55 PM)MichelleL11 Wrote: You are not allowed to view links. Register or Login to view.I know you are very interested in the reduplication stats and l would guess that this manipulation didn’t cause this - maybe do we need some sort of re-ordering of the words to get the repetition?

My opinion is that reduplication originates in the underlying text. If this is true, either the underlying language is not European or the text is highly anomalous in this respect. But here I am speculating. Anyway, a systematic re-ordering that generates repetitions would likely result in a lower biword entropy and Voynichese shows the opposite, when compared with ordinary European texts.

(29-12-2020, 12:55 PM)MichelleL11 Wrote: You are not allowed to view links. Register or Login to view.I hope you don’t mind these questions and thanks again for sharing these results!

Questions are always welcome! Whenever I try something I seem to end up with more questions than answers: this is what makes this hobby so addictive! Of course, in order to put together something like a plausible theory, it is necessary to take everything into account. But I think that also exploring single features in isolation can be instructive.
Thanks, Marco! I agree that the results in English are too hard to read. I think with abbreviations, you need to keep more consonants for it to work. Unless the abbreviations are frequent enough to be understood by the individual.

One example, closer to a practical explanation, is the way students will abbreviate common words after a while when taking notes. For example at university in some courses, I often had to write the word "maatschappij" (society) which I just abbreviated to "mij" with a line instead of dots. Now, since I was lazy taking notes, I often ended up photocopying girls' notes before an exam. They also used their own abbreviations, which would be impossible to expand in isolation. But I had no problem with them, because I actually knew the context and the expected vocabulary.

It is like Helmut suggested that [daiin] could be "d'aui" for "according to Avicenna".

Assuming an extreme case of such abbreviations would allow the text to remain useful for those who know the context and vocabulary. If the abbreviation are chosen in a way that they reduce the lexicon, this might "fix" word entropy. This would be what you did in the first experiment, right?

But, as you say, this is more likely to increase character entropy than to reduce it, which is our issue with so many tests. 

I'm not sure if this is correct, but right now I feel like character entropy is the more fundamental problem. Most operations on the level of the lexicon (like turning certain words into homographs, omitting words, randomly duplicating words and creating duplication patterns) should hardly impact character entropy. 

Conversely, doing something like ordering words' letters alphabetically would reduce character entropy and affect word entropy as well. (Although if something was done that involves shuffling letters, our efforts of decoding it are certainly doomed).
(29-12-2020, 10:15 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Conversely, doing something like ordering words' letters alphabetically would reduce character entropy and affect word entropy as well. (Although if something was done that involves shuffling letters, our efforts of decoding it are certainly doomed).

I don't see shuffling letters as a big problem, because I think that any pattern used, if used in some sort of consistency, can be figured out with enough time and computer power.  I think the Zodiac's 340 crack is an example of this -- because that's all it came down to in the end, a transposition, with a few irregularities thrown in, that took over 50 years and clever coding to figure out.

What I'm afraid of are letters gone, as this seems (especially if taken to extremes) like something that might be unrecoverable.

It also makes me nervous that this was medieval "standard practice" -- just dropping letters and substituting in inconsistent markers (granted, they were relatively consistent for a particular scribe but when you add in the "optional" spelling, etc. it can get pretty bad).  By relying on context and "tradition" as to what is abbreviated to support the sense, there is a line that is being run there.  I understand there are some highly abbreviated manuscript texts that will never be known for sure . . . 

So I do get concerned that without that grounding context (or similarity to traditional practice), just like what Koen's discusses above for his class notes, that things could get really, really problematic.

In my opinion, the Zodiac killer didn't realize how hard he was making his cipher.  This was due to ignorance.  In a similar way, I get concerned that the VM authors also did not know how difficult they were making things.  Now they have a bit more of an excuse, because they could have very well been making a cipher of this general type for the very first time (certainly not a claim the Zodiac killer can make!).  But by not understanding how difficult getting the sense back out would be -- the VM could prove, in the long run, to be also the only use of that cipher.

But let me emphasize, I'm not negative, just realistic, and definitely have more motivation to keep trying.Our "period-19"* data could be discovered tomorrow or maybe is even in this string -- just have to keep thinking.

[b]*see Zodiac 340 string for more details [/b]
Pages: 1 2 3 4 5 6 7 8 9