VM TTR values - Printable Version

VM TTR values - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: VM TTR values (/thread-2818.html)

Pages: 1 2 3 4 5 6 7 8

RE: VM TTR values - MarcoP - 15-02-2023

Hi Yulia,
it's great to read from you!
The hypothesis you discuss addresses one of the major problems of Voynichese and it has been put forward in the past (e.g. by You are not allowed to view links. Register or Login to view., but I am sure by others as well).

A first problem I see is that repetition is quite pervasive in the VMS. Where do we stop throwing stuff away? Your example includes a repetition with edit:2 distance (from9 9from). Is that the maximum edit distance for irrelevant words? Edit distance (which I am using here to keep things simple) is a totally anachronistic way to decide what to keep and what to dump, what should be used instead? In a sequence of similar words, do we keep the first one, the last one or there is some different criterium?

Example:
<f82v.8,+P0> qokain.sheol.qoteedy.chedy.qokey.qokedy.qokol.chedy.chedy.lchy

EVA edit distances between consecutive words:
qokain sheol 6
sheol qoteedy 6
qoteedy chedy 4
chedy qokey 4
qokey qokedy 1
qokedy qokol 3
qokol chedy 5
chedy chedy 0
chedy lchy 3

Let's say we remove everything with an edit-distance of 3 or less. I am keeping the first word of each sequence:

qokey for qokey.qokedy.qokol
chedy for chedy.chedy.lchy

We dump 4/10 of the words and are left with:
qokain sheol qoteedy chedy qokey chedy

These 6 words still seem to show a rigid structure (low character entropy) and repetitions (though they now alternate: 3 qo words + 3 bench-e words, with chedy occurring twice and 4 consecutive -y words).
Also, subtler repetitions like You are not allowed to view links. Register or Login to view. will probably be almost unaffected by dumping consecutive similar words (unless the threshold is so high that you dump most of the manuscript).

To summarize, two of the problems I see are:

There are numberless ways of deciding what to dump and what to keep of the similar consecutive words (see how different what Pardis does is from the edit-distance method above).
Whatever we do, I am afraid we throw away much of the text and the result still has all the problems of Voynichese, minus the consecutive repetition of the same word.

RE: VM TTR values - nablator - 15-02-2023

(10-02-2023, 04:03 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I am aware that perfect reduplication is a special case of a more general phenomenon.

I would like to use a metric that works when word boundaries are unknown, because they often are, if they are not completely fictitious (a possibility).
Could the (1) local diversity and (2) dissemblance between chunks of text be measured by the compression ratio of zip (or better algorithm)? (For comparison between different pages/sections only.)
(1) compressed_size(chunk)/uncompressed_size(chunk)
(2) compressed_size(chunk1+chunk2)/(compressed_size(chunk1)+compressed_size(chunk2))

RE: VM TTR values - Searcher - 15-02-2023

(15-02-2023, 10:49 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Hi Yulia,
it's great to read from you!
The hypothesis you discuss addresses one of the major problems of Voynichese and it has been put forward in the past (e.g. by You are not allowed to view links. Register or Login to view., but I am sure by others as well).

A first problem I see is that repetition is quite pervasive in the VMS. Where do we stop throwing stuff away? Your example includes a repetition with edit:2 distance (from9 9from). Is that the maximum edit distance for irrelevant words? Edit distance (which I am using here to keep things simple) is a totally anachronistic way to decide what to keep and what to dump, what should be used instead? In a sequence of similar words, do we keep the first one, the last one or there is some different criterium?

Hi Marco!
I'm afraid that I meant something slightly different, well, at least it's definitely not about edit distance. My idea is not so smart. Everything is simple here: if one word is repeated several times in a row in a continuous sequence, then all its duplicates are garbage. The "9" attached to the words in my example was optional, I just added it as a null (or garbage) to more plausibly convey the situation in the Voynich text. In fact, I could make it like this:
When Tämerlin returned home home from from Babiloni, he he he sent word to all in his land that they were were to be ready in four months, as he wanted to go into Lesser India, distant from from from from his capital a four months’ journey. When the time came, he went into Lesser India with four hundred thousand men men, and crossed a desert...
or like this:
When Tämerlin re9turned home 9home from from Babiloni, he he he sent9 9word9 to all in9 his land that they were were to be ready in four 9mon9ths, as he w9anted to go into 9Lesser India, distant from from from9 9from his capital9 a four months’ journey. When the9 time9 came, he went into 9Lesser In9dia with four hundred thousand men9 men, and crossed a desert...

Using 9 as a null in my example I don't mean the voynich y is a null, it's just for example. In fact, you know that I consider q null to be that one.
My approach is based on simply reducing the chain of duplicated words to one word, that is, if the string contains the following sequence: "chedy chedy qokedy otedy chol okedy qokedy qokedy", then it can probably be reduced to "chedy qokedy otedy chol shol okedy qokedy", or"chedy okedy otedy chol shol okedy" if we assume that "q" is a null. Obviously, if we take away only absolute duplications, it doesn't change characteristics of the text too much.
For example (the last paragraph of f75r):

polshy dal shedy qokain das chsdy shedy qokar shedy ldy

qokeey lshedy qol chedy qokain chcthedy ltedy darom
solkedy okal dar oty otar otar ol kain olkedy
qokain sheety qokain dar dar shedy qokar ol dy
sol keedy qokeedy qokey okar otar dar dar dy
qokedy dy sheety qokedy qokchdy qokechdy lol
qokeedy qokeedy qokedy qokedy qokeedy ldy
yshedy qokeedy qokchdy olkeedy otey koldy
dar shedy qokain shedy dal keedy rshedy
sokeedy qokeedy oteedy qoky dykeedy sy
dshedy qokedy c qoteey qoteedy dar

We can take away every next vord that duplicates previous. In this case, the text will be reduced, but not too much. The rest is a deal of experiments and observations (q = null? e = ee = eee? i = ii = iii?, etc.)
The solution I proposed will certainly not change the low entropy of words, for this I have several other assumptions, one of which is suggests that some words could be broken into two or three parts, I think that sholchol, for example, will have higher entropy than shol and chol. I'm just wondering how much the removal of duplicate words, experimenting with "q" and "e"s would affect the test results.
One more example:

the sun is shin ing ing ing bright bring ing a wa ving light but it is is still cold.

On my view, this phrase reflects all the VMs characteristics:
1. short vords;
2. duplicating vords;
3. similar vords that differ with only 1 or 2 characters;
4. literary consonance and assonance (repeating of the same letter in a group of the close words in a phrase).
It is a question whether it is possible to use such devices with the big volume of text, and perhaps the author chose his/her dictionary not randomly, but carefully, using creative approach and poetical devices.

RE: VM TTR values - Addsamuels - 15-02-2023

(15-02-2023, 08:59 PM)Searcher Wrote: You are not allowed to view links. Register or Login to view.abbreviated.

I think you are getting closer, but I don't agree with the character Q, (for example it's common knowledge that the next character after Q is an o, or in other words, Qo pairs> far outstrip other Q? pairs, where ? is an other character not wildcard)
I also have more info Big Grin

RE: VM TTR values - Addsamuels - 15-02-2023

I can't add my ''proofs'' (or more info) due to there being no native image support. I'm trying to create an account, with an image sharing site, to rectify this,

Regards,

RE: VM TTR values - Koen G - 17-02-2023

I agree with Marco's explanation. It is very tempting to "trim the fat" off of Voynichese, and see what's left. But there are many ways to do this, and the result will be that you're still left with the same problem, only on a different level.

It might help to compare with another language. Let's say we don't know what Latin is, we find a Latin text and suspect someone tampered with it by adding all the frequent endings. We slash aggressively, removing everything that resembles a suffix. Just a quick example with Pliny, where I removed common endings as if I didn't know the language:

"Ipsa quae nunc dicetur herbarum claritas, medicinae tantum gignente eas Tellure, in admirationem curae priscorum diligentiaeque animum agit. nihil ergo intemptatum inexpertumque illis fuit, nihil deinde occultatum quodque non prodesse posteris vellent. at nos elaborata iis abscondere ac supprimere cupimus et fraudare vitam etiam alienis bonis."

Becomes something like:

"Ips qu nunc dic herb clar, medicin tant gigne eas Tel, in admir cur prisc dilig anim agi. nihil erg intempt inexpert il fu, nihil deind occult quod non prod post vel. at nos elabor i abscond ac supprim cup et fraud vit et ali bon."

Now it is important to understand that by doing this, I removed linguistic information from Latin, the endings are not there for show. But still, what is left might pass for a badly mangled form of some kind of Romance language. Some forms like "tant" simply become French. The full phoneme inventory is still there, we still have the word's roots.

Now if we do the same to Voynichese, the scenario will be completely different. We will reduce our phoneme inventory even further, because "iin"-clusters and "q" will need to disappear entirely. And keep in mind that the apparent phoneme inventory of Voynichese is already smaller than most people realize, because several characters are infrequent, so it operates on a relatively small core of characters. Now of course one could slash material until benched gallows etc become relatively frequent, and then recombine what is left into words.

In short, my main concern with this kind of approach is what your phoneme inventory would end up looking like, and how you're going to get around 20 characters.

RE: VM TTR values - Searcher - 18-02-2023

(17-02-2023, 10:07 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.I agree with Marco's explanation. It is very tempting to "trim the fat" off of Voynichese, and see what's left. But there are many ways to do this, and the result will be that you're still left with the same problem, only on a different level.

Hi Koen!
You conducted so many experiments trying to bring the text parameters back to normal, didn't a single experiment help to achieve characteristics that were at least similar to those of a normal text? What conclusions did you draw from this?
I really admire all the statistical tests, calculations and programming tricks, but I'm really interested in what conclusions they lead to, whether these conclusions are correct. Are these conclusions about the language or about a possible encryption method? One part of the test can show that the text is rubbish, the other part can confirm that the text is not random and may make sense. What do such conflicting tests say? Is it a one conclusion or there are a few different conclusions possible?
I understand that not a single person will reveal all his/her assumptions about the results of his/her tests, but precisely because of this, it is not entirely clear whether his/herconclusions reflect all the possibilities. It is not about critique, it is about brainstorming.

Quote:It might help to compare with another language. Let's say we don't know what Latin is, we find a Latin text and suspect someone tampered with it by adding all the frequent endings. We slash aggressively, removing everything that resembles a suffix.

I think most people understand that thoughtlessly removing all common endings will not lead to anything good, since all we get is frequent beginnings. Why not continue to search the rules? Is it really a hopeless exercise and does not lead to any positive results? So what are we doing here?

Quote:Now if we do the same to Voynichese, the scenario will be completely different. We will reduce our phoneme inventory even further, because "iin"-clusters and "q" will need to disappear entirely.

Why "-iin", but not "-dy"? Note that "-n" and "-dy" are similar with that fact that in the most cases they are preceded with a character that often is multiplied (e, ee, eee, eeee, i, ii, iii, iiii).

Quote:And keep in mind that the apparent phoneme inventory of Voynichese is already smaller than most people realize, because several characters are infrequent, so it operates on a relatively small core of characters. Now of course one could slash material until benched gallows etc become relatively frequent, and then recombine what is left into words.

In short, my main concern with this kind of approach is what your phoneme inventory would end up looking like, and how you're going to get around 20 characters.

I think that without q and sh, I can get 21 - 22 letters in the normal ratio for an ordinary language. Of course, in this case, it will be necessary to make certain substitutions, but according to certain conditions. I'm still working on it (unfortunately, I can't devote much time to research right now), but I hope to finish it in the near future. Of course it doesn't mean that my complex of actions will lead to a positive relult in decyphering, but the fact that the percentage of letters in the text is approaching the norm makes me happy. It remains to be seen how these substitutions affect the n-gram ratio. I need to note that these actions is not something unique, some of them were quite often discussed on the forum. But, whatever one may say, it turns out that with these methods some Voynich vords will have to be combined as parts of one word. It may turn out that some labels don't contain full words, but abbreviated ones.

RE: VM TTR values - nablator - 18-02-2023

(15-02-2023, 10:57 PM)Addsamuels Wrote: You are not allowed to view links. Register or Login to view.I can't add my ''proofs'' (or more info) due to there being no native image support.

There is an "attachments" box where you can upload an image when you post a new reply. Or you can use a free image hosting website like goopics.net, no account needed.

RE: VM TTR values - Koen G - 18-02-2023

(18-02-2023, 11:51 AM)Searcher Wrote: You are not allowed to view links. Register or Login to view.I think most people understand that thoughtlessly removing all common endings will not lead to anything good, since all we get is frequent beginnings. Why not continue to search the rules? Is it really a hopeless exercise and does not lead to any positive results? So what are we doing here?

My point was kind of the opposite. If we thoughtlessly remove all common endings and beginnings from Latin, what we are left with is still relatively good. It won't be a real language and information will be lost, but in almost all metrics you can imagine it will behave like a normal language. Word length might start looking abnormal, as an abnormal amount of words will be very short. But most words in this text will still look viable. With Voynichese you cannot do anything like that.

It's true I've been looking for ways to optimize certain stats, but I did so without taking the "workability" of the system into account. What I did was check if the information within voynichese words could be expressed in such a way that entropy becomes normal (it barely does). But this was a purely statistical exercise that did not aim to discover a workable encryption system.

When looking at slashing redundant or common elements in Voynichese, there will also come a point where we are basically saying that it is generated filler a la Torsten Timm, with just a bit of information sprinkled inside. I don't see why this can't be possible, but if this is the case, part of our approach must be to ask "where is the actual information" and "how can this system function?"

For example, a workable version might be that each Voynichese word stands for one plaintext letter that has been padded in a way to make it look like a word. So if one knows how to read the relevant letters, one can easily ignore the padding and retrieve the meaning.

RE: VM TTR values - Searcher - 18-02-2023

(18-02-2023, 05:24 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.
(18-02-2023, 11:51 AM)Searcher Wrote: You are not allowed to view links. Register or Login to view.I think most people understand that thoughtlessly removing all common endings will not lead to anything good, since all we get is frequent beginnings. Why not continue to search the rules? Is it really a hopeless exercise and does not lead to any positive results? So what are we doing here?

My point was kind of the opposite. If we thoughtlessly remove all common endings and beginnings from Latin, what we are left with is still relatively good. It won't be a real language and information will be lost, but in almost all metrics you can imagine it will behave like a normal language. Word length might start looking abnormal, as an abnormal amount of words will be very short. But most words in this text will still look viable. With Voynichese you cannot do anything like that.

Oh, it wasn't an objection to your Latin example indeed. Maybe I put quotes not too well. I rather confirmed your phrase from the next quote, adding my observation, as I've also tried to delete "-dy"s, "-edy's or "-iin's and saw that the result was "many frequent beginnings".

Quote:It's true I've been looking for ways to optimize certain stats, but I did so without taking the "workability" of the system into account. What I did was check if the information within voynichese words could be expressed in such a way that entropy becomes normal (it barely does). But this was a purely statistical exercise that did not aim to discover a workable encryption system.

For example, a workable version might be that each Voynichese word stands for one plaintext letter that has been padded in a way to make it look like a word. So if one knows how to read the relevant letters, one can easily ignore the padding and retrieve the meaning.

I try to do this, too. I want to reach higher perfection of my experiment, and I'll share it later. Maybe, it's just an exercise, but still... who knows where it will lead to.