Mapping between Currier A and B

Mapping between Currier A and B - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Mapping between Currier A and B (/thread-3016.html)

Pages: 1 2 3 4

RE: Mapping between Currier A and B - Koen G - 09-12-2019

I was thinking about this in the shower (as one does), and I wondered: both Currier languages generally correspond to two hands, right? It doesn't really matter whether you think these are two scribes or one scribe at different times, there is a clear divide between both hands.

But there is less of a clear divide between both "languages". It is clear that one hand does different things than the other, but how do they behave in the in between pages? Marco pointed to some pages that appear 50/50, which hand is those? Do both hands stray or is one "pure" and the other variable?

If we can understand this, we may be able to decide which pages can best be compared to each other for the purpose of this thread?

RE: Mapping between Currier A and B - Torsten - 10-12-2019

(09-12-2019, 02:26 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.
Assuming I have not made majors errors, one can see at least four different patterns:

the two ch-words that are most frequent in HerbalA (chol, chor) have smaller and smaller frequencies has you move towards B;

symmetrically, there are words that are rare in A and progressively more frequent in B (cheey, chckhy);

there are words that do not appear in A and are frequent in B (chedy, chdy); this asymmetry could be useful in choosing the direction of the mapping A->B or B->A;

chey is somehow constant across sections.

If you would build lists for common initial glyphs like <d->, <s->, <o->, or <qo-> you would get similar results. The same is true for typical word final glyphs like <-in>, <-l>, <-r>, or <-y>.

The following table lists the five most frequent <-in>-words for different sections:
Herbal A daiin dain chaiin aiin otaiin
Pharma A daiin aiin dain olaiin saiin
Astro daiin aiin dain odaiin oteodaiin
Cosmo aiin daiin qokaiin ytaiin ykaiin
Herbal B aiin daiin okaiin qokaiin saiin
Stars B aiin daiin qokaiin okaiin otaiin
Biological B qokain qokaiin daiin dain okain

The top words occur with the following frequencies:
daiin dain aiin odaiin okaiin otaiin qokaiin qokain word count
------ ----- ------ ------- ------- ------- -------- ------ -----------
Herbal (A) 403 80 33   20 28 28 15 1 8,087
Pharma (A) 99 13 30 4 6 3 2 1 2,529
Astro 12 4 11 3 1 2,136
Cosmo 36 3 56 7 9 14 18 6    2,691
Herbal (B)   72 11 72 4 31 8 20 5 3,233
Stars (B) 122 53 193   17 94 74 114 105 10,673
Biological (B) 84 47 32    1 34 12 88 159 6,911

The two tables illustrate the usage of word pairs like <daiin/dain>, <daiin/aiin>, <daiin/odaiin>, <okaiin/otaiin> or <qokaiin/qokain>.

The reason for this observation is that "all pages containing at least some lines of text do have in common that pairs of frequently used words with high mutual similarity appear. The exact co-occurrences may vary: there are pages where <daiin> is paired with <dain>, but also pages where it is frequently used together with <aiin> (f41v, f46r, f55v, f89v2, v105v, and f114r) or <saiin> (f2r, f16r, and f90r2)" (You are not allowed to view links. Register or Login to view., p. 3). Or to say it with Renés words: "A given word pattern is just about as likely to start with <o-> as with <qo->, or with <ch-> vs. <sh->, or contain <k> vs. <t>, or <p> vs <f>, or end with <-y> vs. <-dy> or <-in> vs. <-iin>" (You are not allowed to view links. Register or Login to view.).

RE: Mapping between Currier A and B - MarcoP - 10-12-2019

Thanks everybody! Enough ideas have been mentioned to keep me busy for months Smile

(09-12-2019, 03:12 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Another point I had been wondering about is whether the B language could be seen as A language with additional words. The fact that B-language pages tend to have much more text than A-language pages could be just an effect of this 'adding words'.
This was suggested (probably in quite a cryptic manner) by the last bullet above 'Suggestions for further study' on this page .

Hi Rene,
the fact that Currier B adds new word types to Currier A is certainly central to the whole phenomenon. I don't think that it can explain everything, since there are fluctuations in the frequencies of words that appear everywhere that must have another explanation.
For instance the occurrences of chol/cheey vary from 192/15=12.8 in Herbal_A to 12/36=0.3 in Bio. The frequency of chol in Herbal_A is 2.5%: comparable with the frequency of the most frequent word in English. Even if tokens belonging to the new B-types are frequent, I don't think they are enough to explain how a word drops from 2.5% to 0.2%.

(09-12-2019, 06:52 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.An alternative to the verbose cipher would be a number theory. If the Voynich MS words are like a numbering or enumeration system, a similar progression could be expected. Just compare it with Roman numerals. D only starts appearing after 500 words and M only after 1000.

In such a case, there is no mapping, but there is a 'generating algorithm' that explains the dialects.

You are not allowed to view links. Register or Login to view. you proposed is (together with Timm & Schinner) one of the very rare solid explanations for quasi-reduplication. But (as pointed out by Torsten) one does not expect new function words to appear after several thousands of words. Can 'chedy' be something like MI (1001)? With a progressive numeric system, MI would correspond to a word (or character, syllable, anything) that never appears in the first several pages of the text: certainly not a frequent item. The frequency of 'chedy' in Bio is close to that of the English conjunction 'and' or the article 'the', or an averagely frequent character like 'm': these items are everywhere, they do not appear after several pages. A similar argument applies to other frequent B-words.

Here I compare the top 30 most frequent words in different sections of the VMS and Latin texts, spanning different subjects and a long time period. In Latin, there is a considerable intersection between the most frequent words (though there also are considerable differences).

You are not allowed to view links. Register or Login to view.

I only see these two options:

'chedy' and other frequent B-words are meaningless (i.e. Currier B is Currier A + meaningless stuff, where CurrierA can be meaningless stuff in its turn);
'chedy' represents something (a character, a syllable, a word, part of the encoding of one of these, etc.) that is represented differently in Currier A: some kind of (likely unrecoverable) mapping must exist.

Of course, I would be interested in different suggestions.

RE: Mapping between Currier A and B - ReneZ - 10-12-2019

Hi Marco, yes, the observation that new words all of a sudden become very frequent is something that requires a better explanation.
This is one of those cases where it is not difficult to come up with quite a number of possibillities, especially taking into account combinations, but it is of course extremely difficult to judge how 'likely' any of these is.

Some examples would be
- Change of dialect by itself
- Change of dialect in combination with some encoding
- Change of scribe
- Change of handwriting in a source document in combination with some encoding
- Minor change of a rule in combination with some encoding
- They are null words

A numbering system in combination with a change of dialect could result in the observed effect.

The strong impression I have is the the evolution in the direction A -> B seems easier to explain than in the other direction.

RE: Mapping between Currier A and B - -JKP- - 10-12-2019

(10-12-2019, 04:18 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
...
- They are null words
...

If they are nulls, I'm not sure we can call them words. Wink

I can see this one

"- Minor change of a rule in combination with some encoding "

possibly being a major factor.

RE: Mapping between Currier A and B - ReneZ - 10-12-2019

No problem JKP, but I just used that to distinguish them from the more standard use of the term null.

One could imagine that the sequence ed is a null. This would normally mean that only these two characters are meaningless and one should read qokeedy as qokey

Alternatively, this character pair could indicate a null word, in which case the entire word qokeedy should be ignored.

Either one of the two options could bring B language closer to A language. I never tried. I think there is more to it, and the number of possibilities that one could test are rather large.

RE: Mapping between Currier A and B - Torsten - 11-12-2019

(10-12-2019, 03:32 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.the fact that Currier B adds new word types to Currier A is certainly central to the whole phenomenon. I don't think that it can explain everything, since there are fluctuations in the frequencies of words that appear everywhere that must have another explanation.
For instance the occurrences of chol/cheey vary from 192/15=12.8 in Herbal_A to 12/36=0.3 in Bio. The frequency of chol in Herbal_A is 2.5%: comparable with the frequency of the most frequent word in English. Even if tokens belonging to the new B-types are frequent, I don't think they are enough to explain how a word drops from 2.5% to 0.2%.

Indeed, a token dominating a paragraph, page or section might be rare or missing on the next one.

An interesting example for this pattern is the usage of <You are not allowed to view links. Register or Login to view.> (see You are not allowed to view links. Register or Login to view.). Jorg Stolfi has described the usage of <qokeey> within the stars section in terms of three distributions:
"If a paragraph contains the word <qokeey>, there is a 38% chance that the next paragraph will contain the word <qokeey>.
If a paragraph contains the word <qokeey>, there is a 40% chance that <qokeey> occurs more than once.
If the current word is <qokeey>, there is a 6% chance that the next word will be <qokeey>" (You are not allowed to view links. Register or Login to view.).

The general rule behind this pattern is that "high-frequency tokens also tend to have high numbers of similar words." (Timm & Schinner 2019, p. 6). With other words <You are not allowed to view links. Register or Login to view.> occurs preferably in close vicinity of tokens with high structural similarity (see Timm & Schinner 2019, p. 3):

okey okeey qokeey  qokeedy word count
------ ------ ------- -------- -----------
Herbal    (A)    4 7 12 8,087
Pharma    (A) 11 24 21 2,529
Astro    4 6 4 2,136
Cosmo 1 6 6 4 2,691
Herbal (B) 6 8 9 9 3,233
Stars (B) 16 96 159 137 10,673
Biological (B) 12 19 89 153 6,911

RE: Mapping between Currier A and B - nablator - 11-12-2019

(09-12-2019, 03:12 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Another point I had been wondering about is whether the B language could be seen as A language with additional words.

I have been wondering if EVA-lk, the second-best discriminator bigram for A/B after EVA-ed, should be split with a space. It looks like omitting the space becomes increasingly popular in Currier-B, which explains the increasing frequency of lk... maybe.

RE: Mapping between Currier A and B - MarcoP - 13-12-2019

I have run a first set of experiments, according to the simple approach described in the first post. In particular:

I matched A (HerbalA+Pharma) vs B (HerbalB+Stars+Bio), ignoring the intermediate sections Astro / Cosmo / Zodiac (as suggested by Nick You are not allowed to view links. Register or Login to view.);
I used three different transliteration systems: EVA, Cuva, Currier. The basis for all being the Zandbergen-Landini Eva file, ignoring uncertain spaces;
I searched for rewrite rules making A closer to B;
as a difference measure, I considered the difference between word histograms (absolute value of the % frequency difference between each word type); this measure falls in the 0-1 range; the distance between the two unmodified sets is about 0.68, for all transliterations;
each set A,B was treated as a single string, with '.' representing a space between words and '|' a line-break;
for each set A,B the 100 most frequent character sequences of length 1-4 were considered; for each combination, the variation in distance was measured and the highest negative differences were selected.

Given this procedure, obviously the transliteration system only has a minor effect (e.g. Cuva/Currier sequences of length 4 often correspond to longer Eva sequences - Currier:SC89 corresponds to EVA:chedy which is too long to enter the scope of this experiment).

These are the top 20 results for the three systems.

You are not allowed to view links. Register or Login to view.

There clearly is a lot of redundancy and most results convert various A suffixes into B:-edy. The reason for this is quite obvious.

It is possibly less obvious that
or -> edy.
appears to be more effective than
or. -> edy.
The difference between the two is that the first one can break an A word e.g.
dorchaiin -> dedy.chaiin
where (in this case) both resulting words appear in B. The second rule only applies to the subset of the scope where -or is word final.

Another substitution I did not expect is:
y.d -> dy.
this results in rewrites like:
shey.dair -> shedy.air
chy.dam -> chdy.am
which indeed transform A-word-sequences into valid B-words.

There could be other substitutions worth commenting upon, but my main interest now is understanding how to proceed. A possibility could be re-formulating this search into something like "simulated annealing", rather than this simple brute-force approach. Also, a distance measure that considers the overlap between word-sequences (instead of word frequencies alone) is something I am curious about.

RE: Mapping between Currier A and B - nablator - 13-12-2019

(13-12-2019, 11:38 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.There could be other substitutions worth commenting upon, but my main interest now is understanding how to proceed. A possibility could be re-formulating this search into something like "simulated annealing", rather than this simple brute-force approach.

You need to investigate multiple substitutions at the same time. One substitution at a time may not be enough to detect an improvement in your metric. This of course makes the search space huge. I would try to "hill climb" first. Some problems are well suited to the hill climbing algorithm (just swap the target part of two randomly selected rules or enable a rule and disable the other and retest: if the result is worse, backtrack). If you are out of luck you get a different sub-optimal "solution" each time, and you need a better algorithm.

To reduce the set of possible sets of rules to apply together, this can be taken in consideration: a rule like "or -> edy" cannot possibly be right without another rule something -> "or" because "or" must not be eliminated from Currier-B. So you need a set of rules in cycles: pattern 1 -> pattern 2 -> pattern 3 -> ... -> pattern n -> pattern 1.