04-10-2016, 06:55 AM
I've recently repurposed my genetic algorithm code to use EVA rather than Voyn_101. The GA seems to do better with EVA, and I'd like to report an interesting result using Latin as a base language for You are not allowed to view links. Register or Login to view. (a folio I picked at random).
The way this works is that the GA reads in the EVA transcription for the given folio(s), line by line and word by word, and as it does so it creates frequency tables of all the ngrams it finds. Right now it uses ngrams up to 3 glyphs long.
It then reads in a very large Latin word list, to use as a validation dictionary.
It then prepares a set of Latin letters, nulls and scribal abbreviations, currently numbering around 60 items in total.
Then it randomly pairs each EVA ngram with one of the Latin letters, nulls or abbreviations, and using that pairing (called a chromosome in the jargon), applies it to all lines and words in the EVA, so as to produce new words in plaintext. Each plaintext word is checked for validity in the Latin dictionary, and scored. If the word is valid, it gets a high score. If the word is long, it gets a higher score. All the word scores are summed. If a consecutive sequence of valid Latin words appear, that causes the overall score of the chromosome to increase according to the length of the sequence. The idea here is to reward chromosomes that produce sequences of valid, long Latin words.
This random process continues over many pairings/chromosome and many generations, using selection between each generation to refine the pairings (I'll spare you the details!).
Here are details for one of the better results (with a score of over 22000):
A) The list of letters, nulls and abbreviations used is as follows:
'a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'x', 'y', 'z',
' ', ' ', ' ', (nulls)
'qu',
'ra', 're', 'ca', 'ci', 'co', 'us', 'os', 'is', 'ur', 'um', 'er', 'in', 'im', 'nt', 'nd'
'quo', 'cum', 'con', 'cun', 'cus', 'cre', 'car', 'cer', 'cri', 'cis',
'ent', 'est', 'rum', 'tis', 'tum', 'tur', 'ter', 'mum',
'ntum', 'quon', 'eius', 'etam'
B) The best chromosome of VM glyph pairing to the Latin ngrams in A), includes the following
[font=voynich] a = r[/font]
[font=voynich] 8 = t[/font]
[font=voynich] c = re[/font]
[font=voynich] h = ur[/font]
[font=voynich] o = er[/font]
[font=voynich] y = tum[/font]
[font=voynich] s = u[/font]
[font=voynich] k = [font=Arial]est[/font][/font]
[font=voynich] 9 = um[/font]
[font=voynich] 8a = c[/font]
[font=voynich] co = m[/font]
[font=voynich] ii = <null>[/font]
[font=voynich] 4o = in[/font]
(The remaining pairs are omitted for brevity.)
I found the 9 = um equivalence that the GA discovered to be striking (Brumbaugh claimed this equivalence in his solution), but I suppose it's sort of obvious.
B) The best pairing translates the following valid Latin words on f3r:
'ratis', 'carus', 'ratum', 'cum', 'inque', 'cercis', 'erratum', 'interim', 'da', 'carum', 'uterum', 'certis', 'ra', 'ius', 'caro', 'pratis', 'inda', 'is', 'pratum', 'us', 'istis', 'sus', 'sum', 'corda', 'iratum', 'irent', 'inest', 'iterum', 'tergum', 'istum', 'peius', 'creo', 'irem'
(All these words appear in the Latin word list I'm using.)
C) The longest sequence of valid words is 16 (spanning folio lines 2,3,4 and 5):
"ratum cum inque cercis erratum interim ratis da carum uterum ratum certis cercis cercis ra ius"
D) The translation, using this ngram pairing, of the first few lines of f3r:
tsheos qopal chol cthol daimm
estseru inquonrtis ratis carus dciscis
ycheor chor dam qotcham cham
umterim ratum cum inque cercis
ochor qocheor chol daiin cthy
erratum interim ratis da carum
schey chor chal cham cham cho
uterum ratum certis cercis cercis ra
qokol chololy s cham cthol
ius ratisusum u cercis carus
ychtaiin chor cthom otal dam
umturestcarum ratum caro prtis cum
otchol qodaiin chom shom damo
pratis inda racis iscis cumer
ysheor chor chol oky damo
umsim ratum ratis coum cumer
I expect the Latin above makes no sense at all, but the "look and feel" of the word lengths and the vocabulary size I find encouraging.
I'd welcome suggestions of Latin abbreviations, prefixes and suffixes that I could include in (or remove from) the list in A) above (which I gleaned mostly from d'Imperio's summary of Cappelli).
Julian
The way this works is that the GA reads in the EVA transcription for the given folio(s), line by line and word by word, and as it does so it creates frequency tables of all the ngrams it finds. Right now it uses ngrams up to 3 glyphs long.
It then reads in a very large Latin word list, to use as a validation dictionary.
It then prepares a set of Latin letters, nulls and scribal abbreviations, currently numbering around 60 items in total.
Then it randomly pairs each EVA ngram with one of the Latin letters, nulls or abbreviations, and using that pairing (called a chromosome in the jargon), applies it to all lines and words in the EVA, so as to produce new words in plaintext. Each plaintext word is checked for validity in the Latin dictionary, and scored. If the word is valid, it gets a high score. If the word is long, it gets a higher score. All the word scores are summed. If a consecutive sequence of valid Latin words appear, that causes the overall score of the chromosome to increase according to the length of the sequence. The idea here is to reward chromosomes that produce sequences of valid, long Latin words.
This random process continues over many pairings/chromosome and many generations, using selection between each generation to refine the pairings (I'll spare you the details!).
Here are details for one of the better results (with a score of over 22000):
A) The list of letters, nulls and abbreviations used is as follows:
'a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'x', 'y', 'z',
' ', ' ', ' ', (nulls)
'qu',
'ra', 're', 'ca', 'ci', 'co', 'us', 'os', 'is', 'ur', 'um', 'er', 'in', 'im', 'nt', 'nd'
'quo', 'cum', 'con', 'cun', 'cus', 'cre', 'car', 'cer', 'cri', 'cis',
'ent', 'est', 'rum', 'tis', 'tum', 'tur', 'ter', 'mum',
'ntum', 'quon', 'eius', 'etam'
B) The best chromosome of VM glyph pairing to the Latin ngrams in A), includes the following
[font=voynich] a = r[/font]
[font=voynich] 8 = t[/font]
[font=voynich] c = re[/font]
[font=voynich] h = ur[/font]
[font=voynich] o = er[/font]
[font=voynich] y = tum[/font]
[font=voynich] s = u[/font]
[font=voynich] k = [font=Arial]est[/font][/font]
[font=voynich] 9 = um[/font]
[font=voynich] 8a = c[/font]
[font=voynich] co = m[/font]
[font=voynich] ii = <null>[/font]
[font=voynich] 4o = in[/font]
(The remaining pairs are omitted for brevity.)
I found the 9 = um equivalence that the GA discovered to be striking (Brumbaugh claimed this equivalence in his solution), but I suppose it's sort of obvious.
B) The best pairing translates the following valid Latin words on f3r:
'ratis', 'carus', 'ratum', 'cum', 'inque', 'cercis', 'erratum', 'interim', 'da', 'carum', 'uterum', 'certis', 'ra', 'ius', 'caro', 'pratis', 'inda', 'is', 'pratum', 'us', 'istis', 'sus', 'sum', 'corda', 'iratum', 'irent', 'inest', 'iterum', 'tergum', 'istum', 'peius', 'creo', 'irem'
(All these words appear in the Latin word list I'm using.)
C) The longest sequence of valid words is 16 (spanning folio lines 2,3,4 and 5):
"ratum cum inque cercis erratum interim ratis da carum uterum ratum certis cercis cercis ra ius"
D) The translation, using this ngram pairing, of the first few lines of f3r:
tsheos qopal chol cthol daimm
estseru inquonrtis ratis carus dciscis
ycheor chor dam qotcham cham
umterim ratum cum inque cercis
ochor qocheor chol daiin cthy
erratum interim ratis da carum
schey chor chal cham cham cho
uterum ratum certis cercis cercis ra
qokol chololy s cham cthol
ius ratisusum u cercis carus
ychtaiin chor cthom otal dam
umturestcarum ratum caro prtis cum
otchol qodaiin chom shom damo
pratis inda racis iscis cumer
ysheor chor chol oky damo
umsim ratum ratis coum cumer
I expect the Latin above makes no sense at all, but the "look and feel" of the word lengths and the vocabulary size I find encouraging.
I'd welcome suggestions of Latin abbreviations, prefixes and suffixes that I could include in (or remove from) the list in A) above (which I gleaned mostly from d'Imperio's summary of Cappelli).
Julian