Options

Interesting result for f3r with an ngram Genetic Algorithm

Index
Interesting result for f3r with an ngram Genetic Algorithm
Interesting result for f3r with an ngram Genetic Algorithm

julian > 04-10-2016, 06:55 AM

I've recently repurposed my genetic algorithm code to use EVA rather than Voyn_101. The GA seems to do better with EVA, and I'd like to report an interesting result using Latin as a base language for You are not allowed to view links. Register or Login to view. (a folio I picked at random).

The way this works is that the GA reads in the EVA transcription for the given folio(s), line by line and word by word, and as it does so it creates frequency tables of all the ngrams it finds. Right now it uses ngrams up to 3 glyphs long.

It then reads in a very large Latin word list, to use as a validation dictionary.

It then prepares a set of Latin letters, nulls and scribal abbreviations, currently numbering around 60 items in total.

Then it randomly pairs each EVA ngram with one of the Latin letters, nulls or abbreviations, and using that pairing (called a chromosome in the jargon), applies it to all lines and words in the EVA, so as to produce new words in plaintext. Each plaintext word is checked for validity in the Latin dictionary, and scored. If the word is valid, it gets a high score. If the word is long, it gets a higher score. All the word scores are summed. If a consecutive sequence of valid Latin words appear, that causes the overall score of the chromosome to increase according to the length of the sequence. The idea here is to reward chromosomes that produce sequences of valid, long Latin words.

This random process continues over many pairings/chromosome and many generations, using selection between each generation to refine the pairings (I'll spare you the details!).

Here are details for one of the better results (with a score of over 22000):

A) The list of letters, nulls and abbreviations used is as follows:

'a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'x', 'y', 'z',
' ', ' ', ' ', (nulls)
'qu',
'ra', 're', 'ca', 'ci', 'co', 'us', 'os', 'is', 'ur', 'um', 'er', 'in', 'im', 'nt', 'nd'
'quo', 'cum', 'con', 'cun', 'cus', 'cre', 'car', 'cer', 'cri', 'cis',
'ent', 'est', 'rum', 'tis', 'tum', 'tur', 'ter', 'mum',
'ntum', 'quon', 'eius', 'etam'

B) The best chromosome of VM glyph pairing to the Latin ngrams in A), includes the following

a = r
8 = t
c = re
h = ur
o = er
y = tum
s = u
k = est
9 = um
8a = c
co = m
ii = <null>
4o = in

(The remaining pairs are omitted for brevity.)

I found the 9 = um equivalence that the GA discovered to be striking (Brumbaugh claimed this equivalence in his solution), but I suppose it's sort of obvious.

B) The best pairing translates the following valid Latin words on f3r:

'ratis', 'carus', 'ratum', 'cum', 'inque', 'cercis', 'erratum', 'interim', 'da', 'carum', 'uterum', 'certis', 'ra', 'ius', 'caro', 'pratis', 'inda', 'is', 'pratum', 'us', 'istis', 'sus', 'sum', 'corda', 'iratum', 'irent', 'inest', 'iterum', 'tergum', 'istum', 'peius', 'creo', 'irem'

(All these words appear in the Latin word list I'm using.)

C) The longest sequence of valid words is 16 (spanning folio lines 2,3,4 and 5):

"ratum cum inque cercis erratum interim ratis da carum uterum ratum certis cercis cercis ra ius"

D) The translation, using this ngram pairing, of the first few lines of f3r:

tsheos qopal chol cthol daimm
estseru inquonrtis ratis carus dciscis

ycheor chor dam qotcham cham
umterim ratum cum inque cercis

ochor qocheor chol daiin cthy
erratum interim ratis da carum

schey chor chal cham cham cho
uterum ratum certis cercis cercis ra

qokol chololy s cham cthol
ius ratisusum u cercis carus

ychtaiin chor cthom otal dam
umturestcarum ratum caro prtis cum

otchol qodaiin chom shom damo
pratis inda racis iscis cumer

ysheor chor chol oky damo
umsim ratum ratis coum cumer

I expect the Latin above makes no sense at all, but the "look and feel" of the word lengths and the vocabulary size I find encouraging.

I'd welcome suggestions of Latin abbreviations, prefixes and suffixes that I could include in (or remove from) the list in A) above (which I gleaned mostly from d'Imperio's summary of Cappelli).

Julian
RE: Interesting result for f3r with an ngram Genetic Algorithm

davidjackson > 04-10-2016, 08:24 AM

An interesting concept - translation by brute force, eh?
My concern is that there doesn't seem to be consistency in your translation. For example, in the first three lines above the suffix os appears as -ru, -im and -tum.
he appears as -se-, -ter- and -erra-.

Quote: This random process continues over many pairings/chromosome and many generations, using selection between each generation to refine the pairings (I'll spare you the details!).
That's fine, but only if it builds on the previous generation to produce a coherent translation across all words. It should be discarding unmatched chromosomes and keeping the favourable ones that link together, but it seems to be doing this only on a word by word basis - not across the whole transcription.

Quote: [3] erratum interim ratis da carum

Meanwhile, the error was that the raft was expensive?
RE: Interesting result for f3r with an ngram Genetic Algorithm

-JKP- > 04-10-2016, 08:32 AM

(04-10-2016, 06:55 AM)julian Wrote: You are not allowed to view links. Register or Login to view.I've recently repurposed my genetic algorithm code to use EVA rather than Voyn_101. The GA seems to do better with EVA, and I'd like to report an interesting result using Latin as a base language for You are not allowed to view links. Register or Login to view. (a folio I picked at random).

The way this works is that the GA reads in the EVA transcription for the given folio(s), line by line and word by word, and as it does so it creates frequency tables of all the ngrams it finds. Right now it uses ngrams up to 3 glyphs long.

It then reads in a very large Latin word list, to use as a validation dictionary.

It then prepares a set of Latin letters, nulls and scribal abbreviations, currently numbering around 60 items in total.

Then it randomly pairs each EVA ngram with one of the Latin letters, nulls or abbreviations, and using that pairing (called a chromosome in the jargon), applies it to all lines and words in the EVA, so as to produce new words in plaintext. Each plaintext word is checked for validity in the Latin dictionary, and scored. If the word is valid, it gets a high score. If the word is long, it gets a higher score. All the word scores are summed. If a consecutive sequence of valid Latin words appear, that causes the overall score of the chromosome to increase according to the length of the sequence. The idea here is to reward chromosomes that produce sequences of valid, long Latin words.

This random process continues over many pairings/chromosome and many generations, using selection between each generation to refine the pairings (I'll spare you the details!).

Here are details for one of the better results (with a score of over 22000):

A) The list of letters, nulls and abbreviations used is as follows:

'a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'x', 'y', 'z',
' ', ' ', ' ', (nulls)
'qu',
'ra', 're', 'ca', 'ci', 'co', 'us', 'os', 'is', 'ur', 'um', 'er', 'in', 'im', 'nt', 'nd'
'quo', 'cum', 'con', 'cun', 'cus', 'cre', 'car', 'cer', 'cri', 'cis',
'ent', 'est', 'rum', 'tis', 'tum', 'tur', 'ter', 'mum',
'ntum', 'quon', 'eius', 'etam'

B) The best chromosome of VM glyph pairing to the Latin ngrams in A), includes the following

a = r
8 = t
c = re
h = ur
o = er
y = tum
s = u
k = est
9 = um
8a = c
co = m
ii = <null>
4o = in

(The remaining pairs are omitted for brevity.)

I found the 9 = um equivalence that the GA discovered to be striking (Brumbaugh claimed this equivalence in his solution), but I suppose it's sort of obvious.

B) The best pairing translates the following valid Latin words on f3r:

'ratis', 'carus', 'ratum', 'cum', 'inque', 'cercis', 'erratum', 'interim', 'da', 'carum', 'uterum', 'certis', 'ra', 'ius', 'caro', 'pratis', 'inda', 'is', 'pratum', 'us', 'istis', 'sus', 'sum', 'corda', 'iratum', 'irent', 'inest', 'iterum', 'tergum', 'istum', 'peius', 'creo', 'irem'

(All these words appear in the Latin word list I'm using.)

C) The longest sequence of valid words is 16 (spanning folio lines 2,3,4 and 5):

"ratum cum inque cercis erratum interim ratis da carum uterum ratum certis cercis cercis ra ius"

D) The translation, using this ngram pairing, of the first few lines of f3r:

tsheos qopal chol cthol daimm
estseru inquonrtis ratis carus dciscis

ycheor chor dam qotcham cham
umterim ratum cum inque cercis

ochor qocheor chol daiin cthy
erratum interim ratis da carum

schey chor chal cham cham cho
uterum ratum certis cercis cercis ra

qokol chololy s cham cthol
ius ratisusum u cercis carus

ychtaiin chor cthom otal dam
umturestcarum ratum caro prtis cum

otchol qodaiin chom shom damo
pratis inda racis iscis cumer

ysheor chor chol oky damo
umsim ratum ratis coum cumer

I expect the Latin above makes no sense at all, but the "look and feel" of the word lengths and the vocabulary size I find encouraging.

I'd welcome suggestions of Latin abbreviations, prefixes and suffixes that I could include in (or remove from) the list in A) above (which I gleaned mostly from d'Imperio's summary of Cappelli).

Julian

I'm always interested in genetic algorithms.

Unfortunately, I don't have time to comment overall right now (my workday hasn't ended yet), but wanted to point out that the Latin 9 abbreviation (EVA-y) that is usually -um or -us at the end of a word means com- or con- when it's at the beginning of a word.

Also, the EVA-j shape is actually three different abbreviations in Latin... -cis, -ris, and -tis. The first part of the shape (straight or curved) determines which one it is in Latin. The -cis is usually pretty clear, the -ris and -tis are sometimes less clear, depending on the scribe, and sometimes distinguished by context.
RE: Interesting result for f3r with an ngram Genetic Algorithm

MarcoP > 04-10-2016, 10:44 AM

(04-10-2016, 06:55 AM)julian Wrote: You are not allowed to view links. Register or Login to view.I'd welcome suggestions of Latin abbreviations, prefixes and suffixes that I could include in (or remove from) the list in A) above (which I gleaned mostly from d'Imperio's summary of Cappelli).

Hello Julian, have you considered having a look at the full Cappelli book?
You are not allowed to view links. Register or Login to view.
RE: Interesting result for f3r with an ngram Genetic Algorithm

ThomasCoon > 04-10-2016, 02:33 PM

This is wonderful work, Julian! Much praise!

I just have one question: when your program checks a Latin dictionary for word validity, does the program also account for the fact that nouns may be in different grammatical cases? (Ex: in a dictionary only the form femina might be listed for "woman", but that noun can also appear as feminae, feminam, feminis, feminas, feminarum in a Latin text). It looks like you have accounted for that, but I just wanted to ask.
RE: Interesting result for f3r with an ngram Genetic Algorithm

julian > 04-10-2016, 05:53 PM

(04-10-2016, 08:24 AM)davidjackson Wrote: You are not allowed to view links. Register or Login to view.An interesting concept - translation by brute force, eh?
My concern is that there doesn't seem to be consistency in your translation. For example, in the first three lines above the suffix os appears as -ru, -im and -tum.
he appears as -se-, -ter- and -erra-.

Quote: This random process continues over many pairings/chromosome and many generations, using selection between each generation to refine the pairings (I'll spare you the details!).
That's fine, but only if it builds on the previous generation to produce a coherent translation across all words. It should be discarding unmatched chromosomes and keeping the favourable ones that link together, but it seems to be doing this only on a word by word basis - not across the whole transcription.

Quote: [3] erratum interim ratis da carum

Meanwhile, the error was that the raft was expensive?

To follow the translation, you really need to see the full mapping of EVA ngrams to Latin, which I only posted a selection of. The translation of os depends on what other glyphs abut the o and the s, and if those sequences are in the pairing table, if you see what I mean. Here is the full list:

Substitution {'eo': 'm', 'ch': 'tur', 'ai': 'mum', 'am': 'e', 'ii': ' ', 'ey': 'nt', 'tch': 'qu', 'in': 'a', 'eol': 'eius', 'qok': 'i', 'ct': 'ent', 'cha': 'cer', 'iin': 'con', 'ham': 'h', 'p': 'quon', 'c': 'z', 'hol': 'q', 'hom': 'b', 'cho': 'ra', 'r': 'tum', 'hor': ' ', 'th': 'l', 'heo': 'cri', 'tc': 'ci', 't': 'est', 'dam': 'cum', 'dai': 'd', 'da': 'c', 'che': 'ter', 'oke': 'cre', 'ho': 'cus', 'ok': 'co', 'ha': 'ndquo', 'he': 'os', 'a': 'r', 'om': 'o', 'ol': 'us', 'e': 're', 'd': 't', 'ke': 'cun', 'i': 'y', 'h': 'n', 'k': 'ur', 'od': 'g', 'm': 'cis', 'l': 'tis', 'o': 'er', 'n': 'rum', 'q': 'ntum', 'eor': 'x', 's': 'u', 'sh': ' ', 'aii': 'ca', 'she': 's', 'y': 'um', 'ot': 'p', 'cth': 'car', 'or': 'im', 'qo': 'in', 'sho': 'is'}

(This is a dict with each key being the EVA glyph(s) and each value being the Latin translation.)

Regarding the generational learning: at each epoch the chromosomes are ordered by score, and the top half retained. These are mixed/mated/mutated to generate a new half for the next generation. So, the best chromosomes are always kept between epochs. The scoring is done for the whole of the EVA input provided, so each chromosome's score is for all words on You are not allowed to view links. Register or Login to view. in this case.

So, this isn't really brute force, it's more of an iterative refinement based on lots of initial random guesses. The number of possible combinations is of course huge, so there need to be many chromosomes in the population, and many learning epochs (I'm currently using 2000 chromosomes per run, each of length around 60.)

Thanks for the feedback!

(04-10-2016, 02:33 PM)ThomasCoon Wrote: You are not allowed to view links. Register or Login to view.This is wonderful work, Julian! Much praise!

I just have one question: when your program checks a Latin dictionary for word validity, does the program also account for the fact that nouns may be in different grammatical cases? (Ex: in a dictionary only the form femina might be listed for "woman", but that noun can also appear as feminae, feminam, feminis, feminas, feminarum in a Latin text). It looks like you have accounted for that, but I just wanted to ask.

Hi Thomas,

No, it's not that sophisticated. It simply uses a big Latin wordlist to check the validity of each word.

If there was some way of computing whether the translated Latin makes sense grammatically, that would be extremely useful: the chromosome's score could be boosted if it produced Latin that made sense.

(04-10-2016, 08:32 AM)-JKP- Wrote: You are not allowed to view links. Register or Login to view.
(04-10-2016, 06:55 AM)julian Wrote: You are not allowed to view links. Register or Login to view.I've recently repurposed my genetic algorithm code to use EVA rather than Voyn_101. The GA seems to do better with EVA, and I'd like to report an interesting result using Latin as a base language for You are not allowed to view links. Register or Login to view. (a folio I picked at random).

The way this works is that the GA reads in the EVA transcription for the given folio(s), line by line and word by word, and as it does so it creates frequency tables of all the ngrams it finds. Right now it uses ngrams up to 3 glyphs long.

It then reads in a very large Latin word list, to use as a validation dictionary.

It then prepares a set of Latin letters, nulls and scribal abbreviations, currently numbering around 60 items in total.

Then it randomly pairs each EVA ngram with one of the Latin letters, nulls or abbreviations, and using that pairing (called a chromosome in the jargon), applies it to all lines and words in the EVA, so as to produce new words in plaintext. Each plaintext word is checked for validity in the Latin dictionary, and scored. If the word is valid, it gets a high score. If the word is long, it gets a higher score. All the word scores are summed. If a consecutive sequence of valid Latin words appear, that causes the overall score of the chromosome to increase according to the length of the sequence. The idea here is to reward chromosomes that produce sequences of valid, long Latin words.

This random process continues over many pairings/chromosome and many generations, using selection between each generation to refine the pairings (I'll spare you the details!).

Here are details for one of the better results (with a score of over 22000):

A) The list of letters, nulls and abbreviations used is as follows:

'a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'x', 'y', 'z',
' ', ' ', ' ', (nulls)
'qu',
'ra', 're', 'ca', 'ci', 'co', 'us', 'os', 'is', 'ur', 'um', 'er', 'in', 'im', 'nt', 'nd'
'quo', 'cum', 'con', 'cun', 'cus', 'cre', 'car', 'cer', 'cri', 'cis',
'ent', 'est', 'rum', 'tis', 'tum', 'tur', 'ter', 'mum',
'ntum', 'quon', 'eius', 'etam'

B) The best chromosome of VM glyph pairing to the Latin ngrams in A), includes the following

a = r
8 = t
c = re
h = ur
o = er
y = tum
s = u
k = est
9 = um
8a = c
co = m
ii = <null>
4o = in

(The remaining pairs are omitted for brevity.)

I found the 9 = um equivalence that the GA discovered to be striking (Brumbaugh claimed this equivalence in his solution), but I suppose it's sort of obvious.

B) The best pairing translates the following valid Latin words on f3r:

'ratis', 'carus', 'ratum', 'cum', 'inque', 'cercis', 'erratum', 'interim', 'da', 'carum', 'uterum', 'certis', 'ra', 'ius', 'caro', 'pratis', 'inda', 'is', 'pratum', 'us', 'istis', 'sus', 'sum', 'corda', 'iratum', 'irent', 'inest', 'iterum', 'tergum', 'istum', 'peius', 'creo', 'irem'

(All these words appear in the Latin word list I'm using.)

C) The longest sequence of valid words is 16 (spanning folio lines 2,3,4 and 5):

"ratum cum inque cercis erratum interim ratis da carum uterum ratum certis cercis cercis ra ius"

D) The translation, using this ngram pairing, of the first few lines of f3r:

tsheos qopal chol cthol daimm
estseru inquonrtis ratis carus dciscis

ycheor chor dam qotcham cham
umterim ratum cum inque cercis

ochor qocheor chol daiin cthy
erratum interim ratis da carum

schey chor chal cham cham cho
uterum ratum certis cercis cercis ra

qokol chololy s cham cthol
ius ratisusum u cercis carus

ychtaiin chor cthom otal dam
umturestcarum ratum caro prtis cum

otchol qodaiin chom shom damo
pratis inda racis iscis cumer

ysheor chor chol oky damo
umsim ratum ratis coum cumer

I expect the Latin above makes no sense at all, but the "look and feel" of the word lengths and the vocabulary size I find encouraging.

I'd welcome suggestions of Latin abbreviations, prefixes and suffixes that I could include in (or remove from) the list in A) above (which I gleaned mostly from d'Imperio's summary of Cappelli).

Julian

I'm always interested in genetic algorithms.

Unfortunately, I don't have time to comment overall right now (my workday hasn't ended yet), but wanted to point out that the Latin 9 abbreviation (EVA-y) that is usually -um or -us at the end of a word means com- or con- when it's at the beginning of a word.

Also, the EVA-j shape is actually three different abbreviations in Latin... -cis, -ris, and -tis. The first part of the shape (straight or curved) determines which one it is in Latin. The -cis is usually pretty clear, the -ris and -tis are sometimes less clear, depending on the scribe, and sometimes distinguished by context.

The way the algorithm is set up allows only one mapping from a single EVA glyph to a Latin ngram. I think it might be possible to have two entries for EVA-y in the chromosome: one that maps to um and one that maps to com, for example. I'll look into it.

Regarding EVA-j - I can't easily handle if that is in fact more than one glyph. Previously I used Glenn Caston's transcription, which was far too verbose (it went in the other direction), and I think he saw multiple glyph shapes where there was in fact only one, but written differently by the scribe(s) according to how much mead they'd been drinking.
RE: Interesting result for f3r with an ngram Genetic Algorithm

R. Sale > 04-10-2016, 08:27 PM

Among other inconsistencies, 'ratis' and 'ratum' originate from multiple VMs words.

Is it possible to shift to triglyphs alone? And pump up the Latin vocabulary? Or to add quadglyphs like Latin '-orum'? Seems to me that the more complex elements that are frequently found in Latin are the ones to be searching for. While the monoglyphs and diglyphs create a noise that obscures those potential patterns.
RE: Interesting result for f3r with an ngram Genetic Algorithm

Anton > 04-10-2016, 08:41 PM

Quote:If there was some way of computing whether the translated Latin makes sense grammatically, that would be extremely useful: the chromosome's score could be boosted if it produced Latin that made sense.

There's You are not allowed to view links. Register or Login to view. there, maybe you could accomodate it somehow.
RE: Interesting result for f3r with an ngram Genetic Algorithm

julian > 04-10-2016, 09:20 PM

(04-10-2016, 08:27 PM)R. Sale Wrote: You are not allowed to view links. Register or Login to view.Among other inconsistencies, 'ratis' and 'ratum' originate from multiple VMs words.

Is it possible to shift to triglyphs alone? And pump up the Latin vocabulary? Or to add quadglyphs like Latin '-orum'? Seems to me that the more complex elements that are frequently found in Latin are the ones to be searching for. While the monoglyphs and diglyphs create a noise that obscures those potential patterns.

My Latin wordlist contains 14,000 words ... I'd love to get hold of more, but it's hard finding decent resources online: I've already plundered the obvious ones.

I made sure to include in the wordlist as many Latin herb/plant names and star names as I could find.

I don't think using just triglyphs would work at all.

There are a few quadglyphs in the list already: ntum, quon, eius and etam. I'll add orum - any other suggestions?
RE: Interesting result for f3r with an ngram Genetic Algorithm

R. Sale > 05-10-2016, 12:19 AM

So, here's my method - fast and dirty. I took a Latin dictionary, and made a note where I happened to see a whole column of words starting with the same four letters.

Results: circ, coll, comm, comp, conc, conf, cong, cons, cont, conv, disp, diss, ibus, inve, perp, pers, pert, prae, proc, prop, pros, quad, quat, quin, semi

A few with five letters: inter, super, trans

Some additional, more common, three letter possibilities: acc, des, dis, exc, exp, exs, ill, inc, ins, per, pro, rec, rep, res, sub, tri
Next Oldest Next Newest

Interesting result for f3r with an ngram Genetic Algorithm

Index

Interesting result for f3r with an ngram Genetic Algorithm

RE: Interesting result for f3r with an ngram Genetic Algorithm

RE: Interesting result for f3r with an ngram Genetic Algorithm

RE: Interesting result for f3r with an ngram Genetic Algorithm

RE: Interesting result for f3r with an ngram Genetic Algorithm

RE: Interesting result for f3r with an ngram Genetic Algorithm

RE: Interesting result for f3r with an ngram Genetic Algorithm

RE: Interesting result for f3r with an ngram Genetic Algorithm

RE: Interesting result for f3r with an ngram Genetic Algorithm

RE: Interesting result for f3r with an ngram Genetic Algorithm