The Voynich Ninja

Full Version: Interesting result for f3r with an ngram Genetic Algorithm
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
(05-10-2016, 12:19 AM)R. Sale Wrote: You are not allowed to view links. Register or Login to view.So, here's my method - fast and dirty. I took a Latin dictionary, and made a note where I happened to see a whole column of words starting with the same four letters.

Results: circ, coll, comm, comp, conc, conf, cong, cons, cont, conv, disp, diss, ibus, inve, perp, pers, pert, prae, proc, prop, pros, quad, quat, quin, semi

A few with five letters: inter, super, trans

Some additional, more common, three letter possibilities: acc, des, dis, exc, exp, exs, ill, inc, ins, per, pro, rec, rep, res, sub, tri

Thanks. That is a method quite easy to program, and was indeed the technique I used in the previous versions of the GA: basically to take the plaintext corpus and derive frequency counts for all the ngrams in it, and then use the most frequent of them in the pairings/ chromosomes. The trouble is, those frequent ngrams may not include those that medieval scribes were used to using, which is why I switched to e.g. lists by Cappelli.
(04-10-2016, 09:20 PM)julian Wrote: You are not allowed to view links. Register or Login to view.My Latin wordlist contains 14,000 words ... I'd love to get hold of more, but it's hard finding decent resources online: I've already plundered the obvious ones.
William Whitaker's wordlist is a good resource. More than a million words

You are not allowed to view links. Register or Login to view.

Curiously it lacks some words that are found in all dictionaries. No idea why.

I am looking for medieval Latin texts (correctly OCR'd, otherwise they are useless) to extract non-classic vocabulary from them.
(30-11-2017, 04:13 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Curiously it lacks some words that are found in all dictionaries. No idea why.

Which ones for example?
Some words produced by means of prefixation and suffixation can be missed I believe (or otherwise this list would be much bigger).
Also Whitaker has assigned a code to each word of his dictionary to distinguish time period, which is pretty useful.
You are not allowed to view links. Register or Login to view. is (among others) medieval word list by Lynn Nelson - didn't explore it thoroughly.
(30-11-2017, 04:44 PM)farmerjohn Wrote: You are not allowed to view links. Register or Login to view.
(30-11-2017, 04:13 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Curiously it lacks some words that are found in all dictionaries. No idea why.

Which ones for example?

eclipsis is missing
meridionalis is missing
inrationabilis is present but rationabilis is missing... that's unreasonable. Smile

Also some verb tenses are maybe missing (and supine) and some comparatives (for example: superiores).
One of the biggest problems with expanding certain glyphs based on Latin-abbreviation conventions is that by expanding it, you have already created a valid syllable by definition (since the algorithm is programmed with a list of valid expansions), and adding a valid expansion (like que or -um or con-) to any short syllable (and even some medium ones) has a high likelihood of producing a valid Latin word.
I am looking for a way to reward small improvements for the GA to chew on without selecting too much fake Latin. The idea is not to make a lorem ipsum generator. So I was thinking about something like this:
- count actual Latin words as correct with score = word length
- count common Latin bigrams and trigrams (a selection of 20-50 for example) as correct with score = a small percentage of their total length
Another scoring mechanism I used was to (considerably) boost the score when the GA found a sequence of two or more Latin words that matched a phrase found in Latin texts ...
(30-11-2017, 06:12 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I am looking for a way to reward small improvements for the GA to chew on without selecting too much fake Latin. The idea is not to make a lorem ipsum generator. So I was thinking about something like this:
- count actual Latin words as correct with score = word length
- count common Latin bigrams and trigrams (a selection of 20-50 for example) as correct with score = a small percentage of their total length

The statistics of Voynichese make clear that it cannot be a simple substitution code for Latin. Independently from the specific language, I think it could be a good idea to consider frequency as well:
autem and super should have higher scores than seror and loris.
(04-10-2016, 10:44 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.
(04-10-2016, 06:55 AM)julian Wrote: You are not allowed to view links. Register or Login to view.I'd welcome suggestions of Latin abbreviations, prefixes and suffixes that I could include in (or remove from) the list in A) above (which I gleaned mostly from d'Imperio's summary of Cappelli).

Hello Julian, have you considered having a look at the full Cappelli book?
You are not allowed to view links. Register or Login to view.

I can only agree with Marco
(30-11-2017, 08:07 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.
(30-11-2017, 06:12 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I am looking for a way to reward small improvements for the GA to chew on without selecting too much fake Latin. The idea is not to make a lorem ipsum generator. So I was thinking about something like this:
- count actual Latin words as correct with score = word length
- count common Latin bigrams and trigrams (a selection of 20-50 for example) as correct with score = a small percentage of their total length

The statistics of Voynichese make clear that it cannot be a simple substitution code for Latin. Independently from the specific language, I think it could be a good idea to consider frequency as well:
autem and super should have higher scores than seror and loris.

I think the statistics will look very different for an n-gram substitution scheme with nulls, as opposed to a simple substitution scheme which is clearly discounted.
Pages: 1 2 3