-JKP- > 30-11-2017, 09:11 PM
Quote:Julian:
I think the statistics will look very different for an n-gram substitution scheme with nulls, as opposed to a simple substitution scheme which is clearly discounted.
farmerjohn > 30-11-2017, 09:34 PM
(30-11-2017, 05:38 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.(30-11-2017, 04:44 PM)farmerjohn Wrote: You are not allowed to view links. Register or Login to view.(30-11-2017, 04:13 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Curiously it lacks some words that are found in all dictionaries. No idea why.
Which ones for example?
eclipsis is missing
meridionalis is missing
inrationabilis is present but rationabilis is missing... that's unreasonable.
Also some verb tenses are maybe missing (and supine) and some comparatives (for example: superiores).
MarcoP > 01-12-2017, 01:55 PM
(30-11-2017, 09:06 PM)julian Wrote: You are not allowed to view links. Register or Login to view.(30-11-2017, 08:07 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.(30-11-2017, 06:12 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I am looking for a way to reward small improvements for the GA to chew on without selecting too much fake Latin. The idea is not to make a lorem ipsum generator. So I was thinking about something like this:
- count actual Latin words as correct with score = word length
- count common Latin bigrams and trigrams (a selection of 20-50 for example) as correct with score = a small percentage of their total length
The statistics of Voynichese make clear that it cannot be a simple substitution code for Latin. Independently from the specific language, I think it could be a good idea to consider frequency as well:
autem and super should have higher scores than seror and loris.
I think the statistics will look very different for an n-gram substitution scheme with nulls, as opposed to a simple substitution scheme which is clearly discounted.
MarcoP > 01-12-2017, 05:57 PM
julian > 01-12-2017, 10:27 PM
(01-12-2017, 05:57 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.The two attached histograms are based on the length of word occurrences (not word "types" as in Rene's graph discussed above).
Latin has a spike for words 2 characters long. These correspond to a number of hugely frequent particles like prepositions (in, ex, ab, de...), conjunctions (et, ac...), pronouns (id, tu, me...).
Latin also has a long tail of long words, which of course get less frequent with the growing of length. In the vulgate, 9% of the words are 10 or more characters long.
Voynichese has a simpler distribution, with a single peak corresponding to length 5 (corresponding to the central, but secondary, peak for Latin). Long words are rarer: less than 1% are 10 or more characters long.
The mapping from Voynichese to Latin should stretch the histogram in two opposite directions, possibly by making short words shorter and long words longer.
MarcoP > 02-12-2017, 11:04 AM
(01-12-2017, 10:27 PM)julian Wrote: You are not allowed to view links. Register or Login to view.(01-12-2017, 05:57 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.The two attached histograms are based on the length of word occurrences (not word "types" as in Rene's graph discussed above).
Latin has a spike for words 2 characters long. These correspond to a number of hugely frequent particles like prepositions (in, ex, ab, de...), conjunctions (et, ac...), pronouns (id, tu, me...).
Latin also has a long tail of long words, which of course get less frequent with the growing of length. In the vulgate, 9% of the words are 10 or more characters long.
Voynichese has a simpler distribution, with a single peak corresponding to length 5 (corresponding to the central, but secondary, peak for Latin). Long words are rarer: less than 1% are 10 or more characters long.
The mapping from Voynichese to Latin should stretch the histogram in two opposite directions, possibly by making short words shorter and long words longer.
Thanks, Marco.
Based on comparisons of Voynich word frequencies in Herbal A and Herbal B, the following pops out:
This substitution places EVA-aiin at the top of the Herbal word frequency list for language A and B.
- EVA-d is a null. Or optional.
It looks like the Language A scribe always wrote EVA-daiin whereas the Language B scribe sometimes wrote EVA-daiin and sometimes EVA-aiin.
julian > 02-12-2017, 08:01 PM
(02-12-2017, 11:04 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.(01-12-2017, 10:27 PM)julian Wrote: You are not allowed to view links. Register or Login to view.(01-12-2017, 05:57 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.The two attached histograms are based on the length of word occurrences (not word "types" as in Rene's graph discussed above).
Latin has a spike for words 2 characters long. These correspond to a number of hugely frequent particles like prepositions (in, ex, ab, de...), conjunctions (et, ac...), pronouns (id, tu, me...).
Latin also has a long tail of long words, which of course get less frequent with the growing of length. In the vulgate, 9% of the words are 10 or more characters long.
Voynichese has a simpler distribution, with a single peak corresponding to length 5 (corresponding to the central, but secondary, peak for Latin). Long words are rarer: less than 1% are 10 or more characters long.
The mapping from Voynichese to Latin should stretch the histogram in two opposite directions, possibly by making short words shorter and long words longer.
Thanks, Marco.
Based on comparisons of Voynich word frequencies in Herbal A and Herbal B, the following pops out:
This substitution places EVA-aiin at the top of the Herbal word frequency list for language A and B.
- EVA-d is a null. Or optional.
It looks like the Language A scribe always wrote EVA-daiin whereas the Language B scribe sometimes wrote EVA-daiin and sometimes EVA-aiin.
Hi Julian,
as I wrote, if you treat 'd' (or any other character) as a null, you get shorter words. I don't think this makes the Voynichese histogram closer to the bimodal distribution of Latin word lengths. It certainly makes the matching of the frequent long Latin words even harder.
About daiin / aiin, a few month ago I analysed some aspects of Voynich reduplication (the repeated occurrence of the same words, like daiin.daiin). These graphs (fromYou are not allowed to view links. Register or Login to view.) compare actual reduplicated occurrences of words with what would be statistically expected on the basis of word frequency (separately for Currier A and B).
It turns out that a major difference between aiin and daiin is that daiin reduplicates less than expected but consistently (in both A and B). On the other hand, aiin never does: it is the most frequent word exhibiting this peculiar behaviour. The examples of reduplication I have seen in other languages are semantic (they often imply the intensification of a concept, e.g. "very very good"). The fact that the two words have different reduplication patterns suggests that they might have different meanings and possibly belong to different word classes (random example: one is an adjective, the other a conjunction).
[I am very grateful to Emma May Smith for discussing the aiin / daiin stats with me: my opinions on the subject are largely dependent of those exchanges]
MarcoP > 02-12-2017, 08:42 PM
(02-12-2017, 08:01 PM)julian Wrote: You are not allowed to view links. Register or Login to view.That's very interesting, Marco. Of course, if EVA-d is a null, then EVA-aiin would reduplicate more often.
ReneZ > 04-12-2017, 02:20 PM