Anton, should we perhaps make a separate thread about the search for and? The advantage of focusing on one function word is that once we have a few candidates, they can be accurately compared to the distribution in a corpus of any language we want.
(30-03-2018, 05:23 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.But ol is something highly sequentially repetitive, ol ol is not rare, and there is one instance of ol ol ol.
Almost 50 occurrences of
olol and if you include the ones that cross spaces, there are almost 150.
(30-03-2018, 06:05 PM)-JKP- Wrote: You are not allowed to view links. Register or Login to view.Almost 50 occurrences of olol and if you include the ones that cross spaces, there are almost 150.
That's not as significant as the presence of occurrences of
ol ol. In some language, some word might be written as two consecutive "and" 's, like in Latin quoquo is written as two consecutive quo's, but when you have two "and" 's in succession divided by a space, that's something strange in any language.
Quote:Anton, should we perhaps make a separate thread about the search for and? The advantage of focusing on one function word is that once we have a few candidates, they can be accurately compared to the distribution in a corpus of any language we want.
I don't think right now there are definite results to discuss... if they are obtained, one could open a separate thread of course. I was focusing more on methodology. Likewise, the same approach would be valid for "or". The word "or" is something which one normally would not expect to be paragraph-final as well - although, probably, it's not as frequent as "and".
If we assume a direct mapping, then taking a romance example (Italian) at once brings up annoying counter examples that make a mockery of attempting to map functors in this fashion Ie, how to distinguish between
Quote:e = and
è = is/are (third person singular of essere: to be)
To further complicate matters when followed by a vowel initial word it's Ed, not e. And as, not a. Etc
Similar to a/an in English.
So we need to take into account the following word when creating our mappings. Duplication is unlikely to account for this as this is purely a spoken preference which has worked its way into grammar.
But when we consider the last word of a paragraph, there is no following word. I guess I may misunderstand your point?!
It would be good if we could construct a baseline that cuts out all the clutter.
I'm thinking we could locate the most common form for "and" in a range of languages, European but also some plausible others like Turkic and Arabic. Collect a corpus for each language. Calculate frequency for most common form of "and".
This will give us a range of frequencies. In a corpus of a decent size, we expect a word for "and" to appear between x and y times.
Because we don't know if VMS spaces are word boundaries, you also have to calculate how often the "and" word (in any given language) occurs as a syllable.
For example, Latin "et" occurs as "and" and also occurs as a syllable that has nothing to do with "and". Also, in many languages the conjunction "and" is attached to words (does not appear separately).
And, as I've mentioned previously, "and" is often expressed as one character (ampersand) in the middle ages, in many languages. This possibility must also be considered. Keep in mind that the ampersand character was also sometimes used to designate "et" as a syllable (usually at the beginnings of words in Latin-based languages) even when it does not mean "and".
Quote:I'm thinking we could locate the most common form for "and" in a range of languages, European but also some plausible others like Turkic and Arabic. Collect a corpus for each language. Calculate frequency for most common form of "and".
I briefly checked major European languages (modern), and in all of them "and" is at least in top twelve. For some languages, notably for Russian, "and" is most frequent.
Quote:Because we don't know if VMS spaces are word boundaries, you also have to calculate how often the "and" word (in any given language) occurs as a syllable.
That will only increase the frequency rank (in the respective language), not decrease it.
If we mess this with the space problem, it will be a nightmare, since instead of top-N vords we'll have to consider top-N Voynichese n-grams (not knowing which value of n to look at specifically).
So I'd suggest to assume beforehand that spaces are spaces.
Actually, let's maybe create a dedicated thread indeed.
My point was that when contemplating scenarios, we need to bear in mind that small functors may be confused with homographs and syllables.
So when examining Italian, we need to have some clear rules about the word e. We need to understand that it is always delimited by spaces, is almost never sentence final, and cannot be modified by a diacritic. Furthermore, it can be substituted by a different word (Ed) when preceding a vowel initial word.
Without having the rules in place, any attempt to extract semiotic meaning from e would be doomed to failure.
David: the second problem (e vs. ed) can be navigated by focusing on the most common form. This is what we'd be doing in the VM as well, after all.