DonaldFisk > 11-05-2018, 10:19 PM
(09-05-2018, 03:15 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I think it could be interesting to perform more structured experiments along these lines, adding more quantitative indexes that could help measure distance between languages. Also, there are other corpora that one could try with this or similar approaches.
Both the two lowest-entropy languages (Vai and Korean) are not written in the Latin alphabet. Both alphabets appear to be syllabic.
Koen G > 11-05-2018, 10:30 PM
DonaldFisk > 11-05-2018, 10:39 PM
(09-05-2018, 05:49 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.(09-05-2018, 04:45 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Very intersting Marco!
Have you done any pre-processing on the text files to 'clean them up' (as far as that is needed)?
Thank you, Rene!
No, I haven't done any pre-processing. I didn't convert upper-case / lower-case, nor remove punctuation. A difficulty is that several of the files are not written in the Latin alphabet, so a generic pre-processing system is not easy do define. I only removed maybe half a dozen files that seemed wrong (too short to be an encoding of the text).
There certainly is some noise, and the samples are short (about 10.000 characters). I checked these rough entropy values with those on your site, and they do look comparable, but of course they should be taken as purely indicative (values from your site on the left):
DonaldFisk > 11-05-2018, 10:47 PM
(11-05-2018, 10:30 PM)Koen Gh. Wrote: You are not allowed to view links. Register or Login to view.Well it depends which question we want to answer. If we want to compare Korean script to Voynichese we have to take each glyph at face value. If we want to see whether the Korean language could underlie Voynichese, we must hypothesize that Voynichese is written alphabetically and hence find transliterated Hangul more suitable.
At the time the VM was written, Koreans still wrote in Chinese script though. Still it's interesting to see that Korean script produces such an outlier value. I still don't understand why
ReneZ > 12-05-2018, 07:35 AM
MarcoP > 12-05-2018, 07:15 PM
(11-05-2018, 10:30 PM)Koen Gh. Wrote: You are not allowed to view links. Register or Login to view.Still it's interesting to see that Korean script produces such an outlier value. I still don't understand why
(11-05-2018, 06:58 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Total characters: 3344
Nr of different characters: 303
H1 = 7.09
H2 = 9.01
Difference: 1.92 (conditional entropy)
...
What is happening here is that a very large number of syllables occur exactly once in this relatively short text.
That means that each of these is followed by exactly one other syllable, meaning that this second syllable (in this text) is completely predictable. The average information carried by these second characters is low, and this is precisely the meaning of the conditional entropy.
Effectively, character pairs are under-sampled, or in other words, the text is too short to give a good measure of the conditional entropy.
-JKP- > 12-05-2018, 09:59 PM
(09-05-2018, 03:15 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view....
Both the two lowest-entropy languages (Vai and Korean) are not written in the Latin alphabet. Both alphabets appear to be syllabic.
MarcoP > 19-05-2018, 10:56 AM
(11-05-2018, 08:14 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.In the next days. I will try comparing the UDHR samples with the VMS taking into account the different entropies.
DONJCH > 28-05-2018, 09:57 PM
MarcoP > 29-05-2018, 11:21 AM
(28-05-2018, 09:57 PM)DONJCH Wrote: You are not allowed to view links. Register or Login to view.Can a confidence interval be calculated for any of these figures?