Options

Experiments with language corpora

Index
Experiments with language corpora
RE: Experiments with language corpora

DonaldFisk > 11-05-2018, 10:19 PM

(09-05-2018, 03:15 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I think it could be interesting to perform more structured experiments along these lines, adding more quantitative indexes that could help measure distance between languages. Also, there are other corpora that one could try with this or similar approaches.
Both the two lowest-entropy languages (Vai and Korean) are not written in the Latin alphabet. Both alphabets appear to be syllabic.

Korean is alphabetic, though it might not look like it. The letters are grouped into syllables, each of which is a Unicode character (3 UTF-8 bytes). For example, the word Samsung is 삼성, the syllables are 삼 (sam) and s성 (seong), and the letters are ㅅ (s),ㅏ(a), ㅁ (m),ㅓ (eo), and O (ng).

If you want to process Korean script as alphabetic, you need to transliterate the syllables into letters first.
RE: Experiments with language corpora

Koen G > 11-05-2018, 10:30 PM

Well it depends which question we want to answer. If we want to compare Korean script to Voynichese we have to take each glyph at face value. If we want to see whether the Korean language could underlie Voynichese, we must hypothesize that Voynichese is written alphabetically and hence find transliterated Hangul more suitable.

At the time the VM was written, Koreans still wrote in Chinese script though. Still it's interesting to see that Korean script produces such an outlier value. I still don't understand why
RE: Experiments with language corpora

DonaldFisk > 11-05-2018, 10:39 PM

(09-05-2018, 05:49 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.
(09-05-2018, 04:45 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Very intersting Marco!

Have you done any pre-processing on the text files to 'clean them up' (as far as that is needed)?

Thank you, Rene!
No, I haven't done any pre-processing. I didn't convert upper-case / lower-case, nor remove punctuation. A difficulty is that several of the files are not written in the Latin alphabet, so a generic pre-processing system is not easy do define. I only removed maybe half a dozen files that seemed wrong (too short to be an encoding of the text).
There certainly is some noise, and the samples are short (about 10.000 characters). I checked these rough entropy values with those on your site, and they do look comparable, but of course they should be taken as purely indicative (values from your site on the left):

If it's any help, what I've been doing as preprocessing before doing PCA is to downcase (this only applies to Roman, Greek, and Cyrillic alphabets AFAIK), treat non-alphabetic characters are spaces, and treat multiple spaces as single spaces. After that, I try to identify groups of characters which act as phonetic units (for example, gl, tt, and a, but not st because it's two sounds, in Italian). To do this, you need some idea of how to pronounce the language. For non-alphabetic scripts, I either find some romanized text or give up.
RE: Experiments with language corpora

DonaldFisk > 11-05-2018, 10:47 PM

(11-05-2018, 10:30 PM)Koen Gh. Wrote: You are not allowed to view links. Register or Login to view.Well it depends which question we want to answer. If we want to compare Korean script to Voynichese we have to take each glyph at face value. If we want to see whether the Korean language could underlie Voynichese, we must hypothesize that Voynichese is written alphabetically and hence find transliterated Hangul more suitable.

At the time the VM was written, Koreans still wrote in Chinese script though. Still it's interesting to see that Korean script produces such an outlier value. I still don't understand why

Yes, good point.

Korean's an unlikely candidate for a variety of reasons, and the number of characters in Voynichese suggest it's either alphabetic/abjad or supposed to look like it is, with each glyph encoding a phoneme (more or less).
RE: Experiments with language corpora

ReneZ > 12-05-2018, 07:35 AM

Indeed, Korean and other 'exotic' languages are unlikely candidates.
The main interest in looking at all them is that this has hardly been done, or at least (in line with a post I just wrote elsewhere), this is not documented.

This follows the logic of You are not allowed to view links. Register or Login to view. , which tries to visualise why the Voynich MS text is not a plain text in Latin or a similar known European language of the time, encoded with a simple substitution.

Since it isn't that, there are other options:
- it isn't meaningful text at all
- it is a straightforward encoding of some other (i.e. 'exotic') language
- it is a more complicated encoding of something.

Of course, 'exotic' doesn't equate with 'Far East'. It could be anything that is not typically found in the manuscripts written in the 15th Century.

In any case, doing the sort of bulk experiment as described by Marco (and indeed Hauer and Kondrak) should help a lot in 'ticking off' that option, or at least providing much more factual information about it.
RE: Experiments with language corpora

MarcoP > 12-05-2018, 07:15 PM

(11-05-2018, 10:30 PM)Koen Gh. Wrote: You are not allowed to view links. Register or Login to view.Still it's interesting to see that Korean script produces such an outlier value. I still don't understand why

We can try expanding on Rene's comment about Korean:

(11-05-2018, 06:58 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Total characters: 3344
Nr of different characters: 303
H1 = 7.09
H2 = 9.01
Difference: 1.92 (conditional entropy)
...
What is happening here is that a very large number of syllables occur exactly once in this relatively short text.
That means that each of these is followed by exactly one other syllable, meaning that this second syllable (in this text) is completely predictable. The average information carried by these second characters is low, and this is precisely the meaning of the conditional entropy.

Effectively, character pairs are under-sampled, or in other words, the text is too short to give a good measure of the conditional entropy.

Conditional entropy measures how difficult it is to "guess" the next symbol on the basis of the immediately preceding one. Typically, in a language encoded with the Latin alphabet, you can tell that the symbol after a vowel will likely be a consonant and vice versa. Things become slightly more predictable with languages that use the Latin alphabet to represent a larger set of sounds: this requires the use of digraphs to represent sounds that do not belong to Latin. For instance, in German "ch" is very common: when you see a 'c' you can predict that it will be followed by 'h' and you will be right 95% of the times (and most of the remaining 5% will be 'ck').

To be able to measure conditional entropy, you need a sample that includes a reasonably complete set of digraphs. In the case of an alphabet with 26 different symbols, the possible digraphs are less than 26*26=676 (a number of these -like 'yy' or 'fz' in English- will be absent or so rare to be substantially irrelevant). The typical UDHR file is about 10.000 characters long: enough to estimate digraph distribution on the Latin alphabet.

Since the Korean sample has more than 300 different syllabic symbols, the possible combinations of two symbols are close to 100.000. The syllable-encoded text is about 3000 symbols long (obviously much shorter than a text represented in the Latin alphabet, where each symbol roughly corresponds to a single sound). The length of the sample is about 1/30 of the number of possible digraphs: as Rene says, a severe under-sampling. It would be like considering a 20 characters long sample of (say) German. We are not even seeing the whole alphabet: the entropy of a single character appears to be close to that of a digraph. Both H1 and H2 are under-estimated, but this is more serious for H2, which works on the largest digraph space. In such a tiny sample, both characters and digraphs will mostly occur only once or never at all. Things look very predictable: some combinations can occur (all basically with the same 1 frequency) and others cannot occur.
If you compute conditional entropy on this German fragment:
"die eltern haben ein"
you get a value below 1. You can still get some very rough idea of H1 ('e' and 'n' are the most frequent symbols), but digraph structure is lost.
RE: Experiments with language corpora

-JKP- > 12-05-2018, 09:59 PM

(09-05-2018, 03:15 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view....
Both the two lowest-entropy languages (Vai and Korean) are not written in the Latin alphabet. Both alphabets appear to be syllabic.

In contrast to Chinese and Japanese Kanji, the Korean alphabet is phonetic. The syllables are combined in terms of sound rather than concept.

Korean Hangul differs from Vai in that there is a very clear visual logic to the consonant-vowel combinations. You only have to learn 25 characters to create a palette of 154 syllables, which enables you to write pretty much anything you need.

Vai takes longer to learn and requires more memorization because each consonant-vowel combination has its own shape.

Thus, they are dramatically different when you write them. In Korean, words that sound similar look similar, because they use 25 basic building blocks. Syllables like Pa, Ba, and Sa look similar because they have the "a" in common.

In Vai, words or syllables that rhyme rarely have any visual similarity because each combination has its own shape. Syllables like Pa, Ba, and Sa look nothing like one another (there are quite a few African alphabets that are constructed this way).
RE: Experiments with language corpora

MarcoP > 19-05-2018, 10:56 AM

(11-05-2018, 08:14 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.In the next days. I will try comparing the UDHR samples with the VMS taking into account the different entropies.

I attach the graphs based on H1 and conditional entropy. The plot at the bottom right is Figure 12 from You are not allowed to view links. Register or Login to view. on Rene's site: the dots for the various languages are close, even if not perfectly identical. This is partly due by the fact that different texts where used and I didn't pre-process the corpus to remove uppercase letters and punctuation.

The detail at the bottom left highlights the closest languages. Nothing much new:

hea Hmong, Northern Qiandong
top Totonac, Papantla
cbs Cashinahua
qxu Quechua, Arequipa-La Unión
hms Hmong, Southern Qiandong
qud Quechua (Unified Quichua, old Hispanic orthography)
auc Waorani
rar Rarotongan
tca Ticuna
chj Chinantec, Ojitlán
zro Záparo
zam Zapotec, Miahuatlán
RE: Experiments with language corpora

DONJCH > 28-05-2018, 09:57 PM

Can a confidence interval be calculated for any of these figures?
RE: Experiments with language corpora

MarcoP > 29-05-2018, 11:21 AM

(28-05-2018, 09:57 PM)DONJCH Wrote: You are not allowed to view links. Register or Login to view.Can a confidence interval be calculated for any of these figures?

I am an expert of statistics, but I guess the answer can only be "yes" - we have a set of numbers, so they can be fed to any numerical method. The meaningfulness of the output is another matter.

Of course, it also depends on the measurement we want to evaluate. Qualitatively, from the graphs above, it seems that we can be relatively confident about Voynichese H1, while conditional entropy has strong variations depending on both the subset of the text (Currier A B) and the transcription system.
Next Oldest Next Newest

Experiments with language corpora

Index

RE: Experiments with language corpora

RE: Experiments with language corpora

RE: Experiments with language corpora

RE: Experiments with language corpora

RE: Experiments with language corpora

RE: Experiments with language corpora

RE: Experiments with language corpora

RE: Experiments with language corpora

RE: Experiments with language corpora

RE: Experiments with language corpora