MarcoP > 09-05-2018, 03:15 PM
rar Rarotongan Oceania Cook Islands
qud Quechua (Unified Quichua, old Hispanic orthography) South-America Peru
hms Hmong, Southern Qiandong Asia China
cbs Cashinahua South-America Peru
prq Ashéninka Perené South-America Peru
mri Maori Oceania New Zealand
qug Quichua, Chimborazo Highland South-America Ecuador
fon Fon Africa Niger
miq Mískito Central-America Nicaragua
kmb Mbundu Africa Angola
ReneZ > 09-05-2018, 04:45 PM
Koen G > 09-05-2018, 05:17 PM
MarcoP > 09-05-2018, 05:49 PM
(09-05-2018, 04:45 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Very intersting Marco!
Have you done any pre-processing on the text files to 'clean them up' (as far as that is needed)?
Mattioli Latin 3.234
Pliny Latin 3.266 | 2.983 lat
Dante Italian 3.126 | 3.000 ita
Tristan German 3.039 | 2.825 deu_1901
MarcoP > 09-05-2018, 06:40 PM
(09-05-2018, 05:17 PM)Koen Gh. Wrote: You are not allowed to view links. Register or Login to view.Thank you, Marco, excellent work again. There's not much VM research being shared at the moment so it's nice to read something like this.
Given examples like Korean, does this mean that low entropy can be caused - in part or entirely - by the writing system, independent of the language?
(09-05-2018, 05:17 PM)Koen Gh. Wrote: You are not allowed to view links. Register or Login to view.Also, might the choice of text influence the results? The Declaration, a series of statements about what is or "shall be", excludes certain language forms from occurring. Past tenses, for example.
(09-05-2018, 05:17 PM)Koen Gh. Wrote: You are not allowed to view links. Register or Login to view.I've been looking around a bit for a readily available corpus but haven't found anything useful yet.
On this site they have the Bible in hundreds of languages: You are not allowed to view links. Register or Login to view.
Perhaps with some planning and forum collaboration we could compile a corpus from something like this?
KING_JAMES 0.053 2.992
eng 0.094 2.981
ReneZ > 09-05-2018, 06:45 PM
MarcoP > 09-05-2018, 07:23 PM
(09-05-2018, 06:45 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.I wonder if the very small entropy value for Korean isn't a consequence of processing of the individual bytes of UTF-8 encoded text.
Davidsch > 10-05-2018, 01:52 PM
ReneZ > 11-05-2018, 06:58 PM
MarcoP > 11-05-2018, 08:14 PM