quimqu > 02-07-2025, 04:10 PM
oshfdk > 02-07-2025, 05:05 PM
Koen G > 02-07-2025, 05:11 PM
MarcoP > 02-07-2025, 05:32 PM
quimqu > 02-07-2025, 06:29 PM
(02-07-2025, 05:11 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Hmm. Your range for regular language bigrams spans two orders of magnitude, from 24.6 to 3.4. is Voynichese's 1.9 then really an outlier?
(02-07-2025, 05:32 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I wonder if there is any correlation with the number of repeating bi/trigrams in each corpus text
quimqu > 02-07-2025, 06:34 PM
(02-07-2025, 05:05 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.For the results to be comparable, we need to establish how Voynichese words correspond to plaintext words (and if they are words and not syllables, character combinations, etc). I think it makes sense repeating the experiment using comparable training sizes and output sizes expressed in Shannon bits
Koen G > 02-07-2025, 06:45 PM
quimqu > 02-07-2025, 07:05 PM
(02-07-2025, 06:45 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Some Latin text without repetitiveness inherent to the genre or style may score very low in your system, since all the little grammatical words are expressed grammatically.
oshfdk > 02-07-2025, 07:20 PM
(02-07-2025, 06:34 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.So while it’s an open question what exactly Voynichese tokens represent, the GPT’s failure to learn even short sequences suggests something deeper than just a mismatch in segmentation level. It might be structural.
magnesium > 02-07-2025, 08:19 PM
(02-07-2025, 06:34 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.(02-07-2025, 05:05 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.For the results to be comparable, we need to establish how Voynichese words correspond to plaintext words (and if they are words and not syllables, character combinations, etc). I think it makes sense repeating the experiment using comparable training sizes and output sizes expressed in Shannon bits
I see your point, but I wonder if that distinction might be overemphasized. Whether the tokens represent words, syllables, or even phonemes, any genuine writing system that encodes natural language should still reflect the underlying linguistic structure — which GPT models are designed to pick up on, regardless of granularity.
For example, Chinese uses characters that are often closer to morphemes or syllables than words, and Latin is highly inflected with longer words — but both still have strong statistical patterns in their sequential structure. If Voynichese encoded a real language at any level — word, syllable, or otherwise — we’d expect some degree of learnable co-occurrence patterns.
So while it’s an open question what exactly Voynichese tokens represent, the GPT’s failure to learn even short sequences suggests something deeper than just a mismatch in segmentation level. It might be structural.