Options

Interchangeable ch/sh and k/t characters

Index
Interchangeable ch/sh and k/t characters
RE: Interchangeable ch/sh and k/t characters

Jorge_Stolfi > 11 hours ago

(Today, 06:26 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.That’s an interesting parallel and a great opportunity to remember how peculiar Voynichese is. First, character entropy tells us that Voynichese isn’t a phonetic rendering of a European language: we know that Voynichese glyphs do not behave like vowels and/or consonants in French, English, German, Latin, Greek, Italian etc.

Character entropy is totally dependent on the encoding of phonemes. The character entropy of Italian would increase noticeably if one replaced "ch"->"k", "gn"->"ñ", "gl"->"ł', "sc"->"š", etc. Even more if one replaced every open "e" by "ɛ", every open "o" by "ɔ", and marked every stressed vowel with a diacritic.

Likewise, the character entropy of Voynichese depends on what you count as a character. It will be very low you consider each EVA letter as a character. It will be higher if you treat Ch, Cth etc. as single characters. It will be even higher if you consider each "element" of my word model as a character -- in particular, if you count Che, She, ee, eee, CThe, CKhe, in, iin, iiin as single characters.

Word entropy is less dependent on the encoding. But even that is affected by orthography, e.g. by word splitting and joining. The word entropy of Italian would be lower if oblique pronouns were split from the verb ("ditemelo" -> "dite me lo") and compounds were split into components ("automoble" -> "auto mobile", "solamente" -> "sola mente", etc. And it is lower also if the text is heavily abbreviated or in shorthand, so that many word types are merged ("pasto","pesto","posta" -> "pst" etc.)

Anyway, IIRC the word entropy of Voynichese was about 10 bits per word, which was well within the range of European languages.

Quote:Second, while in ordinary European languages it is possible that an initial character can be replaced with a different initial, this is much rarer than in Voynichese (where the phenomenon is systematic).

True. But do you know what happens if you drop the "European"?

All the best, --stolfi
RE: Interchangeable ch/sh and k/t characters

dashstofsk > 11 hours ago

(Yesterday, 05:29 PM)ololololo Wrote: You are not allowed to view links. Register or Login to view.What do you think?

This sort of thing seems to be a feature of the VMS. Many prefixes can be exchanged to form valid words. In particular, words starting s can be replaced by d, or d by s, or may be removed. Each replacement gives a frequent word. See the attached output.

Also earlier I tried to show that gallows words are formed of common stems and endings and which are independent of each other [ You are not allowed to view links. Register or Login to view. ].

These observations seem to suggest that the writer was employing some sort of regular construction, forming words from common segments, and is one reason why I believe the writing to be artificial and meaningless.
RE: Interchangeable ch/sh and k/t characters

Ruby Novacna > 10 hours ago

(Today, 06:26 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view....we know that Voynichese glyphs do not behave like vowels and/or consonants in French, English, German, Latin, Greek, Italian etc.
... the differences with European natural languages are so huge that I think you can get the idea in any case.

Marco, I'm not trying to convince you; I have no counter-argument to offer, especially when you're comparing the text of a single manuscript with a language like French.

That said, I agree that the EVA glyphs t and k are sometimes difficult to distinguish, so the scribe could have confused them if he didn't truly master the language.
Personally, even in dictionaries, I struggle to distinguish between Coptic n and p, Greek omega and pi, sigma and stigma, and I rarely recognize the subscript iota, etc.
If you ask me to copy a manuscript, I'm likely to make many mistakes.
RE: Interchangeable ch/sh and k/t characters

ololololo > 10 hours ago

(Today, 06:26 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.
(Yesterday, 06:55 PM)Ruby Novacna Wrote: You are not allowed to view links. Register or Login to view.We are dealing with words that differ by only one consonant, such as, for example, in French the words "bateau" (boat), "château" (castle), "gâteau" (cake), and "râteau" (rake).
Do they all mean the same thing?
And what about the verbs: "disait" (was saying), "gisait" (laying), "lisait" (was reading), "misait" (was betting), "visait" (was aiming)?

That’s an interesting parallel and a great opportunity to remember how peculiar Voynichese is. First, character entropy tells us that Voynichese isn’t a phonetic rendering of a European language: we know that Voynichese glyphs do not behave like vowels and/or consonants in French, English, German, Latin, Greek, Italian etc.

Second, while in ordinary European languages it is possible that an initial character can be replaced with a different initial, this is much rarer than in Voynichese (where the phenomenon is systematic). In You are not allowed to view links. Register or Login to view., I only find 190 couples that differ by only the first character, while I find 1004 for the top 1500 Voynichese words.
For instance, daiin has these variants among the top 1500 word types:
daiin 850 / saiin 127
daiin 850 / kaiin 79
daiin 850 / raiin 64
daiin 850 / taiin 45
daiin 850 / oaiin 25
daiin 850 / laiin 14
daiin 850 / paiin 7
daiin 850 / yaiin 5

Third, word frequencies for matching pairs in French often differ by a whole order of magnitude.
E.g. t/b matches:
ton 15513 / bon 11483
tout 47221 / bout 4571
tete 11999 / bete 1706

This of course makes sense, since the words in a couple typically are semantically unrelated and the match is "coincidental". This also happens for most of the 'daiin' examples above, but it usually doesn't happen for ch/sh.

Also, in most cases the t/b replacement doesn’t work in French:
trouver 16833 brouver?
temps 16785 bemps?
toujours 14336 boujours?
…

Compare with the top ten Voynichese ch/sh matches:
chedy 507 / shedy 434
chol 395 / shol 185
chey 352 / shey 278
chor 211 / shor 95
cheey 185 / sheey 149
cheol 173 / sheol 108
chy 164 / shy 98
chdy 146 / shdy 45
chckhy 140 / shckhy 60
cheor 93 / sheor 47

Here the replacement almost always works and typically results in comparable frequencies (max/min < 3).

EDIT: as always, it's possible that I made errors, but the differences with European natural languages are so huge that I think you can get the idea in any case.
In natural languages, this phenomenon of letter substitution is not universal (but it is inherent in almost all of them). This applies only to some examples.
RE: Interchangeable ch/sh and k/t characters

ReneZ > 7 hours ago

(11 hours ago)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Character entropy is totally dependent on the encoding of phonemes. The character entropy of Italian would increase noticeably if one replaced "ch"->"k", "gn"->"ñ", "gl"->"ł', "sc"->"š", etc. Even more if one replaced every open "e" by "ɛ", every open "o" by "ɔ", and marked every stressed vowel with a diacritic.

A couple of observations related to that...
First, the dependency is likely to be less than one might expect. It is certainly a lot less than the difference between Italian and Voynichese in any of the common transliteration alphabets.
Secondly, making such substitutions means it is no longer Italian.
Thirdly, the fact that such changes show up in the character entropies (single char- and bigrams) make this a useful measure IMHO.

(11 hours ago)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Word entropy is less dependent on the encoding. But even that is affected by orthography, e.g. by word splitting and joining. The word entropy of Italian would be lower if oblique pronouns were split from the verb ("ditemelo" -> "dite me lo") and compounds were split into components ("automoble" -> "auto mobile", "solamente" -> "sola mente", etc. And it is lower also if the text is heavily abbreviated or in shorthand, so that many word types are merged ("pasto","pesto","posta" -> "pst" etc.)
Anyway, IIRC the word entropy of Voynichese was about 10 bits per word, which was well within the range of European languages.

Word entropy can barely be measured for the Voynich MS text. Word pair entropy is completely out of reach. For single-word entropy, the uncertainty of the alphabet, the handwriting itself and the word spacing makes that this cannot be estimated reliably. As an example, one will get a highly variable number of hapax from the text, depending on which transliteration (of the same text) one uses.
There are some figures related to that in my 2022 conference paper.
RE: Interchangeable ch/sh and k/t characters

Jorge_Stolfi > 2 hours ago
(7 hours ago)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Secondly, making such [digraph-for-letter] substitutions means it is no longer Italian.

Considering that the VMS script is original, any statistical comparisons with other languages must take into account that the "encoding" may not be a simple one-by-one substitution cipher of the official spelling.

Besides, when languages go through spelling reforms, people normally don't say that they became a different language. Turkish changed its spelling radically from an Arabic-like script to a (remarkably phonetic) Roman-based one in the early 1900s; I don't think anyone would say that it is no longer Turkish.

Quote:Word entropy can barely be measured for the Voynich MS text. Word pair entropy is completely out of reach. For single-word entropy, the uncertainty of the alphabet, the handwriting itself and the word spacing makes that this cannot be estimated reliably. As an example, one will get a highly variable number of hapax from the text, depending on which transliteration (of the same text) one uses.

But the numbers are interesting anyway:
- Count of word types, count of tokens, and word entropy (bits), considering only word types that occur at least M times:
+---+------------------+------------------+------------------+
| M | Shennong (py) | Voyn SPS (wp) | Voyn SPS (wc) |
+---+------------------+------------------+------------------+
| 1 | 621 13266 7.53 | 3323 9892 10.08 | 2850 11205 9.56 |
| 2 | 489 13134 7.46 | 923 7492 8.65 | 941 9296 8.49 |
| 3 | 413 12982 7.38 | 567 6780 8.17 | 608 8630 8.10 |
+---+------------------+------------------+------------------+
- Shennong = Shennong Bencao Jing in Mandarin pinyin, derived from files at the Chinese Text Projects and Chinese Wikisource, with some bug fixes ("in/2026-06-27-bencao-py.utf").
- Voyn SPS = Stared Parags section of VMS (f103r.1--f116r.30), from a recent transcription by J.Stolfi, with all weirdos mapped to '?', with commas ignored ("in/2026-06-29-starps-wp.ivp") or treated as spaces ("in/2026-06-29-starps-wc.ivp").
The files are available at You are not allowed to view links. Register or Login to view.

So my problem is that the word entropy of Voynichese seems to be too *high*. Even considering all commas as spaces and excluding words that occur only once or twice. I have several possible explanations, but that belongs to another thread. Let me just say that, by my estimate, there are still about 3000 word breaks that were completely omitted either by the Scribe or by the transcriber [me].

Anyway, let me insist again that statistics (glyph and word frequencies, Zipfness, entropies, correlations etc) are not properties of languages, but of texts. In any language one can have texts that have much lower or much higher word entropy than a typical novel or philosophical treatise.

All the best, --stolfi
RE: Interchangeable ch/sh and k/t characters

Jorge_Stolfi > 1 hour ago

(2 hours ago)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Let me just say that, by my estimate, there are still about 3000 word breaks that were completely omitted either by the Scribe or by the transcriber [me].

I take that back. Bad reasoning. There must be many missing spaces and bogus spaces in the transcription file; but probably a lot less than 3000.

All the best, --stolfi
RE: Interchangeable ch/sh and k/t characters

Mauro > 1 hour ago

(2 hours ago)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Anyway, let me insist again that statistics (glyph and word frequencies, Zipfness, entropies, correlations etc) are not properties of languages, but of texts.

You are strictly reason, of course, but it's also true that texts from the same language have, save rare exceptions, similar statistics.

Indeed it's almost always possible to determine the language of an unknown text just by comparing basic statistics: for example, this is the output of a program I wrote some time ago which categorizes texts according to their statistics. Here I'm comparing a book in Italian ("L'amore di Loredana") with a panel of texts in different languages, by calculating the root-mean-square distance of just the bigrams distributions:

As you can see, "Amore di Loredana", written in 1908 by a certain L. Zuccoli from Canton Ticino, clusters with the Italian texts (which have dates ranging from late middle ages to contemporary, with different genres including poetry, and include a specialized text: 'Appunti di anatomia', notes from anatomy lessons), and not with texts in other languages.

Note: DHR means 'declaration of human rights' (a very short text prone to statistical quirks)
RE: Interchangeable ch/sh and k/t characters

Jorge_Stolfi > 32 minutes ago

(1 hour ago)Mauro Wrote: You are not allowed to view links. Register or Login to view.Indeed it's almost always possible to determine the language of an unknown text just by comparing basic statistics: for example, this is the output of a program I wrote some time ago which categorizes texts according to their statistics. Here I'm comparing a book in Italian ("L'amore di Loredana") with a panel of texts in different languages, by calculating the root-mean-square distance of just the bigrams distributions.

Generally true, ... provided that the unknown text is written in the "official" orthography, and contains a sufficiently large fraction of "normal" prose text. In that case, even the letter frequencies could distinguish Italian from English.

But the VMS is definitely not written in the official orthography of any language. Even if it was in un-encrypted Italian, it would be in an orthography that is not simple letter-by-letter mapping of the "official" one. It would use its own alphabet and digraphs.

And it would probably have its own quirks of spelling, like attaching the articles to nouns, detaching oblique pronouns from the verbs, marking stressed vowels in a different way, making heavy use of abbreviations (like "cãtaaŕ" for "cantare") etc.

And the text may make heavy use of certain words that are rare in typical novels and contain unusual digraphs. Like, a Medieval Italian herbal may have an excess of occurrences of the digraph "rb", and an excess of word-initial "h", if it repeats the old word "herba" often enough.

All the best, --stolfi
RE: Interchangeable ch/sh and k/t characters

Mauro > 8 minutes ago

(32 minutes ago)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(1 hour ago)Mauro Wrote: You are not allowed to view links. Register or Login to view.Indeed it's almost always possible to determine the language of an unknown text just by comparing basic statistics: for example, this is the output of a program I wrote some time ago which categorizes texts according to their statistics. Here I'm comparing a book in Italian ("L'amore di Loredana") with a panel of texts in different languages, by calculating the root-mean-square distance of just the bigrams distributions.

Generally true, ... provided that the unknown text is written in the "official" orthography, and contains a sufficiently large fraction of "normal" prose text. In that case, even the letter frequencies could distinguish Italian from English.

But the VMS is definitely not written in the official orthography of any language. Even if it was in un-encrypted Italian, it would be in an orthography that is not simple letter-by-letter mapping of the "official" one. It would use its own alphabet and digraphs.

And it would probably have its own quirks of spelling, like attaching the articles to nouns, detaching oblique pronouns from the verbs, marking stressed vowels in a different way, making heavy use of abbreviations (like "cãtaaŕ" for "cantare") etc.

And the text may make heavy use of certain words that are rare in typical novels and contain unusual digraphs. Like, a Medieval Italian herbal may have an excess of occurrences of the digraph "rb", and an excess of word-initial "h", if it repeats the old word "herba" often enough.

All the best, --stolfi

Yes of course, but what you're saying now contradicts your previous position:

Quote:Anyway, let me insist again that statistics (glyph and word frequencies, Zipfness, entropies, correlations etc) are not properties of languages, but of texts.

If I can summarize: what you say is strictly true, but in practice the statistics of a text are (barring rare outliers, as the famous English novel written without a single 'e') mostly governed by the underlying language (which includes, of course, its specific orthography) and not by the text. Once a language has been chosen it's usually possible to say if a given text is written in that language or not just by checking the most basic statistics. Instead the text itself has little relevance, even when comparing very diverse texts as in my example of You are not allowed to view links. Register or Login to view.. A Medieval Italian herbal might have an excess of "rb" and some orthographical quirks, but I bet it'd be recognized as Italian nonetheless. If you have a transcription in .txt format or the like I'd be glad to check.
Next Oldest Next Newest

Interchangeable ch/sh and k/t characters

Index

RE: Interchangeable ch/sh and k/t characters

RE: Interchangeable ch/sh and k/t characters

RE: Interchangeable ch/sh and k/t characters

RE: Interchangeable ch/sh and k/t characters

RE: Interchangeable ch/sh and k/t characters

RE: Interchangeable ch/sh and k/t characters

RE: Interchangeable ch/sh and k/t characters

RE: Interchangeable ch/sh and k/t characters

RE: Interchangeable ch/sh and k/t characters

RE: Interchangeable ch/sh and k/t characters