(04-01-2026, 10:17 PM)guidoperez Wrote: You are not allowed to view links. Register or Login to view.Segment definition: I used individual lines as a first approximation. While shorter segments tend towards lower J-scores due to sampling noise, the gap between VMS (J≈0.08) and Latin baselines (J≈0.30) remains massive (d=2.45). ... I segmented Latin texts by sentence.
For Latin, do you mean "sentence" or "paragraph"? Either way, that would explain the very high J scores for the control texts (30% of the words that appear in either or both of two consecutive "segments" occur in both of them). I doubt you can get anywhere close to that level if you take 10 words and the next 10 words (which would be like the Voynichese parags text).
And, did you use all the text of the VMS (including labels), or only the paragraph text? Label lines are usually a single word, and their J score is zero. They may be one of the reasons why the VMS J score is so low.
Quote:Control texts: Even when truncating Latin strings to match VMS line lengths, the J-index remains significantly higher. I will include a more detailed sensitivity analysis on segment length in the next iteration.
I am looking forward to that. And be sure to consider only paragraph text in the VMS analysis.
But if you reformat the control texts to "segments" of 10 tokens or less, you will get low J scores that are mostly sampling noise. I suggest that instead you ignore line breaks(in the VMS
and in the control texts), and split each paragraph into segments with a fixed number N of tokens (at least 50), discarding the trailing segments at the end of each paragraph with less than N tokens.
Quote:Column C: I used the transcription layer that normalizes certain glyphs to reduce orthographic noise while keeping the token structure intact.
So you mean "the 'C' transcription". That is one of the oldest transcriptions, made from B&W low-resolution photographs and thus probably more contaminated by reading errors. You should use a more modern one. For your analysis it does not matter whether a glyph is encoded in one letter (as in the C transcription) or several letters (as in EVA).
Quote:Hapax legomena: I included all tokens. Potential noise from transcription errors actually works against my hypothesis by lowering J across the board. If I filtered rare tokens, the J-index for natural language would rise faster than for the VMS, making the divergence even more striking.
That is not obvious at all. We don't know how many single-use words are there in the Latin text.
In the VMS transcription I have at hand, there are 7171 lexemes, and 4803 of them occur only once
in the book. Many of these are likely to be due to errors -- by the Author, the Scribe, or the transcribers. Including omission of word spaces or insertion of bogus spaces.
if you delete the VMS hapaxes from A and B before computing J, it will decrease |A \cup B| but not change |A\cap B|, thus
increasing the J score.
Quote: the J≈0.08 signature is a hurdle that any linguistic theory must explain.
... provided this index is properly computed for the VMS, and the control texts are compared after formatting them to the same average number of tokens per segment.
And I insist that you
must look at the sets A\cap B for the VMS and control texts and figure out what kind of words is contributing most to their J scores. If (as I suspect) the control texts get high J score because of shared articles, prepositions, auxiliary verbs (including "to be") and other function words, then you
must add control texts in languages that do not have such grammatical features. Like Turkish, Tibetan, Chinese ...
Quote:I use modern computational tools to assist in data processing and formalizing my findings
"Modern" does not mean "reliable". You cannot trust anything that is produced by an LLM. No matter how you arrived at your algorithms, you are expected to
thoroughly understand what they are computing.
All the best, --stolfi