Options

New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model)

Index
New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model)
RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model)

Jorge_Stolfi > 05-01-2026, 04:01 AM

(04-01-2026, 10:17 PM)guidoperez Wrote: You are not allowed to view links. Register or Login to view.Segment definition: I used individual lines as a first approximation. While shorter segments tend towards lower J-scores due to sampling noise, the gap between VMS (J≈0.08) and Latin baselines (J≈0.30) remains massive (d=2.45). ... I segmented Latin texts by sentence.

For Latin, do you mean "sentence" or "paragraph"? Either way, that would explain the very high J scores for the control texts (30% of the words that appear in either or both of two consecutive "segments" occur in both of them). I doubt you can get anywhere close to that level if you take 10 words and the next 10 words (which would be like the Voynichese parags text).

And, did you use all the text of the VMS (including labels), or only the paragraph text? Label lines are usually a single word, and their J score is zero. They may be one of the reasons why the VMS J score is so low.

Quote:Control texts: Even when truncating Latin strings to match VMS line lengths, the J-index remains significantly higher. I will include a more detailed sensitivity analysis on segment length in the next iteration.

I am looking forward to that. And be sure to consider only paragraph text in the VMS analysis.

But if you reformat the control texts to "segments" of 10 tokens or less, you will get low J scores that are mostly sampling noise. I suggest that instead you ignore line breaks(in the VMS and in the control texts), and split each paragraph into segments with a fixed number N of tokens (at least 50), discarding the trailing segments at the end of each paragraph with less than N tokens.

Quote:Column C: I used the transcription layer that normalizes certain glyphs to reduce orthographic noise while keeping the token structure intact.

So you mean "the 'C' transcription". That is one of the oldest transcriptions, made from B&W low-resolution photographs and thus probably more contaminated by reading errors. You should use a more modern one. For your analysis it does not matter whether a glyph is encoded in one letter (as in the C transcription) or several letters (as in EVA).

Quote:Hapax legomena: I included all tokens. Potential noise from transcription errors actually works against my hypothesis by lowering J across the board. If I filtered rare tokens, the J-index for natural language would rise faster than for the VMS, making the divergence even more striking.

That is not obvious at all. We don't know how many single-use words are there in the Latin text.
In the VMS transcription I have at hand, there are 7171 lexemes, and 4803 of them occur only once
in the book. Many of these are likely to be due to errors -- by the Author, the Scribe, or the transcribers. Including omission of word spaces or insertion of bogus spaces.

if you delete the VMS hapaxes from A and B before computing J, it will decrease |A \cup B| but not change |A\cap B|, thus increasing the J score.

Quote: the J≈0.08 signature is a hurdle that any linguistic theory must explain.

... provided this index is properly computed for the VMS, and the control texts are compared after formatting them to the same average number of tokens per segment.

And I insist that you must look at the sets A\cap B for the VMS and control texts and figure out what kind of words is contributing most to their J scores. If (as I suspect) the control texts get high J score because of shared articles, prepositions, auxiliary verbs (including "to be") and other function words, then you must add control texts in languages that do not have such grammatical features. Like Turkish, Tibetan, Chinese ...

Quote:I use modern computational tools to assist in data processing and formalizing my findings

"Modern" does not mean "reliable". You cannot trust anything that is produced by an LLM. No matter how you arrived at your algorithms, you are expected to thoroughly understand what they are computing.

All the best, --stolfi
RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model)

RobGea > 05-01-2026, 04:33 AM

Try it with original Dante and Orlando Furioso.

Function words in English quick n dirty check

You are not allowed to view links. Register or Login to view.

Code:
Quick n dirty check on shared words between 2-line-ngrams in 'The Iliad' trans by Jowett As Jorge_Stolfi predicts function words are the commonest words shared between 2 consecutive lines in English. Word Count THE 1339 AND 433 OF 336 TO 267 HIS 265 IN 98
RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model)

Ruby Novacna > 05-01-2026, 09:06 AM

I can't find the article by Schmidt and Burghardt cited in the References, were you able to find it?
RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model)

Koen G > 05-01-2026, 09:39 AM

(05-01-2026, 09:06 AM)Ruby Novacna Wrote: You are not allowed to view links. Register or Login to view.I can't find the article by Schmidt and Burghardt cited in the References, were you able to find it?

I don't think it exists.
RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model)

Jorge_Stolfi > 05-01-2026, 09:41 AM
(05-01-2026, 09:06 AM)Ruby Novacna Wrote: You are not allowed to view links. Register or Login to view.I can't find the article by Schmidt and Burghardt cited in the References, were you able to find it?

I asked Google AI to find "a 2018 paper by Schmidt Burghardt 'An evaluation of lexical measures...' " , and this is what I got back:

"It appears there may be a slight variation in the title you provided. The publication by Thomas Schmidt and Manuel Burghardt from 2018 that matches your description is titled:
"An Evaluation of Lexicon-based Sentiment Analysis Techniques for the Plays of Gotthold Ephraim Lessing".

Paper Details
- Authors: Thomas Schmidt and Manuel Burghardt.
- Published: August 25, 2018.
- Conference: Published in the proceedings of the Second Workshop on Corpus-based Research in the Humanities (CRH-2).
- Focus: The paper evaluates different German lexical measures (specifically sentiment lexicons like SentiWS and BAWL) and processing configurations (lemmatization, stop word elimination) for their performance on historical literary texts. "
So the reason why the refs in Perez's article are not cited in the text may be that its bibliography is just an LLM hallucination...

Sigh.

All the best, --stolfi
RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model)

Koen G > 05-01-2026, 09:45 AM

Are people even too lazy to verify the existence of their references? Into the bucket it goes.
Next Oldest Next Newest

New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model)

Index

RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model)

RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model)

RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model)

RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model)

RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model)

RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model)