![]() |
|
New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model) - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Theories & Solutions (https://www.voynich.ninja/forum-58.html) +---- Forum: The Slop Bucket (https://www.voynich.ninja/forum-59.html) +---- Thread: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model) (/thread-5209.html) Pages:
1
2
|
New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model) - guidoperez - 04-01-2026 Hi everyone, I’ve just published a preprint on Zenodo regarding a structural analysis of the Voynich Manuscript. Instead of a traditional linguistic approach, I analyze the MS as a Diagrammatic Information System (DIS). My main finding is a systematic suppression of lexical continuity between adjacent segments. Using a full-manuscript quantitative analysis, I found a mean Jaccard similarity of 0.0807. This indicates that lexical resets are the global structural norm, with 67% of all comparisons falling below 0.10. I’ve compared these values against medieval Latin technical baselines (such as De Materia Medica and Herbarius), finding an effect size (Cohen's d) of 2.45. This suggests a modular, non-linear information architecture where text functions as locally bound parameters rather than continuous prose. The study operates at a global structural scale and does not attempt to model local text generation mechanisms, but it provides empirical constraints for future models. You can find the full paper and data here: zenodo.org/records/18147517 I look forward to your thoughts and technical feedback. Best regards, Guido Javier Pérez RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model) - tavie - 04-01-2026 This looks like it has been written by an LLM. I don't understand what you are saying in the paper. Do you? RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model) - Koen G - 04-01-2026 Also the paper does this weird thing where it has a numbered list of references but doesn't actually refer to them. The abstract is almost as long as the paper itself. I don't understand a single point that's being made. RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model) - Jorge_Stolfi - 04-01-2026 (04-01-2026, 08:57 PM)guidoperez Wrote: You are not allowed to view links. Register or Login to view.I’ve just published a preprint on Zenodo regarding a structural analysis of the Voynich Manuscript. Instead of a traditional linguistic approach, I analyze the MS as a Diagrammatic Information System (DIS).You can find the full paper and data here: zenodo.org/records/18147517 Could you please define "segment" more clearly? Do you really mean a single line? Did you use all the text (including labels -- each a separate segment?), or only the multiline paragraphs? Each VMS paragraph line has ~10 tokens on average, so the similarity index between two lines is likely to be very low, down in the sampling noise level; and lexical resets should happen at the majority of line breaks. What is "transcription column C"? Every transcription contains some fraction of errors, that create new lexemes that occur only once or twice in the whole book. How do you handle that? Do you take all words, or only the words that occur at least N times? Shouldn't you use a metric of similarity that takes into account the number of occurrences of each lexeme in each segment? How did you define the segments in the two control texts (Appendix)? How many tokens per segment did they have? You are aware, I hope, that the longer the segments are, the higher their J similarity scores will be. Have you looked at the sets A \cup B of these control texts, to see what kind of words contributed most to their high J index? If they are articles or other function words, you should include control texts in languages that do not have such grammatical features (which we know are absent in Voynichese). All the best, --stolfi RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model) - guidoperez - 04-01-2026 Thank you all for the feedback, especially to Jorge Stolfi for the detailed technical critique. To address the main points: 1) On Methodology:
3) Regarding the writing style and format: I am an independent researcher and I use modern computational tools to assist in data processing and formalizing my findings. I appreciate the observation regarding the bibliography; the missing in-text citations for the numbered references will be corrected in a new version. I don’t claim to have "solved" the manuscript. I am presenting a reproducible statistical constraint. I invite anyone to download the transcription, run a Jaccard analysis with different segmentations, and share if they find a cohesion level comparable to natural language. RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model) - nablator - 04-01-2026 Hi, I guess the C layer is the Currier part of the transliteration. Quote:# C: Currier's transcription plus new additions from members of the "C lines" would be more understandable than "transcription column C", because they are found in specific lines of the text file. However the N column (number of lines?) doesn't match the number of lines in the LSI file, so I don't know either what is N or what is C... The analysis would make more sense if the length of lines was normalized somehow (lines could be truncated or wrapped at xx columns), instead of kept as it is in the transliteration file: of course the "mean J" results depend a lot on the length of the lines: you can't compare sections that have a very different average (and distribution) of the length of lines. In the Moving Average Type Token Ratio statistics, for examples, the size of the rolling window matters. Comparing the mean J value of sections of the VMS to "Medieval Latin Baselines" makes no sense because there is no baseline, and Voynichese is not Latin. Every manuscript is different, the length of the lines is not a constant. Some scribes write big letters, some write small letters. Herbals sometimes use the available space around the illustrations, sometimes the full width of the page, like the VMS. Some scribes write words more abbreviated than others. The number of words per line varies. Edit after reading your last post: Quote:Even when truncating Latin strings to match VMS line lengths, the J-index remains significantly higher. I will include a more detailed sensitivity analysis on segment length in the next iteration.Yes that would be the first thing to do. Then try other languages if you want to claim that the VMS is outside the range of natural languages. Your statistical results do not require a "reset" at each line, whatever that means, this is completely unjustified. "DIS" or "modular architecture" is not an hypothesis without some precise definition and explanation of why you think that it would explain anything. RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model) - tavie - 05-01-2026 (04-01-2026, 10:17 PM)guidoperez Wrote: You are not allowed to view links. Register or Login to view.3) Regarding the writing style and format: I am an independent researcher and I use modern computational tools to assist in data processing and formalizing my findings. If by this you mean LLMs, work assisted by them is prohibited here. The current models produce nonsense that no one can understand, especially those prompting them. Your paper looks suspiciously like all the past ones we've seen, and all your responses here are written by an LLM. Can you explain your paper in your own words, in simple and non-technical terms? RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model) - guidoperez - 05-01-2026 (05-01-2026, 12:06 AM)tavie Wrote: You are not allowed to view links. Register or Login to view.(04-01-2026, 10:17 PM)guidoperez Wrote: You are not allowed to view links. Register or Login to view.3) Regarding the writing style and format: I am an independent researcher and I use modern computational tools to assist in data processing and formalizing my findings. I compared the shared words between lines. In normal books there is around 30% of shared words, in Voynich it's around 8%, so Voynich lines are very independent. RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model) - nablator - 05-01-2026 (05-01-2026, 12:24 AM)guidoperez Wrote: You are not allowed to view links. Register or Login to view.In normal books What is a normal book? A printed book, pocket or larger format? Why would it be relevant for comparison to a manuscript? RE: New Research: Quantitative analysis of lexical independence in MS 408 (DIS Model) - guidoperez - 05-01-2026 (05-01-2026, 12:44 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.(05-01-2026, 12:24 AM)guidoperez Wrote: You are not allowed to view links. Register or Login to view.In normal books I meant books that in principle are "similar", same period, technical, herbal/medicinal structure and have different entries with coherent text. I understand the numbers shouldn't be so different if Voynich behaved in the same way. In Voynich each line has new "words" at a much higher rate than in "similar" technical books. |