The Voynich Ninja - Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts

Pages: 1 2 3 4

(29-04-2025, 10:34 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Sounds fine. Are you looking for some specific feedback?

Yup, in a few. I’m currently preprocessing the EVA text and a control text to obtain thoroughly cleaned corpora. After that, I will extract the cleaned tokens from each and build the vocabularies for further analysis. Then I will double check with Taka one. Then entropy, classification, permutation and ablation...

(29-04-2025, 10:51 AM)Urtx13 Wrote: You are not allowed to view links. Register or Login to view.Actually, I opened a new thread because, although the aim is the same, the approaches are different!

Well, if there are any specific things to discuss, let's hear them. Otherwise, I have a manuscript to try to decode Big Grin

STEP 1- preprocess

Following Nab’s comments in the last thread, I decided to preprocess the ZL3a-n transcription again using a new custom script. I find the result quite interesting, and I plan to publish all the steps of the process here.

Earlier researchers estimated that there were around 35,000 to 38,000 tokens in previous studies.
In my case, after strict cleaning and tokenization (only [a-z]{1,10} characters, with internal comments removed), I obtain 46,675 tokens and 8,421 unique tokens.

This is not an error as such; it mainly reflects differences in the transcription version (ZL3a-n is more complete), as well as stricter token extraction and the inclusion of all transcribed material, including marginalia and previously excluded elements.

I’ll continue from this clean base to explore distributions and internal structure, following a step-by-step reproducible pipeline.

(29-04-2025, 12:10 PM)Urtx13 Wrote: You are not allowed to view links. Register or Login to view.Earlier researchers estimated that there were around 35,000 to 38,000 tokens in previous studies.
In my case, after strict cleaning and tokenization (only [a-z]{1,10} characters, with internal comments removed), I obtain 46,675 tokens and 8,421 unique tokens.

The ZL transliteration has uncertain word spaces (","), extended EVA ("@") codes, alternative readings ("[:]"), apostrophes (" ' ") and illegible characters ("?") that are not generally accepted as word separators.

RF1a-n has 38510 words with "," interpreted as a space, 37851 words without.

(29-04-2025, 12:29 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(29-04-2025, 12:10 PM)Urtx13 Wrote: You are not allowed to view links. Register or Login to view.Earlier researchers estimated that there were around 35,000 to 38,000 tokens in previous studies.
In my case, after strict cleaning and tokenization (only [a-z]{1,10} characters, with internal comments removed), I obtain 46,675 tokens and 8,421 unique tokens.

The ZL transliteration has uncertain word spaces ",", extended EVA "@" codes and illegible characters "?" that are not generally accepted as word separators.

RF1a-n has 38510 words with "," interpreted as a space, 37851 words without.

Nab! Thank you always for your kind feedback.

My preprocessing does not treat commas (,) as word separators, nor does it attempt to recover uncertain or illegible characters (@, ?).

The code strictly extracts only valid [a-z]{1,10} tokens, after removing internal comments (<! ... >) from the lines.

This conservative and reproducible approach means:
-No “guesses” about ambiguous spaces.
-No special handling of uncertain or damaged transcriptions.

As a result, the token count is naturally higher because the ZL3a-n transcription includes more material than earlier versions (such as RF1a-n), and no tokens are artificially excluded unless they fall outside strict [a-z]{1,10} matching.

What do you think?

(29-04-2025, 12:39 PM)Urtx13 Wrote: You are not allowed to view links. Register or Login to view.What do you think?

"a@123;b" is not a valid [a-z]{1,10} token so this word will be split in two: "a" and "b".

Also alternative readings ("[:]") and apostrophes (" ' ") will split some words.

(29-04-2025, 12:39 PM)Urtx13 Wrote: You are not allowed to view links. Register or Login to view.because the ZL3a-n transcription includes more material than earlier versions (such as RF1a-n),

This is not correct. ZL3a existed before RF1a. It was an input to the creation of RF1.
Both cover exacly the same amount of transcribed text. Only GC has slightly less.

This very much looks like a parsing issue, and I can only advise to use ivtt to aid in such issues.

(29-04-2025, 12:51 PM)nablator Wrote: You are not allowed to view links. Register or Login to view."a@123;b" is not a valid [a-z]{1,10} token so this word will be split in two: "a" and "b".

To @urtx13:

Note that @123; is just a low-Ascii way to describe a single character with Ascii value 123 (decimal), which is a rare character shape.

(29-04-2025, 01:07 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
(29-04-2025, 12:51 PM)nablator Wrote: You are not allowed to view links. Register or Login to view."a@123;b" is not a valid [a-z]{1,10} token so this word will be split in two: "a" and "b".

To @urtx13:

Note that @123; is just a low-Ascii way to describe a single character with Ascii value 123 (decimal), which is a rare character shape.

Oh, true! Thank you very much for your clarification regarding the relationship between ZL3a and RF1a. I completely understand now. You are right: ZL3a is the original input, and the amount of transcribed text is the same for both versions, with only GC being slightly reduced. I appreciate you pointing this out so clearly.

Regarding the parsing issue:
You are also correct that @123 represents a single character (ASCII 123) in the transcription.
In my case, my processing pipeline deliberately applies a rigorous filter ([a-z]{1,10}) to isolate only “regular” tokens, which are composed of standard lowercase letters, for a specific type of linguistic analysis focused on text rhythm and internal structure.

The Token ID numbers we assign (0, 1, 2, etc.) are entirely artificial and have no internal meaning in the original text.
They are simply keys that allow us to:
-Reference the words easily,
-Reconstruct the text,
-Build statistical models.

What matters for me to find real patterns, mabroblocs related to the three unities, is the order of the words or tokens, not the number we assign to them.

Of course, this means that characters like @123 get split or discarded, by design, to simplify the analysis of letter-only word patterns, rather than fully reconstructing the text in its exact encoded form. And it is pretty interesting.

Thank you again for your advice about using IVTT. I was already using it!

What about with alternative readings ("[:]") for example, f1 line 1 contains [cth:oto]

as nablator says with a regex of ([a-z]{1,10}, you will get 2 words 'cth' and 'oto' and one of those does not exist.

similarly with @lettters and words containing '?', a word like "@130;tol" reduces to "tol" , which is not the original word and increases your token count of "tol" which then would be wrong.

If you are using words and counts to find patterns as with TF.;IDF then your results will have statistical artefacts which are not present in the vms.

If you've been using Takahashi then parsing ZL is a world of pain away Smile

theres a whole pdf about it You are not allowed to view links. Register or Login to view.
and a brief explanation here "voynich<dot>nu/transcr.html -> Common transliteration file format", found You are not allowed to view links. Register or Login to view.

Pages: 1 2 3 4