Options

Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts

Index
Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts
RE: Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts

Urtx13 > 9 hours ago

STEP 1.2

After feedback, I’ve revised the preprocessing step to more accurately reflect the structure and meaning of the ZL3a transcription. This new pipeline fixes several issues that could have introduced artificial artifacts into token-based analyses like TF-IDF or topic modeling.

Key Fixes Implemented
1. Special ASCII codes like @123;
These represent single glyphs and are not actual words.
Fix: They are now completely ignored during tokenization.
2. Alternative readings like [cth:oto]
These indicate transcription uncertainty between multiple candidates.
Fix: Only the first token (e.g., cth) is retained; the rest is ignored.
3. Tokens with ? characters
These represent uncertain or corrupted readings.
Fix: Entire tokens containing ? are excluded from the vocabulary.
4. Inline comments like <!...>
These were already being filtered out, and continue to be excluded correctly.

Resulting Token Statistics (after filtering):
-Total number of tokens (cleaned EVA): 45,037
-Unique tokens in final vocabulary: 8,542
RE: Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts

nablator > 8 hours ago

(9 hours ago)Urtx13 Wrote: You are not allowed to view links. Register or Login to view.-Total number of tokens (cleaned EVA): 45,037

Too many, something is wrong.
RE: Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts

Urtx13 > 8 hours ago

(8 hours ago)nablator Wrote: You are not allowed to view links. Register or Login to view.
(9 hours ago)Urtx13 Wrote: You are not allowed to view links. Register or Login to view.-Total number of tokens (cleaned EVA): 45,037

Too many, something is wrong.

Interesting!

Thanks for your input. As always, I appreciate nice collaborative answers.

I’m a bit confused and would love to understand your comment better. You mentioned that 45,037 tokens (after cleaning the EVA transcription) seem too high and suggested “something is wrong.” Could I ask what token count you’re expecting instead? Are you perhaps referring to the ~38,000 figure often cited by Stolfi for the number of tokens in the Voynich manuscript?

Let me briefly explain what my preprocessing code does, just in case there’s a misunderstanding:

What the code does during EVA preprocessing:
1. Reads from the ZL3a-n.txt transcription (not RF, not Takahashi).
2. Removes inline comments like <!...>.
3. Handles alternate readings like [cth:oto] → we keep only the first option (cth).
4. Removes corrupted/uncertain tokens containing ?.
5. Ignores ASCII codes like @123; (these are filtered out).
6. Uses a regex pattern to extract only lowercase words of 1 to 10 letters (no punctuation or numbers).
7. Generates:
-A tokenized version per folio.
-A vocabulary with unique tokens and IDs.

After applying these filters, the number of remaining tokens is exactly:

Total tokens (cleaned EVA): 45,037
42,856
Unique tokens: 5,067
8446
This feels reasonable to me (we’re removing only ~2.6% of the raw tokens), but maybe I’m missing something?
RE: Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts

Urtx13 > 1 hour ago

STEP 2

Until I get an answer about the total amount of tokens, I've constructed three structured random control datasets. These controls mimic specific properties of the EVA text while removing its internal structure and semantic content.

CONTROL2: Randomized Vocabulary

A new vocabulary is generated by creating random strings that match the exact lengths of the original EVA tokens. For example, if an EVA word has six characters, its corresponding CONTROL2 token will also be six characters, but composed of randomly chosen lowercase letters. Each TokenID from the original EVA vocabulary is mapped one-to-one to one of these artificial tokens. The resulting pages preserve the same page structure and token IDs per page, but with completely random token forms.

CONTROL3: Generic Token Pages

This control retains the same number of tokens per page as in the EVA dataset. However, instead of using any EVA vocabulary, it uses a generic pool of 2,000 artificial words (e.g., w1, w2, …, w2000). Tokens are randomly drawn from this pool with replacement, ensuring that the length and granularity of each page match EVA, but the vocabulary is entirely independent and non-semantic.

CONTROL4: Shuffled TokenIDs

This dataset preserves the overall frequency distribution of the EVA vocabulary. We extract all TokenIDs used in EVA and shuffle them. Then we reassign them to pages using the same token count per page as in EVA. In other words, the “words” remain the same, and their frequencies across the dataset are preserved, but their positions and associations across pages are randomized.

Why This Matters

These structured controls allow us to isolate which properties of the EVA text drive the observed structure. Each control preserves different features:
-CONTROL2 keeps token lengths but randomizes content
-CONTROL3 matches page lengths but uses unrelated vocabulary
-CONTROL4 preserves vocabulary and frequencies but randomizes order

Suppose the EVA dataset continues to show consistent and significant structure under these comparisons. In that case, it supports the hypothesis that the Voynich Manuscript contains a non-random, internally coherent system, possibly thematic or linguistic in nature.

If so, what?

Next Step

The next step will be to apply topic modeling (LDA) to each of these controls, using the same number of topics (4), which matches the prior analysis on EVA. By extracting latent topic distributions per page and comparing them to the EVA results, we will assess whether similar thematic structures emerge in the controls. If they do not, and the structure only appears in EVA, this would be strong evidence of internal organization unique to the original manuscript.

As always, I'll appreciate opinions!
RE: Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts

ReneZ > 42 minutes ago

(8 hours ago)nablator Wrote: You are not allowed to view links. Register or Login to view.
(9 hours ago)Urtx13 Wrote: You are not allowed to view links. Register or Login to view.-Total number of tokens (cleaned EVA): 45,037

Too many, something is wrong.

I agree.

Note that:
- reliable removal of <!..> comments can be done by ivtt
- reliable selection of the first option of [ | ] can be done by ivtt
- reliable removal of words includung ? can be done by ivtt

Not having to write your own code for that means saving time and removing the risk of bugs.

Handling @123; is bit more tricky. ivtt can convert them to single high-ascii bytes for you, but then you need to include these in your selection of characters. By the way, you can remove 0-9 from your selection of characters.

Alternatively, you can use bitrans to convert them to 'nearest basic Eva'.
Step 1: convert full Eva to STA (can be skipped if you use the STA file in the first place)
Step 2: convert STA to basic Eva (irreversible) using the file "Beva.bit" which is provided

If you start with STA, you can use any of the files, and ZL, GC, RF should give similar statistics.
Some of these statistics are also provided at my web site, Table 13 on You are not allowed to view links. Register or Login to view.
RE: Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts

Mauro > 32 minutes ago

(8 hours ago)Urtx13 Wrote: You are not allowed to view links. Register or Login to view.This feels reasonable to me (we’re removing only ~2.6% of the raw tokens), but maybe I’m missing something?

I agree 42852 tokens are too many. Hard to say from here what is not working properly.

Did you check if the [cth:oto] removal procedure recovers the correct tokens? You should get a 'cthres' from line 1
<f1r.1,@P0> <%>fachys.ykal.ar.ataiin.shol.shory.[cth:oto]res.y.kor.sholdy<!doodle: @254;>

.. and an 'oteos' from line four:
<f1r.4,+P0> soiin.oteey.oteo[s:r],roloty.cthiar,daiin.okaiin.or.okan

What does your processing do in cases where curly brackets are found, ie.
<f1r.17,+P0> ycho.tchey.chekain.sheo,pshol.dydyd.cthy.dai[{cto}: @194;]y
<f1r.19,+P0> dchar.shcthaiin.okaiir.chey.@192;chy.@130;tol.cthols.dlo{ct}o

Then, I'd check manually a few pages of the cleaned text vs. the original.

You chose a quite complicated transcription format for your work, why did you not use RF1a-n, which is simpler to manage? (just remove non-word characters and tokens including a '?'). The less time you spend on coding the text cleaner, the more time you'll have for your actual research.
RE: Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts

ReneZ > 26 minutes ago

First guess: are you removing lines that start with # (full comment lines)?
RE: Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts

Urtx13 > 17 minutes ago

(26 minutes ago)ReneZ Wrote: You are not allowed to view links. Register or Login to view.First guess: are you removing lines that start with # (full comment lines)?

I had already removed the hash and other artifacts. But I just realized the real issue: some words from the control text were accidentally being included in the total token count.

IVTT is truly impressive. In my case, it’s just a bit of healthy stubbornness, like someone trying to crack a hard riddle for fun. I want to build a pipeline that helps me figure it out on my own.

Now, after correcting everything, I’m getting exactly:
-Total tokens: 39,876
-Unique tokens: 7,886
RE: Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts

Urtx13 > 12 minutes ago

(32 minutes ago)Mauro Wrote: You are not allowed to view links. Register or Login to view.
(8 hours ago)Urtx13 Wrote: You are not allowed to view links. Register or Login to view.This feels reasonable to me (we’re removing only ~2.6% of the raw tokens), but maybe I’m missing something?

I agree 42852 tokens are too many. Hard to say from here what is not working properly.

Did you check if the [cth:oto] removal procedure recovers the correct tokens? You should get a 'cthres' from line 1
<f1r.1,@P0> <%>fachys.ykal.ar.ataiin.shol.shory.[cth:oto]res.y.kor.sholdy<!doodle: @254;>

.. and an 'oteos' from line four:
<f1r.4,+P0> soiin.oteey.oteo[s:r],roloty.cthiar,daiin.okaiin.or.okan

What does your processing do in cases where curly brackets are found, ie.
<f1r.17,+P0> ycho.tchey.chekain.sheo,pshol.dydyd.cthy.dai[{cto}: @194;]y
<f1r.19,+P0> dchar.shcthaiin.okaiir.chey.@192;chy.@130;tol.cthols.dlo{ct}o

Then, I'd check manually a few pages of the cleaned text vs. the original.

You chose a quite complicated transcription format for your work, why did you not use RF1a-n, which is simpler to manage? (just remove non-word characters and tokens including a '?'). The less time you spend on coding the text cleaner, the more time you'll have for your actual research.

Hola!

Thanks for your thoughts. You’re right that RF1a-n would have made things simpler for preprocessing. I agree from a software engineering perspective.

But in this project, I chose to work with ZL3a-n (EVA full) intentionally, not out of convenience, but because:
1. EVA is the community standard for most statistical and linguistic analyses of the Voynich Manuscript, especially in computational papers. Since I’m applying topic modeling, entropy, permutation tests, and other statistical tools, I wanted to remain compatible with existing literature.
2. I wanted complete control over the preprocessing pipeline so that I could experiment with variant handling, bracketed alternatives, special glyphs like @123;, and non-alphabetic tokens. Writing my parser (now IVTT-like) gave me insight into how different cleaning strategies affect downstream structure.
3. Reproducibility. One of my goals is to offer an end-to-end, reproducible pipeline from raw EVA to statistical validation. Having precise control over every transformation step, even the messy ones, is part of the methodology.
4. Personal challenge. Part of this is pedagogical—this is a research project with a strong exploratory dimension. I want to see how each decision (e.g., what to remove, what to retain) shifts the result.

Still, your advice is spot-on for production-quality workflows. I may eventually re-run the pipeline with RF1a-n or the IVTT output to compare token counts and structural consistency.

Thank you always for your kind answers!
Next Oldest Next Newest

Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts

Index

RE: Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts

RE: Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts

RE: Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts

RE: Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts

RE: Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts

RE: Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts

RE: Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts

RE: Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts

RE: Linguistic Patterns Before Decipherment: A Key to Understanding Unknown Texts