(5 hours ago)nablator Wrote: You are not allowed to view links. Register or Login to view. (6 hours ago)Urtx13 Wrote: You are not allowed to view links. Register or Login to view.-Total number of tokens (cleaned EVA): 45,037
Too many, something is wrong.
Interesting!
Thanks for your input. As always, I appreciate nice collaborative answers.
I’m a bit confused and would love to understand your comment better. You mentioned that 45,037 tokens (after cleaning the EVA transcription) seem too high and suggested “something is wrong.” Could I ask what token count you’re expecting instead? Are you perhaps referring to the ~38,000 figure often cited by Stolfi for the number of tokens in the Voynich manuscript?
Let me briefly explain what my preprocessing code does, just in case there’s a misunderstanding:
What the code does during EVA preprocessing:
1. Reads from the ZL3a-n.txt transcription (not RF, not Takahashi).
2. Removes inline comments like <!...>.
3. Handles alternate readings like [cth:oto] → we keep only the first option (cth).
4. Removes corrupted/uncertain tokens containing ?.
5. Ignores ASCII codes like @123; (these are filtered out).
6. Uses a regex pattern to extract only lowercase words of 1 to 10 letters (no punctuation or numbers).
7. Generates:
-A tokenized version per folio.
-A vocabulary with unique tokens and IDs.
After applying these filters, the number of remaining tokens is exactly:
Total tokens (cleaned EVA):
45,037
42,856
Unique tokens:
5,067
8446
This feels reasonable to me (we’re removing only ~2.6% of the raw tokens), but maybe I’m missing something?