(Yesterday, 01:07 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view. (Yesterday, 12:51 PM)nablator Wrote: You are not allowed to view links. Register or Login to view."a@123;b" is not a valid [a-z]{1,10} token so this word will be split in two: "a" and "b".
To @urtx13:
Note that @123; is just a low-Ascii way to describe a single character with Ascii value 123 (decimal), which is a rare character shape.
Oh, true! Thank you very much for your clarification regarding the relationship between ZL3a and RF1a. I completely understand now. You are right: ZL3a is the original input, and the amount of transcribed text is the same for both versions, with only GC being slightly reduced. I appreciate you pointing this out so clearly.
Regarding the parsing issue:
You are also correct that @123 represents a single character (ASCII 123) in the transcription.
In my case, my processing pipeline deliberately applies a rigorous filter ([a-z]{1,10}) to isolate only “regular” tokens, which are composed of standard lowercase letters, for a specific type of linguistic analysis focused on text rhythm and internal structure.
The Token ID numbers we assign (0, 1, 2, etc.) are entirely artificial and have no internal meaning in the original text.
They are simply keys that allow us to:
-Reference the words easily,
-Reconstruct the text,
-Build statistical models.
What matters for me to find real patterns, mabroblocs related to the three unities, is the order of the words or tokens, not the number we assign to them.
Of course, this means that characters like @123 get split or discarded, by design, to simplify the analysis of letter-only word patterns, rather than fully reconstructing the text in its exact encoded form. And it is pretty interesting.
Thank you again for your advice about using IVTT. I was already using it!