quimqu > Yesterday, 08:31 AM
(Yesterday, 07:38 AM)stopsquark Wrote: You are not allowed to view links. Register or Login to view.Did the Latin text use scribal abbreviations (and if so, how were these tokenized?) or was it first expanded into plaintext? Many folks on this site have pointed out the similarities between Voynichese letters and Latin scribal abbreviations, and I'd be very curious about whether or not a Latin text with the sigla tokenized as distinct characters behaves like Voynichese. This would effectively be a verbose substitution cipher, but one leaning strongly on historical convention.
Great work on this- I'm planning on tinkering with something similar in order to try to do embedding space alignment across Currier hands. I'd love to look at your repository, if it's public!!!
quimqu > Yesterday, 09:33 AM
(Yesterday, 01:30 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Interesting!
You may want to try this on a sample of Torsten's generated text...
Working on Torsten Timm generated
10832 words identified.
Trainng model for Torsten Timm generated
number of parameters: 1.08M
Random initial word: sharo
Entropy of word distribution for 'Torsten Timm generated': 8.7303 bits
Top 10 words by entropy contribution:
word count prob entropy
chedy 17 0.016983 0.099856
cheedy 16 0.015984 0.095380
char 15 0.014985 0.090814
daiin 15 0.014985 0.090814
ar 11 0.010989 0.071514
dain 11 0.010989 0.071514
air 11 0.010989 0.071514
aiin 10 0.009990 0.066387
ol 10 0.009990 0.066387
ain 10 0.009990 0.066387
✔️ % of 2-grams found in original text: 18.20% (182/1000)
✔️ % of 3-grams found in original text: 0.10% (1/999)
quimqu > Yesterday, 12:06 PM
(Yesterday, 03:59 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.It would be interesting to see what your network would do if you mapped every a to o, every s to r, every ch to ee...
Jorge_Stolfi > Yesterday, 02:34 PM
(Yesterday, 12:06 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.In fact, I trained several models with different block sizes (from 1 to 32). Block size tells how many tokens are used for learning and predicting. As expected, with block size 1 the results are better — the model just echoes bigrams. But with larger blocks (e.g., 8 or more), it needs to rely on broader context, and that’s where Voynichese starts to diverge from natural languages.
Quote:Here’s a sample of matching bigrams and trigrams for block size = 8, for several texts including Voynichese:
You are not allowed to view links. Register or Login to view.
quimqu > Yesterday, 02:43 PM
(Yesterday, 02:34 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Google Drive says I don't have permission to read that link.
From the Voynichese output that you quoted, it seems that many word spaces are missing. I suppose they were marked with comma ("possible space") instead of period in the original file. (Or, sadly,not marked at all: it looks like the Scribe sometimes ran words together by accident.) Would you consider trying again with periods and commas treated as word separators?
obelus > Yesterday, 07:24 PM
(02-07-2025, 04:10 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I trained several nanoGPT models