quimqu > 03-07-2025, 08:31 AM
(03-07-2025, 07:38 AM)stopsquark Wrote: You are not allowed to view links. Register or Login to view.Did the Latin text use scribal abbreviations (and if so, how were these tokenized?) or was it first expanded into plaintext? Many folks on this site have pointed out the similarities between Voynichese letters and Latin scribal abbreviations, and I'd be very curious about whether or not a Latin text with the sigla tokenized as distinct characters behaves like Voynichese. This would effectively be a verbose substitution cipher, but one leaning strongly on historical convention.
Great work on this- I'm planning on tinkering with something similar in order to try to do embedding space alignment across Currier hands. I'd love to look at your repository, if it's public!!!
quimqu > 03-07-2025, 09:33 AM
(03-07-2025, 01:30 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Interesting!
You may want to try this on a sample of Torsten's generated text...
Working on Torsten Timm generated
10832 words identified.
Trainng model for Torsten Timm generated
number of parameters: 1.08M
Random initial word: sharo
Entropy of word distribution for 'Torsten Timm generated': 8.7303 bits
Top 10 words by entropy contribution:
  word  count    prob  entropy
chedy    17 0.016983 0.099856
cheedy    16 0.015984 0.095380
  char    15 0.014985 0.090814
daiin    15 0.014985 0.090814
    ar    11 0.010989 0.071514
  dain    11 0.010989 0.071514
  air    11 0.010989 0.071514
  aiin    10 0.009990 0.066387
    ol    10 0.009990 0.066387
  ain    10 0.009990 0.066387
✔️ % of 2-grams found in original text: 18.20% (182/1000)
✔️ % of 3-grams found in original text: 0.10% (1/999)quimqu > 03-07-2025, 12:06 PM
(03-07-2025, 03:59 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.It would be interesting to see what your network would do if you mapped every a to o, every s to r, every ch to ee...
![[Image: w4mYK5M.png]](https://i.imgur.com/w4mYK5M.png)
Jorge_Stolfi > 03-07-2025, 02:34 PM
(03-07-2025, 12:06 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.In fact, I trained several models with different block sizes (from 1 to 32). Block size tells how many tokens are used for learning and predicting. As expected, with block size 1 the results are better — the model just echoes bigrams. But with larger blocks (e.g., 8 or more), it needs to rely on broader context, and that’s where Voynichese starts to diverge from natural languages.
Quote:Here’s a sample of matching bigrams and trigrams for block size = 8, for several texts including Voynichese:
You are not allowed to view links. Register or Login to view.
quimqu > 03-07-2025, 02:43 PM
(03-07-2025, 02:34 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Google Drive says I don't have permission to read that link.
From the Voynichese output that you quoted, it seems that many word spaces are missing. I suppose they were marked with comma ("possible space") instead of period in the original file. (Or, sadly,not marked at all: it looks like the Scribe sometimes ran words together by accident.) Would you consider trying again with periods and commas treated as word separators?
obelus > 03-07-2025, 07:24 PM
(02-07-2025, 04:10 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I trained several nanoGPT models
Jorge_Stolfi > 04-07-2025, 11:59 PM
srjskam > 05-07-2025, 04:35 PM