The Voynich Ninja

Full Version: Why not positional variation?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7
I'm not sure if I understand correctly what this does exactly. It says for "Relative probability of ld{?}" that "g" is first with 0.85.
(12-05-2025, 03:41 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.I'm not sure if I understand correctly what this does exactly. It says for "Relative probability of ld{?}" that "g" is first with 0.85.

Intuitively this means that out of all possible characters after ld the one least likely to appear only by accident is g. I cannot look up the numbers now, but this probably means that ldg appears a few times in the ms (say, 3-4 times) and given that both ld and g are relatively rare, this combination is less likely to just appear by chance. In absolute numbers probably lde or lda appear more often, but e and a are frequent characters, so these combinations are more likely to appear by chance.

Edit: the model compares the actual count of a combination like ldg or ldc with the expected count given the total number of ld's and g's or c's in the ms. The reported probability delta is the difference between the actual and the expected divided by twice the mean of actual and expected. This number can be meaningless for when both actual and expected are very small, I'm not sure if Claude considered this.
Maybe for clarity I can ask Claude to add a tooltip that would be visible for a specific character with mouse over on a desktop and with a long press on mobile, that would show the actual and the expected count.
So it's like if you put all the letters (tokens) in the MS in a big bag and draw them at random, what are your chances of drawing "g", and then this is compared to the actually occurrence of "g" before and after your prompt?

I see how this is very useful and instructive in most cases, but like you say, the small numbers cause counterintuitive results. That is to say, from the perspective of "g" it might be interesting that it occurs twice after "ld". But in the story of "ld", the "g" doesn't really matter. In its ca. 450 occurrences, it only accounts for a tiny percentage.
Interestingly in the transcription file used by the script there are 4 ldg's. This looks quite high. Could this be a parsing mistake with some uncertain [d:g] parsed as dg?
I just did a quick check on voynichese.com where I got 2 instances. Might just be the different file used.

First one matches, that's f14v. 
On f57v, Voynichese gives okoldm vs. okaldg in your file.
On f68v3, opcholdg matches.
Finally on You are not allowed to view links. Register or Login to view. we have Sheoldg vs. Sheol.dg

So you get a 100% increase depending on the transliteration file used.
I've persuaded Claude to make a few changes:

1) instead of dividing by (actual + expected) / 2, it will divide by (actual + expected + 2) / 2, effectively downgrading extremely low count events
2) low count cells are also marked with '?' (fewer than 7 instances) and '??' (fewer than 4)
3) there is a tooltip that shows the actual and the expected, should activate by hovering the mouse on a desktop and via a long press on touchscreens. The tooltip doesn't work very well on touchscreens, but I'm afraid if I start arguing with Claude on this one, it will only make it worse.

By default the page is cached locally, so you may need to refresh the page a couple of times before you get the latest version. All past statistics will not be recalculated automatically, so they probably should be deleted and recomputed if needed.

Edit: also Claude made a few attempts of fixing the problem with wrong counts when looking for characters that are already present in the string, I accepted the version that I found the most plausible. It involves reducing the count of available characters by the count of the character in the search string times the count of the search string in the text, but not including this character if it's the last in the search string for preceding characters mapping and the first for the following characters mapping. It probably sounds complicated, but it's the cleanest that Claude offered and it probably works fine in most cases.
The combination (Eva) "ld"  ld  is really quite common, and arises e.g. from the combination of ol - dy , al - dy or ol - daiin (etc).
As line final dy (occasionally to rarely) changes into g or dg , both lg and ldg will appear occasionally.

Since we are talking about only a small number of g in the MS, I am sceptical that the probability of ldg can be tested in any significant way.
(13-05-2025, 12:29 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The combination (Eva) "ld"  ld  is really quite common, and arises e.g. from the combination of ol - dy , al - dy or ol - daiin (etc).
As line final dy (occasionally to rarely) changes into g or dg , both lg and ldg will appear occasionally.

Since we are talking about only a small number of g in the MS, I am sceptical that the probability of ldg can be tested in any significant way.

These 4 below appear to be actual ldg's. To me 4 events with the expected of 0.36 does look statistically significant. However, there is a simple explanation, the model which randomly mixes letters doesn't account for curve-line preferences. So, it would expect ~7 times more ldm's than ldg's, because m is ~7 times more prevalent than g. But there is not a single ldm, as per the transliteration, and this is expected given CLS. To put this differently, if we treat g as the version of m that goes after d, then the expected number of ldg's would be ~3, which is in a very good agreement with the observed of 4.

[attachment=10620]

Edit: I'm not sure, but maybe this logic can be used to show that the CLS is more about adjusting glyph shapes to fit the preceding shapes rather than about which glyphs are allowed after which other glyphs.
It is always important to note whether a symbol can also stand alone; I have already explained enough about (8g).
Example: If (8) stands alone, it has the value 2 and almost certainly stands for (de) and perhaps also for (di).
So there are the variants for (discussed symbol).
Alone (est) or with (8), (dest) or (d'est). Variants with (t) possible.
These small differences make up the text, or not. A transcription does not recognise these small differences. And now no AI can help you.
Pages: 1 2 3 4 5 6 7