The Voynich Ninja

Full Version: Need advice for testing of hypotheses related to the self-citation method
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10
(02-07-2025, 08:49 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.An evolution rule like "ey." → "eo." is unlikely, 

But that's exactly what it does.
It follows the system. ey, eo, ty usw.

The rule: follow the breadcrumbs.
[attachment=10940]
(02-07-2025, 09:20 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.What do you mean exactly by "initialisation problem"?

If the "initialisation problem" is not knowing what/where the seeds are, I don't see why it is a problem.

One problem is exactly that we don't know.

A more theoretical problem is that it has never been described and nobody seems to care about it.

The whole approach does not work without it.
It is an integral part of the method.
EDIT: the results in this post don't seem to break down under random permutation of words in the MS, so likely they don't mean anything. 

I'm not a big fan of the self-citation hypothesis, not because I think it's right or wrong, but because it's largely irrelevant to my line of investigation. Self citation is compatible with one-to-many ciphers and the extent of its use largely depends on how bored or in how much of a hurry the encoder is and tells little about the encoding itself. So, treat the below "proof" of self citation presence with a grain of salt, it's just a 20 minutes effort using OpenAI to create a Python script with not much independent verification on my part. It is highly likely that I'm just repeating a computation that has already been published before.

The basic idea: we get EVA transliteration and remove all spaces and ligature marks, rare character marks and paragraph marks, and collapse all alternatives to the first option. Then from the resulting text we take 10000 rarest shortest substrings between 4 and 12 characters long (not crossing new lines). For each of these substrings we will find in the text the nearest preceding and nearest following substring with the Levenstein distance <= 1 (so, it should be either the same exact substring or its variation with 1 character added, removed or replaced). We limit the shortest length of the substrings to 4 because for shorter strings there will be too many matching candidates (for example, substring AB of length 2 will match with any nearest A or B, and substring ABC of length 3 will match any AB, AB or CB BC). Given that we ignore the spaces, the edit distance of gluing two sequences together or separating them is 0.

Then we'll plot three charts - the distribution of the nearest preceding distances, the distribution of the nearest following distances and the distribution of the closest distances (which is the min of the first two for each instance).

And we get the following picture:


[attachment=10941]

What is remarkable here? For many rare substrings there can be no similar substring all the way forward to the end of the manuscript, or all the way back to the beginning of the manuscript. The tails on the preceding and following graphs are substantial.

However, there is almost always a similar substring within a couple of thousand characters if we look both directions. Given that it's likely the folios have been rearranged, I'd say most rare substrings have a doppelgänger within a couple of pages at most, and the majority on the same page.

This seems to show that there is a lot of short range pairwise similarity in the text, as if many rare sequences in the text are tied to only one other nearby locus.

(Edit: please see the run with randomly shuffled strings below.)

Python script: You are not allowed to view links. Register or Login to view.
CSV with results for individual substrings: You are not allowed to view links. Register or Login to view.
(02-07-2025, 09:53 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.One problem is exactly that we don't know.

A more theoretical problem is that it has never been described and nobody seems to care about it.

The whole approach does not work without it.
It is an integral part of the method.

I don't see why it is a problem to pick a few random words in a stack of already written folios, enough to generate (maybe) a line per paragraph, or less. Why would it not work? It doesn't have to be a deterministic process. It's not necessary to reverse engineer 100% of the method, including every random choice, to evaluate it, find constraints that improve it, and verify that it is compatible (or not) with what we know of the VM. So what could be the possible problem in not knowing and not caring about the initialization?

For example, the chromosome theory of heredity (Walter Sutton, early 20th century) doesn't specify which chromosome is inherited from the mother and which one from the father (except the Y chromosome of course). A random selection can be part of the theory without being a problem.
(02-07-2025, 10:26 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.So, treat the below "proof" of self citation presence with a grain of salt, it's just a 20 minutes effort using OpenAI to create a Python script with not much independent verification on my part.

Well, I followed my own advice and produced the graphs after shuffling words in the whole MS with:

Code:
import sys
import random
import re

# Read entire stdin
text = sys.stdin.read()

# Split on both newlines and dots
chunks = re.split(r'[\n\.]', text)

# Remove empty or whitespace-only chunks
chunks = [chunk.strip() for chunk in chunks if chunk.strip()]

# Shuffle the chunks
random.shuffle(chunks)

# Output the result
print('.'.join(chunks))

And they show the same effect. So I retract my post ;-) No idea what properly of edit distances and matching two sets instead of one causes this effect. Maybe the script doesn't do what ChatGPT promised it would.

[attachment=10942]
(02-07-2025, 11:11 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.And they show the same effect. So I retract my post ;-) No idea what properly of edit distances and matching two sets instead of one causes this effect. Maybe the script doesn't do what ChatGPT promised it would.

So the VM and the randomly scrambled VM "chunks" behave in the same way? Does this pattern carry over to texts we know have meaning? I know people have used the KJ Bible before to test for "meaningful" text but maybe a VM-contemporary herbarium or scientific text would be better suited. No idea if there are any readily available transcriptions on the internet though.
(02-07-2025, 11:28 AM)davidma Wrote: You are not allowed to view links. Register or Login to view.I know people have used the KJ Bible before to test for "meaningful" text but maybe a VM-contemporary herbarium or scientific text would be better suited. No idea if there are any readily available transcriptions on the internet though.

There are many... not always easy to find. I had to OCR or transcribe them myself from manuscripts in some instances.

This reminds me of the graph of average Levenshtein distance between words of lines, and the control on Dioscorides with Mattioli's commentary. See this thread: You are not allowed to view links. Register or Login to view.
(02-07-2025, 11:28 AM)davidma Wrote: You are not allowed to view links. Register or Login to view.So the VM and the randomly scrambled VM "chunks" behave in the same way? Does this pattern carry over to texts we know have meaning? I know people have used the KJ Bible before to test for "meaningful" text but maybe a VM-contemporary herbarium or scientific text would be better suited. No idea if there are any readily available transcriptions on the internet though.

My go-to lab rat is Opus Majus, and it shows the same pattern. Maybe it has something to do with how the probability of encountering a single edit substring grows with the number of candidates. I don't really want to spend much time on this, I just thought I got an interesting result, but it turned out to be not very interesting. Maybe the metric I have chosen - the distribution of distances to the closest element - is flawed. In any case, I have attached the source above and it begins with a detailed high level and low level description of what the script does (generated by the same ChatGPT), so you can just drop it into an AI bot and ask it to make any needed changes. In most cases when the source has both the source itself and the detailed natural language description, AIs can adapt and debug it in all kinds of ways.

Edit: essentially, the question is there are two indexed sets of random numbers with roughly the same mean and variance, but unknown shape of the distribution. What would be the mean and variance of a set where for each index you pick the minimum of two numbers from the original sets. My original intuition was that the mean would be something like 0.5-0.7 of the original mean, but now I understand that it could be quite different if two sets are not independent. Could be anything.

I also overlooked the fact that there might be no single edit twin for some of the substrings at all. I don't even know what would the script do in this case.
Sorry for temporarily hijacking the thread, but just some concluding remarks. I asked ChatGPT to change the script by capping the distance at 1/4 the size of the MS, so if no following or preceding string can be found within 1/4 of the text, the distance is set to this value.

And I switched graphs to log scale. Now I see no stark difference between both ways and one way matching, as well as between EVA, shuffled EVA and Opus Majus. For Opus Majus I used the initial portion the size of the Voynich MS, to allow for easy comparison. 

So I think this was a nothing-burger.

[attachment=10943]
(02-07-2025, 10:48 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.I don't see why it is a problem to pick a few random words in a stack of already written folios, enough to generate (maybe) a line per paragraph, or less. Why would it not work?

1. That does not work for the very beginning, which is my real interest
1a. It is still open whether there should be only a single initialisation or one per page or one per paragraph

2. Torsten's app seems to do only one global initialisation (I never checked this myself)

3. In later comments, Torsten suggested that paragraph-initial lines are taken from earlier paragraph initial lines, in order to cause the top line effect (f and p)

Whatever comes out, we need to keep in mind that this was not done by some modern automaton but by a medieval person who had no idea about several concepts that we always have in our heads (letter frequency, word frequency, word entropy), let alone worry about avoiding patterns.
Pages: 1 2 3 4 5 6 7 8 9 10