The Voynich Ninja
Similar words in the same position - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Similar words in the same position (/thread-2404.html)

Pages: 1 2


Similar words in the same position - bi3mw - 19-06-2018

I've been thinking about how repetitive the text in the VMS really is, so I had the idea for the following experiment:

If you compare two folios word by word, how big is the chance to find similar words in the same place? Even if one does not consider the very short words at a distance of max. 2, the result is remarkable. Compared to "De Balneo", the hit rate is high. When compared to another folio of the VMS, the result is similar.

The weak point here is that the word boundaries in the VMS are not always clear.

You are not allowed to view links. Register or Login to view.


RE: Similar words in the same position - Emma May Smith - 19-06-2018

Surely this is a problem with defining 'similar'? The word structure in the Voynich text is relatively limited and so similar words occur more often than in other texts.


RE: Similar words in the same position - bi3mw - 21-06-2018

As for the similarity, you could narrow the selection to a maximum of 1. That would be very similar. The result is even clearer. For "De Balneo" remain two (!) Word pairs left, "balneum / balneum" and "et / ex". The hit rate in the VMS is still very high.

I do not know any form of "free text" that would show such a repetition rate except tables.

[Image: vms_f78r_vs_f84v_lev_1.png]
[Image: vms_f84r_vs_f84v_lev_1.png]


RE: Similar words in the same position - MarcoP - 21-06-2018

Hi bi3mw,
I find your idea interesting, but I don't understand how to read your graphs. Could you please explain?
Also, I think you could present the data as a table listing the matching pairs and their positions.

Finally, could you please tell us more of the De Balneo tests you are mentioning? In particular, which text are you using exactly? Which matches did you find?


RE: Similar words in the same position - bi3mw - 21-06-2018

Hi Marco,

The graphics are arranged from the outside inwards. If there is a point on the outer circle, then the distance is 1, if there is no point then it is 0. This representation is rather clear at larger distances with several circles. The tables with the word positions (numbers) are here:

You are not allowed to view links. Register or Login to view.

You are not allowed to view links. Register or Login to view.

You are not allowed to view links. Register or Login to view.

In the chapter "De Balneo" from the "Regimen Sanitatis" I simply divided three pages (Volume 2, Chapter 4, pages 168 - 170) and compared them to each other. All punctuation marks and comments have been removed before. The text lengths are comparable to the selection in the VMS. The link to the text:

You are not allowed to view links. Register or Login to view.


RE: Similar words in the same position - Koen G - 21-06-2018

I'm also just starting to understand how to read these diagrams. So basically the more spokes there are on the wheel, the more pairs you have found in a similar position between pages. And the more outward the dot, the more distance between pairs?

So if I'm understanding this all correctly, then I wonder: isn't this just a result of LAAFU effects and low entropy?


RE: Similar words in the same position - MarcoP - 22-06-2018

Thank you, bi3mw, things are much clearer now!
I guess that an additional test you could run is randomly scrambling word order in one of the two pages (without altering words) and see how much the results change. This would allow you to measure the impact of word order.

Here are some considerations that ignore word order, but I am not sure they are correct:

Word counts:
  • 362 You are not allowed to view links. Register or Login to view.
  • 337 You are not allowed to view links. Register or Login to view.
Words common among the two files:
153 (with 58 different word types: most shared words appear multiple times in both files)
Averagely, common words repeat in both files 153/58=2.64 times

For each word in 84v, the probability that it repeats somewhere in You are not allowed to view links. Register or Login to view. is:
153/362=42%

The probability that you find that specific repeating word is:
2.64 * 1/58 = 4.5%

The joint probability of finding a repeating word and that it is the correct one is:
42%*4.5%=1.9%

The expected number of perfect matches is 1.9% of the 337 words in f84v:
337*1.9%=6.4


You found 9 exact co-occurrences versus an expected number of 6.4.


Extracts of the same size from the "De Balneo" chapter share 94 words (vs 153 in the VMS). This tells us that a lower number of matches is expected.


RE: Similar words in the same position - bi3mw - 22-06-2018

(22-06-2018, 08:27 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view....
I guess that an additional test you could run is randomly scrambling word order in one of the two pages (without altering words) and see how much the results change.
...

Thanks Marco,

Your suggestion to test shuffled text against original text is very interesting. I repeated the experiment twice and the result was the same every time: the number of exact and similar hits is halved.

[Image: vms_f78r_vs_f84v_shuffled.png]

You are not allowed to view links. Register or Login to view.


RE: Similar words in the same position - MarcoP - 22-06-2018

(22-06-2018, 10:30 AM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.
(22-06-2018, 08:27 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view....
I guess that an additional test you could run is randomly scrambling word order in one of the two pages (without altering words) and see how much the results change.
...

Thanks Marco,

Your suggestion to test shuffled text against original text is very interesting. I repeated the experiment twice and the result was the same every time: the number of exact and similar hits is halved.

Thank you, bi3mw, very interesting again!
If this figure is confirmed in a number of observations, your analysis can proceed in several directions.
One could think of how to discriminate between the 50% matches that appear to be random and those that must be explained. A simple idea could be ignoring the matches that involve frequent words (those are more likely to be coincidental).
Given the uncertainty in position (due to uncertainty in word spaces) one could also allow some tolerance in position (e.g. considering matches with up to a 5% difference in position). In my opinion, this could be more informative than allowing word-variation via Levenshtein: I think that focussing on perfect matches could be useful as a first step.

In general, if the phenomenon turns out not be illusory, it could be an effect of local concentration of words inside pages: e..g some words tend to occur in the top half of a page and others in the bottom half. Ideas like this can of course be independently tested by measuring the average page-position of each word through the whole manuscript (or through consistent subsections of the manuscript).

I will certainly follow the developments with attention!


RE: Similar words in the same position - bi3mw - 22-06-2018

I mostly use Bash scripts for searches. Finding strings across word boundaries is possible, but it's difficult to test the hits against the percentage ratio inside / outside the word boundaries. I will think about it.