The Voynich Ninja
Common vocabulary between pages - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Common vocabulary between pages (/thread-2914.html)

Pages: 1 2 3 4


RE: Common vocabulary between pages - ReneZ - 01-09-2019

I  did basically the same thing many years ago and it is described on You are not allowed to view links. Register or Login to view. .

The relevant figure is similar to the one posted earlier by nablator:

[Image: la_corr.gif]

One problem is how to normalise the results. The text on pages can have many different lenghts.
Obviously, one can get more matches on longer pages.


RE: Common vocabulary between pages - -JKP- - 01-09-2019

I was particularly interested in finding patterns.

For example, would wet and dry plants have a pattern of similar tokens?

Would plants that share the same ruling planet have a pattern of similar tokens?

Would plants and pool-page share certain patterns?

Would plants that were used together in common medicinal recipes share similar patterns?

Would star pages and zodiac-figure pages share certain patterns?

Would the patterns of "base" tokens have consistent pre- and post-pends (and was this a valid way of looking at them)?

Was there a progression of patterns?

Were there cycles of patterns?

I could go on for several pages. I was hoping some of these patterns (or something along these lines, it didn't have to be these specific patterns as long as they were patterns one might reasonably expect in the Middle Ages) would become more visible. The color-coded transcript was an important part of this. You can see by glancing at the folios whether there are patterns and then look in the corcordance and/or the database to see where else the pattern might appear.


RE: Common vocabulary between pages - Koen G - 01-09-2019

(01-09-2019, 12:58 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.With the right color gradient (I'm sure it can be improved) the pages with the most common vocabulary should be easy to find in this Excel sheet. It displays the percentage of common vocabulary between any two pages for ZL2.

Nice! Is this like a zoomed-in version of Rene's graphs?


RE: Common vocabulary between pages - Koen G - 01-09-2019

(01-09-2019, 05:49 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.One problem is how to normalise the results. The text on pages can have many different lenghts.
Obviously, one can get more matches on longer pages.

Hmm yes that's right. 
What if this exercise were done only with rare words? Then you would specifically look for themes common to certain pages.


RE: Common vocabulary between pages - nablator - 01-09-2019

(01-09-2019, 10:26 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Nice! Is this like a zoomed-in version of Rene's graphs?

Basically yes but the data is different: I used unmodified ZL2 with all special characters in the Excel sheet and "my" transcription for the image (it doesn't matter much, the differences are small).

I don't know how to normalize the results. Of course short pages can only share a small percentage of the total vocabulary of two pages when the other page is much longer. A non-symmetrical version with the percentage of vocabulary of page i that can be found in page j would also suffer from the same issue when there is a large differences in size between pages.

Quote:What if this exercise were done only with rare words? Then you would specifically look for themes common to certain pages.

I would be very surprised if vocabulary (rare or not) has anything to do with themes and meaning. Many attempts to correlate vords with themes and illustrations and find grammatical structures would have showed better results than basically nothing. Currier was right: vords are not words.

FWIW my interpretation of what these tables and also MATTR show is that the author(s) strived in most Currier A "language" pages to maximize finesse and elegance, the only exceptions (where MATTR drops) is when they were doing "challenges" avoiding some common pattern(s) or repeating them as much as possible. They finally got bored with the relatively slow process of generating nearly optimal vords (optimized for size and variable constraints) in Q13 and to a lesser extent Q20, they relied much more on the reuse of vords and patterns (a.k.a. autocopy) to finish the work more quickly.


RE: Common vocabulary between pages - Koen G - 01-09-2019

(01-09-2019, 12:09 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I would be very surprised if vocabulary (rare or not) has anything to do with themes and meaning. Many attempts to correlate vords with themes and illustrations and find grammatical structures would have showed better results than basically nothing. Currier was right: vords are not words.

Could be... I remain agnostic about what the text is, but I prefer to design my tests "as if it were" language. (But let's not get into the whole "proving a negative" thing here, there's a thread for that).

I'm also taking into account the possibility that we simply haven't found the best way yet to classify pages based on imagery (in preparation for subsequent vocabulary testing).


RE: Common vocabulary between pages - MarcoP - 01-09-2019

(01-09-2019, 05:49 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.One problem is how to normalise the results. The text on pages can have many different lenghts.
Obviously, one can get more matches on longer pages.

With some googling, I found that You are not allowed to view links. Register or Login to view. was designed to address this problem.
It is quite simple and I expect it should work in our case:

Quote:The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:
[Image: eaef5aa86949f49e7dc6b9c8c3dd8b233332c9e7]

For any page A, longer pages B are more likely to get more matches, but a longer B will also result in a larger denominator (union i.e. total number of different types in A and B), compensating the larger nominator (intersection i.e. matches).


RE: Common vocabulary between pages - nablator - 01-09-2019

(01-09-2019, 12:55 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.With some googling, I found that You are not allowed to view links. Register or Login to view. was designed to address this problem.

It is quite simple and I expect it should work in our case:

I didn't know the name of this index but it is exactly how I calculated my percentages.


RE: Common vocabulary between pages - RobGea - 01-09-2019

The simplest method is probably just to eliminate the 10-20 most common words.

It's very much dependent on exactly what it is that you want from the data.

Take folio.f65r  with 3 words only, 'otaim dam alam'.
otaim occurs in 1 other folio -> f111v[recipes].
dam occurs in 67 other folios ->Occurs in all sections.
alam occurs in 7 other flolios -> 5 times in recipes[f104r, f107r, f108r, f111r, f111v],  2 in text only[f58r,f58v]

shared words:
(2, 'f58v', {'dam', 'alam'}), (2, 'f111r', {'dam', 'alam'}) (2, 'f104r', {'dam', 'alam'})
(2, 'f111v', {'otaim', 'alam'})

Which of these results is the most useful ?


RE: Common vocabulary between pages - Koen G - 01-09-2019

I think folios with very little text are simply not very useful for this kind of exercise and can best be omitted.

Nablator: I isolated all Herbal B folios from your file and averaged the numbers per column. (3-word folio omitted)
Light green are You are not allowed to view links. Register or Login to view. f41r You are not allowed to view links. Register or Login to view. f46v
Dark green are You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view.
Their word counts are 90, 98, 68, 87, 45 = average 77.6
Average for remaining B folios is 110.6

Doesn't that look like it's not normalized?