The Voynich Ninja
Common vocabulary between pages - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Common vocabulary between pages (/thread-2914.html)

Pages: 1 2 3 4


Common vocabulary between pages - Koen G - 31-08-2019

Has anyone collected data on which pages share most vocabulary? Specifically, I'd like to know when I select a page (for example f1v), which other folios share the most different word types with it?


RE: Common vocabulary between pages - Anton - 31-08-2019

Wladimir did that for some folios, check out his threads.

Also here (in Russian): You are not allowed to view links. Register or Login to view.

I consider this approach a very promising avenue, something that should be automated and a full data set built for examination.


RE: Common vocabulary between pages - nablator - 31-08-2019

   


RE: Common vocabulary between pages - Koen G - 31-08-2019

What I'd like to be able to do is select a plant page and see which pages it "connects" to the most, vocabulary wise. I tried to do this manually but soon gave up, there are so many variables to keep track of.


RE: Common vocabulary between pages - RobGea - 01-09-2019

I have done something similar to group voynich sections by folios, maybe the code can be adapted.
Meanwhile over at voynich attacks there is grouping folios by glyphs rather than words:
You are not allowed to view links. Register or Login to view.


RE: Common vocabulary between pages - nablator - 01-09-2019

(31-08-2019, 11:43 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.What I'd like to be able to do is select a plant page and see which pages it "connects" to the most, vocabulary wise. I tried to do this manually but soon gave up, there are so many variables to keep track of.

With the right color gradient (I'm sure it can be improved) the pages with the most common vocabulary should be easy to find in this Excel sheet. It displays the percentage of common vocabulary between any two pages for ZL2.


RE: Common vocabulary between pages - -JKP- - 01-09-2019

(31-08-2019, 11:43 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.What I'd like to be able to do is select a plant page and see which pages it "connects" to the most, vocabulary wise. I tried to do this manually but soon gave up, there are so many variables to keep track of.

That's what I was trying to do with my VMS Concordance. It took me years. Years.

I went through the entire manuscript and recorded which tokens occurred in which sections and with what frequency, and how they connected throughout the manuscript—a network of tokens. It was a HUGE project with minimal returns.


This is the summary page for a two-glyph token lk. It documents where it occurs, how many times on a specific folio it occurs, the specific sections in which it occurs, variations that include glyphs before and after the token, comments on anomalies or unusual patterns. And that's just the summary page. It is also color-coded within the transcript, plus is set up as charts and frequency distribution lists. This is just a peephole of all the information I collected. I did this for every token in the manuscript:

[Image: VMSConcordanceLK.png]

PS, I didn't use folio numbers, I used mnemonics (whatever gave me a quick picture of the plant or pool folio in my head, it didn't have to be a correct ID, it preferably had to be funny or goofy to make it easier to remember). This allows me to scan through the tokens quickly and see where they occur. I cross-referenced each one with the color-coded transcript to get the folio numbers, so this is only one small part of all the data and only a small part of the visualization aids.

SP stands for "small-plants" section.

You can see, for example, that lk occurs on 6 of the zodiac-figure pages, all of the pool pages, all of the rota on the rosettes folio except rotum 3. It seems to be preferential to the pool pages (doesn't occur on as many plant pages as some tokens). It is rather sparse on the starred-text pages, but when it occurs, it occurs quite a few times on individual folios.

I haven't released this information because it became too big, too complex, and needs to be double-checked (which would probably take a year of dedicated attention), plus I'm still trying to figure out what it means. It combines several applications. This uses my own fonts, my own transcript, my own database, and a couple of other utilities. It's not something one can simply hand to someone else. Also, even if there were some practical way to do it... it's almost impossible to release raw data to an academic audience if it hasn't been double-checked—the responses tend to be scathing, and focused on small mistakes or omissions rather than all the good information that can potentially gleaned from it. Plus... the documentation that goes with it is more than 1100 pages long (not including the transcript and the database). Not exactly the sort of thing one can easily submit for publication.


RE: Common vocabulary between pages - RobGea - 01-09-2019

OK got some code working in Python36. If you want i'll post it but its rough and ready and not thoroughly tested.
It uses Takahashi Transcription IVTFF Eva- 1.5
Some truncated sample output:
folio: f1v  contains 65 individual words. 
Matches ::  Folio
(20, 'f8r') 
(20, 'f89v1')

folio: f65r  contains 3 individual words
Matches  :: Folio
(2, 'f58v')
(2, 'f111v')

folio: f100r  contains 92 individual words
Matches  :: Folio
(26, 'f101r')
(25, 'f113r')
(25, 'f111r')


RE: Common vocabulary between pages - -JKP- - 01-09-2019

The data is only as good as the transcript.

When I first started studying the VMS, I downloaded the Takahashi transcript. It only took a few minutes of checking the early pages to realize it was flawed and there was no point in using it. That's why I created my own.

That's not to say it is bad. It's reputed to be one of the better ones. But I didn't feel it was good enough. Many spaces are ignored. Some characters are ignored (like cccc patterns). The variety of endings in "daiin" are frequently ignored. Daiin is not as homogenous as the transcripts make it look.


There are still plenty of things that can be done with a flawed transcript as long as the proportion of problems is small... but people should be aware that it's flawed when they examine the data (it also depends exactly what one is trying to get out of it... a rough overview, or information about details).


RE: Common vocabulary between pages - -JKP- - 01-09-2019

(01-09-2019, 02:26 AM)RobGea Wrote: You are not allowed to view links. Register or Login to view.OK got some code working in Python36. If you want i'll post it but its rough and ready and not thoroughly tested.
...


folio: f100r  contains 92 individual words

Matches  :: Folio

(26, 'f101r')

(25, 'f113r')

(25, 'f111r')

This is why I used mnemonics instead of folio numbers (I also have the folio numbers listed in the background). It's much easier to quickly grasp relationships if, for example, one scans through and sees that the same tokens occurring on Water Lily and Viola occur on Rotum 9 on the map folio (Rotum 9 looks to me a bit like a garden, so this is why I chose it as an example).

It's tiresome and time-consuming to look up folio numbers. It's easy to code in mnemonics of the researcher's preference to go with the folio numbers. I highly recommend it.