The Voynich Ninja
Common vocabulary between pages - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Common vocabulary between pages (/thread-2914.html)

Pages: 1 2 3 4


RE: Common vocabulary between pages - RobGea - 19-10-2019

VMS Folio text matching.
Using cosine similarity on Takahashi transcipt with top20 commonest words removed.

cosine algorithm taken from here:
You are not allowed to view links. Register or Login to view.

Explanation of Cosine Similarity here:
You are not allowed to view links. Register or Login to view.

Cosine Similarity on sample text generated by text-generator.jar with default values.
Using T.Timms tool created sample text then edited it to match the set of words per voynich folio
also  with top20 commonest words removed.
You are not allowed to view links. Register or Login to view.

In T.Timm image, the high values on the diagonal are data artefacts created when the text was split into folios.
Where a page is compared to itself the value is set to zero.
For visual reference, You are not allowed to view links. Register or Login to view. the folio with 3 words is page 114.
       


RE: Common vocabulary between pages - Koen G - 19-10-2019

Interesting. How would you read these graphs?
In the Timm example, it seems that the color is only affected by word count per folio (as we would expect)?

In the VM, there are clearer patterns. Are those caused by Currier A and B?


RE: Common vocabulary between pages - RobGea - 19-10-2019

Oh wow! Caveat: never done anything like this before.
a) All errors and omissions are mine.
b) I am a hobbyist with no skiils nor knowledge indeed this is my introduction to data science.
c) There is awesome amount of data in these plots and there is a lot going on.


1. Cosine similarity works at the word level and as there are words that only occur in CurrierA or CurrierB
  so the Currier langs will be present in this data.
  Rene explains deeply here:
  You are not allowed to view links. Register or Login to view.
  You are not allowed to view links. Register or Login to view.

2. The  Cosine similarity algo i used, uses L2 normailzation to counter word frequency effects,
  however the word density in the balneo and recipes sections are very high and as such
  i dont think L2-norma can remove that completely (see Random image).

3. In the T.Timm image there is a subtle underlying grid layout thoughout the image, this i think is created
  by the parameters in the auto-copying theory about when and how to modify the stream of words.

4. For comparison , i did the same thing to Dantes divine comedy in Italian.
  divided into voynich stlye folios and removed top 20 words.
  It shows the same word frequency bias as the other 3 images but its distribution is much smoother
  than the Timm or Voynich images and seems to increase gradually from left to right/top to bottom.
  Interestingly it does not show the similarity links one would expect given its structure,
  ie. the division into 3 'cantiche' is not obvious (see Dante image).
       


RE: Common vocabulary between pages - Koen G - 20-10-2019

But VM, Timm and Dante all show the pink squares bottom right. Wouldn't this imply that any grid you see is due to text length per page? Would it be clearer if you somehow normalize for this? 

One way to do this would be to limit the number of words to x per page and omit those pages that don't have enough words, but that may be a lot of work.


RE: Common vocabulary between pages - Davidsch - 05-11-2019

@DonaldFisk, concerning your excellent work 'A Principal Component Analysis..' some Q:

1.
If you plotted against vectors that represent all words, and their repeated count, why did You are not allowed to view links. Register or Login to view. cause problems, with these 3 words and did you leave it out?

2. 
I never did this comparison like figure 2, between the Currier types, and would be very interested in knowing the words from the top quandrants, that are entirely in green, and entirely red.  Because, I suspect that there is only a slight inner difference in these individual words.

3. "We can now see that each section is a cluster in its own right, but without gaps between them". figure 3.
Why do you think that is, or what does it suggest? 

To explain myself: I am not looking for differences pur sang, but I am looking for similarities in the deviating words.

4. Are those graphs from Python, imported tables from CSV?
I would like something similar on other texts, perhaps you could help me there

NB: you have a great sense of humor linking to "series of tubes", haha