Oh wow! Caveat: never done anything like this before.
a) All errors and omissions are mine.
b) I am a hobbyist with no skiils nor knowledge indeed this is my introduction to data science.
c) There is awesome amount of data in these plots and there is a lot going on.
1. Cosine similarity works at the word level and as there are words that only occur in CurrierA or CurrierB
so the Currier langs will be present in this data.
Rene explains deeply here:
You are not allowed to view links.
Register or
Login to view.
You are not allowed to view links.
Register or
Login to view.
2. The Cosine similarity algo i used, uses L2 normailzation to counter word frequency effects,
however the word density in the balneo and recipes sections are very high and as such
i dont think L2-norma can remove that completely (see Random image).
3. In the T.Timm image there is a subtle underlying grid layout thoughout the image, this i think is created
by the parameters in the auto-copying theory about when and how to modify the stream of words.
4. For comparison , i did the same thing to Dantes divine comedy in Italian.
divided into voynich stlye folios and removed top 20 words.
It shows the same word frequency bias as the other 3 images but its distribution is much smoother
than the Timm or Voynich images and seems to increase gradually from left to right/top to bottom.
Interestingly it does not show the similarity links one would expect given its structure,
ie. the division into 3 'cantiche' is not obvious (see Dante image).