Options

Categorizing the text-only pages

Index
Categorizing the text-only pages
RE: Categorizing the text-only pages

MarcoP > 27-07-2020, 07:33 AM

Hi JKP,
if I understand correctly, light blue corresponds to Scribe1. Herbal pages (big plants) do not have a specific colour, they are just marked by scribe-colours. So pages 26r, 95v2, 95r1, 45r at the bottom left border are all Herbal pages by three different scribes.
Dots corresponding to all other sections have two colours: a smaller dot for the scribe and a larger circle or connected cluster for the section.

@RobGea: what does "cosine matched" mean exactly? Are you using cosine-similarity on the N-dimensional vectors of token counts (where N is the total number of word-types)?
If so, this seems to me a significant improvement with respect to what Julian described on voynichattacks (his system only uses word-types, ignoring token counts).
I don't know how easy it is to do, but I would be curious to see the same plot based on PCA instead of whatever system LinLog uses: I suspect that the closeness among page couples across the A/B boundary (e.g. 65v / 88v) is an artefact of this particularly plotting system. But of course a PCA plot will be much less easy to read.

I think that what you are doing is of the greatest interest. Further exploring Julian's experiments with LinLog seems very promising: I am looking forward to read more of what you find!
You make me want to play with this software myself, but as a first step I should probably read some of the papers you linked...
RE: Categorizing the text-only pages

RobGea > 27-07-2020, 04:20 PM

Hi all ,
MarcoP has succintly described whats going on in the graphic.
A dot that is only a single color on the white background is a Herbal ( big plants ) page and that single color refers to the Scribe who wrote it.
My apologies for not making it clearer.

MarcoP asked "Are you using cosine-similarity on the N-dimensional vectors of token counts (where N is the total number of word-types)?"
If i have understood what i'm doing correctly Then yes.
I used this You are not allowed to view links. Register or Login to view.

Using term frequency–inverse document frequency( tf-idf ) in addition to cosine-similarity would be an improvement.

Regarding You are not allowed to view links. Register or Login to view.
It is now an old (in internet terms) tool from 2009, it is very easy to use but the layout it creates is from a random seed its not deterministic
So if you run it several times with the same data , you will get slighly different layouts.
Overall LinLogLayout will produce similar big groupings on multiple runs but the specific neighbors of a folio may well differ.
( added note to my previous post )

If you have Java JDK then you can edit the source code and recompile to modify this behaviour (I have not done this) : You are not allowed to view links. Register or Login to view.

From my very limited understanding, the algorithms for the Linear Logarithmic layout are now widespread and can be found in many datascience tools.
One of which is You are not allowed to view links. Register or Login to view.
RE: Categorizing the text-only pages

-JKP- > 27-07-2020, 04:21 PM

Thanks, Marco.

Rob, this topic interests me, but I'll have to come back to it... duty calls (work day).
Next Oldest Next Newest

Categorizing the text-only pages

Index

RE: Categorizing the text-only pages

RE: Categorizing the text-only pages

RE: Categorizing the text-only pages