(01-09-2019, 03:54 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Doesn't that look like it's not normalized?
It's not.
a = number of types of a page
b = number of types of another page
c = number of common types
pct = 100*c/(a+b-c)
Maybe averaging c/a and c/b would give a better indicator of how much the pages have in common, giving equal importance to both pages?
pct = 50*(c/a+c/b)
EDIT: no, the results are visually very close.
[
attachment=3235]
I have a really hard time wrapping my head around how this could work.
When one page has 50 types and another 50 as well, they can get 50c max.
But when one page has 90 types and another 10, they can get 10c max.
So ideally I think we should strive for a situation where the max c value would result in a same score.
I did some tests with both formulas and your second formula comes closer to desired numbers.
[
attachment=3236]
Edit: Nablator I just now see your edit. Even though the colors are close, your second attempt should still be better. But it's not yet optimal.
Is there a way to derive a formula from a bunch of examples and desired outcome?
To get the values of the DESIRED column, you want 100 * c / min(a, b). It's a good formula.
(01-09-2019, 07:31 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.To get the values of the DESIRED column, you want 100 * c / min(a, b). It's a good formula.
Could you make the sheet with that one?
(01-09-2019, 07:47 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view. (01-09-2019, 07:31 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.To get the values of the DESIRED column, you want 100 * c / min(a, b). It's a good formula.
Could you make the sheet with that one?
Is it better? They all look nearly the same to me.
[
attachment=3237]
[
attachment=3238]
[
attachment=3239]
It does look the same. Still, I think this formula normalizes the data as best as we can? Basically it expresses c as a percentage of max c.
It does change the order of folios when you sort them though. For example, if I sort by f40r, the first sheet gives as highest hits: You are not allowed to view links.
Register or
Login to view. f94r You are not allowed to view links.
Register or
Login to view.
For the new one it is: You are not allowed to view links.
Register or
Login to view. f39r You are not allowed to view links.
Register or
Login to view.
So nothing major, just some shuffling. But it's better.
This may be of use:
You are not allowed to view links.
Register or
Login to view.
(31-08-2019, 09:23 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Has anyone collected data on which pages share most vocabulary? Specifically, I'd like to know when I select a page (for example f1v), which other folios share the most different word types with it?
I did this using Principal Component Analysis for the whole manuscript in You are not allowed to view links.
Register or
Login to view., and for herbal, text/recipe, and biological pages in You are not allowed to view links.
Register or
Login to view..