ReneZ > 18-10-2025, 12:52 AM
MarcoP > 18-10-2025, 06:03 AM
u/Miseryy Wrote:Your UMAP is massively overfit. In general tight strings that curve around are just indicative of a very small # neighbors used and too small distance threshold. You can replicate this effect with ~any dataset.
Also, I'm of the opinion you should never do clustering on UMAP, ever. Furthermore "UMAP clustering" isn't a noun that exists. UMAP can be used as an initial preprocessing step, and then a standard clustering algorithm can be used. But again, I think it's terrible methodology, since you can ~always tune UMAP to achieve the clusters you want in the first place. People do it though, no denying that.
I'd suggest going with the default parameters unless you really know what you're doing and have a good justification (read: mathematical reason) to adjust them. The parameters affect the math.
Don't focus to much on meaning. If you want meaning, use PCA, and look at the vectors. The only interpretable meaning of UMAP is relative positioning. And even that is sketchy. You really should be taking away: there are groups that can be visually separated and appear to be distinct. UMAP is not proof of anything
I would recommend that you just do clustering on your data. Not "UMAP clustering". How about starting with a simple hierarchical clustering and then looking at what you get? You can cluster either across genes or samples, and observe what falls into what group.
quimqu > 18-10-2025, 11:37 AM
MarcoP > 19-10-2025, 09:04 AM
quimqu > 19-10-2025, 09:53 AM
(19-10-2025, 09:04 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I am not sure I understand correctly, but the PCA K=2 plots show a single outlier at the top that makes the plots basically useless? Is it so? Or maybe it's a set of samples that generate identical outlier dots?
Anyway, if that actually is the situation, investigating and removing, or fixing, that sample (or samples) could improve the quality of the overall analysis.
quimqu > 19-10-2025, 10:16 AM
quimqu > 19-10-2025, 09:31 PM
MarcoP > 20-10-2025, 06:29 AM
Quote:Most of the first 1/3 of the herbal, it is almost pure A. But then gruadually, paragraphs that mix B topic words appear, and appear quite strongly. They are not as continuous as strong topic A paragraphs, but they appear suddenly after 1/3 of herbal paragraphs. Are the herbs starting to be described partially with topic B?We know what is happening here since Currier’s analysis half a century ago: herbal bifolios by different scribes were mixed and bound together. This is detailed in You are not allowed to view links. Register or Login to view. (Table 1). All pages from f1 to f25 (“the first 1/3 of the herbal”) were created by scribe 1. After that point, scribe 2 pages begin to be inter-mixed with scribe 1 pages. If we take the illustrations as indicative of a topic, the change in statistics does not appear to be due to a different topic, but to a different scribe.
quimqu > 20-10-2025, 10:24 AM
(20-10-2025, 06:29 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.If we take the illustrations as indicative of a topic, the change in statistics does not appear to be due to a different topic, but to a different scribe.
(20-10-2025, 06:29 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Hi quimqu, as I pointed out You are not allowed to view links. Register or Login to view., this appears to overlap with the ongoing research by Lisa Fagin Davis and Colin Layfield. We know that quires were put together in an at least partly arbitrary way; trying to understand more of the order in which the ms was created is extremely interesting.
(20-10-2025, 06:29 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.If I understand correctly, Lisa and Colin are going even more in-depth, also considering the stains that affect many of the pages; they are working on bifolio-level reordering, rather than section-level. I expect that their paper will be a major step forward in our understanding of the structure of the text.
(20-10-2025, 06:29 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.We know what is happening here since Currier’s analysis half a century ago: herbal bifolios by different scribes were mixed and bound together. This is detailed in You are not allowed to view links. Register or Login to view. (Table 1). All pages from f1 to f25 (“the first 1/3 of the herbal”) were created by scribe 1. After that point, scribe 2 pages begin to be inter-mixed with scribe 1 pages. If we take the illustrations as indicative of a topic, the change in statistics does not appear to be due to a different topic, but to a different scribe.
(20-10-2025, 06:29 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.The CUVA bigram plots You are not allowed to view links. Register or Login to view. (or before) are a good match for your topic line. See bottom of the page.
null
E.g. the plot for ‘ed’ shows how the zodiac section (gray) appears to gradually shift from Currier A (bottom) towards Currier B (top). It also shows that Quire13 Bio is more strongly “B” than Quire20 Star-Paragraphs. It also shows that Pharma (yellow) is comparable with Herbal A pages (both HA and Pharma are attributed to Scribe 1).
(20-10-2025, 06:29 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.EDIT: another point that I think Rene mentioned in the past. Results based on only a few samples are more noisy and unreliable than results based on larger sets. This could play a role in the fact that Bio/Q13 paragraphs get more consistent results than the much shorter Stars/Q20 paragraphs.
Jorge_Stolfi > 20-10-2025, 12:00 PM
(20-10-2025, 06:29 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.this information can be used to reorder Voynich sections, so that the text begins with Strong-A (HA) and ends with Strong-B (Bio). If I understand correctly, Lisa and Colin are going even more in-depth, also considering the stains that affect many of the pages; they are working on bifolio-level reordering, rather than section-level.