quimqu > 21-09-2025, 10:05 PM
(21-09-2025, 09:45 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.So you are trying several K-clustering algorithms on the same set of data points, where each point is a bag-of-words -- essentially an N-vector where N is the number of distinct word types. Where the algorithms do not assign each parag to a single cluster (topic), but gives it a "belonging" or "mixing" score for each cluster. Is that correct?
Can we interpret those scores as Bayesian probabilities of the parag belonging to each topic?
quimqu > 22-09-2025, 08:44 AM
Jorge_Stolfi > 22-09-2025, 03:52 PM
(22-09-2025, 08:44 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.That pattern looks like shared structure running through the book, not several unrelated streams of invented text.
Rafal > 22-09-2025, 07:28 PM
Jorge_Stolfi > 22-09-2025, 08:24 PM
(22-09-2025, 07:28 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.If there are patterns then we should be able to see something in the text - beginnings and ends of sentences, repeated phrases, nouns and verbs, "and" word, numbers and so on. But we can't.
quimqu > 22-09-2025, 09:40 PM
Jorge_Stolfi > 22-09-2025, 09:50 PM
(22-09-2025, 08:44 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.recurring “themes” in the text and to say how strongly each paragraph uses each theme. With that setup, the model settled on six themes that line up with the well-known sections: Herbal, Pharmaceutical, Biological (balneological), Astronomical/Zodiac, Marginal Stars, and a Text/Cosmological connector. When the manuscript switches section, the dominant theme usually flips at the same boundary.
Jorge_Stolfi > 22-09-2025, 10:04 PM
(22-09-2025, 09:40 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Now I’ve run into a different issue. I’m drilling down by section, starting with Herbal, by filtering to those paragraphs only. Inside Herbal I expected at least a couple of thematic clusters (think Culpeper-style: when to harvest, flowering stages, preparations, etc.). Instead, the splits I’m getting are almost entirely language A vs B, with little evidence of cross-language themes. That’s surprising. Herbal is big enough that, ideally, we should see some topic diversity across languages, but we don't; the fact that I’m mostly seeing language separation makes me wonder why...
quimqu > 22-09-2025, 10:11 PM
(22-09-2025, 03:52 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.(22-09-2025, 08:44 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.That pattern looks like shared structure running through the book, not several unrelated streams of invented text.
In one of the few You are not allowed to view links. Register or Login to view., back in the late 2nd millennium CE (at the IMPA Brazilian Mathematics Colloquium), I mentioned the use of word distribution similarity to infer the original nesting and folding of the Biology bifolios. The relevant part of that file starts at slide/page 25. The basic idea is to solve a Traveling Salesman Problem in N-space, by exhaustive enumeration of all possible page orders. It is feasible because the physical pairing of pages and folios puts strong constraints on these orders.
I did not give the result in that talk because (IIRC) it was rather ambiguous and I did not have time to discuss it. Anyway the transcription quality must have improved enormously since then, and the page distance I used probably was not ideal. I think it would be better to redo it from scratch. You seem to have all the required boring parts programmed already...
All the best, --jorge
quimqu > 22-09-2025, 10:27 PM
(22-09-2025, 09:50 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Consider the hypothesis
"in each page (with maybe a couple of exceptions), all parags have the same 'topic'".
Are your results compatible with this hypothesis? In other words, how many pages did you find which have two or more parags with clearly distinct topics?
If there are such pages, is the break between topics consistent with a transition between two "sections" that falls in the middle of a page? That is, do you see only transitions like [AA][AAA][AAB][BBB][BB][BAA][AA], or are there transitions like [AA][BA][BBB][BAB][AA] etc.?
I would not join labels to nearby parags. Label distributions are quite different from parag word distributions. If a page has two parags and you join a bunch of labels to the second parag, they will probably came out as having different topics, just for that reason.
All the best, --jorge