quimqu > 22-08-2025, 09:28 AM
(22-08-2025, 08:42 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.(22-08-2025, 07:53 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The null hypothesisis is:
H₀: topics are independent of language/hand.
I'm not sure this null hypothesis is valid, at least not when it's used for the languages. Since languages were initially defined using the properties of the text (relative abundance of various words and symbol combinations), and topic modeling uses the same properties, the null hypothesis states an a priori impossible situation, so its p value might not have any sense. I'm not a scientist though and my experience with p-values is nearly zero.
For hands this is more interesting, but as far as I know, some correlation between hands and languages do exist? This is not really my area.
Mauro > 22-08-2025, 10:38 AM
(22-08-2025, 07:53 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The p-value is the probability of obtaining a chi-square statistic at least as extreme as the observed one, assuming the null hypothesis is true.
A low p-value (e.g., < 0.05) means we reject the null hypothesis, suggesting that the distribution of topics depends on the language or scribal hand.
A high p-value means we cannot reject the null, suggesting no evidence of dependence.
In both cases, we have a very low p-value (lowest when we analyze the topics vs. hands), rejecting the null hipotesis that topics are independent of.language/hand.
quimqu > 22-08-2025, 01:01 PM
(22-08-2025, 10:38 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.(*) but this does not remove my doubts about the p-values, they're really too small.
magnesium > 22-08-2025, 04:11 PM
(21-08-2025, 08:43 PM)davidma Wrote: You are not allowed to view links. Register or Login to view.(21-08-2025, 08:19 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.
These findings suggest that topic modeling not only helps cluster content by lexical features, but also reflects deeper structural patterns of authorship and writing practices in the manuscript. It supports the idea that different scribes may have introduced or emphasized different "topics", even when writing in the same Currier language.
I'd be very interested to hear your interpretations or see comparisons with other modeling approaches.
Just thinking of the naibbe cipher, could it be that each scribe had it's own encryption table? What is the scribral frequency? Would it match the 5-3-1-1 naibbe distribution? Regardless, I think these are extremely interesting results.
quimqu > 22-08-2025, 10:10 PM
(22-08-2025, 04:11 PM)magnesium Wrote: You are not allowed to view links. Register or Login to view.Just to add here, Naibbe ciphertexts exhibit language-dependent and topic-dependent variation in word frequencies, for the simple reason that alphabet letter and plaintext bigram frequencies can vary by language and topic. For example, because it uses the word “herba” very often, a herbal plaintext might have more Bs than an astrology text, holding the language constant across both texts. As an exercise, I recommend that people do NMF on my reference Naibbe ciphertexts, using 1000-5000 tokens as the “document” subdivision. Each of those reference texts is equal parts of Dante’s Divine Comedy, Book 16 of Pliny’s Natural History, Grosseteste’s De sphaera, and the Latin alchemical herbal. The Divine Comedy portion will cleanly separate out, as will the alchemical herbal section. De sphaera and Natural History will tend to resemble each other a bit more.
quimqu > 22-08-2025, 10:36 PM
(21-08-2025, 08:38 PM)LisaFaginDavis Wrote: You are not allowed to view links. Register or Login to view.Claire Bowern and two of her PhD students also addressed the question of topic modeling in the VMS here:
You are not allowed to view links. Register or Login to view.
magnesium > 22-08-2025, 11:11 PM
(22-08-2025, 10:10 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Hello Magnesium,
I have doubts how to process your reference Naibbe ciphertexts.
- shall i process all of them one after the other? In case it is yes, which is the order, the alphabetical order of the files?
- how should I divide the paragraphs or pages? Or should I take one file as a "page" every time. For topic processing I need to separate the texts so the progam can find the topics of he texts (one topic per text). How should I separate the texts?
Once I have the pages (or the chunks of texts that should be given a topic), I can show you the results and see if it is coherent. I don't know which real text is in each file.
Thank you
oshfdk > 22-08-2025, 11:19 PM
quimqu > 23-08-2025, 11:28 PM
(22-08-2025, 11:19 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Maybe in order to get some meaningful results, we need a less restrictive or more specific null hypothesis. For example, H0 is "within a single language topic assignment doesn't depend on hands" (that is, separate tests for language A and language B).
obelus > 24-08-2025, 07:24 AM
(21-08-2025, 08:19 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.A lower p-value (approaching 0) indicates stronger statistical association between the topic distribution and the given variable.It is good practice to report the probability that a statistical inference is due to chance. According to your calculations, this risk is astronomically low for correlations between the 11 "optimal" topics and either Currier language or scribal hand. But the p value only quantifies confidence that a correlation exists, not the strength of correlation. The interesting part of a statistically significant result is its effect size, which is measured by other means.