kckluge > 10-10-2025, 09:26 PM
quimqu > 10-10-2025, 09:37 PM
(10-10-2025, 09:26 PM)kckluge Wrote: You are not allowed to view links. Register or Login to view.The thing that makes PCA based cluster analysis (in particular) of sub-dialects tricky is
kckluge > 10-10-2025, 10:42 PM
(10-10-2025, 09:37 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.(10-10-2025, 09:26 PM)kckluge Wrote: You are not allowed to view links. Register or Login to view.The thing that makes PCA based cluster analysis (in particular) of sub-dialects tricky is
As there is no prior mention of PCA cluster analysis in the ppst I imagine you think I did cluster analysis, but that is not the case. My analysis is topic analysis, usindg Latent Dirichlet Allocation, that is not using PCA, neither k-means.
LDA is a probabilistic model used to discover hidden topics in a collection of documents. It is language agnlstic and assumes that each document is composed of several topics (K), and each topic is a distribution over words. By analyzing the patterns of word co-occurrence across documents, LDA infers which topics are present and how strongly each document is associated with them. The model is unsupervised and typically estimated using Bayesian inference methods such as variational inference or Gibbs sampling. Its main goal is to uncover the underlying thematic structure of text corpora. In my case, the language units were the paragraphs.
quimqu > 11-10-2025, 08:37 AM
Jorge_Stolfi > 11-10-2025, 04:13 PM
(11-10-2025, 08:37 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Distinguishing whether that reflects stylistic, dialectal, or linguistic variation is much less straightforward.
quimqu > 11-10-2025, 04:25 PM
(11-10-2025, 04:13 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Perhaps one can answer that question by looking at frequencies of pairs of words in consecutive positions. Preferably excluding the first and last words of each line, and the first line of each parag.
If the A/B difference is due to spelling or dialect difference, there should be a roughly 1-1 mapping W -> m(W) between word types. Then it should be possible to match most of the most common word pairs in a way that is consistent with that mapping.
That is, if "W1 W2" is a common pair in language A, then "m(W1) m(W2)" should be common in language B.
Since daiin is the most common word in both A and B, we can start by guessing that m(dain) = daiin. Then we should look for the most common word pairs "daiin W" and "W daiin", and see whether we can guess m(W) for some of those words.
quimqu > 11-10-2025, 04:28 PM
(11-10-2025, 04:13 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.You are not allowed to view links. Register or Login to view. are four texts that could be used for that purpose. The files are in UTF-8 encoding. They are non-overlapping extracts from the same novel, with three different spelling systems and translated into a different but closely related language. Parags are separated by blank lines. More details are in You are not allowed to view links. Register or Login to view.. For better comparison with Voynichese, you may want to ignore all lines starting with "#", delete the string "\emph", map everything to lower case, map all letters with diacritics to ASCII letters without them, and replace all punctuation by spaces. You may also want to delete all parags that are too short (mostly dialog lines).
quimqu > 12-10-2025, 11:17 AM
(11-10-2025, 04:28 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.(11-10-2025, 04:13 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.You are not allowed to view links. Register or Login to view. are four texts that could be used for that purpose. The files are in UTF-8 encoding. They are non-overlapping extracts from the same novel, with three different spelling systems and translated into a different but closely related language. Parags are separated by blank lines. More details are in You are not allowed to view links. Register or Login to view.. For better comparison with Voynichese, you may want to ignore all lines starting with "#", delete the string "\emph", map everything to lower case, map all letters with diacritics to ASCII letters without them, and replace all punctuation by spaces. You may also want to delete all parags that are too short (mostly dialog lines).
I will give it a try and let you know.
Jorge_Stolfi > 12-10-2025, 11:54 AM
(12-10-2025, 11:17 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Hi Jorge, I ran NMF topic modeling on the four Dom Casmurro files (DC1–DC4). ... The NMF results show a very strong match between the learned topics and the file sections, with the best performance at K = 3 (AMI = 0.81, p ≈ 0). This means that the model clearly identifies three main linguistic clusters across the corpus. Given the setup, that outcome makes perfect sense: the two Portuguese spellings are grouped togethr, the phonetic respelling is separated apart, and the Spanish text stands apart also as its own distinct group.
Quote:You can see that a small part of the paragraphs is given a wrong topic, but I assume this can happen due to the proximity of the three languages.
Quote:Finally, here you can see the top weighted words per topic that the model has found:
=== Main topic 0 ===
? Dialect (main topic words):
nao, ci, os, la, um, di, mi, nau, con, un, el, do, com, si, en, em, ao, que, las, lhe, los, us, du, para, lhi, pra, olhos, in, as, pero, me, mas, mae, su, au, disse, os olhos, dos, menti, voce
=== Main topic 1 ===
? Dialect (main topic words):
ci, di, nau, de, que, me, us, du, pra, lhi, la, in, nao, si, mais, se, no, au, menti, comu, os, um, eli, nu, el, por, para, min, mai, mas, pur, en, como, joze, padri, do, com, veis, ci nau, tamben
=== Main topic 2 ===
? Dialect (main topic words):
la, el, en, los, las, no, de, lo, pero, su, una, del, de la, le, al, es, sus, en la, con, por, madre, fue, en el, de mi, yo, mi madre, que, ojos, de los, ci, solo, habia, ni, tenia, nao, un, mi, despues, sin, lo que
quimqu > 12-10-2025, 12:04 PM
(12-10-2025, 11:54 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.This is bizarre. The words "nao" ("não") and "os" (for instance) should be strictly Portuguese in 1899 or 1999 spellings (DC1 and DC2), "nau" ("nãu") and "ci" strictly from the phonetic spelling (DC3), and "los" and "pero" strictly from Spanish (DC4). But they appear across those three topics. Although topic 2 seems to be mostly Spanish, while topics 0 and 1 are mostly a mix of official and phonetic Portuguese words.