MarcoP > 21-10-2025, 07:44 AM
(20-10-2025, 10:24 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.(20-10-2025, 06:29 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.EDIT: another point that I think Rene mentioned in the past. Results based on only a few samples are more noisy and unreliable than results based on larger sets. This could play a role in the fact that Bio/Q13 paragraphs get more consistent results than the much shorter Stars/Q20 paragraphs.Even though I agree in general terms, I don’t fully agree regarding topic modelling. Topic modelling is intended to be applied at the sentence level. The shorter Stars paragraphs are actually the perfect size, whereas the longer herbal or biological paragraphs might be too long. I suppose (though I can’t confirm it) that the longer paragraphs should contain internal sentences, but since there is no punctuation, we can’t recognize them yet.
Quote: Our data are 16,000 documents from a subset of the TREC AP corpus (Harman, 1992). After removing a standard list of stop words, we used the EM algorithm described in Section 5.3 to find the Dirichlet and conditional multinomial parameters for a 100-topic LDA model.
Koen G > 21-10-2025, 08:15 AM
MarcoP > 21-10-2025, 08:30 AM
(21-10-2025, 08:15 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.I wonder how/if Colin Layfield got around the scarcity of data problem. Restoring folio order based on a method similar to this sounds like it should be impossible, but from what we've seen in Lisa's talk, it looks like they may have gotten some useful results.
quimqu > 21-10-2025, 02:53 PM
(21-10-2025, 07:44 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Hi quimqu, I am afraid that what can be said of topic modelling in general cannot always be said of its application to Voynichese.
For instance, Blei et. al You are not allowed to view links. Register or Login to view. mentions a data-set that is orders of magnitudes larger than the Voynich corpus
(21-10-2025, 08:30 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.they work on whole pages rather than paragraphs, and they might even consider whole folios (recto+verso) or bifolios.
(21-10-2025, 07:44 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.But, more importantly, we don’t know if the differences we see are due to different topics, different scribes, different languages or dialects, different cipher mechanisms such as different usage of nulls, different spelling, different preferences in spontaneous gibberish etc. We might even be classifying based on "potato" and "potahto".