02-06-2026, 02:31 PM
(02-06-2026, 01:18 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.(01-06-2026, 12:55 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.So I started wondering if the opposite approach might actually be more informative.
Or you could use a balanced approach where rare terms are upweighted to reflect their relative importance.
Quote:In addition to creating the A matrix as described, which uses straight TF values, two weighting schemes are also employed to modify the values contained in A. The two schemes applied are Term Frequency-Inverse Document Frequency (TF-IDF) and Log-Entropy (LE).You are not allowed to view links. Register or Login to view.
I don’t think that Lisa's work is really the same thing as what I’m doing (it is simmilar to my automitic topic analysis). What they describe there is basically a weighting scheme inside an LSA/topic-modelling framework. TF-IDF increases the influence of rarer words and reduces the effect of very common ones. That is standard NLP.
But their goal is still global semantic structure and document similarity through latent-space methods. They are trying to see whether nearby text behaves coherently, whether sections cluster, whether discourse segmentation exists, etc.
My approach is much more explicit and local. I am not embedding the text into a latent semantic space or reducing dimensions. I am literally tracing the actual low-frequency tokens themselves across folios and sections, then building direct overlap networks from those tokens. In practice this changes the interpretation a lot.
A TF-IDF weighted LSA model may tell you that two pages are “similar” in some abstract statistical sense (so do other topic models). What I’m looking at is closer to: “this exact semi-rare token appears in these particular folios and creates this concrete bridge between these sections”. So when You are not allowed to view links. Register or Login to view. or some Marginal Stars folios emerge as hubs, I can inspect the exact tokens creating those links page by page. That is a different scale of analysis.
Also, interestingly, their own paper actually supports part of what I’m seeing. They found that Biological, Pharmaceutical and Stars sections show unusually high internal cohesion compared to other sections. That is not very far from the kind of structure emerging in my rare-token graphs.
The important difference is that I’m not measuring coherence of adjacent text windows. I’m measuring selective reuse of uncommon vocabulary across distant folios.