Dunsel > Yesterday, 05:40 AM
Jorge_Stolfi > Yesterday, 07:51 AM
(Yesterday, 03:55 AM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.I typically strip all punctuation, lower case and then split the text into 100 word groups, something close to herbal page sizes.
davidma > Yesterday, 11:39 AM
Skoove > Yesterday, 11:48 AM
(Yesterday, 11:39 AM)davidma Wrote: You are not allowed to view links. Register or Login to view.Very interesting. It always struck me that "ed" is the only bigram that appears by itself in the central "star" of f69r, together with y, d, o, l, s.
ReneZ > Yesterday, 12:46 PM
Dunsel > 3 hours ago
(Yesterday, 07:51 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Thus if a digram occurs N times overall, you should divide the text into, say, N/2 pages. Then, if the digram is evenly distributed, one expects that it will occur in (1-1/e^2) = ~86% of the pages. But if it is concentrated in some sections, the percentage of pages with the bigram will be much lower than that.
All the best, --stolfi
Dunsel > 3 hours ago
Koen G > 2 hours ago
Dunsel > 1 hour ago
(2 hours ago)Koen G Wrote: You are not allowed to view links. Register or Login to view.Nice graphs! I think I can see edy in that last one.
(2 hours ago)Koen G Wrote: You are not allowed to view links. Register or Login to view.Makes one wonder what these "edy" clusters are. It would be so neat if they corresponded to "ain" clusters, but your very first graph shows nicely how that cannot be the case. There should be more outliers then.
(2 hours ago)Koen G Wrote: You are not allowed to view links. Register or Login to view.So "edy" is new. Unless the thing it corresponds to in A-pages also occurs a bit on most B-pages. Then it wouldn't register as an outlier.
Dunsel > 1 hour ago
(Yesterday, 12:46 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.But then wat about 'eed'?