Jorge_Stolfi > 21-04-2026, 11:52 PM
Jorge_Stolfi > 22-04-2026, 02:46 AM
(21-04-2026, 12:29 AM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.How could the tendency of the second words of lines to begin disproportionately with [Sh] and [s] be explained as a byproduct of an algorithm for splitting lines based on word length?
Jorge_Stolfi > 22-04-2026, 03:05 AM
pfeaster > 22-04-2026, 11:36 PM
(22-04-2026, 02:46 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.In the case you mention,it must be that words starting with sh are more likely to occur after a longer word than a shorter one.
I did a simulation of this effect by creating two token streams....
Jorge_Stolfi > 23-04-2026, 11:14 AM
(22-04-2026, 11:36 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.That's an interesting conjecture, but I believe we could test it more directly than via simulation.
If it's true that words starting with [Sh] are more likely to occur after longer words than shorter ones, then -- in the absence of meaningful line-based patterning -- we should see this effect elsewhere too. If we were to exclude the beginnings and ends of lines from consideration and calculate the average length of mid-line words that precede words starting with [Sh], it should differ significantly from the norm.
Jorge_Stolfi > 23-04-2026, 02:38 PM
quimqu > 23-04-2026, 08:39 PM
(23-04-2026, 02:38 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I am sorry to say this, but I believe that this line of investigation is and will be a big waste of time.
Rafal > 23-04-2026, 10:33 PM
Quote:The results for that are strong, indicating that there is almost no connection from the last words to the first words of the next line. This, under my opinion, doesn't happen in a "normal" text and is a clue that we should take into account.
Jorge_Stolfi > 24-04-2026, 06:10 AM
(23-04-2026, 08:39 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.according to the data and as a summary to my coding in this thread, the lines seem to have some strong initial tokens, some strong final tokens and a middle region. But what I find more interesting is that each line seems independent from the next or previous one. The results for that are strong, indicating that there is almost no connection from the last words to the first words of the next line. This, under my opinion, doesn't happen in a "normal" text and is a clue that we should take into account.
oeesordy > 24-04-2026, 06:35 AM
(23-04-2026, 08:39 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.(23-04-2026, 02:38 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I am sorry to say this, but I believe that this line of investigation is and will be a big waste of time.
I don't agree, Jorge,
according to the data and as a summary to my coding in this thread, the lines seem to have some strong initial tokens, some strong final tokens and a middle region. But what I find more interesting is that each line seems independent from the next or previous one. The results for that are strong, indicating that there is almost no connection from the last words to the first words of the next line. This, under my opinion, doesn't happen in a "normal" text and is a clue that we should take into account.