ReneZ > 17-04-2026, 08:26 AM
quimqu > 17-04-2026, 09:44 AM
| Dataset | Mean real score | Mean nearby | Mean random | Diff vs nearby | AUC |
|---|---|---|---|---|---|
| Same sections (Herbal/Biological) | 2.62 | -3.07 | -3.24 | +5.69 | ~0.93 |
| Other sections | 2.95 | -2.71 | -3.32 | +5.65 | ~0.92 |
| Version | Dataset | Mean real score | Mean nearby score | Diff real-nearby | AUC vs nearby |
|---|---|---|---|---|---|
| No spaces | Held-out Herbal + Biological | 2.62 | -3.07 | +5.69 | ~0.93 |
| Spaces kept as characters | Held-out Herbal + Biological | 3.46 | -3.31 | +6.77 | ~0.94-0.95 |
| No spaces | Other sections | 2.95 | -2.71 | +5.65 | ~0.92 |
| Spaces kept as characters | Other sections | 3.93 | -3.01 | +6.94 | ~0.94-0.95 |
nablator > 17-04-2026, 12:50 PM
(17-04-2026, 09:44 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Removing spaces does not kill the effect, and keeping them improves the model, which suggests that the spacing, even if imperfect, still reflects some real structure in the text.
quimqu > 17-04-2026, 06:37 PM
nablator > 17-04-2026, 07:40 PM
(17-04-2026, 06:37 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I can make all my code available if someone wants to check it.
quimqu > 17-04-2026, 11:09 PM
(17-04-2026, 07:40 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I would love to see your code not to only to check it but to learn from it. I don't understand what you are talking about exactly: what kind of model you train, how you score alternatives etc.
quimqu > 17-04-2026, 11:12 PM
Jorge_Stolfi > 18-04-2026, 01:36 AM
(17-04-2026, 09:44 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.So even after removing all spaces, the signal is still there, and it is strong.
Quote:Maybe what we call "words" are not real words. [... but] Removing spaces does not kill the effect, and keeping them improves the model, which suggests that the spacing, even if imperfect, still reflects some real structure in the text.
quimqu > 18-04-2026, 07:13 PM
Jorge_Stolfi > 18-04-2026, 11:24 PM
(18-04-2026, 07:13 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Even then, the model can still detect line starts and ends quite clearly from character patterns alone. That’s the part I find hard to explain purely as a side effect of line breaking on tokens.