Urtx13 > 9 hours ago
Urtx13 > 8 hours ago
(8 hours ago)nablator Wrote: You are not allowed to view links. Register or Login to view.(9 hours ago)Urtx13 Wrote: You are not allowed to view links. Register or Login to view.-Total number of tokens (cleaned EVA): 45,037
Too many, something is wrong.
Urtx13 > 1 hour ago
ReneZ > 42 minutes ago
(8 hours ago)nablator Wrote: You are not allowed to view links. Register or Login to view.(9 hours ago)Urtx13 Wrote: You are not allowed to view links. Register or Login to view.-Total number of tokens (cleaned EVA): 45,037
Too many, something is wrong.
Mauro > 32 minutes ago
(8 hours ago)Urtx13 Wrote: You are not allowed to view links. Register or Login to view.This feels reasonable to me (we’re removing only ~2.6% of the raw tokens), but maybe I’m missing something?
ReneZ > 26 minutes ago
Urtx13 > 17 minutes ago
(26 minutes ago)ReneZ Wrote: You are not allowed to view links. Register or Login to view.First guess: are you removing lines that start with # (full comment lines)?
Urtx13 > 12 minutes ago
(32 minutes ago)Mauro Wrote: You are not allowed to view links. Register or Login to view.(8 hours ago)Urtx13 Wrote: You are not allowed to view links. Register or Login to view.This feels reasonable to me (we’re removing only ~2.6% of the raw tokens), but maybe I’m missing something?
I agree 42852 tokens are too many. Hard to say from here what is not working properly.
Did you check if the [cth:oto] removal procedure recovers the correct tokens? You should get a 'cthres' from line 1
<f1r.1,@P0> <%>fachys.ykal.ar.ataiin.shol.shory.[cth:oto]res.y.kor.sholdy<!doodle: @254;>
.. and an 'oteos' from line four:
<f1r.4,+P0> soiin.oteey.oteo[s:r],roloty.cthiar,daiin.okaiin.or.okan
What does your processing do in cases where curly brackets are found, ie.
<f1r.17,+P0> ycho.tchey.chekain.sheo,pshol.dydyd.cthy.dai[{cto}: @194;]y
<f1r.19,+P0> dchar.shcthaiin.okaiir.chey.@192;chy.@130;tol.cthols.dlo{ct}o
Then, I'd check manually a few pages of the cleaned text vs. the original.
You chose a quite complicated transcription format for your work, why did you not use RF1a-n, which is simpler to manage? (just remove non-word characters and tokens including a '?'). The less time you spend on coding the text cleaner, the more time you'll have for your actual research.