Hello Marco,
You make some very good points!
I did the calculations that we wrote up in the paper (I think perhaps the Wikipedia corpus has been modified since the article was published, which might explain why your results are slightly different). You are correct that the structure of the Wikipedia articles may inflate the reduplication rate. The tool I used to extract Wikipedia articles does not distinguish title from content text. Also, some Wikipedia versions, especially for minority languages with very few speakers, have a very short average article length and more formulaic writing, which also inflates the reduplication rate. I'm hoping to look closer at the reduplication rates in the Wikipedia Corpus at some point to separate out reduplication that is a result of article structure and look at the different rates of grammatical reduplication in separate languages.
In the Historical text corpus, the rate ranges from 0-0.16%. So from that perspective Voynich is a clear outlier, and we haven't yet found any historical texts with reduplication rates as high as Voynich. So it's definitely something we're interested in looking into.
Best,
Luke
I have checked the cleaning script. The unusual high number of word repeats is obviously the result of an incomplete cleaning process.
For instance the Wikipedia website You are not allowed to view links.
Register or
Login to view. contains the following code fragment:
Code:
<table border="1" align="right" cellpadding="4" cellspacing="0" width="300" style="margin: 0 0 1em 1em; background: #f9f9f9; border: 1px #aaaaaa solid; border-collapse: collapse; font-size: 95%;">
The raw files are cleaned by using the following script You are not allowed to view links.
Register or
Login to view.
Code:
delete_uncommon_chars(doc) # Delete characters with freq < .0001
replace (border))? align ((left)|(right)|(center) with ''
replace (cellpadding) with ''
replace (cellspacing) with ''
replace (width style margin em em background) with ''
replace (border collapse collapse) with ''
The result is
Code:
f f f border aaaaaa solid font size
see You are not allowed to view links.
Register or
Login to view.
Thanks to lurker for looking into the code! I guess that these issues with formatting tags can be fixed, but likely others have already developed and shared a more effective cleaning software?
(17-01-2021, 10:33 PM)Luke Lindemann Wrote: You are not allowed to view links. Register or Login to view.In the Historical text corpus, the rate ranges from 0-0.16%. So from that perspective Voynich is a clear outlier, and we haven't yet found any historical texts with reduplication rates as high as Voynich. So it's definitely something we're interested in looking into.
Hi Luke,
thank you very much for your kind reply! It's great to have you on the forum
Figures from the Historical corpus are closer to what I expected, yet there seem to be a few problems in those files too (see also You are not allowed to view links.
Register or
Login to view.).
The Historical text with the highest reduplication rate appears to be the English Secretum Secretorum by Copland. That file has issues too and I doubt it can be regarded as correct English.
There is a different online transcription (You are not allowed to view links.
Register or
Login to view.) that appears to be better.
For instance, this fragment from github:
i have dyscovered to the the thynges that ben to be hyd
is rendered in this other way at umich:
I haue dyscouered to ye the thynges that bē to be hyd
It seems clear that in this case "to ye" is correct and "to the" is not. I would be curious to see scans of the actual 1528 edition, but I have been unable to find them.
I appreciate that collecting a reliable corpus of reference texts (both historical and modern) is a huge effort. I also believe that the Yale Corpora will be a precious resource for people interested in the language side of Voynich research. Thank again to you and Claire for starting this project and sharing it with everybody!
This is a simple test for scanning smaller files for string duplicates:
Code:
grep -Eo '(\b.+) \1\b' filename | sort | uniq -u
This is the outcome for the Zhuang text sample:
Code:
#grep -Eo '(\b.+) \1\b' Zhuang | sort | uniq -u
aen aen
aen gauqbaiq neix aen gauqbaiq neix
alfabeto de esperanto alfabeto de esperanto
b b b b b b b b b b
bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar
borassus flabellifer borassus flabellifer
bouxcuengh bouxcuengh
breinigerberg breinigerberg
byamanh caethwk nduk byamanh caethwk nduk
c c
ch
chinese only chinese only
cm cm
cm kwk cm kwk
d d
d d d d
daeuj soebsoeb daeuj soebsoeb
deborah read read deborah read read
diet diet
documents documents
doengz doengz
dwg dwg
eiqceiq eiqceiq
em em
feihcouh feihcouh
fouz fouz
geux satsat geux satsat
ginghciyoz veizgvanh ginghciyoz veizgvanh
grasslands grasslands
guinea guinea
gwn duh ceuj gwn duh ceuj
gyoepsuenq gyoepsuenq
h h
h h h h h h h h
hanciuz hanciuz
hoi dou lu vei hoi dou lu vei
hojceh hojceh
hungzgvanh ginghciyoz hungzgvanh ginghciyoz
ij dawz ij dawz
in netherlands in netherlands
isbn isbn isbn isbn isbn isbn
isbn isbn isbn isbn isbn isbn isbn isbn isbn isbn isbn isbn isbn isbn
kbs kbs
km h km h km h km h
km km km km
km straße gelände km straße gelände
km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km²
l l
liuzgiuzgoz liuzgiuzgoz
male on maldives male on maldives
mauz cwzdungh mauz cwzdungh
mbouj naih law mbouj naih law
mingzciuz mingzciuz
mm mm mm mm
mm mm mm mm wannenpanze
monte aguila inauguracion nueva plaza de monte aguila inauguracion nueva plaza de
mwngz gvai ha mwngz gvai ha
n n n n n n n n n n
nantucket nantucket
nanz lai
nienz nyied hauh nienz
nienz nyied nienz nyied
no no
ouhcouh ouhcouh
penicillus penicillus
ps ps ps ps
ps t ps t
ps t ps t ps t ps t
right right
sanskrit sanskrit
saw doj saw doj
swiq swiq
t t t t
the supernova the supernova
tomislav nikoli tomislav nikoli
tundra tundra
v v
vangz yenfanh vangz yenfanh
vaƅ vaƅ
voltaire voltaire
whitefield whitefield whitefield whitefield
yahvaiz yahvaiz
yesu yesu
yienghndang yienghndang
yienzbit yienzbit
zylinder ottomotor zylinder dieselmotor zylinder ottomotor zylinder dieselmotor
This outcome illustrates that most duplicated strings did not belong to the Zhuang language.
Hello Marco!
You make some really good points!
I believe we have modified the Wikipedia Corpus slightly since publication, which may explain why your answers are a little different.
Yes, the structure of the text in the Wikipedia Corpus inflates the reduplication rate. The tool I used to compile the texts does not distinguish between title text and content text, as you demonstrated. It also includes a lot of metadata, which I tried to clean as much as possible using a series of regular expressions to capture the most common Wikipedia code snippets, but as Lurker shows there are some I wasn't able to get rid of. These issues are especially relevant for Wikipedia language versions that a) have a small number of articles in total, b) have articles which are short on average, and c) are written in the Latin script (because for other scripts I can just filter out the Latin script metadata). This particularly affects minority languages like Cree and Piedmontese, which also have very basic, formulaic entries.
The Historical Corpus, by contrast, has a much smaller reduplication rate range from 0.0-0.16%, so Voynich is a clear outlier among the historical manuscripts we have. But there may be texts we haven't found that have higher rates of reduplication either because they're in certain genres (e.g. magical encantations) or because the grammar of the language itself uses reduplication more extensively.
All of this is to say that reduplication is an interesting topic that warrants a lot more examination than we were able to give to it in the Review article. Thank you for bringing it up!
Luke Lindemann
The reduplications are caused by the cleaning process. For instance this wikipedia You are not allowed to view links.
Register or
Login to view. contains a table about tanks. The table also contains a row saying that the 1st tank can drive 210 km, the 2nd tank 465 km, the 3rd tank 210 km and the 4th tank 225 km. By deleting all the metadata for the table and also all the numbers the only thing left is "km km km km". This way the cleaning process is causing reduplications.