The Voynich Ninja

Pages: 1 2 3 4 5

Hello Marco,

You make some very good points!

I did the calculations that we wrote up in the paper (I think perhaps the Wikipedia corpus has been modified since the article was published, which might explain why your results are slightly different). You are correct that the structure of the Wikipedia articles may inflate the reduplication rate. The tool I used to extract Wikipedia articles does not distinguish title from content text. Also, some Wikipedia versions, especially for minority languages with very few speakers, have a very short average article length and more formulaic writing, which also inflates the reduplication rate. I'm hoping to look closer at the reduplication rates in the Wikipedia Corpus at some point to separate out reduplication that is a result of article structure and look at the different rates of grammatical reduplication in separate languages.

In the Historical text corpus, the rate ranges from 0-0.16%. So from that perspective Voynich is a clear outlier, and we haven't yet found any historical texts with reduplication rates as high as Voynich. So it's definitely something we're interested in looking into.

Best,

Luke

I have checked the cleaning script. The unusual high number of word repeats is obviously the result of an incomplete cleaning process.

For instance the Wikipedia website You are not allowed to view links. Register or Login to view. contains the following code fragment:

Code:
<table border="1" align="right" cellpadding="4" cellspacing="0" width="300" style="margin: 0 0 1em 1em; background: #f9f9f9; border: 1px #aaaaaa solid; border-collapse: collapse; font-size: 95%;">

The raw files are cleaned by using the following script You are not allowed to view links. Register or Login to view.

Code:
delete_uncommon_chars(doc) # Delete characters with freq < .0001

replace (border))? align ((left)|(right)|(center) with ''

replace (cellpadding) with ''

replace (cellspacing) with ''

replace (width style margin em em background) with ''

replace (border collapse collapse) with ''

The result is

Code:
f f f border aaaaaa solid font size

see You are not allowed to view links. Register or Login to view.

Thanks to lurker for looking into the code! I guess that these issues with formatting tags can be fixed, but likely others have already developed and shared a more effective cleaning software?

(17-01-2021, 10:33 PM)Luke Lindemann Wrote: You are not allowed to view links. Register or Login to view.In the Historical text corpus, the rate ranges from 0-0.16%. So from that perspective Voynich is a clear outlier, and we haven't yet found any historical texts with reduplication rates as high as Voynich. So it's definitely something we're interested in looking into.

Hi Luke,
thank you very much for your kind reply! It's great to have you on the forum Smile

Figures from the Historical corpus are closer to what I expected, yet there seem to be a few problems in those files too (see also You are not allowed to view links. Register or Login to view.).

The Historical text with the highest reduplication rate appears to be the English Secretum Secretorum by Copland. That file has issues too and I doubt it can be regarded as correct English.
There is a different online transcription (You are not allowed to view links. Register or Login to view.) that appears to be better.

For instance, this fragment from github:

i have dyscovered to the the thynges that ben to be hyd

is rendered in this other way at umich:

I haue dyscouered to ye the thynges that bē to be hyd

It seems clear that in this case "to ye" is correct and "to the" is not. I would be curious to see scans of the actual 1528 edition, but I have been unable to find them.

I appreciate that collecting a reliable corpus of reference texts (both historical and modern) is a huge effort. I also believe that the Yale Corpora will be a precious resource for people interested in the language side of Voynich research. Thank again to you and Claire for starting this project and sharing it with everybody!

This is a simple test for scanning smaller files for string duplicates:

Code:
grep -Eo '(\b.+) \1\b' filename | sort | uniq -u

This is the outcome for the Zhuang text sample:

Code:
#grep -Eo '(\b.+) \1\b' Zhuang | sort | uniq -u

aen aen

aen gauqbaiq neix aen gauqbaiq neix

alfabeto de esperanto alfabeto de esperanto

b b b b b b b b b b

bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar

borassus flabellifer borassus flabellifer

bouxcuengh bouxcuengh

breinigerberg breinigerberg

byamanh caethwk nduk byamanh caethwk nduk

c c

ch

chinese only chinese only

cm cm

cm kwk cm kwk

d d

d d d d

daeuj soebsoeb daeuj soebsoeb

deborah read read deborah read read

diet diet

documents documents

doengz doengz

dwg dwg

eiqceiq eiqceiq

em em

feihcouh feihcouh

fouz fouz

geux satsat geux satsat

ginghciyoz veizgvanh ginghciyoz veizgvanh

grasslands grasslands

guinea guinea

gwn duh ceuj gwn duh ceuj

gyoepsuenq gyoepsuenq

h h

h h h h h h h h

hanciuz hanciuz

hoi dou lu vei hoi dou lu vei

hojceh hojceh

hungzgvanh ginghciyoz hungzgvanh ginghciyoz

ij dawz ij dawz

in netherlands in netherlands

isbn isbn isbn isbn isbn isbn

isbn isbn isbn isbn isbn isbn isbn isbn isbn isbn isbn isbn isbn isbn

kbs kbs

km h km h km h km h

km km km km

km straße gelände km straße gelände

km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km² km²

l l

liuzgiuzgoz liuzgiuzgoz

male on maldives male on maldives

mauz cwzdungh mauz cwzdungh

mbouj naih law mbouj naih law

mingzciuz mingzciuz

mm mm mm mm

mm mm mm mm wannenpanze

monte aguila inauguracion nueva plaza de monte aguila inauguracion nueva plaza de

mwngz gvai ha mwngz gvai ha

n n n n n n n n n n

nantucket nantucket

nanz lai

nienz nyied hauh nienz

nienz nyied nienz nyied

no no

ouhcouh ouhcouh

penicillus penicillus

ps ps ps ps

ps t ps t

ps t ps t ps t ps t

right right

sanskrit sanskrit

saw doj saw doj

swiq swiq

t t t t

the supernova the supernova

tomislav nikoli tomislav nikoli

tundra tundra

v v

vangz yenfanh vangz yenfanh

vaƅ vaƅ

voltaire voltaire

whitefield whitefield whitefield whitefield

yahvaiz yahvaiz

yesu yesu

yienghndang yienghndang

yienzbit yienzbit

zylinder ottomotor zylinder dieselmotor zylinder ottomotor zylinder dieselmotor

This outcome illustrates that most duplicated strings did not belong to the Zhuang language.

Hello Marco!

You make some really good points!

I believe we have modified the Wikipedia Corpus slightly since publication, which may explain why your answers are a little different.

Yes, the structure of the text in the Wikipedia Corpus inflates the reduplication rate. The tool I used to compile the texts does not distinguish between title text and content text, as you demonstrated. It also includes a lot of metadata, which I tried to clean as much as possible using a series of regular expressions to capture the most common Wikipedia code snippets, but as Lurker shows there are some I wasn't able to get rid of. These issues are especially relevant for Wikipedia language versions that a) have a small number of articles in total, b) have articles which are short on average, and c) are written in the Latin script (because for other scripts I can just filter out the Latin script metadata). This particularly affects minority languages like Cree and Piedmontese, which also have very basic, formulaic entries.

The Historical Corpus, by contrast, has a much smaller reduplication rate range from 0.0-0.16%, so Voynich is a clear outlier among the historical manuscripts we have. But there may be texts we haven't found that have higher rates of reduplication either because they're in certain genres (e.g. magical encantations) or because the grammar of the language itself uses reduplication more extensively.

All of this is to say that reduplication is an interesting topic that warrants a lot more examination than we were able to give to it in the Review article. Thank you for bringing it up!

Luke Lindemann

The reduplications are caused by the cleaning process. For instance this wikipedia You are not allowed to view links. Register or Login to view. contains a table about tanks. The table also contains a row saying that the 1st tank can drive 210 km, the 2nd tank 465 km, the 3rd tank 210 km and the 4th tank 225 km. By deleting all the metadata for the table and also all the numbers the only thing left is "km km km km". This way the cleaning process is causing reduplications.

Pages: 1 2 3 4 5

Luke Lindemann

lurker

MarcoP

lurker

Luke Lindemann

lurker