![]() |
[Article] The Linguistics of the Voynich Manuscript (Bowern et al. 2020) - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: News (https://www.voynich.ninja/forum-25.html) +--- Thread: [Article] The Linguistics of the Voynich Manuscript (Bowern et al. 2020) (/thread-3344.html) |
RE: The Linguistics of the Voynich Manuscript (Bowern et al. 2020) - Luke Lindemann - 17-01-2021 Hello Marco, You make some very good points! I did the calculations that we wrote up in the paper (I think perhaps the Wikipedia corpus has been modified since the article was published, which might explain why your results are slightly different). You are correct that the structure of the Wikipedia articles may inflate the reduplication rate. The tool I used to extract Wikipedia articles does not distinguish title from content text. Also, some Wikipedia versions, especially for minority languages with very few speakers, have a very short average article length and more formulaic writing, which also inflates the reduplication rate. I'm hoping to look closer at the reduplication rates in the Wikipedia Corpus at some point to separate out reduplication that is a result of article structure and look at the different rates of grammatical reduplication in separate languages. In the Historical text corpus, the rate ranges from 0-0.16%. So from that perspective Voynich is a clear outlier, and we haven't yet found any historical texts with reduplication rates as high as Voynich. So it's definitely something we're interested in looking into. Best, Luke RE: The Linguistics of the Voynich Manuscript (Bowern et al. 2020) - lurker - 18-01-2021 I have checked the cleaning script. The unusual high number of word repeats is obviously the result of an incomplete cleaning process. For instance the Wikipedia website You are not allowed to view links. Register or Login to view. contains the following code fragment: Code: <table border="1" align="right" cellpadding="4" cellspacing="0" width="300" style="margin: 0 0 1em 1em; background: #f9f9f9; border: 1px #aaaaaa solid; border-collapse: collapse; font-size: 95%;"> The raw files are cleaned by using the following script You are not allowed to view links. Register or Login to view. Code: delete_uncommon_chars(doc) # Delete characters with freq < .0001 The result is Code: f f f border aaaaaa solid font size see You are not allowed to view links. Register or Login to view. RE: The Linguistics of the Voynich Manuscript (Bowern et al. 2020) - MarcoP - 18-01-2021 Thanks to lurker for looking into the code! I guess that these issues with formatting tags can be fixed, but likely others have already developed and shared a more effective cleaning software? (17-01-2021, 10:33 PM)Luke Lindemann Wrote: You are not allowed to view links. Register or Login to view.In the Historical text corpus, the rate ranges from 0-0.16%. So from that perspective Voynich is a clear outlier, and we haven't yet found any historical texts with reduplication rates as high as Voynich. So it's definitely something we're interested in looking into. Hi Luke, thank you very much for your kind reply! It's great to have you on the forum ![]() Figures from the Historical corpus are closer to what I expected, yet there seem to be a few problems in those files too (see also You are not allowed to view links. Register or Login to view.). The Historical text with the highest reduplication rate appears to be the English Secretum Secretorum by Copland. That file has issues too and I doubt it can be regarded as correct English. There is a different online transcription (You are not allowed to view links. Register or Login to view.) that appears to be better. For instance, this fragment from github: i have dyscovered to the the thynges that ben to be hyd is rendered in this other way at umich: I haue dyscouered to ye the thynges that bē to be hyd It seems clear that in this case "to ye" is correct and "to the" is not. I would be curious to see scans of the actual 1528 edition, but I have been unable to find them. I appreciate that collecting a reliable corpus of reference texts (both historical and modern) is a huge effort. I also believe that the Yale Corpora will be a precious resource for people interested in the language side of Voynich research. Thank again to you and Claire for starting this project and sharing it with everybody! RE: The Linguistics of the Voynich Manuscript (Bowern et al. 2020) - lurker - 18-01-2021 This is a simple test for scanning smaller files for string duplicates: Code: grep -Eo '(\b.+) \1\b' filename | sort | uniq -u This is the outcome for the Zhuang text sample: Code: #grep -Eo '(\b.+) \1\b' Zhuang | sort | uniq -u This outcome illustrates that most duplicated strings did not belong to the Zhuang language. RE: The Linguistics of the Voynich Manuscript (Bowern et al. 2020) - Luke Lindemann - 18-01-2021 Hello Marco! You make some really good points! I believe we have modified the Wikipedia Corpus slightly since publication, which may explain why your answers are a little different. Yes, the structure of the text in the Wikipedia Corpus inflates the reduplication rate. The tool I used to compile the texts does not distinguish between title text and content text, as you demonstrated. It also includes a lot of metadata, which I tried to clean as much as possible using a series of regular expressions to capture the most common Wikipedia code snippets, but as Lurker shows there are some I wasn't able to get rid of. These issues are especially relevant for Wikipedia language versions that a) have a small number of articles in total, b) have articles which are short on average, and c) are written in the Latin script (because for other scripts I can just filter out the Latin script metadata). This particularly affects minority languages like Cree and Piedmontese, which also have very basic, formulaic entries. The Historical Corpus, by contrast, has a much smaller reduplication rate range from 0.0-0.16%, so Voynich is a clear outlier among the historical manuscripts we have. But there may be texts we haven't found that have higher rates of reduplication either because they're in certain genres (e.g. magical encantations) or because the grammar of the language itself uses reduplication more extensively. All of this is to say that reduplication is an interesting topic that warrants a lot more examination than we were able to give to it in the Review article. Thank you for bringing it up! Luke Lindemann RE: The Linguistics of the Voynich Manuscript (Bowern et al. 2020) - lurker - 18-01-2021 The reduplications are caused by the cleaning process. For instance this wikipedia You are not allowed to view links. Register or Login to view. contains a table about tanks. The table also contains a row saying that the 1st tank can drive 210 km, the 2nd tank 465 km, the 3rd tank 210 km and the 4th tank 225 km. By deleting all the metadata for the table and also all the numbers the only thing left is "km km km km". This way the cleaning process is causing reduplications. |