If anyone wants to do their own research, here are the necessary text files. I downloaded the HTM files from Takeshi Takahashi`s website (wget) and then converted them (with html2text) to text files. If you want to work with Stylo you have to copy the files into a folder called "corpus" and specify the parent folder as working directory.
[
attachment=7869]
edit: When converting to text files, additional line breaks were sometimes inserted. However, this is irrelevant for an analysis.
On closer inspection I noticed that there are some comments in the Takahashi files. I will remove them and post the files again as soon as I have have time.
Or does someone have a clean corpus he / she wants to share here ?
(06-11-2023, 10:25 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.Or does someone have a clean corpus he / she wants to share here ?
That would not be difficult. In what form would you like to have this?
- One file per page? (zipped)
- Which alphabet (Basic Eva?)
- Only paragraph text, or also labels etc.?
- Any preferred transliteration file?
(07-11-2023, 09:54 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.That would not be difficult. In what form would you like to have this?
- One file per page (zipped)
- Basic EVA
- Just plain text without comments, special characters, labels, etc. ( if possible, use spaces as word separators )
- Transliteration does not matter.
Thanks in advance
Just to be sure: should circular text, radial text be included?
For the terminology, see here: You are not allowed to view links.
Register or
Login to view.
With labels I meant star labels, zodiac labels etc...
(08-11-2023, 03:48 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Just to be sure: should circular text, radial text be included?
Yes, this kind of text should be included.
(08-11-2023, 03:48 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.With labels I meant star labels, zodiac labels etc...
I see, yes with labels.
The zip file should be available via this link:
You are not allowed to view links.
Register or
Login to view.
I spent (wasted) some time trying to set up an ftp repository, but was not yet successful.
The file includes 227 short text files. The file names may appear mysterious at first.
I will add some explanations in the transliteration thread.
Since you were successful, recently, in setting up and running bitrans, you may want to do the same with ivtt, and then you will be able to repeat this yourself.
Edit: this was based on the RF transliteration, file: RF1a-n.txt (-n stands for native alphabet, i.e. Eva).
I don't know how robust the PCA diagram can be, since many pages don't contain much text, but I would say this shows the large variance in "language" even for individual scribes. This is particularly clear for Scribe2's Quire13 (high Y=PC2) vs Herbal (low Y) pages; it's also noteworthy that Scribe3's Q20 basically falls between the two clusters by Scribe2.
You can see that scribe 1 falls largely in Currier language "A" while the remaining scribes are spread across "B" and Unknown.