(11-04-2019, 11:00 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.@Anton: Sorry, You are not allowed to view links. Register or Login to view. again the corpus.
I did an entropy calculation. If spaces are counted as characters, I get:
Code:
single-char: 4.014
bigram: 7.256
conditional: 3.242
If spaces are just treated as separators, and only characters within words are considered, then:
Code:
single-char: 3.992
bigram: 7.101
conditional: 3.109
Looking at Table 2 on this page: You are not allowed to view links.
Register or
Login to view.
the above values match almost exactly with the Latin texts of Pliny and Mattioli.
My calculations:
h0 = 4.70
h1 = 4.01
h2 = 3.24
Same values as Rene's
h0 - h1 = 0.69
h1 - h2 = 0.77
For the VMS (Herbal), h1 - h2 is 1.5 ... 1.9 see here: You are not allowed to view links.
Register or
Login to view.
@ JKP
Actually it makes sense to take
any contemporary text of a similar field (herbal, pharma, alchemy...) and make a transcription of ten thousand characters to have at least
some benchmark to begin with
In principle, it should be possible to find a decent OCR of an early printed book with abbreviations. Of course, the OCR will not correctly transcribe the abbreviations, but it could work on "regular" characters [one should edit the abbreviations, hopefully with a semi-automatic process]. The quality of the You are not allowed to view links.
Register or
Login to view. is terrible, but maybe there are other texts with better scans and better results.
[
attachment=2808]
archive.org OCR:
vero cadebbtl^ pnacccebat birccte in alm mfenus bns trespcdcs.£t farfsj bSiat fcjrcalamos Cijrcdietes
Ocbaftili.tresci: pno latcre-^r eres ei allo^d^ furiusobUq.^uoufcB Etmgcret ad alttmdxnebafhCJln
baftiU to quamwz qpbi infUr mios.l^uos aUd pom^s oicnt.5lta ® boo apbi pofm coira almfa#
No I think that's not suitable. A printed book would be something only moderately abbreviated, and we can produce virtually the same result by taking e.g. bi3mw's text and introducing some common abbreviations ourselves. I have done that before and apart from slight increase of entropy there was nothing interesting.
What we need is some heavily abbreviated stuff, which would be a manuscript.
(12-04-2019, 09:13 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view....
archive.org OCR:
vero cadebbtl^ pnacccebat birccte in alm mfenus bns trespcdcs.£t farfsj bSiat fcjrcalamos Cijrcdietes
Ocbaftili.tresci: pno latcre-^r eres ei allo^d^ furiusobUq.^uoufcB Etmgcret ad alttmdxnebafhCJln
baftiU to quamwz qpbi infUr mios.l^uos aUd pom^s oicnt.5lta ® boo apbi pofm coira almfa#
That is pretty bad OCR, and it's not even handwritten text.
vero cadebbtl^ pnacccebat birccte in alm mfenus bns trespcdcs.£t farfsj bSiat fcjrcalamos Cijrcdietes
vero ca'delabri q' procedebat directe in altu' inferius h'ns tres pedes. Et sursu[m] he'b at sex calamos egredie'tes
vero candelabri qui procedebat directe, etc.
Heavily abbreviated text has a lot of variation in the characters.
Most texts weren't heavily abbreviated (I'm not sure all the scribes even knew all the abbreviations). They used the abbreviations I listed above and sometimes only half of those.
I'm not sure it has to be heavily abbreviated, but I'm definitely interested in hearing what Anton has in mind.
(12-04-2019, 09:47 AM)Anton Wrote: You are not allowed to view links. Register or Login to view.What we need is some heavily abbreviated stuff, which would be a manuscript.
Any Pseudo-Apuleius that would suit you?
How much text do you feel it should be (this is a question for everyone) to be useful in terms of comparison?
A page? 10 pages? 20? 50? 100?
I also quickly made the bigram plot for the "Corpus_2" text:
[
attachment=2814]
It is very similar to the other ones for Latin on You are not allowed to view links.
Register or
Login to view. .
Quote:Any Pseudo-Apuleius that would suit you?
If you ask my
personal preference, then I'd be interested in some High German text, not in a Latin one. Something like this: You are not allowed to view links.
Register or
Login to view. This one is only moderately abbreviated, but I have seen another MS on e-codices recently, very abbreviated, but can't recall what it was.
Quote:How much text do you feel it should be (this is a question for everyone) to be useful in terms of comparison?
A page? 10 pages? 20? 50? 100?
That's not a question of a number of pages, but rather a number of characters. I think ten thousand would be quite enough. Several tens of thousands are definitely enough.