RobGea > 30-05-2019, 03:19 PM
Hi all,
it would be most helpful if there was a basic stats summary somewhere,
just for basic word count and such from individual transcriptions.
Something like this:
Transcription:
TT_ivtff_v0a
IVTFF Eva- 1.5
# Extracted from LSI_ivtff_0d.txt
# Version v0a of 26/08/2017
Notes:
words with question marks rejected.
'<->' replaced with space
Stats:
Total word count: 37759
distinct words: 8026
Hapax legomenon: 5527
words occurring twice or more: 2499 (anywhere in the manuscript)
wordsoccurring2ormoretimes: of length: 11 How many: 0
wordsoccurring2ormoretimes: of length: 10 How many: 6
wordsoccurring2ormoretimes: of length: 9 How many: 57
wordsoccurring2ormoretimes: of length: 8 How many: 204
wordsoccurring2ormoretimes: of length: 7 How many: 456
wordsoccurring2ormoretimes: of length: 6 How many: 638
wordsoccurring2ormoretimes: of length: 5 How many: 599
wordsoccurring2ormoretimes: of length: 4 How many: 328
wordsoccurring2ormoretimes: of length: 3 How many: 145
wordsoccurring2ormoretimes: of length: 2 How many: 48
wordsoccurring2ormoretimes: of length: 1 How many: 18
EOF
This kind of thing would be a great boon to baseline/calibrate any code.
Yes it's all been done before, but finding the actual data is a pain.
An easy to find reference would be good, esp. if several folks could agree on the numbers.
If anyone knows where to find such a thing , that would be great.