The Voynich Ninja

Full Version: Basic stats summary
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi all,
it would be most helpful if there was a basic stats summary somewhere,
just for basic word count and such from individual transcriptions.
Something like this:

Transcription:
    TT_ivtff_v0a
    IVTFF Eva- 1.5
# Extracted from LSI_ivtff_0d.txt
# Version v0a of 26/08/2017

Notes:
    words with question marks rejected.
    '<->' replaced with space
Stats:
    Total word count: 37759
    distinct words:    8026
    Hapax legomenon:   5527
    words occurring twice or more: 2499  (anywhere in the manuscript)

    wordsoccurring2ormoretimes: of length: 11 How many: 0
    wordsoccurring2ormoretimes: of length: 10 How many: 6
    wordsoccurring2ormoretimes: of length: 9 How many: 57
    wordsoccurring2ormoretimes: of length: 8 How many: 204
    wordsoccurring2ormoretimes: of length: 7 How many: 456
    wordsoccurring2ormoretimes: of length: 6 How many: 638
    wordsoccurring2ormoretimes: of length: 5 How many: 599
    wordsoccurring2ormoretimes: of length: 4 How many: 328
    wordsoccurring2ormoretimes: of length: 3 How many: 145
    wordsoccurring2ormoretimes: of length: 2 How many: 48
    wordsoccurring2ormoretimes: of length: 1 How many: 18
EOF

This kind of thing would be a great boon to baseline/calibrate any code.
Yes it's all been done before, but finding the actual data is a pain.
An easy to find reference would be good, esp. if several folks could agree on the numbers.

If anyone knows where to find such a thing , that would be great.
"...    words occurring twice or more:"

It needs to be made clear whether this is words occurring twice or more (anywhere in the manuscript) or words occurring twice or more in a row (since repetition is often discussed). A slight modification of the wording might be enough to clarify this.