The Voynich Ninja
Basic stats summary - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Voynich Talk (https://www.voynich.ninja/forum-6.html)
+--- Thread: Basic stats summary (/thread-2800.html)



Basic stats summary - RobGea - 30-05-2019

Hi all,
it would be most helpful if there was a basic stats summary somewhere,
just for basic word count and such from individual transcriptions.
Something like this:

Transcription:
    TT_ivtff_v0a
    IVTFF Eva- 1.5
# Extracted from LSI_ivtff_0d.txt
# Version v0a of 26/08/2017

Notes:
    words with question marks rejected.
    '<->' replaced with space
Stats:
    Total word count: 37759
    distinct words:    8026
    Hapax legomenon:   5527
    words occurring twice or more: 2499  (anywhere in the manuscript)

    wordsoccurring2ormoretimes: of length: 11 How many: 0
    wordsoccurring2ormoretimes: of length: 10 How many: 6
    wordsoccurring2ormoretimes: of length: 9 How many: 57
    wordsoccurring2ormoretimes: of length: 8 How many: 204
    wordsoccurring2ormoretimes: of length: 7 How many: 456
    wordsoccurring2ormoretimes: of length: 6 How many: 638
    wordsoccurring2ormoretimes: of length: 5 How many: 599
    wordsoccurring2ormoretimes: of length: 4 How many: 328
    wordsoccurring2ormoretimes: of length: 3 How many: 145
    wordsoccurring2ormoretimes: of length: 2 How many: 48
    wordsoccurring2ormoretimes: of length: 1 How many: 18
EOF

This kind of thing would be a great boon to baseline/calibrate any code.
Yes it's all been done before, but finding the actual data is a pain.
An easy to find reference would be good, esp. if several folks could agree on the numbers.

If anyone knows where to find such a thing , that would be great.


RE: Basic stats summary - -JKP- - 30-05-2019

"...    words occurring twice or more:"

It needs to be made clear whether this is words occurring twice or more (anywhere in the manuscript) or words occurring twice or more in a row (since repetition is often discussed). A slight modification of the wording might be enough to clarify this.