As gratifying as it has been to see the read count on this thread growing after my update earlier this month (which will probably lead to the release of the first several chapters of _The (Duh!) Voynich Code_ next April Fools' Day), ideally I'd like to be remembered for more substantial contributions. As a result, I thought I would take advantage of the recent visibility of this thread to point people to the open-source Awk language "Swiss Army knife" text analysis tool I posted in You are not allowed to view links.
Register or
Login to view.. If the low download count represents lack of access to an awk interpreter or issues with the documentation in the comments and the file containing examples of the tool's output, I'm happy to work on adding info on getting awk working under Windows or clarifying the command line options. While this is going to get mangled (at least in part because of the lack of a fixed-width font option), here's a taste of the info it can generate from the input text sample (note that while the examples use Currier from the D'Imperio transcription, you can use the transcription scheme and text of your choice or analyze natural language texts for comparison):
Example 2: Comparing 20 most-frequent contexts of word-initial '9' (two
glyphs before the space and one glyph after the '9') between Herbal A and
Biological B.
bash-3.1$ cat HerbA.txt | sed 's/^...... //g' | awk -f ekg2Awk.txt k=5 hz=1 TE=20 SC='-#/' RE='../9.'
# 33702 char 'VAS92/9FAE...E/SOE/8AM#'
# Input alphabet (40 'letters'): 'VAS92/FERPMZOQ8-XN*DUC$TWY#I3HJB0,4K67LG'
# Vowels found by Sukhotin's method excluding digrams containing characters
# in '-#/': O A 9 C 0 6
# RE = '../9.'
# XRE = '^$'
# MinCt = 0 (12417 types, 33698 tokens)
# k NTyp NTok Hk (bits) PctFreq1 Typ/Tok k-grams
# 5 12417 33698 12.409070 0.636225 0.36848 all
# 5 121 285 6.179331 65.289256 0.42456 in RE & not in XRE & ct >= MinCt
# Max possible H5: all tokens = 13.600029 bits, RE tokens = 6.918863 bits
# Selected = 0.9745 pct of types, 0.8457 pct of tokens
kgram: OR/9P OR/9F OE/9F AM/9F AM/9P OE/9P S9/9P OR/9/ C9/9P C9/9F
Rank: 1 2 3 4 5 6 7 8 9 10
Count: 22 15 14 12 11 10 9 8 8 6
AllFreq: 0.0007 0.0004 0.0004 0.0004 0.0003 0.0003 0.0003 0.0002 0.0002 0.0002
REFreq: 0.0772 0.0526 0.0491 0.0421 0.0386 0.0351 0.0316 0.0281 0.0281 0.0211
RECmFrq: 1.0000 0.9228 0.8702 0.8211 0.7789 0.7404 0.7053 0.6737 0.6456 0.6175
kgram: AR/9F 89/9P S9/9F S9/98 C2/9F AR/9P AN/9F ZO/9P ZO/9F Q9/9P
Rank: 11 12 13 14 15 16 17 18 19 20
Count: 6 6 5 4 4 4 4 3 3 3
AllFreq: 0.0002 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
REFreq: 0.0211 0.0211 0.0175 0.0140 0.0140 0.0140 0.0140 0.0105 0.0105 0.0105
RECmFrq: 0.5965 0.5754 0.5544 0.5368 0.5228 0.5088 0.4947 0.4807 0.4702 0.4596
bash-3.1$ cat BioB.txt | sed 's/^...... //g' | awk -f ekg2Awk.txt k=5 hz=1 TE=20 SC='-#/' RE='../9.'
# 35485 char 'VSC89FAR9/...OPOE/SC89-'
# Input alphabet (34 'letters'): 'VSC89FAR/O4Z-NE2JMPXBQD*TULYW36G5H'
# Vowels found by Sukhotin's method excluding digrams containing characters
# in '-#/': C O A 9 V L Y
# RE = '../9.'
# XRE = '^$'
# MinCt = 0 ( 7966 types, 35481 tokens)
# k NTyp NTok Hk (bits) PctFreq1 Typ/Tok k-grams
# 5 7966 35481 10.703298 0.464474 0.22451 all
# 5 50 79 5.268988 74.000000 0.63291 in RE & not in XRE & ct >= MinCt
# Max possible H5: all tokens = 12.959640 bits, RE tokens = 5.643856 bits
# Selected = 0.6277 pct of types, 0.2227 pct of tokens
kgram: 89/9F 89/9P C9/9P C9/9F AR/9P AN/9P P9/9F C8/9F AR/9F AN/9F
Rank: 1 2 3 4 5 6 7 8 9 10
Count: 10 6 3 3 3 3 2 2 2 2
AllFreq: 0.0003 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
REFreq: 0.1266 0.0759 0.0380 0.0380 0.0380 0.0380 0.0253 0.0253 0.0253 0.0253
RECmFrq: 1.0000 0.8734 0.7975 0.7595 0.7215 0.6835 0.6456 0.6203 0.5949 0.5696
kgram: AN/98 AM/98 AE/9Z S9/9F S2/98 Q9/9P Q9/9O Q9/9F OR/9Z OR/9P
Rank: 11 12 13 14 15 16 17 18 19 20
Count: 2 2 2 1 1 1 1 1 1 1
AllFreq: 0.0001 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
REFreq: 0.0253 0.0253 0.0253 0.0127 0.0127 0.0127 0.0127 0.0127 0.0127 0.0127
RECmFrq: 0.5443 0.5190 0.4937 0.4684 0.4557 0.4430 0.4304 0.4177 0.4051 0.3924
bash-3.1$
Example 3: Print the 20 most-frequent Herbal A word types that
a) match Zattera's slot model for "regular" (not "separable") words [1], but
b) do not match Tiltman's prefix/suffix model [2]
[1]: strictly speaking, the '[8DERJNGTKMHUL3105]' part of the regex should
be modified to something like '((I?I?I?8)|[DERJNGTKMHUL3105])', but in
practice it doesn't matter...
[2]: assuming both prefix and suffix have to be non-empty
bash-3.1$ cat HerbA.txt | sed 's/^...... //g' | sed 's.[/#-]. .g' | awk -f ekg2Awk.txt hz=2 WL=1 XRE='^((4?O[FVBP])|[SZ82])((A[DNM3RTU0EGH1])|(O[ER])|(CC?C?8?9))$' RE='^[428]?[O9]?[ER]?[PFBV]?[SZ]?[QXWY]?(CC?C?)?[28]?[OA]?[8DERJNGTKMHUL3105]?9?$'
# Input alphabet (37 'letters'): 'VAS92FERPMZOQ8XN*DUC$TWYI3HJB0,4K67LG'
# RE = '^[428]?[O9]?[ER]?[PFBV]?[SZ]?[QXWY]?(CC?C?)?[28]?[OA]?[8DERJNGTKMHUL3105]?9?$'
# XRE = '^((4?O[FVBP])|[SZ82])((A[DNM3RTU0EGH1])|(O[ER])|(CC?C?8?9))$'
# MinCt = 0 ( 2224 types, 7121 tokens)
#
# Word frequency and length histograms: all words
#
# TypeCount 2224 TokenCount 7121 AvgTypeLen 4.72 AvgTokLen 3.73
# Number of words with given number of occurances:
# NOcc : 1 2 3 4 5 6 7 8 9 10 11 12 13 (>=14)
# NWord: 1562 253 114 58 42 26 22 10 16 10 10 12 8 81
#
# Length: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
# % Type: 0.7 3.9 14.4 25.0 29.0 17.1 7.4 1.9 0.4 0.1 0.0 0.0 0.0 0.0
# % Tok : 3.3 12.0 35.0 22.3 16.9 7.3 2.5 0.6 0.1 0.0 0.0 0.0 0.0 0.0
# Type length: mean 4.717 mode 5; token length: mean 3.732 mode 3
#
# Word frequency and length histograms: selected words
#
# TypeCount 1055 TokenCount 3817 AvgTypeLen 4.16 AvgTokLen 3.44
# Number of words with given number of occurances:
# NOcc : 1 2 3 4 5 6 7 8 9 10 11 12 13 (>=14)
# NWord: 610 162 78 43 28 18 18 6 15 9 7 7 6 48
#
# Length: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
# % Type: 1.2 6.8 20.5 32.8 26.1 10.2 2.0 0.4 0.0 0.0 0.0 0.0 0.0 0.0
# % Tok : 5.7 21.8 23.7 27.5 15.7 5.0 0.7 0.1 0.0 0.0 0.0 0.0 0.0 0.0
# Type length: mean 4.162 mode 4; token length: mean 3.438 mode 4
#
# Zipf's Law fit to 25 most-frequent selected words:
# ln(freq) = -0.569149 * ln(rank) + 5.050078
# Average residual -0.0000000 (SD 0.1139755), RMSE = 0.1139755
#
# NTyp NTok H (bits) PctFreq1 Typ/Tok Words
# 2224 7121 9.350571 70.233813 0.31232 all
# 1055 3817 8.693055 57.819905 0.27640 in RE & not in XRE & ct >= MinCt
# Selected = 47.4371 pct of types, 53.6020 pct of tokens
kgram: 89 S9 2 Q9 ZO Z9 OE 4OPS9 OR QOE
Rank: 1 2 3 4 5 6 7 8 9 10
Count: 113 102 94 88 83 52 49 47 45 44
AllFreq: 0.0159 0.0143 0.0132 0.0124 0.0117 0.0073 0.0069 0.0066 0.0063 0.0062
REFreq: 0.0296 0.0267 0.0246 0.0231 0.0217 0.0136 0.0128 0.0123 0.0118 0.0115
RECmFrq: 1.0000 0.9704 0.9437 0.9190 0.8960 0.8742 0.8606 0.8478 0.8355 0.8237
kgram: QOR OP9 9 SO 4OFS9 OF9 OPS9 SCOR 8AJ SO89
Rank: 11 12 13 14 15 16 17 18 19 20
Count: 43 41 40 37 34 31 30 29 29 28
AllFreq: 0.0060 0.0058 0.0056 0.0052 0.0048 0.0044 0.0042 0.0041 0.0041 0.0039
REFreq: 0.0113 0.0107 0.0105 0.0097 0.0089 0.0081 0.0079 0.0076 0.0076 0.0073
RECmFrq: 0.8122 0.8009 0.7901 0.7797 0.7700 0.7611 0.7529 0.7451 0.7375 0.7299
bash-3.1$