The Voynich Ninja - Binomial distribution in VMS

Note on the code in post #25:

If the strings that are removed consist of only one letter and the next extension consists of more than one letter, unfavorable effects occur. Character strings such as "iiiiii" are formed. However, it is easy to extend the code so that all character strings that produce more than two consecutive, identical letter sequences are simply truncated. Then the binomial distribution is no longer 100%, but still very good.

It makes sense to shorten the character strings anyway, as they no longer contain any usable information. Due to the double output, a reader would immediately recognize that only repetitions follow.

Repeating chars in VMS (TT)
#Word Types: 20
#Word Tokens: 9487

1 4325 ee
2 4321 ii
3 395 eee
4 166 iii
5 84 hh
6 83 oo
7 31 ss
8 28 ll
9 23 dd
10 14 eeee
11 3 cc
12 3 yy
13 2 aa
14 2 iiii
15 2 mm
16 1 hhh
17 1 nn
18 1 ooooooooo
19 1 rr
20 1 rrr

Repeating chars in Speculum humanae salvationis (modified )
#Word Types: 23
#Word Tokens: 5303

1 981 ss
2 621 tt
3 580 mm
4 480 ll
5 426 ii
6 361 aa
7 350 ee
8 297 rr
9 267 cc
10 235 uu
11 212 nn
12 178 oo
13 131 ff
14 77 dd
15 60 pp
16 16 xx
17 9 bb
18 8 iii
19 8 xxx
20 2 hh
21 2 vv
22 1 nnn
23 1 sss

[attachment=9076]

Code:
import sys

import numpy as np

from scipy.special import comb

def calculate_binomial_distribution(n, max_length):

    """Berechnet die binomiale Verteilung für Wortlängen."""

    k_values = np.arange(1, max_length + 1)

    probabilities = [comb(n, k) * (0.5 ** n) for k in k_values]

    probabilities /= np.sum(probabilities)

    return probabilities

def trim_repeated_chars(s):

    """Schneidet die Zeichenkette nach zwei Zeichen ab, wenn drei oder mehr gleiche Zeichen am Ende stehen."""

    if len(s) >= 3 and s[-1] == s[-2] == s[-3]:

        # Finde den Start der wiederholten Zeichen

        char = s[-1]

        i = len(s) - 1

        while i >= 0 and s[i] == char:

            i -= 1

        # Kürze die Zeichenkette auf die ersten zwei wiederholten Zeichen

        return s[:i+1] + char * 2

    return s

def adjust_word_lengths(words, target_distribution, last_truncated_part):

    """Passt die Wortlängen an, um die Zielverteilung zu erfüllen, indem Wörter gekürzt oder verlängert werden."""

    adjusted_words = []

    max_word_length = len(target_distribution)

    length_bins = np.arange(1, max_word_length + 1)

    length_probs = np.array(target_distribution)

    new_last_truncated_part = last_truncated_part

    for word in words:

        current_length = len(word)

        target_length = np.random.choice(length_bins, p=length_probs)

        if target_length < current_length:

            # Speichern des gekürzten Teils

            new_last_truncated_part = word[target_length:]

            adjusted_word = word[:target_length]

            adjusted_words.append(adjusted_word)

        elif target_length > current_length:

            if new_last_truncated_part:

                # Berechnen der benötigten Länge für die Verlängerung

                needed_length = target_length - current_length

                # Erstellen des Erweiterungsteils durch Wiederholung des gekürzten Teils

                repeated_part = (new_last_truncated_part * ((needed_length // len(new_last_truncated_part)) + 1))[:needed_length]

                # Prüfen, ob der Erweiterungsteil auf wiederholte Zeichen gekürzt werden muss

                extended_word = word + repeated_part

                adjusted_word = trim_repeated_chars(extended_word)

            else:

                # Falls kein gekürzter Teil vorhanden ist, das Wort mit Fallback-Zeichen verlängern

                extended_word = word + "_" * (target_length - current_length)

                adjusted_word = trim_repeated_chars(extended_word)

            adjusted_words.append(adjusted_word)

        else:

            adjusted_words.append(word)  # Länge entspricht der Zielvorgabe, Wort bleibt unverändert

    return adjusted_words, new_last_truncated_part

def process_text(file_path, output_path, target_distribution):

    """Liest den Text aus der Datei, passt die Wortlängen an und schreibt den modifizierten Text in eine Ausgabedatei."""

    try:

        with open(file_path, 'r', encoding='utf-8') as file:

            lines = file.readlines()

    except FileNotFoundError:

        print(f"Fehler: Die Datei {file_path} wurde nicht gefunden.")

        sys.exit(1)

    except IOError as e:

        print(f"Fehler: Ein Fehler ist beim Lesen der Datei aufgetreten: {e}")

        sys.exit(1)

    last_truncated_part = ""

    adjusted_lines = []

    for line in lines:

        words = line.split()

        adjusted_words, last_truncated_part = adjust_word_lengths(words, target_distribution, last_truncated_part)

        adjusted_lines.append(' '.join(adjusted_words))

    # Schreiben des modifizierten Textes in die Ausgabedatei

    try:

        with open(output_path, 'w', encoding='utf-8') as file:

            file.write('\n'.join(adjusted_lines))

        print(f"Modifizierter Text wurde in {output_path} geschrieben.")

    except IOError as e:

        print(f"Fehler: Ein Fehler ist beim Schreiben der Datei aufgetreten: {e}")

        sys.exit(1)

def main():

    if len(sys.argv) != 3:

        print("Verwendung: python adjust_word_length.py <input_filename> <output_filename>")

        sys.exit(1)

    input_file_path = sys.argv[1]

    output_file_path = sys.argv[2]

    max_word_length = 15

    n = 10

    # Berechnen der Binomialverteilung

    target_distribution = calculate_binomial_distribution(n, max_word_length)

    # Prozess des Textes und Schreiben in die Ausgabedatei

    process_text(input_file_path, output_file_path, target_distribution)

if __name__ == "__main__":

    main()

[attachment=9077]

Here is a direct comparison between the VMS and the modified Speculum humanae salvationis. In principle, the modification of any text file is possible :

[attachment=9078]

Here is an Excel file containing the original and adapted words. This makes it easy to understand the system described.

[attachment=9089]

Using bi3mws' code from post #19.
Graphed all the texts found using the page variable for illustration type in ZL3a-n transcription file.
Plus Labels from the 'L' locus and the complete text.

A = Astronomical (excluding zodiac)
B = Biological
C = Cosmological
H = Herbal
P = Pharmaceutical
S = Marginal stars only
T = Text-only page (no illustrations)
Z = Zodiac
Lb - Labels Only
All - Complete text from transcription file

Preparation:

You are not allowed to view links. Register or Login to view.

Note: n = 9 , here for better fit
Graphs:
[attachment=9102]

Graph of All text from ZL3a-n:
[attachment=9103]

ZLlabelsOnly with n=10 , slighly better fit.
[attachment=9104]

(25-08-2024, 06:11 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.A = Astronomical (excluding zodiac)
B = Biological
C = Cosmological
H = Herbal
P = Pharmaceutical
S = Marginal stars only
T = Text-only page (no illustrations)
Z = Zodiac
Lb - Labels Only
All - Complete text from transcription file

Naming the sections can lead to confusion. Can you write the folio numbers ( from - to ) after the sections ?

I know, for example, the following classification:

Herbal section (f. 1r-66v)
Astronomical section (f. 67r-73v)
Anatomical-balneological section ( Biological ? ) (f. 75r-84v)
Cosmological section (f. 85r-86v)
Pharmaceutical section (f. 87r-102v)
Recipes (f. 103r-116r)

Zodiac (f. 70v2-73v ?)

The sections in the above graphs are defined by the illustration type page variable found as part of the format of a IVTT file.[1]

Illustration type page variables
page_variable = [ '$I=A', '$I=B', '$I=C', '$I=H', '$I=P', '$I=S', '$I=T', '$I=Z' ]

Examples:

You are not allowed to view links. Register or Login to view.

.
Counts table:
Name: Description : IVTT Variable : Number of folios
A = Astronomical (excluding zodiac) $I=A :: 8
B = Biological $I=B :: 19
C = Cosmological $I=C :: 11
H = Herbal $I=H :: 129
P = Pharmaceutical $I=P :: 17
S = Marginal stars only $I=S :: 25
T = Text-only page (no illustrations) $I=T :: 7
Z = Zodiac $I=Z :: 12
total nbr of folios: 228

Folio listing:
Page variable: $I=A :: How many folios: 8
['f67r1', 'f67r2', 'f67v1', 'f68r1', 'f68r2', 'f68r3', 'f68v2', 'f68v1']

Page variable: $I=B :: How many folios: 19
['f75r', 'f75v', 'f76v', 'f77r', 'f77v', 'f78r', 'f78v', 'f79r', 'f79v', 'f80r', 'f80v', 'f81r', 'f81v', 'f82r', 'f82v', 'f83r', 'f83v', 'f84r', 'f84v']

Page variable: $I=C :: How many folios: 11
['f57v', 'f67v2', 'f68v3', 'f69r', 'f69v', 'f70r1', 'f70r2', 'f85r2', 'fRos', 'f86v4', 'f86v3']

Page variable: $I=H :: How many folios: 129
['f1v', 'f2r', 'f2v', 'f3r', 'f3v', 'f4r', 'f4v', 'f5r', 'f5v', 'f6r', 'f6v', 'f7r', 'f7v', 'f8r', 'f8v', 'f9r', 'f9v', 'f10r', 'f10v', 'f11r', 'f11v', 'f13r', 'f13v', 'f14r', 'f14v', 'f15r', 'f15v', 'f16r', 'f16v', 'f17r', 'f17v', 'f18r', 'f18v', 'f19r', 'f19v', 'f20r', 'f20v', 'f21r', 'f21v', 'f22r', 'f22v', 'f23r', 'f23v', 'f24r', 'f24v', 'f25r', 'f25v', 'f26r', 'f26v', 'f27r', 'f27v', 'f28r', 'f28v', 'f29r', 'f29v', 'f30r', 'f30v', 'f31r', 'f31v', 'f32r', 'f32v', 'f33r', 'f33v', 'f34r', 'f34v', 'f35r', 'f35v', 'f36r', 'f36v', 'f37r', 'f37v', 'f38r', 'f38v', 'f39r', 'f39v', 'f40r', 'f40v', 'f41r', 'f41v', 'f42r', 'f42v', 'f43r', 'f43v', 'f44r', 'f44v', 'f45r', 'f45v', 'f46r', 'f46v', 'f47r', 'f47v', 'f48r', 'f48v', 'f49r', 'f49v', 'f50r', 'f50v', 'f51r', 'f51v', 'f52r', 'f52v', 'f53r', 'f53v', 'f54r', 'f54v', 'f55r', 'f55v', 'f56r', 'f56v', 'f57r', 'f65r', 'f65v', 'f66v', 'f87r', 'f87v', 'f90r1', 'f90r2', 'f90v2', 'f90v1', 'f93r', 'f93v', 'f94r', 'f94v', 'f95r1', 'f95r2', 'f95v2', 'f95v1', 'f96r', 'f96v']

Page variable: $I=P :: How many folios: 17
['f88r', 'f88v', 'f89r1', 'f89r2', 'f89v2', 'f89v1', 'f99r', 'f99v', 'f100r', 'f100v', 'f101r', 'f101v', 'f101r2', 'f102r1', 'f102r2', 'f102v2', 'f102v1']

Page variable: $I=S :: How many folios: 25
['f58r', 'f58v', 'f103r', 'f103v', 'f104r', 'f104v', 'f105r', 'f105v', 'f106r', 'f106v', 'f107r', 'f107v', 'f108r', 'f108v', 'f111r', 'f111v', 'f112r', 'f112v', 'f113r', 'f113v', 'f114r', 'f114v', 'f115r', 'f115v', 'f116r']

Page variable: $I=T :: How many folios: 7
['f1r', 'f66r', 'f76r', 'f85r1', 'f86v6', 'f86v5', 'f116v']

Page variable: $I=Z :: How many folios: 12
['f70v2', 'f70v1', 'f71r', 'f71v', 'f72r1', 'f72r2', 'f72r3', 'f72v3', 'f72v2', 'f72v1', 'f73r', 'f73v']

Note: 'f101r2' is included in the $I=P list.

total nbr of folios: 228

DATA SHEET ATTACHMENT
[attachment=9123]

Labels ( annotated as 'Lb' in post #34 ) are defined by IVTT format Locus type 'L' [1]
Nbr. of Labels: 1033
How many folios have labels 58.

Here is python formatted list of tuples containing the folio number and how many labels it contains e.g ( f1r, 3)

You are not allowed to view links. Register or Login to view.

.
[1] IVTFF – Intermediate Voynich MS,Transliteration File Format,R.Zandbergen,File format version 2.0.,Document issue 2.0, 02/02/2023
IVTFF format 2.0 definition. Link to which can be found on this page:You are not allowed to view links. Register or Login to view.
>>>

Thank you @RobGea for your explanations. I find it a bit difficult to start "IVTT" so that it writes for example all herbal folios into a text file (without line markers etc.). Can you give an example to start IVTT on the command line with the right parameters ?

This is what i used to extract herbal text. ivtt.exe -x7 +IH ZL3a-n.txt ZL_Herbal.txt

I'm on windows. So from the command prompt.

[1] [2] [3] [4] [5]
C:\Users>ivtt.exe -x7 +IH ZL3a-n.txt ZL_Herbal.txt

[1] Executable name

[2] -x7 {Turn into a text‐only Ascii file, preserving uncertain spaces}

[3] +IH {+ keep} {(2 character locus type), I = Ilustration type, H = herbal }

[4] Input file
[5] Output file

Link to IVTT User manual can be found on this page, subsection 'Transliteration File processing tools'
You are not allowed to view links. Register or Login to view.

Oh, there is also this thread, IVTT recipes, You are not allowed to view links. Register or Login to view.
If you find the command lines to do what you want, it would be nice to note it there for reference.
Actually i will repost this to that thread.
Cmd line switches can be such a pain.

Yes, it also works without problems under linux in the bash shell (you have to leave out the ".exe" extension of course).

Sweet !