22-08-2024, 07:59 PM
Note on the code in post #25:
If the strings that are removed consist of only one letter and the next extension consists of more than one letter, unfavorable effects occur. Character strings such as "iiiiii" are formed. However, it is easy to extend the code so that all character strings that produce more than two consecutive, identical letter sequences are simply truncated. Then the binomial distribution is no longer 100%, but still very good.
It makes sense to shorten the character strings anyway, as they no longer contain any usable information. Due to the double output, a reader would immediately recognize that only repetitions follow.
Repeating chars in VMS (TT)
#Word Types: 20
#Word Tokens: 9487
1 4325 ee
2 4321 ii
3 395 eee
4 166 iii
5 84 hh
6 83 oo
7 31 ss
8 28 ll
9 23 dd
10 14 eeee
11 3 cc
12 3 yy
13 2 aa
14 2 iiii
15 2 mm
16 1 hhh
17 1 nn
18 1 ooooooooo
19 1 rr
20 1 rrr
Repeating chars in Speculum humanae salvationis (modified )
#Word Types: 23
#Word Tokens: 5303
1 981 ss
2 621 tt
3 580 mm
4 480 ll
5 426 ii
6 361 aa
7 350 ee
8 297 rr
9 267 cc
10 235 uu
11 212 nn
12 178 oo
13 131 ff
14 77 dd
15 60 pp
16 16 xx
17 9 bb
18 8 iii
19 8 xxx
20 2 hh
21 2 vv
22 1 nnn
23 1 sss
[attachment=9076]
[attachment=9077]
If the strings that are removed consist of only one letter and the next extension consists of more than one letter, unfavorable effects occur. Character strings such as "iiiiii" are formed. However, it is easy to extend the code so that all character strings that produce more than two consecutive, identical letter sequences are simply truncated. Then the binomial distribution is no longer 100%, but still very good.
It makes sense to shorten the character strings anyway, as they no longer contain any usable information. Due to the double output, a reader would immediately recognize that only repetitions follow.
Repeating chars in VMS (TT)
#Word Types: 20
#Word Tokens: 9487
1 4325 ee
2 4321 ii
3 395 eee
4 166 iii
5 84 hh
6 83 oo
7 31 ss
8 28 ll
9 23 dd
10 14 eeee
11 3 cc
12 3 yy
13 2 aa
14 2 iiii
15 2 mm
16 1 hhh
17 1 nn
18 1 ooooooooo
19 1 rr
20 1 rrr
Repeating chars in Speculum humanae salvationis (modified )
#Word Types: 23
#Word Tokens: 5303
1 981 ss
2 621 tt
3 580 mm
4 480 ll
5 426 ii
6 361 aa
7 350 ee
8 297 rr
9 267 cc
10 235 uu
11 212 nn
12 178 oo
13 131 ff
14 77 dd
15 60 pp
16 16 xx
17 9 bb
18 8 iii
19 8 xxx
20 2 hh
21 2 vv
22 1 nnn
23 1 sss
[attachment=9076]
Code:
import sys
import numpy as np
from scipy.special import comb
def calculate_binomial_distribution(n, max_length):
"""Berechnet die binomiale Verteilung für Wortlängen."""
k_values = np.arange(1, max_length + 1)
probabilities = [comb(n, k) * (0.5 ** n) for k in k_values]
probabilities /= np.sum(probabilities)
return probabilities
def trim_repeated_chars(s):
"""Schneidet die Zeichenkette nach zwei Zeichen ab, wenn drei oder mehr gleiche Zeichen am Ende stehen."""
if len(s) >= 3 and s[-1] == s[-2] == s[-3]:
# Finde den Start der wiederholten Zeichen
char = s[-1]
i = len(s) - 1
while i >= 0 and s[i] == char:
i -= 1
# Kürze die Zeichenkette auf die ersten zwei wiederholten Zeichen
return s[:i+1] + char * 2
return s
def adjust_word_lengths(words, target_distribution, last_truncated_part):
"""Passt die Wortlängen an, um die Zielverteilung zu erfüllen, indem Wörter gekürzt oder verlängert werden."""
adjusted_words = []
max_word_length = len(target_distribution)
length_bins = np.arange(1, max_word_length + 1)
length_probs = np.array(target_distribution)
new_last_truncated_part = last_truncated_part
for word in words:
current_length = len(word)
target_length = np.random.choice(length_bins, p=length_probs)
if target_length < current_length:
# Speichern des gekürzten Teils
new_last_truncated_part = word[target_length:]
adjusted_word = word[:target_length]
adjusted_words.append(adjusted_word)
elif target_length > current_length:
if new_last_truncated_part:
# Berechnen der benötigten Länge für die Verlängerung
needed_length = target_length - current_length
# Erstellen des Erweiterungsteils durch Wiederholung des gekürzten Teils
repeated_part = (new_last_truncated_part * ((needed_length // len(new_last_truncated_part)) + 1))[:needed_length]
# Prüfen, ob der Erweiterungsteil auf wiederholte Zeichen gekürzt werden muss
extended_word = word + repeated_part
adjusted_word = trim_repeated_chars(extended_word)
else:
# Falls kein gekürzter Teil vorhanden ist, das Wort mit Fallback-Zeichen verlängern
extended_word = word + "_" * (target_length - current_length)
adjusted_word = trim_repeated_chars(extended_word)
adjusted_words.append(adjusted_word)
else:
adjusted_words.append(word) # Länge entspricht der Zielvorgabe, Wort bleibt unverändert
return adjusted_words, new_last_truncated_part
def process_text(file_path, output_path, target_distribution):
"""Liest den Text aus der Datei, passt die Wortlängen an und schreibt den modifizierten Text in eine Ausgabedatei."""
try:
with open(file_path, 'r', encoding='utf-8') as file:
lines = file.readlines()
except FileNotFoundError:
print(f"Fehler: Die Datei {file_path} wurde nicht gefunden.")
sys.exit(1)
except IOError as e:
print(f"Fehler: Ein Fehler ist beim Lesen der Datei aufgetreten: {e}")
sys.exit(1)
last_truncated_part = ""
adjusted_lines = []
for line in lines:
words = line.split()
adjusted_words, last_truncated_part = adjust_word_lengths(words, target_distribution, last_truncated_part)
adjusted_lines.append(' '.join(adjusted_words))
# Schreiben des modifizierten Textes in die Ausgabedatei
try:
with open(output_path, 'w', encoding='utf-8') as file:
file.write('\n'.join(adjusted_lines))
print(f"Modifizierter Text wurde in {output_path} geschrieben.")
except IOError as e:
print(f"Fehler: Ein Fehler ist beim Schreiben der Datei aufgetreten: {e}")
sys.exit(1)
def main():
if len(sys.argv) != 3:
print("Verwendung: python adjust_word_length.py <input_filename> <output_filename>")
sys.exit(1)
input_file_path = sys.argv[1]
output_file_path = sys.argv[2]
max_word_length = 15
n = 10
# Berechnen der Binomialverteilung
target_distribution = calculate_binomial_distribution(n, max_word_length)
# Prozess des Textes und Schreiben in die Ausgabedatei
process_text(input_file_path, output_file_path, target_distribution)
if __name__ == "__main__":
main()
[attachment=9077]