The Voynich Ninja

Full Version: Syllables at the end of a line in VMS
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
Hi Matthias,
if I understand correctly, you are processing the input file line by line, applying Sukhotin's algorithm to each line individually. Since that is a statistical method, it must be applied to the whole text to produce a meaningful result.

When I You are not allowed to view links. Register or Login to view. in 2018, I used this python code by Mans Hulden You are not allowed to view links. Register or Login to view. It includes an implementation of Sukhotin based on Guy's paper mentioned by Rene.


For a well-reasoned approach to Voynich syllables, you can also check You are not allowed to view links. Register or Login to view..
Thanks Rene and Marco for the very constructive hints. I will think about the change of the alphabet. First of all to the intended syllable - detection:

The code presented by me with the implemented Sukhotin - algorithm is faulty ( unfortunately I am not a programmer either ). The output of the result into a text file looks like this:

CO,MME,DI,A
A,l,ig,h,ie,r
I,NF,ER,NO
I
vi,ta
os,cu,ra
sm,ar,r,it,a
du,ra
fo,rt,e
p,au,r,a
mo,rt,e
tr,ov,ai
sc,or,te
v’,in,tr,a,i
pu,nt,o
ab,b,an,do,n,ai
gi,un,to
va,lle
c,om,pu,nt,o
sp,al,l,e
p,i,an,et
ca,lle
qu,et,a
d,ur,at
pi,et,a
af,f,an,n,at,a
ri,va
gu,at,a
fu,ggi,va
......

The easiest solution to output the syllables correctly with Python is to use the library "You are not allowed to view links. Register or Login to view.". Here is the output:

COM,ME,DIA
Ali,ghie,ri
IN,FER,NO
I
vi,ta
oscu,ra
smar,ri,ta
du,ra
for,te
pau,ra
mor,te
tro,vai
scor,te
v’in,trai
pun,to
ab,ban,do,nai
giun,to
val,le
com,pun,to
spal,le
pia,ne,ta
cal,le
que,ta
du,ra,ta
pie,ta
af,fan,na,ta
ri,va
gua,ta
fug,gi,va
......

The output of the last words ( syllable-separated ) in all lines in the VMS is here ( for demonstration only ) :
You are not allowed to view links. Register or Login to view.

The implementation of "pyphen" in the code looks like this:
Code:
#!/usr/bin/env python
import re
import sys
import pyphen

def calculate_syllable_statistics(text, dic):
    total_letters = len(re.sub(r'[^\w\s]', '', text))  # Count letters and spaces
    syllables = dic.inserted(text).split('-')
    total_syllables = len([s for s in syllables if s])  # Filter out empty syllables

    lines = text.split('\n')
    last_word_syllables = 0
    total_lines = len(lines)

    for line in lines:
        words = line.split()
        if words:
            last_word = words[-1]
            last_word_syllables += len([s for s in dic.inserted(last_word).split('-') if s])

    total_words = sum(len(line.split()) for line in lines)
    average_syllables_per_word = total_syllables / total_words if total_words > 0 else 0
    average_syllables_per_line = last_word_syllables / total_lines if total_lines > 0 else 0

    # Calculate the percentage using the rule of three
    syllable_percentage = (total_syllables * 100) / total_letters

    return total_syllables, total_letters, syllable_percentage, average_syllables_per_word, last_word_syllables, average_syllables_per_line

def detect_syllables_in_file(filename):
    print("Processing, please wait...")  # Meldung hinzufügen
    try:
        with open(filename, 'r', encoding='utf-8') as file:
            text = file.read()
            dic = pyphen.Pyphen(lang='it_IT')
            total_syllables, total_letters, syllable_percentage, average_syllables_per_word, last_word_syllables, average_syllables_per_line = calculate_syllable_statistics(text, dic)

            # Ausgabe in der Konsole
            print(f"Total letters and spaces: {total_letters}")
            print(f"Total syllables: {total_syllables}")
            print(f"Syllable percentage: {syllable_percentage:.9f}%")
            print(f"Average syllables per word: {average_syllables_per_word:.9f}")
            print(f"Total syllables in last words of lines: {last_word_syllables}")
            print(f"Average syllables per line (last words): {average_syllables_per_line:.9f}")

            # Erstellung der Ausgabedatei für die Silben der letzten Wörter
            output_filename = "last_word_syllables.txt"
            with open(output_filename, 'w', encoding='utf-8') as output_file:
                lines = text.split('\n')

                for line in lines:
                    words = line.split()
                    if words:
                        last_word = words[-1]
                        last_word_syllables = [s for s in dic.inserted(last_word).split('-') if s]
                        # Schreibe die Silben der letzten Wörter getrennt durch Kommas
                        last_word_syllables_str = ','.join(last_word_syllables)
                        output_file.write(last_word_syllables_str + '\n')

    except FileNotFoundError:
        print(f"Error: File '{filename}' not found.")

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python syllable_detection.py <text_file>")
    else:
        filename = sys.argv[1]
        detect_syllables_in_file(filename)

This line needs to be adjusted for latin ( I am still searching in the documentation) :

dic = pyphen.Pyphen(lang=''it_IT)

LANGUANGE_ALIASES:

    'af': 'af_ZA',
    'bg': 'bg_BG',
    'cs': 'cs_CZ',
    'da': 'da_DK',
    'de': 'de_DE',
    'el': 'el_GR',
    'en': 'en_US',
    'en_Latn_GB': 'en_GB',
    'en_Latn_US': 'en_US',
    'et': 'et_EE',
    'hr': 'hr_HR',
    'hu': 'hu_HU',
    'it': 'it_IT',
    'lt': 'lt_LT',
    'lv': 'lv_LV',
    'nb': 'nb_NO',
    'nl': 'nl_NL',
    'nn': 'nn_NO',
    'pl': 'pl_PL',
    'pt': 'pt_PT',
    'pt_Latn_BR': 'pt_BR',
    'pt_Latn_PT': 'pt_PT',
    'ro': 'ro_RO',
    'ru': 'ru_RU',
    'sk': 'sk_SK',
    'sl': 'sl_SI',
    'te': 'te_IN',
    'uk': 'uk_UA',
    'zu': 'zu_ZA',

You are not allowed to view links. Register or Login to view.

Edit: Unfortunately I can't find any support for Latin Sad I have contacted the developers.
(14-10-2023, 02:02 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.For this, one should not use Eva. I would recommend something along the lines of the FSG alphabet, or this:
You are not allowed to view links. Register or Login to view.
These two options tend to generate quite similar statistics.

@Rene: If I understood correctly, I should convert my text file (EVA) to CUVA. I cleaned the file so that everything that is not pure text was removed. Is Bitrans the right tool ? How exactly should I proceed ?

You are not allowed to view links. Register or Login to view.
You could patch together your own "sed" script, but bitrans is easier, once you know how to use it.

You need a command tool, which I only know how to do in Linux/Unix.
Get the source from here:
You are not allowed to view links. Register or Login to view.

Build it using:  cc -o bitrans bitrans.c

Get the translation table from here:
You are not allowed to view links. Register or Login to view.

And then run the command:

Code:
bitrans -f Eva-Cuva.bit Voynich_full_clean01.txt >Voynich_Cuva.txt

This assumes that all files in the same directory.

Let me know if you run into a problem.

Edit:
if file loading is blocked due to security warnings, best go up one directory:
You are not allowed to view links. Register or Login to view.
and take it from there, or else improvise....
(The manual is also there).

You might want to move the executable bitrans somewhere that is part of your PATH.
(15-10-2023, 01:27 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Let me know if you run into a problem.

Thanks Rene, it worked on the first try !

The only thing is that in the output file there are 157 lowercase "c" and and 198 lowercase "h".

Could it be because my source file is based on Basic EVA as transcription, but the transcriber is Takeshi Takahashi ? Should I better use another source file ?

edit: With this source file the result is about the same:
You are not allowed to view links. Register or Login to view.

You are not allowed to view links. Register or Login to view.
These would not have a great impact on the statistics, but I also prefer to have clean results.
You could make your local copy of the bitrans table, and add the two lines:

Code:
c  E
h  E

This will treat these 'leftover'  pieces the same as if they had been Eva-e in the input file.
As the substitution is greedy, it will not affect the larger groups 'ch', 'ckh' etc, but only the leftover ones.
During a trial with Dante Alighieri`s "Divina commedia" I noticed that the use of a wrong dictionary has considerable effects on the recognition of syllables. The greatest number of recognized syllables are to be found in the correct dictionary. Since I still assume that the last words of a line in the VMS could be filler words, it is conceivable that the writer "fell back" into known patterns when creating these words (the number of syllables would not be noticeably lower in this scenario, see earlier post). This raises the question if it could be useful to test the words, resp. number of syllables, against the languages supported by "Pyphen". If this is the case, the question would be which languages would be considered or which would be excluded from the beginning ( e.g. Afrikaans ).

You are not allowed to view links. Register or Login to view.
The best result is achieved with Czech.
For what it's worth, here's the table of syllables ( made with Czech dictionary ):

[attachment=7780]

A few "O" stand alone. You can correct that by hand if necessary.


edit: It should be noted that the sum of syllables in the last words of a line in the VMS is significantly lower than in the comparative texts ( Commedia, Bible, see opening post ). This is true even with the use of the best fitting dictionary.
To do the crosscheck, I substituted the letters in the alphabet of the corpus as follows:

O Y A S D K E L R T H U Z M C N P J I F G X V
I E A U T S R N O M C L P D B Q G V F H X Y Z

The result is that the ranking of languages has changed. Czech, for example, is now only on the 7th place instead of the 1st place, but it is remarkable that Slovak is on the 2nd place in both rankings.


You are not allowed to view links. Register or Login to view.

You are not allowed to view links. Register or Login to view.

Syllable list with Slovak dictionary:
[attachment=7785]
[attachment=7786]

edit: The good result of Serbian is misleading, because it often outputs syllables according to the following format: ABI-O-DEFG
The question arises whether a comparison with a modern Slovak dictionary is possible at all. For this to be the case, there would have to be a certain linguistic continuity back to the late Middle Ages.

In the 15th century, the territory of today's Slovakia belonged to the Kingdom of Hungary. The Kingdom of Hungary was ruled by King Sigismund of Luxembourg from 1382 to 1437. Among other languages, Middle Slovak ( in various dialects ) was spoken.
At that time, the Slovak language was written in Latin script. This means that the Latin alphabet was used to write down the Slovak language. The writing systems for Slavic languages based on the Cyrillic alphabet, such as those used further east, were not in use in the historical territory of present-day Slovakia at that time.

The development of a uniform orthography for the Slovak language began only in the 18th and 19th centuries ( see You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view. ), and the modern Slovak alphabet as it is known today was standardized even later.

It is therefore very questionable whether a comparison with a modern dictionary is permissible. This problem arises in principle with the use of modern dictionaries, but since the standardization of Slovak began at the earliest in the 18th century, one is simply too late here.
Pages: 1 2 3