The Voynich Ninja - Discussion of "A possible generating algorithm of the Voynich manuscript"

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

(07-06-2019, 09:11 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I will try other Latin texts that are much more repetitive.

This is a weird one from the MATTR thread: You are not allowed to view links. Register or Login to view.
Aristoteles Latinus II 1-2, You are not allowed to view links. Register or Login to view.
([cap. xx] titles and empty lines removed)

[attachment=3002]

If I were to generate a sample text, would it be possible to create a similar graph from it?

What I have in mind is to take a meaningful text and to present it in three different ways.

Would it be a problem it if were presented as one word per line?

(10-06-2019, 08:53 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.If I were to generate a sample text, would it be possible to create a similar graph from it?

Sure.

Quote:What I have in mind is to take a meaningful text and to present it in three different ways.

Would it be a problem it if were presented as one word per line?

No. More words per line just smooth the curve. I can also regex-replace most line feeds to make the lines longer.

I will tweak my encoder to give a probability nudge to words that are similar to recent words to try and replicate the effect.

Thanks!

I started with a relatively short text, in English, namely Poe's "the fall of the house of Usher".

Here is the plain text, one word per line: You are not allowed to view links. Register or Login to view.

Here is the same text, with a word-for-word substitution, where each word is replaced by a Roman numeral, following some rule that I can explain later: You are not allowed to view links. Register or Login to view.

This is just a first experiment and I should be able to have more interesting examples soon, also with a normal layout of many words per line.

(10-06-2019, 12:52 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Thanks!

I started with a relatively short text, in English, namely Poe's "the fall of the house of Usher".

Here is the plain text, one word per line: You are not allowed to view links. Register or Login to view.

Here is the same text, with a word-for-word substitution, where each word is replaced by a Roman numeral, following some rule that I can explain later: You are not allowed to view links. Register or Login to view.

Obviously each word type was replaced by a unique Roman numeral.

The most frequent are:
CCLV = the
CLXXIII = of
XV = and
CVIII = i
I = a
CXIV = in
CCLXXV = to
CCCLXXI = which
CII = his
CCLIV = was
...

usher_norm.txt (one word per line):

[attachment=3004]

usher_norm_80_cols.txt (512 lines of 80 columns max.):

[attachment=3005]

usher_mod1.txt (one word per line):

[attachment=3006]

usher_mod1_80_cols.txt (512 lines of 80 columns max.):

[attachment=3007]

This last curve shows an edit distance decreased by 0.2 within 60 lines, it is 0.3 in the VMS Q20. Well done!

Many thanks!

The two graphs for the same text (looking only at the 80-columns case) are very different, but the text is equally meaningful. The only difference is a word-based substitution.

The average drops from 5.28 to 4.2 .

The increase for the area 0 - 60 lines grows from 0.06 to 0.15 .

The latter is similar to the increase seen for the Voynich MS.

Edit: incorrect memory deleted :-)

This is an important statistic for assessing the quality of pseudo-Voynichese, so here is the Java code in case anyone else wants to try.

Code:
/*

 * Edit distance calculation shamelessly borrowed from https://rosettacode.org/wiki/Levenshtein_distance#Java

 * Example of use: java MDistBetweenLines text.txt 250 > text_dist.txt

 */

import java.io.*;

import java.util.*;

public class MDistBetweenLines {

    static int maxLineDistance;

    static ArrayList<ArrayList<String>> lineList = new ArrayList();

    public static void main(String args[]) {

        try {

            if (args.length == 2) {

                maxLineDistance = Integer.parseInt(args[1]);

                loadFile(new File(args[0]));

                processLines();

            }

        }

        catch (Exception e) {

            e.printStackTrace();

        }

    }

    static void loadFile(File f) throws Exception {

        InputStreamReader isreader = new InputStreamReader(new FileInputStream(f), "UTF-8");

        BufferedReader reader = new BufferedReader(isreader);

        String line;

        while((line = reader.readLine()) != null) {

            if (!line.isEmpty()) {

                String[] words = line.toLowerCase().replaceAll("\\p{P}", " ").split("[0-9 \t]+");

                ArrayList<String> wordList = new ArrayList<String>();

                for (String word: words) {

                    if (!word.isEmpty())

                        wordList.add(word);

                }

                lineList.add(wordList);

            }

        }

        reader.close();

        System.err.println("Loaded " + lineList.size() + " lines");

    }

    static void processLines() {

        for (int lineDistance = 1; lineDistance <= maxLineDistance; lineDistance++) {

            int nSum = 0, totalDist = 0;

            for (int iLine = 0; iLine < lineList.size() - lineDistance; iLine++) {

                ArrayList<String> wordList1 = lineList.get(iLine);

                ArrayList<String> wordList2 = lineList.get(iLine + lineDistance);

                for (String word1 : wordList1) {

                    for (String word2 : wordList2) {

                        totalDist += distance(word1, word2);

                        nSum++;

                    }

                }

            }

            System.out.println(lineDistance + "\t" + (double)totalDist/nSum);

        }

    }

    public static int distance(String a, String b) {

        int [] costs = new int [b.length() + 1];

        for (int j = 0; j < costs.length; j++)

            costs[j] = j;

        for (int i = 1; i <= a.length(); i++) {

            costs[0] = i;

            int nw = i - 1;

            for (int j = 1; j <= b.length(); j++) {

                int cj = Math.min(1 + Math.min(costs[j], costs[j - 1]), a.charAt(i - 1) == b.charAt(j - 1) ? nw : nw + 1);

                nw = costs[j];

                costs[j] = cj;

            }

        }

        return costs[b.length()];

    }

}

(10-06-2019, 12:52 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Here is the same text, with a word-for-word substitution, where each word is replaced by a Roman numeral, following some rule that I can explain later: You are not allowed to view links. Register or Login to view.

I wonder how you did it. Did you choose the Roman numerals somehow to maximize the effect?

It is based on the scenario of a 'dictionary'.

The Voynich author would have made a list of plain text words and corresponding code words. In this case I just replace the code word by the Roman numeral for the index of the word.

Now assume that he started off with a list of common words. In my case there were 400 common words, sorted alphabetically. The first entry was the indefinite article 'a', which ended up as the roman numeral 'I'.
Then, as he started translating, he would encounter 'new words' not yet on the list, and he would assign these words the next free number.

This results in a situation where similar words appear near each other.
This is not so pronounced for a prose text like Poe's story, but would give a better result for a text with a changing subject matter.

I selected the 400 common words as the 400 most common words in the first half of the text.

Clearly, lots of other scenarios could be used.
I'm still playing along with this, but won't have much more today.

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25