The Voynich Ninja

Full Version: [split] (lack of) word groups
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4
(01-07-2019, 05:35 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.The usage of You are not allowed to view links. Register or Login to view. and therefore for phrases is indeed typical for any natural language: see You are not allowed to view links. Register or Login to view.
Of course, but that doesn't necessarily mean that every text in a natural language will contain many identical strings of the same four words (or longer). Especially in relatively synthetic (as opposed to analytic) languages like Latin. For example "I am a man" in English would only be a two-word string in Latin.
(01-07-2019, 05:57 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.
(01-07-2019, 05:35 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.The usage of You are not allowed to view links. Register or Login to view. and therefore for phrases is indeed typical for any natural language: see You are not allowed to view links. Register or Login to view.
Of course, but that doesn't necessarily mean that every text in a natural language will contain many identical strings of the same four words (or longer). Especially in relatively synthetic (as opposed to analytic) languages like Latin. For example "I am a man" in English would only be a two-word string in Latin.

See You are not allowed to view links. Register or Login to view.
(01-07-2019, 05:26 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.I now have a sizable corpus of medieval texts so it could be interesting to test this. Do texts really need to contain many repeating "more than three words" sequences in order to be real?

A sizable corpus will make things interesting.
Obviously, an inflected language like Latin does not have the same positional constraints as English.
In English you must say "I drink coffee" (subject, verb, object), but in Latin "Claudius aquam bibet" "aquam Claudius bibet" etc are all possible, since suffixes specify the grammatical function of each word.

Also, one should consider how much languages change with genre/individual style and time-period. For instance, Chaucer could write:
Ful many a riche contree hadde he wonne

Finally, spelling variation will also have an impact. Unluckily, almost all transcribed texts somehow normalize spelling.
Torsten, those are phrases as in "fixed expressions" there is no reason why one must occur many times in a text...

Marco: yes, as I said in my previous post, synthetic languages like Latin will give us better odds than analytical ones like English. The word order freedom you mention is an additional argument.

I guess essentially I'd need a code that does TTR for x-word chunks.
(01-07-2019, 06:27 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Torsten, those are phrases as in "fixed expressions" there is no reason why one must occur many times in a text...

See You are not allowed to view links. Register or Login to view. in "Antiqui rhetores latini"
(01-07-2019, 07:06 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.See You are not allowed to view links. Register or Login to view. in "Antiqui rhetores latini"

It's likely that most texts in Latin will indeed contain more such phrases than the VM. But we can't simply say that this is the case for all text types in all natural languages without many more tests.
(01-07-2019, 07:12 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.
(01-07-2019, 07:06 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.See You are not allowed to view links. Register or Login to view. in "Antiqui rhetores latini"

It's likely that most texts in Latin will indeed contain more such phrases than the VM. But we can't simply say that this is the case for all text types in all natural languages without many more tests.

See You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view.
(01-07-2019, 06:27 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.I guess essentially I'd need a code that does TTR for x-word chunks.

Your wish is my command. Smile

I just concatenated the x last words with a "." separator. The third and last parameter is x.

Code:
import java.io.*;
import java.util.*;

public class xMATTR
{
    static int window;
    static List<String> tokens = new ArrayList<String>();
    static HashMap<String, Integer> types;
    static int nTokens;
    static int sumTypes;
    static int nLastWords;
    static int nLastWordsMinusOne;
    static String[] lastWords;

    public static void main(String args[])
    {
        try
        {
            if (args.length == 3)
            {
                nLastWords = Integer.parseInt(args[2]);
                nLastWordsMinusOne = nLastWords - 1;
                lastWords = new String[nLastWords];
                window = Integer.parseInt(args[1]);
                processFileOrFolder(new File(args[0]));
            }
        }
        catch (Exception e)
        {
            e.printStackTrace();
        }
    }

    public static void processFileOrFolder(File f) throws Exception
    {
        if (f.exists())
        {
            if (f.isDirectory())
            {
                String[] children = f.list();
                if (children != null)
                {
                    for (String filename : children)
                    {
                        String pathname = f.getPath()+File.separator+filename;
                        processFileOrFolder(new File(pathname));
                    }
                }
            }
            else
            {
                loadFile(f);
                System.out.println(f.getPath()+"\t"+computeMATTR());
            }
        }
    }

    static void loadFile(File f) throws Exception
    {
        tokens.clear();
        InputStreamReader isreader = new InputStreamReader(new FileInputStream(f), "UTF-8");
        BufferedReader reader = new BufferedReader(isreader);
        String line;
        while((line = reader.readLine()) != null)
        {
            if (!line.isEmpty())
            {
                String[] words = line.toLowerCase().replaceAll("\\p{P}", " ").split("[0-9 \t]+");
                for (String word: words)
                {
                    if (!word.isEmpty())
                    {
                        System.arraycopy(lastWords, 1, lastWords, 0, nLastWordsMinusOne);
                        lastWords[nLastWordsMinusOne] = word;
                        if (lastWords[0] != null)
                        {
                            String lastWordsAsOne = "";
                            for (String lastWord: lastWords)
                                lastWordsAsOne += lastWord + ".";
                            tokens.add(lastWordsAsOne);
                        }
                    }
                }
            }
        }
        reader.close();
    }

    static double computeMATTR()
    {
        if (tokens.size() < window)
            return 0;
        nTokens = 0;
        types = new HashMap<String, Integer>(window);
        for (String token: tokens)
            addToken(token);
        double m = (double)sumTypes / window / (tokens.size() - window + 1);
        return m;
    }

    static void addToken(String word)
    {
        incType(word);
        nTokens++;

        if (nTokens > window)
        {
            decType(tokens.get(nTokens - window - 1));
            sumTypes += types.size();
        }
        else if (nTokens == window)
        {
            sumTypes = types.size();
        }
    }

    static void incType(String word)
    {
        Integer count = types.get(word);
        if (count == null)
            types.put(word, 1);
        else
            types.put(word, count + 1);
    }

    static void decType(String word)
    {
        Integer count = types.get(word);
        if (count == 1)
            types.remove(word);
        else
            types.put(word, count - 1);
    }
}
I was hoping this would happen  Big Grin Thanks man. I'll give it a try as soon as I'm home.
(01-07-2019, 06:13 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.
(01-07-2019, 05:26 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.I now have a sizable corpus of medieval texts so it could be interesting to test this. Do texts really need to contain many repeating "more than three words" sequences in order to be real?

A sizable corpus will make things interesting.
Obviously, an inflected language like Latin does not have the same positional constraints as English.
In English you must say "I drink coffee" (subject, verb, object), but in Latin "Claudius aquam bibet" "aquam Claudius bibet" etc are all possible, since suffixes specify the grammatical function of each word.

Also, one should consider how much languages change with genre/individual style and time-period. For instance, Chaucer could write:
Ful many a riche contree hadde he wonne

Finally, spelling variation will also have an impact. Unluckily, almost all transcribed texts somehow normalize spelling.

Hi Marco,

Exactly - English requires the kind of syntax Rugg mentions.  Latin doesn't.

It might also be worth pointing out that Latin metrical verse relies on this positional flexibility.  You could put the words from the following line in any order you like:

quadrupedante putrem sonitu quatit ungula campum

and there will still only be a single grammatically correct meaning.

I'd be curious as to whether a text such as De Balneis Puteolanis, which I think is written in elegiac couplets, exhibits the 'regularities in word order' which Rugg claims 'real' languages have.  My suspicion is that with a few exceptions, such as prepositions immediately preceding the noun they qualify, it may well not.

[I'm not arguing against Rugg's position that the VMs text isn't written in a natural language, and there is plenty of statistical analysis demonstrating why this cannot be so.  But explaining this in terms of 'regularities of word order,' as if English syntax was the be-all and end-all of such things, is misleading at best and incorrect at worst.  And that is unfortunate in an article written by a university lecturer complaining about bad science.]
Pages: 1 2 3 4