Koen G > 01-07-2019, 05:57 PM
(01-07-2019, 05:35 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.The usage of You are not allowed to view links. Register or Login to view. and therefore for phrases is indeed typical for any natural language: see You are not allowed to view links. Register or Login to view.Of course, but that doesn't necessarily mean that every text in a natural language will contain many identical strings of the same four words (or longer). Especially in relatively synthetic (as opposed to analytic) languages like Latin. For example "I am a man" in English would only be a two-word string in Latin.
Torsten > 01-07-2019, 06:04 PM
(01-07-2019, 05:57 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.(01-07-2019, 05:35 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.The usage of You are not allowed to view links. Register or Login to view. and therefore for phrases is indeed typical for any natural language: see You are not allowed to view links. Register or Login to view.Of course, but that doesn't necessarily mean that every text in a natural language will contain many identical strings of the same four words (or longer). Especially in relatively synthetic (as opposed to analytic) languages like Latin. For example "I am a man" in English would only be a two-word string in Latin.
MarcoP > 01-07-2019, 06:13 PM
(01-07-2019, 05:26 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.I now have a sizable corpus of medieval texts so it could be interesting to test this. Do texts really need to contain many repeating "more than three words" sequences in order to be real?
Koen G > 01-07-2019, 06:27 PM
Torsten > 01-07-2019, 07:06 PM
(01-07-2019, 06:27 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Torsten, those are phrases as in "fixed expressions" there is no reason why one must occur many times in a text...
Koen G > 01-07-2019, 07:12 PM
(01-07-2019, 07:06 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.See You are not allowed to view links. Register or Login to view. in "Antiqui rhetores latini"
Torsten > 01-07-2019, 07:25 PM
(01-07-2019, 07:12 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.(01-07-2019, 07:06 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.See You are not allowed to view links. Register or Login to view. in "Antiqui rhetores latini"
It's likely that most texts in Latin will indeed contain more such phrases than the VM. But we can't simply say that this is the case for all text types in all natural languages without many more tests.
nablator > 01-07-2019, 09:03 PM
(01-07-2019, 06:27 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.I guess essentially I'd need a code that does TTR for x-word chunks.
import java.io.*;
import java.util.*;
public class xMATTR
{
static int window;
static List<String> tokens = new ArrayList<String>();
static HashMap<String, Integer> types;
static int nTokens;
static int sumTypes;
static int nLastWords;
static int nLastWordsMinusOne;
static String[] lastWords;
public static void main(String args[])
{
try
{
if (args.length == 3)
{
nLastWords = Integer.parseInt(args[2]);
nLastWordsMinusOne = nLastWords - 1;
lastWords = new String[nLastWords];
window = Integer.parseInt(args[1]);
processFileOrFolder(new File(args[0]));
}
}
catch (Exception e)
{
e.printStackTrace();
}
}
public static void processFileOrFolder(File f) throws Exception
{
if (f.exists())
{
if (f.isDirectory())
{
String[] children = f.list();
if (children != null)
{
for (String filename : children)
{
String pathname = f.getPath()+File.separator+filename;
processFileOrFolder(new File(pathname));
}
}
}
else
{
loadFile(f);
System.out.println(f.getPath()+"\t"+computeMATTR());
}
}
}
static void loadFile(File f) throws Exception
{
tokens.clear();
InputStreamReader isreader = new InputStreamReader(new FileInputStream(f), "UTF-8");
BufferedReader reader = new BufferedReader(isreader);
String line;
while((line = reader.readLine()) != null)
{
if (!line.isEmpty())
{
String[] words = line.toLowerCase().replaceAll("\\p{P}", " ").split("[0-9 \t]+");
for (String word: words)
{
if (!word.isEmpty())
{
System.arraycopy(lastWords, 1, lastWords, 0, nLastWordsMinusOne);
lastWords[nLastWordsMinusOne] = word;
if (lastWords[0] != null)
{
String lastWordsAsOne = "";
for (String lastWord: lastWords)
lastWordsAsOne += lastWord + ".";
tokens.add(lastWordsAsOne);
}
}
}
}
}
reader.close();
}
static double computeMATTR()
{
if (tokens.size() < window)
return 0;
nTokens = 0;
types = new HashMap<String, Integer>(window);
for (String token: tokens)
addToken(token);
double m = (double)sumTypes / window / (tokens.size() - window + 1);
return m;
}
static void addToken(String word)
{
incType(word);
nTokens++;
if (nTokens > window)
{
decType(tokens.get(nTokens - window - 1));
sumTypes += types.size();
}
else if (nTokens == window)
{
sumTypes = types.size();
}
}
static void incType(String word)
{
Integer count = types.get(word);
if (count == null)
types.put(word, 1);
else
types.put(word, count + 1);
}
static void decType(String word)
{
Integer count = types.get(word);
if (count == 1)
types.remove(word);
else
types.put(word, count - 1);
}
}
Koen G > 02-07-2019, 09:16 AM
Hubert Dale > 02-07-2019, 09:55 AM
(01-07-2019, 06:13 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.(01-07-2019, 05:26 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.I now have a sizable corpus of medieval texts so it could be interesting to test this. Do texts really need to contain many repeating "more than three words" sequences in order to be real?
A sizable corpus will make things interesting.
Obviously, an inflected language like Latin does not have the same positional constraints as English.
In English you must say "I drink coffee" (subject, verb, object), but in Latin "Claudius aquam bibet" "aquam Claudius bibet" etc are all possible, since suffixes specify the grammatical function of each word.
Also, one should consider how much languages change with genre/individual style and time-period. For instance, Chaucer could write:
Ful many a riche contree hadde he wonne
Finally, spelling variation will also have an impact. Unluckily, almost all transcribed texts somehow normalize spelling.