(23-04-2020, 09:30 PM)RenegadeHealer Wrote: You are not allowed to view links. Register or Login to view.If I were an idle rich geek, I'd hold a contest, modeled on engineers' egg drop and load-bearing bridge building contests, called the Build-a-Vord Challenge. Each entrant would have a set amount of time (a couple of months, maybe) to design an algorithm that generates Voynichese vords. Each one would be modeled and run for 38,000 cycles with the same hardware and software. The entrant whose algorithm output had the highest ratio of types actually found in the VMS to types not found in the original, would get a large donation made by me to a charity of their choice, or a scholarship, or something like that.
Hi RenegadeHealer ,
as stated, the contest would find that this script is unbeatable (100% accurate):
for i in range(0,38000):
print "daiin"
Maybe you mean that we should compare word frequencies in the output with actual word frequencies in the manuscript (so that daiin should occur about 850 times, each of ol, chedy, aiin about 500 and so on). Something very similar can be done with the grammar that Stolfi built 20 years ago (see You are not allowed to view links.
Register or
Login to view.): differently from most grammars (e.g. what Thomas posted at the start of this thread) Stolfi includes numerical weights for each rule. So, while in Thomas' model 'k' and 'f' are totally equivalent, Stolfi also models the fact that 'k' is about 30 times more frequent than 'f' (second number in each row):
G:
5858 0.34755 0.34755 t
1243 0.07375 0.42130 p
9423 0.55906 0.98036 k
331 0.01964 1.00000 f
There is no doubt that Stolfi's model (however good) can be improved, but is getting a better fit for word frequencies the most promising task on which we should spend our money (or time)?
Another participant to your contest could be Timm and Schinner's algorithm (see You are not allowed to view links.
Register or
Login to view.r). Like Stolfi's model, their algorithm contains several numerical parameters and one could tweak them to get a better fit for word frequencies. But they have chosen to follow a different line, investigating other properties of the text, rather than focusing on word structure. For instance their algorithm reproduces these phenomena:
- the progressive drift in word frequencies through the text (what was initially seen as two different "languages" Currier A and B);
- reduplication and quasi-reduplication (words repeating consecutively, identically or with minimal changes);
- line-effects: words at the beginning or end of lines are different from other words.
Though I don't think that Timm and Schinner come closer to actual word frequencies than Stolfi, their work marks a significant step forward, building on Stolfi's grammar by integrating word structure with other parts of the larger picture.
Another recent "generative system" that adds to the field, without addressing the area of word frequencies is You are not allowed to view links.
Register or
Login to view..
Personally, I would not be terribly interested in a complex software (say, a You are not allowed to view links.
Register or
Login to view.) that produces a perfect word histogram and tells us nothing about dialects/language-drift, reduplication, first-last combinations (the influence of the last character of a word on the first character of the following word), the relationship between labelese and paragraph text, etc. Not only I believe that all features should be explained together (and Timm and Schinner have done the most extensive work in this direction) but I am sure there are many more features and patterns that have not been discovered yet (see Lisa Fagin Davis' ongoing research).