The Voynich Ninja

Pages: 1 2 3

I have worked out rules that work for a great majority of the text, enough that I can tell if something is legal Voynichese, and enough to reproduce "legal" chunks of text but I have not fully worked out an order for the tokens. I have struggled with this for years, but I haven't gotten discouraged. I think it might be possible. What I can do is classify many of them into family groups.

I can get some good results on subsets of tokens or particular sections, but don't yet have something that explains it all. There are places here and there where a character inexplicably shows up in a place where it shouldn't. It doesn't happen often, but it happens often enough that I wonder if there's a dynamic that I haven't sorted out correctly. I don't want to write it off as scribal error. I'd rather search for the reason until I've exhausted all possibilities.

(23-04-2020, 06:54 PM)-JKP- Wrote: You are not allowed to view links. Register or Login to view.I have worked out rules that work for a great majority of the text, enough that I can tell if something is legal Voynichese, and enough to reproduce "legal" chunks of text but I have not fully worked out an order for the tokens. I have struggled with this for years, but I haven't gotten discouraged. I think it might be possible. What I can do is classify many of them into family groups.

Classifying Voynichese glyphs into family groups and creating an order for them ad hoc is all I essentially had in mind. Correct me if I'm wrong, but what you're talking about sounds quite a bit more ambitious. I interpret what you wrote as saying that it may be possible to discern a pre-established order to the glyphs, after getting to know the entire body of text intimately well enough, even without any idea what any of them signify. But if this is what you're saying, I don't think I'd find you making your case for a pre-established Voynichese alphabetical order a boring read at all.

-JKP- Wrote:I can get some good results on subsets of tokens or particular sections, but don't yet have something that explains it all. There are places here and there where a character inexplicably shows up in a place where it shouldn't. It doesn't happen often, but it happens often enough that I wonder if there's a dynamic that I haven't sorted out correctly. I don't want to write it off as scribal error. I'd rather search for the reason until I've exhausted all possibilities.

If I were an idle rich geek, I'd hold a contest, modeled on engineers' egg drop and load-bearing bridge building contests, called the Build-a-Vord Challenge. Each entrant would have a set amount of time (a couple of months, maybe) to design an algorithm that generates Voynichese vords. Each one would be modeled and run for 38,000 cycles with the same hardware and software. The entrant whose algorithm output had the highest ratio of types actually found in the VMS to types not found in the original, would get a large donation made by me to a charity of their choice, or a scholarship, or something like that.

Here's the difficulty with a Voynichese-generation contest (which would be fun, but)... Some people would "cheat" in the sense of taking large chunks (like full tokens) and stringing them together, a process that is guaranteed to have a high hit-rate compared to the VMS text. The real test is whether the tokens can be deconstructed and reconstructed of individual parts with those individual parts being identified.

I have a long paper about the structure of the text. It's been sitting on my hard drive for years mostly done, but doesn't do anybody any good if I don't post it.

I might be able to work on it this weekend. For the first time in years, I am almost down to a normal work week (that's a relative term, when you run a business, it's still about 55 to 60 hours a week, but the coronavirus has lightened things up a little). I'll use the extra few hours to try to get caught up.

I was wondering if @RenegadeHealer was referring to the 'alphabetical sorting order'.

We may seriously wonder if this ever existed, and even more if we can ever reconstruct it while not yet knowing the meaning of the text.

In general one cannot deduce it from the shape of the characters. Look at Arabic for example, where the same character can have different shapes depending on whether it is preceded or followed by a space.

The special case of q should be considerd too: just imagine it represents an aspiration of the following vowel. In Greek this is represented by a diacritic and plays no role in the sorting.

How to sort cTh etc?

One can make several different ones, and there is no objective way to decide which one is preferable.

Different European languages have different rules for sorting 'ch'. In most, these are two characters to be sorted separately, but in Czech (perhaps only modern Czech?) it is a single unit that has its own place in the alphabet.

As to grouping characters into families, I have an example of that, but I know very well that also here one can make several different ones, and criteria to prefer one or the other are quite subjective (even if they can be formulated).
Many of the long-standing questions have no answer yet.
- Are the strings of e single characters or sequences of characters?
- Is ee like ch ?

(23-04-2020, 09:30 PM)RenegadeHealer Wrote: You are not allowed to view links. Register or Login to view.If I were an idle rich geek, I'd hold a contest, modeled on engineers' egg drop and load-bearing bridge building contests, called the Build-a-Vord Challenge. Each entrant would have a set amount of time (a couple of months, maybe) to design an algorithm that generates Voynichese vords. Each one would be modeled and run for 38,000 cycles with the same hardware and software. The entrant whose algorithm output had the highest ratio of types actually found in the VMS to types not found in the original, would get a large donation made by me to a charity of their choice, or a scholarship, or something like that.

Hi RenegadeHealer ,
as stated, the contest would find that this script is unbeatable (100% accurate):

for i in range(0,38000):
print "daiin"

Maybe you mean that we should compare word frequencies in the output with actual word frequencies in the manuscript (so that daiin should occur about 850 times, each of ol, chedy, aiin about 500 and so on). Something very similar can be done with the grammar that Stolfi built 20 years ago (see You are not allowed to view links. Register or Login to view.): differently from most grammars (e.g. what Thomas posted at the start of this thread) Stolfi includes numerical weights for each rule. So, while in Thomas' model 'k' and 'f' are totally equivalent, Stolfi also models the fact that 'k' is about 30 times more frequent than 'f' (second number in each row):

G:
5858 0.34755 0.34755 t
1243 0.07375 0.42130 p

9423 0.55906 0.98036 k
331 0.01964 1.00000 f

There is no doubt that Stolfi's model (however good) can be improved, but is getting a better fit for word frequencies the most promising task on which we should spend our money (or time)?

Another participant to your contest could be Timm and Schinner's algorithm (see You are not allowed to view links. Register or Login to view.r). Like Stolfi's model, their algorithm contains several numerical parameters and one could tweak them to get a better fit for word frequencies. But they have chosen to follow a different line, investigating other properties of the text, rather than focusing on word structure. For instance their algorithm reproduces these phenomena:

the progressive drift in word frequencies through the text (what was initially seen as two different "languages" Currier A and B);
reduplication and quasi-reduplication (words repeating consecutively, identically or with minimal changes);
line-effects: words at the beginning or end of lines are different from other words.

Though I don't think that Timm and Schinner come closer to actual word frequencies than Stolfi, their work marks a significant step forward, building on Stolfi's grammar by integrating word structure with other parts of the larger picture.

Another recent "generative system" that adds to the field, without addressing the area of word frequencies is You are not allowed to view links. Register or Login to view..

Personally, I would not be terribly interested in a complex software (say, a You are not allowed to view links. Register or Login to view.) that produces a perfect word histogram and tells us nothing about dialects/language-drift, reduplication, first-last combinations (the influence of the last character of a word on the first character of the following word), the relationship between labelese and paragraph text, etc. Not only I believe that all features should be explained together (and Timm and Schinner have done the most extensive work in this direction) but I am sure there are many more features and patterns that have not been discovered yet (see Lisa Fagin Davis' ongoing research).

Okay, I finally had a look at Stolfi's paper. I scan-read it, I haven't read it in depth yet, but I will give it a proper read on the weekend.

Is there anyone else I should be aware of in terms of people making serious efforts to reconstruct the text?

Hyde and Rugg, Timm and Schinner, Stolfi.

Also, I read Brian Cham's curve-line system fairly recently and will include a brief mention, but I understand from what Rene posted that his basic premise has been mentioned a number of times earlier. I knew about the Cardan grille from watching a video on youtube and have read the info on the Hyde/Rugg Website

My paper doesn't have a section on Prior Art because I wrote it years ago and wasn't aware of any prior art, but I guess if I'm going to post the paper online, I should add it.

Should anyone else be mentioned? I'm trying to keep the Prior Art down to a couple of pages, just enough to explain the basic concepts of each system, and the key ways in which they differ from one another.

You might also read Emma's posts about the Body Rank Order system:
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.

Thanks, Marco. I will read that one too.

(24-04-2020, 08:44 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Hi RenegadeHealer ,
as stated, the contest would find that this script is unbeatable (100% accurate):

for i in range(0,38000):
print "daiin"

Where shall I wire the funds?

MarcoP Wrote:Maybe you mean that we should compare word frequencies in the output with actual word frequencies in the manuscript (so that daiin should occur about 850 times, each of ol, chedy, aiin about 500 and so on).

Indeed, that makes much more sense. The standard of perfection that this kind of contest would be aiming for would be an output that computerized textual analysis had low confidence in distinguishing from the original VMS.

MarcoP Wrote:Something very similar can be done with the grammar that Stolfi built 20 years ago (see You are not allowed to view links. Register or Login to view.): differently from most grammars (e.g. what Thomas posted at the start of this thread) Stolfi includes numerical weights for each rule. So, while in Thomas' model 'k' and 'f' are totally equivalent, Stolfi also models the fact that 'k' is about 30 times more frequent than 'f' (second number in each row):

G:
5858 0.34755 0.34755 t
1243 0.07375 0.42130 p

9423 0.55906 0.98036 k
331 0.01964 1.00000 f

There is no doubt that Stolfi's model (however good) can be improved, but is getting a better fit for word frequencies the most promising task on which we should spend our money (or time)?

I'm not really sure, to be honest. What makes me think it could be a promising avenue is some reading I've been doing recently about tracking — identifying animals (or people, if you're a detective) by their footprints. You can tell a lot about an animal's physiology just by studying the pattern of its footprints. Similarly, Voynichese is a pattern of imprints. We might be able to tell a lot about the kind of system that produces that kind of pattern of imprints, the better we understand the subtle complexities of the pattern.

For example, given the differential frequencies of [k] and [t], I know that the text probably wasn't composed by randomly rolling a die that had an equal number of faces with [k] options and [t] options. But a die that had about twice as many [k] options than [t] options isn't ruled out. This is kind of like how when I see tracks that involve two deep and wide impints behind two shallow and closely spaced imprints, I know it could be a rabbit, because it would take an animal with back legs far stronger than its front legs to make a track like this.

I read Stolfi a long time ago, really should reread his paper. What it left me wondering was, what kind of information lends itself to being notated with a tri-part system like his core, mantle, and crust, each with the number of degrees of freedom that each of these word parts has? I'm intrigued, for example, with Linda Snider's idea that each vord could be a geographical or mapping coordinate. I would have to know much more about how people in Medieval Eurasia thought about and conceptualized geography to evaluate the merits of that idea, but it's a good example of how I'm thinking about how one could reconstruct the system that outputted the VMS's text, just by the properties of the text alone.

MarcoP Wrote:Another participant to your contest could be Timm and Schinner's algorithm (see You are not allowed to view links. Register or Login to view.r). Like Stolfi's model, their algorithm contains several numerical parameters and one could tweak them to get a better fit for word frequencies. But they have chosen to follow a different line, investigating other properties of the text, rather than focusing on word structure. For instance their algorithm reproduces these phenomena:
the progressive drift in word frequencies through the text (what was initially seen as two different "languages" Currier A and B);

reduplication and quasi-reduplication (words repeating consecutively, identically or with minimal changes);

line-effects: words at the beginning or end of lines are different from other words.

Though I don't think that Timm and Schinner come closer to actual word frequencies than Stolfi, their work marks a significant step forward, building on Stolfi's grammar by integrating word structure with other parts of the larger picture.

Absolutely. I think these are the top contenders so far. And Timm and Schinner's paper shed a good bit of light on the type of system that could potentially produce an output like the VMS. I seem to remember Stolfi's did likewise, though if I remember correctly, the patterns he found led him to different conclusions about the kind of system that could account for those patterns.

MarcoP Wrote:Another recent "generative system" that adds to the field, without addressing the area of word frequencies is You are not allowed to view links. Register or Login to view..

Personally, I would not be terribly interested in a complex software (say, a You are not allowed to view links. Register or Login to view.) that produces a perfect word histogram and tells us nothing about dialects/language-drift, reduplication, first-last combinations (the influence of the last character of a word on the first character of the following word), the relationship between labelese and paragraph text, etc. Not only I believe that all features should be explained together (and Timm and Schinner have done the most extensive work in this direction) but I am sure there are many more features and patterns that have not been discovered yet (see Lisa Fagin Davis' ongoing research).

I do think the simplest and fairest way to do a contest like this would be to award the most perfect histogram. But I think a lot of the fun would be looking at each of the entrants one by one, and asking ourselves, "What would it take to use this tool to produce an output that exhibited, for example, the drift from one Currier language to another?" And so on for any of the unique features that continue to be elucidated in the VMS. If there is no simple way to imagine tweaking one of these tools to produce something that has the subtler features of the VMS text, then it's probably not an accurate reconstruction.

(24-04-2020, 05:37 PM)-JKP- Wrote: You are not allowed to view links. Register or Login to view.Should anyone else be mentioned? I'm trying to keep the Prior Art down to a couple of pages, just enough to explain the basic concepts of each system, and the key ways in which they differ from one another.

JKP, whether you want to mention any of the earlier examples is of course up to you.
You may find a summary (not complete and not up to date) here:
You are not allowed to view links. Register or Login to view.

This does not mention Rugg's or Timm's systems because these aren't really proposals for word structure.
In Timm's system the word structure is hardly addressed, and Rugg's system is actually based on Stolfi's work.

Pages: 1 2 3

-JKP-

RenegadeHealer

-JKP-

ReneZ

MarcoP

-JKP-

MarcoP

-JKP-

RenegadeHealer

ReneZ