Character-Limited Patterns? - Printable Version

Character-Limited Patterns? - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Character-Limited Patterns? (/thread-436.html)

Pages: 1 2

Character-Limited Patterns? - Emma May Smith - 29-02-2016

The Voynich manuscript has only a small deal of repeating phrases. This is often (and rightly) held up as a mark against the text being a natural language. But the search for repeating phrases is based on finding exact, or fairly close matches. If we altered the terms of the search to be less strict would the outcome be any different?

What I am thinking is this: were we to consider only some characters (say [k, t, d, l, r, s]) to be important, would we find that those characters alone were patterned? For example, the phrase [chedy qokeedy lolsaiin qokain] would thus have the pattern [d k d l l s k], and would match the phrase [daiin kedy ol ols qokeeo]. Such character-limited patterns may be much more common and could bring some insight were they exist.

Now, the above is only an example, but would it be possible to find out a set of characters (or important 'bits' of textual information whatever they may be) which show the maximum amount of patterning in the text? And if so, what would it teach us?

RE: Character-Limited Patterns? - ReneZ - 01-03-2016

In my opinion, this is a very good point.
There are several things that would disturb (or destroy) repeated patterns in the text. All of the following points would:
- irregular spelling
- errors
- whatever it is that causes the first word in each line to be different
- insertion of meaningless characters ('nulls') as suggested above
- some other degree of freedom the author had when 'translating' his plain text into the Voynich script

The first two bullets should not be expected to have such a big impact (but could be numerically tested with a simulation).
The third one is there, but should still result in a text with more short repeated sequences than we're seeing
The last two points are the interesting ones.
Trouble is: the number of possibilities is prohibitively large.

RE: Character-Limited Patterns? - Sam G - 01-03-2016

I posted a few examples of this kind of thing here:

You are not allowed to view links. Register or Login to view.

I think there's far more of this type of structure in the text than has been documented.

It would be an interesting project, for someone with better programming/UNIX skills than I possess, to sort the VMS words into different types of "equivalence classes", then come up with sequences of equivalence classes and see how many matches they find in the text, and report on any that show up much more frequently (or much less frequently) than chance would predict. (I would of course be happy to "advise" on this, if someone wants to do the actual coding.)

As far as why there are no exact repeating phrases in the text, I think the "writing style" of the VMS may play a role here. It's possible that the VMS author simply consciously avoided repeating the exact same phrases.

RE: Character-Limited Patterns? - -Job- - 01-03-2016

It's possible to determine the subset of characters which maximizes repeated word sequences, but the right questions would need to be asked.

For example we can expect that a smaller subset will typically result in more repeated sequences, so it would be preferable to ask which subset of n characters yields the most repetition, starting with large n.

There is a trade off between subset size and the number of repeated sequences - the former should not be too small and the latter should not be too large.

It's not clear how the results would be interpreted. I suspect the brute-force search would be the easy part.

RE: Character-Limited Patterns? - Davidsch - 01-03-2016

In my research i offered about 6 months comparing that possibility that letters can vary or are interchangable.

I tried:
* any of the letters is a space
* any of the letters is any other letter
* any of the letters is a null-character (but i quickly stopped there because almost no text is left over at some point)

I do not see how i can implement errors/spelling.

Quote:ReneZ: - whatever it is that causes the first word in each line to be different

Hm, that is very difficult. I did not yet dare to go there, but what if a predecessor /codeword determines what will come after that?
ex1. Chody -> use table A

Something like that ?

Does anyone have more suggestions please do write down !?

RE: Character-Limited Patterns? - Emma May Smith - 01-03-2016

(01-03-2016, 09:16 AM)-Job- Wrote: You are not allowed to view links. Register or Login to view.It's possible to determine the subset of characters which maximizes repeated word sequences, but the right questions would need to be asked.

For example we can expect that a smaller subset will typically result in more repeated sequences, so it would be preferable to ask which subset of n characters yields the most repetition, starting with large n.

There is a trade off between subset size and the number of repeated sequences - the former should not be too small and the latter should not be too large.

It's not clear how the results would be interpreted. I suspect the brute-force search would be the easy part.

Indeed. I expect that whatever the subset used, there must also be some kind of guiding principle behind that selection of characters (or characteristics). For example, we could consider two similar characters (say [k, t]) to be important or unimportant, so long as they were treated the same way. Yet to count one as important and the other unimportant would need explaining beforehand given their similarity. Thus the total number of possible subsets is shrunk to the total number of plausible subsets.

However, as you say, the results would still need interpretation. Were we to measure the patterns of a great deal of subsets it would be hard to choose between them as to which is best and its ultimate meaning. I suppose you could tackle the problem from the other way on, using the measurement to test the principles by which the subset of characters itself was selected.

Though I can foresee that the features which make Voynich words so odd--replaceable letters and rigid character order--would mean that key characters are actually both hugely important semantically and a bar to more word patterning.

RE: Character-Limited Patterns? - -Job- - 02-03-2016

As an example, the VM contains 33319 distinct four-word sequences when paragraph boundaries are respected. If characters d and e are discarded, that number goes down to 33263, a difference of only 56.

Other character pairs yield a higher number of distinct four-word sequences, but not by a significant amount, so d and e are not really special in this regard.

As we remove additional characters, an increasing number of words begin to disappear and the results become less meaningful. Empty words were treated as non-empty to enable comparison with the original number of four-word sequences. Otherwise, discarding d and y is slightly more effective than discarding d and e given that dy occurs as a word 271 times.

It's not particularly revealing and not exactly an in-depth analysis, but it suggests that the low number of repeated four-word sequences is not simply a result of random insertions of one or two null characters.

RE: Character-Limited Patterns? - ReneZ - 02-03-2016

That's an interesting statistic. How many words were in the text used? I guess you counted all overlapping sequences, so that the total nuber of 4-word sequences is N(word)-3.

The next question would be what would be a 'normal' number for a known plain text.
And the next.... since such a text would be edited and spell-checked, what would happen if one intruduces arbitrary errors in this text, e.g. one arbitrary substitution every 80, 40 or 20 characters.....

I do agree with your conclusion though, but it would be interesting to see the magnitude of the problem.

RE: Character-Limited Patterns? - -Job- - 03-03-2016

(02-03-2016, 11:22 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.That's an interesting statistic. How many words were in the text used? I guess you counted all overlapping sequences, so that the total nuber of 4-word sequences is N(word)-3.

The full text was used. The number of four-word sequences is somewhat less than n-3 because it excludes sequences that overlap paragraph boundaries. The total number of non-distinct four-word sequences is 33349. The number of distinct four-word sequences is 33319, so 30 are repeated. Most of these are in f57v, and are essentially sequences of four single-character words.

If we exclude sequences consisting of only single-character words, then there are 33283 four-word sequences, of which 33281 are unique. So there are really only two repeated four-word sequences in the VM, visible here:
You are not allowed to view links. Register or Login to view.

(02-03-2016, 11:22 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The next question would be what would be a 'normal' number for a known plain text.

A sample of Pliny of slightly smaller size has 110 repeated four-word sequences (none of which consist of single-character words). In order to maximize the number of repeated four-word sequences in Pliny, the best characters to discard are i and x. However, this only raises the number of repeated four-word sequences by 17.

(02-03-2016, 11:22 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.And the next.... since such a text would be edited and spell-checked, what would happen if one intruduces arbitrary errors in this text, e.g. one arbitrary substitution every 80, 40 or 20 characters.....

Given that the number of repeated sequences is not extremely large even in a typical spellchecked text, my guess is that the number of errors would need to be fairly high to reduce it even further. For example, the Pliny sample i used has a total of 35125 words. The 110 repeated four-word sequences, along with the original 110, account for (110*2)*4 = 880 words. That's 2.5% of the text, assuming that each sequence is repeated only once.

In order to reduce the number of 110 repeated four-word sequences by half, we'd need errors to affect at least 55 words. If the errors are random then it's possible to compute the total number of errors needed to achieve this.

The expected number of errors necessary to eliminate any one of the 110 sequences is 1/(880/35125) = 39.9. In order to eliminate another sequence after that we'd need an additional 1/((880-8)/(35125-8)) = 40.27 errors.

For the full 55 sequences we'd need a total of 3002 word errors, which corresponds to a per-word error rate of about 8.5%. That's not unreasonable. However, in order to eliminate all but 2 sequences, we'd need 16298 errors, a 46% per-word error rate.

Let me know if you see an error in my thinking.

RE: Character-Limited Patterns? - ReneZ - 03-03-2016

Thanks very much!
The number of repeated sequences in Pliny is certainly less than I was expecting, and the magnitude of our problem has just dropped a bit in my mind :-)

With this small number of repeated sequences, your estimated values should be reasonable. The only problem I see is that of the 55 changed sequences with an 8.5% error rate, there will eventually be more than one change in the same sequence, so one would need a bit higher error rate.

An 8.5% error rate may seem a bit large, but it is worth keeping in mind that there was some process going on in the generation of the text that changed the first word in each line of text. This already accounts for more than 5% 'disturbance' of word sequences.
I don't recall the average nr. of words per line, but it should be less than 20.