The Voynich Ninja

Pages: 1 2 3 4

Let's denote probability of "daiin" to occur as "p".

Now, assuming that there are no interdependencies between words, i.e. each next word is independent of the preceding one, the probability would be p*(1-p)*p*(1-p).

The value of p may be taken from some assumption (as e.g. the Zipf's law), or calculated from the actual count of "daiin". If from the actual count, then it's 845/37718 = 2.240e-2, hence the pattern's probability would be 4.78e-4, which related to ~37718 patterns in total yields the expected count of 18,1 versus 18.9 of yours.

Assuming there are interdependencies (which one normally would expect from a meaningful text), we need to somehow estimate them.

(30-03-2018, 10:49 PM)davidjackson Wrote: You are not allowed to view links. Register or Login to view.Biword doesn't exist The technical name word be the reduplicant.

Any chance for it to become Oxford's 2018 word of the year?

Thank you again, Rene and Anton! Once again, discussing things on the forum has been worth the effort.

These are updated graphs for X Y X Z, where Y and Z are different from X but may be identical (so X Y X Y is included here).
I computed the expected occurrences as described by Anton
p*(1-p)*p*(1-p)
using word frequency as an estimate of 'p'
My idea was comparing with a randomly independent arrangement, so I wouldn't go into anything more complex at the moment.

It turned out that the major distortion I introduced was on the "actual" count, where I removed more aggressively than I realized occurrences of the X X' X kind, where X' is similar to X, or of the X Y X Y' kind. Here I have applied no such filtering, and the VMS figures are very different, closer to what Rene said.
The impact of this is considerably more limited on the other languages.

I have added a blue bar, where I display the occurrences of XYXY (a subset of those counted in the red bar "actual"). As you can see, this pattern does not have a huge impact in any graph. "Word-couple repetition" never occurs in Latin but is present in Dante ("a poco a poco", slowly, "a mille a mille", thousands and thousands).
If I haven't introduced more significant errors, Voynichese seems to have a much more pervasive presence of the alternating pattern than I initially thought. The pattern does not seem to be restricted to dedicated function words as in Mattioli's Latin.
I will look more into this in the future and I hope that others will make independent analyses on the subject.

[attachment=2042]
___
[attachment=2043]
___
[attachment=2044]

PS: these are the total occurrences of the pattern in the examined texts. Quire13 is about 1/5 as long as the other texts. The totals are different, but mostly comparable. As the graphs show, a difference is that, in the comparison languages, most occurrences involve the conjunction 'e'/'et'/'and', while in the VMS several words exhibit a preference for this arrangement. This also happens in Mattioli, but here the difference is that "alternating words" in the VMS also occur in other arrangement, while Mattioli's "tum" etc. regularly appear alternating with other words.

Code:
Dante           153

K.James_Genesis 227

Pliny           109

Mattioli        177

VMS             244

VMS_Quire13      84

Let's theorize what does it mean when, given your methodology, the actual pattern count notably differs from the "expected" value. As we see, it can differ in both directions, although in most cases the actual value is larger than expected.

So far that actual single-word frequency is already accounted for in the methodology, differences in both directions would suggest something.

Most generally, if the actual count is higher that expected, then the vord in question tends to exhibit positive affinity (attraction) to itself, while if it is lower than expected, the vord tends to exhibit negative affinity (repulsion) to itself. I think that similar graphs for reduplications would be useful to test this explanation - i.e. for XX patterns like daiin daiin. The "expected" value in our terms would be simply p^2 in this case. If the "affinity" idea is valid, then the situation with XX patterns will follow the situation with "X any X any" patterns.

Any more ideas? Confused

I have a vague impression that somehow we can kinda sort that out.

See, for Latin "et any et any" is higher than expected. Since it's opposite to what we see in Dante and the Bible, I suspect this is due to stilistics of Latin, where instead of a simple list (such as would be comprised with commas nowadays), the elements of a list are consistently interleaved with "et". Like instead of "dogs, cats, birds, fish and lions" you'd say "dogs and cats and birds and fish and lions".

I'll bet that "et et", though, would turn out to be lower than expected, since it's basically an invalid biword (sorry David!) in Latin.

(31-03-2018, 05:40 PM)Anton Wrote: You are not allowed to view links. Register or Login to view....

I'll bet that "et et", though, would turn out to be lower than expected, since it's basically an invalid biword (sorry David!) in Latin.

I don't want to throw too many wrenches in this but...

In Latin, yes, one would not expect to find et et, but in medieval scribal Latin it can certainly occur as et et if there is a tail on the second "t" so that it looks like EVA-r. Then it becomes "eter" an abbreviation for "eternum/aeternum".

If there is a tail or macron on the e and a tail on the t, then "et et" is read as Latin or French "et entre".

Ligatures and abbreviations were the norm in the middle ages.

I think it's worthwhile to run the statistical tests. One never knows which ones might yield helpful information... but assumptions that apply to modern languages that are based on alphabets may not work on medieval languages in which the scribal abbreviations were as much a part of the "alphabet" as individual letters.

(31-03-2018, 05:23 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.Let's theorize what does it mean when, given your methodology, the actual pattern count notably differs from the "expected" value. As we see, it can differ in both directions, although in most cases the actual value is larger than expected.

So far that actual single-word frequency is already accounted for in the methodology, differences in both directions would suggest something.

Most generally, if the actual count is higher that expected, then the vord in question tends to exhibit positive affinity (attraction) to itself, while if it is lower than expected, the vord tends to exhibit negative affinity (repulsion) to itself. I think that similar graphs for reduplications would be useful to test this explanation - i.e. for XX patterns like daiin daiin. The "expected" value in our terms would be simply p^2 in this case. If the "affinity" idea is valid, then the situation with XX patterns will follow the situation with "X any X any" patterns.

Any more ideas?

Hi Anton,
a while ago I posted reduplication graphs (sequential word repetition) You are not allowed to view links. Register or Login to view.. Since they were split for Currier A and B, I have recomputed them, for the whole VMS and Quire13 only. The graphs I previously posted in this thread were all sorted by decreasing actual counts. The new graphs are sorted by Expected occurrences, so that the Alternation/Reduplication graphs have words in the same order (easier to compare). I hope the numbers are reasonably correct this time.

shey / chey are two words that seem to behave differently: they appear much more than expected in the X.Y.X.Z configuration, but only once as X.X (actually chey.chey).
aiin also does not reduplicate as aiin.aiin, but appears in the X.Y.X.Z alternating pattern.

I don't have time to go deeper into this analysis right now, but I think the subject is interesting and I will come back to it in a few days.

Say, for daiin the repetition count is 20 (correcting one mistaken entry in Voynich Reader), all of those are two-vord repetitions, so the actual pattern count is 20. The "expected" count is 18,9, which is pretty the same (6% difference).

By the way, why did daiin disappear from the corrected Q13 graph.

Let's examine shedy, which exhibits strong positive affinity in the "X any X any" pattern. Voynich Reader gives 9 repetitions of shedy (all two-word ones). The frequency of shedy is 425. The total count in reader is 38045, so I will use that. p^2 = 1.248e-4, hence the "expected" count of "X X" for shedy is 4.7. So, strong positive affinity, same as for "X any X any".

UPDATE: Marco already posted while I was composing this post.

(31-03-2018, 06:53 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.By the way, why did daiin disappear from the corrected Q13 graph.

The graphs You are not allowed to view links. Register or Login to view. are sorted by actual counts: the 20 words with the highest count are included. There are a number of words with actual=1, which of them makes it to the graph is not something I control.

Quote:In Latin, yes, one would not expect to find et et, but in medieval scribal Latin it can certainly occur as et et if there is a tail on the second "t" so that it looks like EVA-r. Then it becomes "eter" an abbreviation for "eternum/aeternum".

Yes, but Marco's been running his tests on expanded versions, not on abbreviated ones.

Pages: 1 2 3 4

Anton

Anton

MarcoP

Anton

Anton

-JKP-

MarcoP

Anton

MarcoP

Anton