The Voynich Ninja

Full Version: Opinions on: line as a functional unit
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
(13-11-2025, 11:48 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.As mentioned, “edy” is also listed among the suffixes in my heatmap (11th from the left). By clicking on the vertical fields, you can see all possible combinations.

Hello,

Just to explain. My algorithm detects affixes depending on thresholds. In the last data I posted, the thresholds were adjusted to detect a higher ammount of stems. If I adjust the threshold so "-edy" is taken as a suffix, the number of stems lowers drastically (the "e" from "edy", which can be in a stem in the last run, is now in a suffix). I really don't know which thresholds are right, but my intuition says that if we have prefixes and sufixes, there should be also stems, so I tried to get the maximum number of them.
I tried too to parse Voynich words into 'chunks'. In case you're interested the method is described here: You are not allowed to view links. Register or Login to view.. There's also a thread here on ninja, You are not allowed to view links. Register or Login to view. , but it's quite long and you'd better jump directly to post #78.

Just for reference, this is one of the possible chunks decompositions I found (but see the end of the referenced thread for caveats and considerations):

[attachment=12348]
Going back to the topic of this thread:  I am still confused about what exactly LAAFU means.  I see that there has been a huge amount of discussion about it,and what I have read only left me more confused.

Is there a short summary of what is known somewhere?

Meanwhile, here are some suggestions that may help understanding those line-position anomalies:

1. Do not try to analyze the whole VMS, all together or even separately by section.  Imagine taking five texts on different topics, some in German and some in Swedish, and trying to uncover grammatical structures by running statistics on them.  Instead, focus on just one sizable homogeneous section -- Herbal-A, Herbal-B, Bio, or Stars -- and forget the rest.  Whatever insights are obtained from that section may not apply exactly to the other sections, but can guide their analysis.

2. Exclude the head lines of paragraphs, labels, and titles. Forget them. Whatever happens in them will be different and more complicated and than what happen in "body" lines. We must understand these first before we tackle the former.

3. Whenever possible, try to run your statistics on "control" manuscript texts with known languages and similar structure.  That presumably means working with Herbal-A or Herbal-B, and using as control some Medieval Latin herbal with roughly the same amount of text for each plant, with the line breaks in the manuscript  preserved in the transcription.   This last point is important because one possible contributing cause for the start-of-line anomalies is the bias towards longer words in that position caused by the normal line-breaking algorithm used by scribes.  I won't be surprised if the control text too shows (false) evidence of LAAFU.

4. Consider mapping VMS glyphs so as to reduce the effect of transcription errors and ambiguities.  Namely, map {a,o,y} to o, {r,s} to r, {k,t,p,f} to k.  Maybe even Ch to ee and Sh to re.  If the anomalies persist, they will be easier to understand. If the anomalies disappear, that by itself would be an important discovery that will need separate analysis.

5. For the same reason, try to avoid statistics that depend on word spaces.  Unless of course the analysis is about word frequencies, lengths, etc. 

Makes sense?

All the best, --stolfi
And also:

6. Exclude any "weirdo" that occurs only a few times in the section in  question.  Exclude words that contain it, and lines that start with it.

7. Try to express frequencies as fractions (or percentages), rather than raw counts. I think the former are easier to reason about.

8. Be aware that frequencies f = k/n obtained by counting (like "frequency of d at start of line) have a sampling error σ that depends on the number of "trials" n (in that case, count of lines in the sample), the number of "successes" k (in that case, count of lines that start with d), as well as on the a priori expectations about f (which could be derived, for example, from the freq of d in word-initial position).  I owe you the formula for σ.  But the point is that "discrepancies" in the frequencies are significant only if they are 3 times σ or more.

9.  Be aware also that if you compute a large number of frequencies (like the frequencies of 20 glyphs), some of them will come out anomalous (above 3σ)  just by chance.  Some of these false anomalies may be detected by splitting the sample text in half and running the analysis on each half separately.  Those false discrepancies are unlikely to occur on both both halves.

All the best, --stolfi
(14-11-2025, 12:33 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.9.  Be aware also that if you compute a large number of frequencies (like the frequencies of 20 glyphs), some of them will come out anomalous (above 3σ)  just by chance.  Some of these false anomalies may be detected by splitting the sample text in half and running the analysis on each half separately.  Those false discrepancies are unlikely to occur on both both halves.

Is it important how the split is made? For example, randomly assigning lines to A or B vs assigning the first half of the lines to A and the second to B.
(14-11-2025, 11:26 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Going back to the topic of this thread:  I am still confused about what exactly LAAFU means.  I see that there has been a huge amount of discussion about it,and what I have read only left me more confused.

Since I am a noob I don't know what is known. But for me it means, that a paragraph is not a continues text and that every line starts new. 

It could be a poem, it could be only recipes that work line by line, it could be a logbook, or something else that works line by line.

But all of this makes no sense, if we look at the paragraph ends, which can be short. That would be not normal for a line wise writing, then we would have variation in every line, but lines in a paragraph have similar length. 

So there are maybe other explanations why we find some words exclusive at the line start. Maybe something is added to the words at the line start, maybe this explains why we find so many unique words hat the line start?
You are not allowed to view links. Register or Login to view.

[Edit] This is the main bit, took a second to find because was looking for "line as" not "line is".. 

The Line Is a Functional Entity. 
In addition to my findings about ‘‘languages’’ and hands, there are two other points that I’d like to touch on very briefly. Neither of these has, I think, been discussed by anyone else before. The first point is that the line is a functional entity in the manuscript on all those pages where the text is presented linearly. There are three things about the lines that make me believe the line itself is a functional unit. The frequency counts of the beginnings and endings of lines are markedly different from the counts of the same characters internally. There are, for instance, some characters that may not occur initially in a line. There are others whose occurrence as the initial syllable of the first ‘‘word’’ of a line is about one hundredth of the expected. This by the way, is based on large samples (the biggest sample is 15,000 ‘‘words’’), so that I consider the sample to be big enough so that these statistics are significant. The ends of the lines contain what seem to be, in many cases, meaningless symbols: little groups of letters which don’t occur anywhere else, and just look as if they were added to fill out the line to the margin. Although this isn’t always true, it frequently happens. There is, for instance, one symbol that, while it does occur elsewhere, occurs at the end of the last ‘‘words’’ of lines 85% of the time. One more fact: I have three computer runs of the herbal material and of the biological material. In all of that, which is almost 25,000 ‘‘words,’’ there is not one single case of a repeat going over the end of a line to the beginning of the next; not one. This is a large sample, too. These three findings have convinced me that the line is a functional entity, (what its function is, I don’t know), and that the occurrence of certain symbols is governed by the position of a ‘‘word’’ in a line. For instance, there is a particular symbol which almost never occurs as the first letter of a ‘‘word’’ in a line except when it is followed by the letter that looks like ‘‘o.’’

- Point 4 is the "TLDR", but he refers to point 3 within 4 so I included also. 

3. The effect of word-final symbols on the initial symbol of the following ‘word’ This ‘word-final effect’ first became evident in a study of the Biol. B index wherein it was noted that the final symbol of ‘words’ preceding ‘words’ with an initial ‘4O’ was restricted pretty largely to ‘9’; and that initial ‘S / Z’ was preceded much more frequently than expected by finals of the ‘M’ series and the ‘E’ series. Additionally, ‘words’ with initial ‘S / Z’ occur in line-initial position far less frequently than expected, which perhaps might be construed as being preceded by an ‘initial nil.’ This phenomenon occurs in other sections of the Manuscript, especially in those ‘written’ in Language B, but in no case with quite the same definity as in Biological B. Language A texts are fairly close to expected in this respect. I can think of no interpretation of this phenomenon, linguistic or otherwise. Inflexional endings would certainly not have this effect nor would any other grammatical feature that I know of if we assume that we are dealing with words. If, however, these word-appearing elements are something else, syllables, letters, even digits, restrictions of this sort might well occur.

4. The line as functional entity As mentioned in para. 3 above, ‘words’ with initial ‘S / Z’ are unexpectedly low in line initial position (on average about .1 of expected); other ‘words’ occur in this position far more frequently than expected, particularly ‘words’ with initial ‘8S,’ ‘9S’ etc., which have the appearance of ‘S’-initial ‘words’ suitably modified for line-initial use. Symbol groups at the ends of lines are frequently of a character unlike those appearing in the body of the text sometimes having the appearance of fillers. Further, in only one instance so far noted has a repeated sequence (of ‘words’) extended beyond the end of one line into the beginning of the next. All in all it is difficult not to assume that the line, on those pages on which the text has a linear arrangement, is a self-contained unit with a function yet to be discovered.

I don't think specific examples are given, but to show one of my own which I think shows what he is alluding to, at least in regards to "appearance of fillers" at the ends of lines.

[attachment=12350]

If we assume - not language. You see this sort of thing in some (later) ciphers such as Francis Bacons

"This decodes in groups of five as
baaab(S) baaba(T) aabaa(E) aabba(G) aaaaa(A) abbaa(N) abbab(O) aabba(G) baaaa® aaaaa(A) abbba(P) aabbb(H) babba(Y) bbaaa bbaab bbbbb
where the last three groups, being unintelligible, are assumed not to form part of the message."
Quote:Going back to the topic of this thread:  I am still confused about what exactly LAAFU means.  I see that there has been a huge amount of discussion about it,and what I have read only left me more confused.

For me LAAFU sounds more importantly and "proudly" than it really is. Like some serious, general law concering Voynich Manuscript.

In practice it just means than statistically some words appear more often at beginnings and ends of lines than in the middle.
Words which appear in the end often are ended themselves with "m" letter, so it's rather a regularity for this letter and not for these words.

We don't have any convincing explanation for it. Maybe the most convicing one is that "m" is actually a glyph of another letter which takes this form at the endings. Think of "long s" and "short s" in old texts. Long s appeared in the words as a middle letter and short s appeared as the final letter.
You are not allowed to view links. Register or Login to view.

And that's basically all. Also when you take a random line form the manuscript there probably won't  be any LAAFU effect for it and there won't be anything special for it. LAAFU works only for minority of lines.

There are also some small effects with word length.
[Tagged comment on to original to make less confusing..]
Just want to add another thing I found. 

cheey
ckheey
ctheey

are very often the second word in a line. That also give a hint, that cz, cTz are the same words and that there are symbols that are injected into the words that are not part of the word.
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13