The Voynich Ninja
Vord frequency histogram as an indicator of the text category - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Vord frequency histogram as an indicator of the text category (/thread-501.html)

Pages: 1 2 3 4 5


RE: Vord frequency histogram as an indicator of the text category - Anton - 09-04-2016

Quote:I currently checked botanical folios up to 10v (I will check them all shortly), and found that, all paragraphs taken together, their first vords exhibit only 50% uniqueness. This means that if we exclude first paragraphs, then second etc. paragraphs would exhibit quite low degree of uniqueness of their first vords - in contrast to the first paragraphs, the first vords of which are at the same time the first vords in folios.

To sum the things up:

1) 49% of first vords of all paragraphs of botanical folios are unique.

2) As already indicated in the title post, 66% of first vords of first paragraphs (or, which is the same thing, first vords of folios) of botanical folios are unique.

3) 31% of first vords of non-first (second, third etc.) paragraphs of botanical folios are unique.

Therefore, if position matters, then the high degree of uniqueness of first vords of botanical folios (as compared to the second, third or last vords of folios) is not due to their position in the paragraph, but rather due to their position in the folio.


RE: Vord frequency histogram as an indicator of the text category - Wladimir D - 10-04-2016

I want to remind everyone to one feature of the German language.
 It is the existence of separable prefix verbs, which are put at the end of sentences. Excellent candidates for this role, the words: "ar", "or", "ai", "oi", "am", "om", "dar", "dor", "dai", "doi", "dam" , "dom".

This can significantly affect the statistics.


RE: Vord frequency histogram as an indicator of the text category - Anton - 10-04-2016

I'm not sure that that was the feature of the German language as it was in the 15th century. Confused  It was significantly different back then from what it is now.


RE: Vord frequency histogram as an indicator of the text category - -JKP- - 27-12-2016

A very high proportion of the "unique" words in the VMS are not unique if you split them into two components (which is why I mentioned that they are only unique if you take the spaces as real).

In other words, if you see ABCDEFGH, it will almost always break into ABCD EFGH or ABCDE FGH or something similar. The individual components are often common vords. On occasion they will split into three, but many are comprised of two.


RE: Vord frequency histogram as an indicator of the text category - Koen G - 27-12-2016

So this means that they are unique because they combine chunks in a way that happens to be unique. But there must also be words that really do unique stuff? Would it help to study those as a separate category?


RE: Vord frequency histogram as an indicator of the text category - -JKP- - 27-12-2016

(27-12-2016, 02:36 PM)Koen Gh. Wrote: You are not allowed to view links. Register or Login to view.So this means that they are unique because they combine chunks in a way that happens to be unique. But there must also be words that really do unique stuff? Would it help to study those as a separate category?


Yes, the combinations are unique, the components frequently are not.

Sometimes the two combined vords appear to be vords in the way we think of words and sometimes the combined units appear to be word + affix (most often a suffix). So, if EFG is a common suffix (or what appears to behave like a suffix, showing up mostly at the ends of different vords), you might see unique vord ABCDEFG as well as unique word HIJKEFG, with EFG also showing up at the ends of other vords (to create combinations that are not unique).

There's another group of "unique" words that I haven't figured out yet, such as dydyd. I've spent time studying the other patterns, but not ones constructed like this, out of repeated but sometimes incomplete smaller units, but they appear less common than the other forms.



I've been compiling some statistics on unique vords. Depending on which transcription system you use (I use my own), there are about 30 unique vords on the first page, but the following plant pages tend to have about 6 to 12 unique vords, depending on how much text is on that page. If it's a long page (e.g., three paragraphs), there might be as many as 20, but the proportion is fairly consistent.


It has been said that maybe the unique vords in the big-plants section are the plant names, but they are not always the first vord and many will break into smaller components. For example, the second vord in the "water lily" plant is unique, unless you break it into components... then it becomes two common components with EVA-P in between and since EVA-P MAY function differently from other glyphs (possibly as a capitulum or some other marker or modifier), this would make logical sense (it's just one way to look at it, but there seems to be some support in the rest of the text for this interpretation).


I could go on for about 20 pages describing just this aspect of the text alone. It's hard to encapsulate it in one post.


RE: Vord frequency histogram as an indicator of the text category - ThomasCoon - 27-12-2016

(27-12-2016, 03:31 PM)-JKP- Wrote: You are not allowed to view links. Register or Login to view.I've been compiling some statistics on unique vords. Depending on which transcription system you use (I use my own), there are about 30 unique vords on the first page, but the following plant pages tend to have about 6 to 12 unique vords, depending on how much text is on that page. If it's a long page (e.g., three paragraphs), there might be as many as 20, but the proportion is fairly consistent.

That is highly interesting - I wonder why that is. Do you think it might be a clue into the workings of the encoding mechanism, JKP?


RE: Vord frequency histogram as an indicator of the text category - -JKP- - 27-12-2016

(27-12-2016, 03:47 PM)ThomasCoon Wrote: You are not allowed to view links. Register or Login to view.
(27-12-2016, 03:31 PM)-JKP- Wrote: You are not allowed to view links. Register or Login to view.I've been compiling some statistics on unique vords. Depending on which transcription system you use (I use my own), there are about 30 unique vords on the first page, but the following plant pages tend to have about 6 to 12 unique vords, depending on how much text is on that page. If it's a long page (e.g., three paragraphs), there might be as many as 20, but the proportion is fairly consistent.

That is highly interesting - I wonder why that is. Do you think it might be a clue into the workings of the encoding mechanism, JKP?


Thomas, there are a number of possible interpretations, but it seems to me that the two most likely reasons are:

1. That it's a clue to the encoding mechanism, or
2. that it's a reflection of the content.

So, I started looking at how unique Vords were distributed and their positions relative to the rest of the text (not just on the same page, but in the manuscript overall) in the hopes that this might offer some answers.

It's not too hard to describe page position, but when you factor in their breakdown components and their relationships to the rest of the manuscript, it becomes a mountain of a project, which is why I haven't finished writing it all up yet.


Without making this post wayyyy too long, I can share some general observations. To keep it short, I'll restrict it to Vords that have not been broken into their components...

So here's a JKP nutshell version of the behavior of the unique VMS word-tokens (Vords), with emphasis on the large-plants section to keep it short enough to fit in a forum post...

As an example, in the large-plants section, the following patterns are evident:

1. Unique Vords have a somewhat regular distribution within the text. They do not tend to fall next to each other.
2. Unique Vords appear more frequently, but not always, at the beginnings of paragraphs. <-- [See point 8.]
3. Unique Vords less commonly fall at the ends of paragraphs than at the beginning (but it does happen).
4. When unique Vords fall at the ends of lines, they are often suffixed by those peculiar constructions that are more frequent at the ends of lines, such as EVA-aj, -oj, or j or EVA-d with straight leg rather than a full figure-8 curve, or those common at the ends of words (e.g., EVA-y, -dy or -ar). <-- [Points worth noting since these are general patterns of the text and apparently not restricted to unique Vords.]
5. Unique Vords are somewhat of the same length as common Vords. In the VMS, vord length is not necessarily an indication of rarity or uniqueness. Sometimes unique vords are short and sometimes common words are long.
6. Most of the time, unique Vords tend to show up about 4 to 9 times per paragraph and are somewhat evenly distributed in the sense of being proportional to the rest of the text. Very short paragraphs will sometimes only have a couple of unique Vords.
7. Unique Vords at the beginnings of paragraphs are often prefaced by gallows characters.
8. Important: unique Vords will often break down into two components that show up elsewhere in the text. These atomic units combine in more than one way (some appear to behave like words, some appear to behave like affixes, frequently suffixes). Some are common vords. <-- [This is also worth noting because it begs the question, "Are they compound words as in natural language, are they structural units as in a synthetic language, or are the spaces contrived?".]
9. When the unique Vord is at the beginning of a paragraph and prefaced by gallows characters, removing the gallows will often result in a common component or a combination of two common components.


Some things I have noticed while studying medieval herbals, compared to the patterns in the distribution of unique Vords...

1. In medieval herbals, there are frequently lists of plant names in a variety of languages. If the unique Vords in the VMS big-plants section were lists of plant names, and if the VMS text followed conventional patterns, then one would expect a higher proportion of unique Vords and they would probably be closer together, rather than being more evenly distributed. Also, unique Vords are not always at the beginning of paragraphs so, even if one name were listed rather than several, either the name consists of common words (e.g., in English, the components water and lily might show up as a plant name, but also as separate words in other sections) or the text in each section does not follow a rigid model or... the text may have nothing to do with plants (or be meaningless). For the record, ancient and medieval plant names tended to be based on unique words rather than compound words (e.g., afodille, androsaema, corcodrillo, pitythalmos, etc.).
2. As examples of pages that are somewhat (although not greatly) different, Plant 4v and 17r have more end-of-line unique Vords, and 20v has fewer unique vords.
3. Following up Point 9, it's important to consider that the preponderance of unique Vords at the beginning of paragraphs might be an artifact. If it turns out that gallows characters are capitula, markers, or modifiers, and are evaluated separately from the following glyphs, then the following glyphs are often not unique. For example, if you have EVA-Pxxxxx at the beginning of the paragraph and you remove the P, the rest of the vord is often found elsewhere. This lends support to the possibility that gallows-P behaves differently from other glyphs and also that the Vords at the beginnings of paragraphs are not necessarily unique.


I have to run and that's more than enough for one post.


RE: Vord frequency histogram as an indicator of the text category - ThomasCoon - 29-12-2016

Thanks for the very thorough reply JKP! Smile

(27-12-2016, 10:35 PM)-JKP- Wrote: You are not allowed to view links. Register or Login to view.Thomas, there are a number of possible interpretations, but it seems to me that the two most likely reasons are:

1. That it's a clue to the encoding mechanism, or
2. that it's a reflection of the content.

That's a great point - It never occurred to me that it could also reflect the content - that's certainly a viable explanation too.

Quote:It's not too hard to describe page position, but when you factor in their breakdown components and their relationships to the rest of the manuscript, it becomes a mountain of a project, which is why I haven't finished writing it all up yet.

It seems you have several monolithic projects going on Big Grin I commend you for always thinking big-picture and comprehensively, even when it's a strenuous road to walk.

As an example, in the large-plants section, the following patterns are evident:

Quote:1. Unique Vords have a somewhat regular distribution within the text. They do not tend to fall next to each other.
2. Unique Vords appear more frequently, but not always, at the beginnings of paragraphs. <-- [See point 8.]
3. Unique Vords less commonly fall at the ends of paragraphs than at the beginning (but it does happen).
4. When unique Vords fall at the ends of lines, they are often suffixed by those peculiar constructions that are more frequent at the ends of lines, such as EVA-aj, -oj, or j or EVA-d with straight leg rather than a full figure-8 curve, or those common at the ends of words (e.g., EVA-y, -dy or -ar). <-- [Points worth noting since these are general patterns of the text and apparently not restricted to unique Vords.]
5. Unique Vords are somewhat of the same length as common Vords. In the VMS, vord length is not necessarily an indication of rarity or uniqueness. Sometimes unique vords are short and sometimes common words are long.
6. Most of the time, unique Vords tend to show up about 4 to 9 times per paragraph and are somewhat evenly distributed in the sense of being proportional to the rest of the text. Very short paragraphs will sometimes only have a couple of unique Vords.
7. Unique Vords at the beginnings of paragraphs are often prefaced by gallows characters.
8. Important: unique Vords will often break down into two components that show up elsewhere in the text. These atomic units combine in more than one way (some appear to behave like words, some appear to behave like affixes, frequently suffixes). Some are common vords. <-- [This is also worth noting because it begs the question, "Are they compound words as in natural language, are they structural units as in a synthetic language, or are the spaces contrived?".]
9. When the unique Vord is at the beginning of a paragraph and prefaced by gallows characters, removing the gallows will often result in a common component or a combination of two common components.

Wow - this is very enlightening! Thanks JKP - I definitely agree with points 2, 7 and 8. Anyone who works with the text long enough will notice how odd paragraph-initial words are (especially those weird gallows constructions). Trying to figure out why they are EVA p-initial or f-initial has always been a thorn in my side...

Point #4 is highly intriguing because (please forgive me if I misinterpreted) the elements that normally appear at the end of a line still appear at the end of a line regardless of whether or not the vord is unique - which may be evidence for some paragraph / line structure that is at least partially independent of the words in that line. I'm not saying I'm advocating that theory (I have no idea) but this observation might support it.

Quote:1. In medieval herbals, there are frequently lists of plant names in a variety of languages. If the unique Vords in the VMS big-plants section were lists of plant names, and if the VMS text followed conventional patterns, then one would expect a higher proportion of unique Vords and they would probably be closer together, rather than being more evenly distributed. Also, unique Vords are not always at the beginning of paragraphs so, even if one name were listed rather than several, either the name consists of common words (e.g., in English, the components water and lily might show up as a plant name, but also as separate words in other sections) or the text in each section does not follow a rigid model or... the text may have nothing to do with plants (or be meaningless). For the record, ancient and medieval plant names tended to be based on unique words rather than compound words (e.g., afodille, androsaema, corcodrillo, pitythalmos, etc.).
2. As examples of pages that are somewhat (although not greatly) different, Plant 4v and 17r have more end-of-line unique Vords, and 20v has fewer unique vords.
3. Following up Point 9, it's important to consider that the preponderance of unique Vords at the beginning of paragraphs might be an artifact. If it turns out that gallows characters are capitula, markers, or modifiers, and are evaluated separately from the following glyphs, then the following glyphs are often not unique. For example, if you have EVA-Pxxxxx at the beginning of the paragraph and you remove the P, the rest of the vord is often found elsewhere. This lends support to the possibility that gallows-P behaves differently from other glyphs and also that the Vords at the beginnings of paragraphs are not necessarily unique.

All three of these are great points - I definitely agree with #3; that has occurred to me also. And as far as #1, I'm glad you understand the norms of Medieval plant naming, because that fact would've passed over most of our heads Big Grin


RE: Vord frequency histogram as an indicator of the text category - Anton - 29-12-2016

Back in summer I published the You are not allowed to view links. Register or Login to view. in which, among other things, uniqueness of vords with  respect to their position in the folio was explored (see in particular Section 6). The focus was on botanical folios. I came to the following conclusions:

1. First vords of botanical folios exhibit behaviour different from vords occupyng other positions, demonstrating not only high degree of uniqueness, but also low count of those vords which are non-unique.

2. This peculiar behaviour of first vords of botanical folios is not attributed to their position, but rather to some other reason.

3. The first vord of a botanical folio is never repeated again in the same folio.

4. First vords of balneological and recipe folios also exhibit high degree of uniqueness.