31-03-2018, 08:59 PM
09-01-2019, 12:28 PM
(30-03-2018, 01:49 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.I like this avenue of investigation, to focus on finding "and", but depending on the language this may be more difficult - even if it's not a clitic. For example, in modern Turkish, "and" is expressed by ve, de, da, ile... There are also constructions with ister and ya, where those words also replace English and. And that's just after a superficial google search. This means that the relative frequency of each of these words will be lower than that of "and" in English.
(30-03-2018, 05:05 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.That's not good. Why would one need so many different words to express the same logical operator?
Hi Anton,
The most simple answer is: “And” is sometimes not a logical operator.
Example:
- This is a secret between you and me.
It is imaginable that, when using simple dictionaries or translating engines, you will get many translations, because the dictionary / engine cannot tell what “and” you want exactly.
Besides, there are also other “and”s:
- My brother and I was going to talk with our father. (one-side connector, simultaneously)
- I’m going to travel to United States and Mexico. (one-side connector, sequentially)
- I’ll go and check what happened. (to check is the purpose of to go)
- …, and so on. (one-side connector, but may be replaced with other phrase, like etc, as whole)
- It’s not their fault, but yours.
- It’s not their fault, and it’s yours.
In Chinese, if I want to say “relationship between countries”, I would say “国与国之间的关系”. The “与” here may also mean “and” in some other contexts. This is one of infinite examples where different languages handle logical operators differently.
I think these might be some of the reasons why Koen found many translations for one word.
19-01-2020, 07:30 PM
I have computed Zipf's law graphs for the individual sections of the VMS (ZL transcription, ignoring uncertain spaces, only considering Paragraph text). As always, keep in mind that I could have made errors somewhere in the process.
According to Zipf's law, word frequencies should follow the distribution:
P( r )=C/r^alpha
Where:
For comparison, I include a Latin example (the beginning of Virgil's Aeneid) and an English one (the beginning of the Genesis from King James Bible). AstCZ includes the Astrological, Cosmological and Zodiac sections: these have very little paragraph text, so the measures could be particularly unreliable. Sections are ordered as described in the "Rearranging the pages" paragraph of You are not allowed to view links. Register or Login to view..
Graphs:
These are the results I get:
virgil_ NRMSD:0.008 C:0.024 alpha:0.696
genesis NRMSD:0.017 C:0.117 alpha:0.921
1_HerbA NRMSD:0.003 C:0.048 alpha:0.773
2_Pharm NRMSD:0.011 C:0.039 alpha:0.667
3_AstCZ NRMSD:0.027 C:0.026 alpha:0.571
4_HerbB NRMSD:0.022 C:0.030 alpha:0.617
5_Stars NRMSD:0.018 C:0.026 alpha:0.630
6_Bio__ NRMSD:0.020 C:0.046 alpha:0.686
Voynichese has low values for C and alpha (like Latin). The Latin text has a small deviation from the model. King James Genesis has a poor fit (we know from Koen's TTR research that this text is anomalously repetitive).
An interesting (though probably not surprising) thing is that Voynichese results vary a lot, both with respect to the values for C and alpha and the quality of the fit. HerbalA is very close to the model, with a deviation that is only half the already small value of the Latin text. The other Currier-A-section (Pharma) is also very good.
All the Currier-B-sections have high deviations.
The frequencies of the most frequent words in each sample help understanding what is happening (this can also be observed in the graphs):
virgil_ et:3.21% in:1.04% atque:0.55% per:0.55% aut:0.53%
genesis and:10.71% the:9.71% of:4.56% god:1.69% earth:1.40%
1_HerbA daiin:4.91% chol:2.56% chor:1.84% s:1.36% shol:1.25%
2_Pharm daiin:4.06% chol:1.75% cheol:1.52% okeol:1.43% ol:1.38%
3_AstCZ aiin:1.82% or:1.76% ar:1.76% daiin:1.65% ol:1.20%
4_HerbB daiin:2.17% chedy:1.74% or:1.71% chdy:1.50% aiin:1.50%
5_Stars chedy:1.77% qokeey:1.55% qokeedy:1.32% aiin:1.23% qokaiin:1.18%
6_Bio__ shedy:3.31% chedy:2.82% qokedy:2.54% qokain:2.54% ol:2.50%
The Bio section is particularly clear, showing that the frequencies of the five most common words are almost "flat": "shedy" is comparable with the frequency of "et" in the Latin text, but the fifth word "ol" is about five times as frequent as the Latin fifth word "aut". This non-Zipf-like flatness of the left side of the curve can also be appreciated visually in the bottom four graphs.
Even more puzzling, the five most frequent words are completely different between A and B. In principle, one would expect all of the five most frequent words to be function words, but in Voynichese this does not seem to be the case (this problem has often been mentioned by Torsten).
If Voynichese is meaningful and its words correspond to words in a natural language, I can only think of these possibilities:
According to Zipf's law, word frequencies should follow the distribution:
P( r )=C/r^alpha
Where:
- P( r ) is the frequency of the word with rank 'r';
- C is a constant close to P(1), the frequency of the most frequent word; in English this is typically about 10% (the frequency of 'the');
- in the case of English, the exponent alpha is close to 1; so the second word is about half as frequent as the first word; the third word 1/3 as frequent as the first word etc.
For comparison, I include a Latin example (the beginning of Virgil's Aeneid) and an English one (the beginning of the Genesis from King James Bible). AstCZ includes the Astrological, Cosmological and Zodiac sections: these have very little paragraph text, so the measures could be particularly unreliable. Sections are ordered as described in the "Rearranging the pages" paragraph of You are not allowed to view links. Register or Login to view..
Graphs:
These are the results I get:
virgil_ NRMSD:0.008 C:0.024 alpha:0.696
genesis NRMSD:0.017 C:0.117 alpha:0.921
1_HerbA NRMSD:0.003 C:0.048 alpha:0.773
2_Pharm NRMSD:0.011 C:0.039 alpha:0.667
3_AstCZ NRMSD:0.027 C:0.026 alpha:0.571
4_HerbB NRMSD:0.022 C:0.030 alpha:0.617
5_Stars NRMSD:0.018 C:0.026 alpha:0.630
6_Bio__ NRMSD:0.020 C:0.046 alpha:0.686
Voynichese has low values for C and alpha (like Latin). The Latin text has a small deviation from the model. King James Genesis has a poor fit (we know from Koen's TTR research that this text is anomalously repetitive).
An interesting (though probably not surprising) thing is that Voynichese results vary a lot, both with respect to the values for C and alpha and the quality of the fit. HerbalA is very close to the model, with a deviation that is only half the already small value of the Latin text. The other Currier-A-section (Pharma) is also very good.
All the Currier-B-sections have high deviations.
The frequencies of the most frequent words in each sample help understanding what is happening (this can also be observed in the graphs):
virgil_ et:3.21% in:1.04% atque:0.55% per:0.55% aut:0.53%
genesis and:10.71% the:9.71% of:4.56% god:1.69% earth:1.40%
1_HerbA daiin:4.91% chol:2.56% chor:1.84% s:1.36% shol:1.25%
2_Pharm daiin:4.06% chol:1.75% cheol:1.52% okeol:1.43% ol:1.38%
3_AstCZ aiin:1.82% or:1.76% ar:1.76% daiin:1.65% ol:1.20%
4_HerbB daiin:2.17% chedy:1.74% or:1.71% chdy:1.50% aiin:1.50%
5_Stars chedy:1.77% qokeey:1.55% qokeedy:1.32% aiin:1.23% qokaiin:1.18%
6_Bio__ shedy:3.31% chedy:2.82% qokedy:2.54% qokain:2.54% ol:2.50%
The Bio section is particularly clear, showing that the frequencies of the five most common words are almost "flat": "shedy" is comparable with the frequency of "et" in the Latin text, but the fifth word "ol" is about five times as frequent as the Latin fifth word "aut". This non-Zipf-like flatness of the left side of the curve can also be appreciated visually in the bottom four graphs.
Even more puzzling, the five most frequent words are completely different between A and B. In principle, one would expect all of the five most frequent words to be function words, but in Voynichese this does not seem to be the case (this problem has often been mentioned by Torsten).
If Voynichese is meaningful and its words correspond to words in a natural language, I can only think of these possibilities:
- the underlying language has no function words (I am not sure this is really possible);
- the same function words in the "source" language are represented by different words in the different sections.
19-01-2020, 08:11 PM
(19-01-2020, 07:30 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.If Voynichese is meaningful and its words correspond to words in a natural language, I can only think of these possibilities:
- the underlying language has no function words (I am not sure this is really possible);
- the same function words in the "source" language are represented by different words in the different sections.
How about function words represented by prefixes or suffixes?
19-01-2020, 08:16 PM
Another thing which recently came to my attention through that stuff of Tranchedino (or whatever his name is) is that one-to-many mapping should not be treated as something impossible.
If one word can be mapped to two or three vords, that completely muddles all our word frequency statistics.
If one word can be mapped to two or three vords, that completely muddles all our word frequency statistics.
19-01-2020, 09:37 PM
(19-01-2020, 08:11 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.How about function words represented by prefixes or suffixes?
If one drops the idea that Voynichese words correspond to words in a natural language, other possibilities open up and this is one of them.
One problem I see is that Voynichese words are not particularly long and (because of the low entropy) plausible affixes are difficult to identify.
For instance, a common function word easily has a frequency of about 2%. A candidate prefix with a similar frequency could be oka-. But we know that a- can only be followed by a very limited number of symbols: so oka- cannot stand for a function word that can occur before a word starting with any symbol.
This subject has been discussed by Emma in You are not allowed to view links. Register or Login to view.. I think that her arguments are solid and could apply many other frequent prefixes / suffixes.
(19-01-2020, 08:16 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.Another thing which recently came to my attention through that stuff of Tranchedino (or whatever his name is) is that one-to-many mapping should not be treated as something impossible.
If one word can be mapped to two or three vords, that completely muddles all our word frequency statistics.
This idea would preserve the assumption that each Voynichese word corresponds to a "source" word. It could also fit my second suggestion:
'the same function words in the "source" language are represented by different words in the different sections'.
Here, one of the problems we face is that Voynichese TTR values are only slightly higher than "normal", while , in this scenario, we would expect a Type-Token-Ratio very close to 1. But if mappings are only switched between sections, TTR would not be affected.
We know that similar Voynichese words tend to behave similarly; similar Voynichese characters also behave similarly (and, in You are not allowed to view links. Register or Login to view., similar cipher characters often represent the same source character). But the problem is that a diplomatic cipher easily uses about 100 different symbols, while the Voynich alphabet has about 15/30 symbols: if the two benches are equivalent, the gallows are equivalent etc. we end up with half a dozen symbols.
Finally, we often discussed evidence of the opposite problem: a single Voynichese word that appears to have multiple meanings (homographs). If we must think of a many-to-many mapping from source to cipher words, I am not sure we end up with something manageable.
While I don't think that something like a diplomatic cipher is possible, a nomenclator with different cipher words for each "source" word could partially explain what we observe here. I think this is what nablator meant in You are not allowed to view links. Register or Login to view..
19-01-2020, 11:05 PM
(19-01-2020, 09:37 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.While I don't think that something like a diplomatic cipher is possible, a nomenclator with different cipher words for each "source" word could partially explain what we observe here. I think this is what nablator meant in You are not allowed to view links. Register or Login to view..
I imagine that the Voynichese generating system is more flexible and easy to use than a large nomenclator. It can produce many possible strings for each string of cleartext, most of which are never used because of some (variable) constraints and preferences. A possible ambiguity can exist if the system is stateful, i.e. what precedes matters, so the same Voynichese string may be used to encipher several (I have worked with 2) different strings of cleartext.
I know how it sounds, comp-sci-ish and unlikely for the 15th century, but I am considering only a practical implementation that is close in complexity to the simplest board games, without any physical action (like rewriting or moving physical objects) or boring computation (mathematical or list look-up).
20-01-2020, 11:26 AM
Thank you, Nablator! I was not aware of this line in your research. I am looking forward to read more about it!
16-02-2021, 04:16 PM
I recently read an interesting paper about function words: You are not allowed to view links. Register or Login to view., by Tony C. Smith & Ian H. Witten, 1993
The largely quantitative approach of that paper made me curious to experiment with function words and the MATTR technique that Koen brought to the attention of the Voynich community.
I considered the first 1000 word tokens from a number of texts and manually partitioned all word types into the two categories of FunctionWords and ContentWords.
These are the part of speech classes I treated as Function Words:
These are the 10 texts I examined (I only provide links for those that are not easy to find):
This table shows the 20 most frequent word types in each text with my tentative Function(green) / Content(yellow) classification.
[attachment=5295]
The interesting result is that there is a significant positive correlation (0.91) between MATTR and the % of content tokens. After I ran these experiments, I found that this percentage is known as Lexical Density.
The correlation is possibly explained by the fact that inflected languages result in higher MATTR (since inflection produces more word types from each stem); these languages use fewer prepositions, since such information is sometimes carried by noun cases. Latin is about as inflected as Greek but additionally it has no articles, producing an even higher Density.
If one assumes that Voynichese words correspond to words in a language that is somehow comparable with the European languages I examined, it is possible to infer a rough estimate for Lexical Density in the VMS. MATTR (I used a 500 words window) has some uncertainty due both to differences between Currier A and B and the presence of uncertain spaces (I used the Zandbergen-Landini transliteration). Some additional uncertainty is due to the fact that the correlation with Density is not perfect, e.g. the Italian and Occitan texts have similar MATTR values and Lexical Density values about 12% apart. Overall, I would say that Density for the VMS could fall into the 55-75% range, i.e. 25-45% of word tokens can be expected to be function words.
Of course, it is possible that Voynichese words do not correspond to words in any language, or that the underlying language is so different from the languages I considered that it does not agree with the correlation discussed here.
NB: VMS dots in the plot show actual MATTR values, while I manually selected Y values to fall near the regression line.
The largely quantitative approach of that paper made me curious to experiment with function words and the MATTR technique that Koen brought to the attention of the Voynich community.
I considered the first 1000 word tokens from a number of texts and manually partitioned all word types into the two categories of FunctionWords and ContentWords.
These are the part of speech classes I treated as Function Words:
- articles (e.g. the)
- prepositions (e.g. in)
- pronouns (e.g. you)
- possessive determiners (e.g. your)
- conjunctions (e.g. and)
- auxiliary verbs (e.g. have)
- adverbs of negation (e.g. not)
These are the 10 texts I examined (I only provide links for those that are not easy to find):
- en_gen King James Genesis, English, XVII Century
- en_herb You are not allowed to view links. Register or Login to view., English, XVI Century
- gr_gen Greek Genesis, III Century BCE
- oc_thes You are not allowed to view links. Register or Login to view., Occitan, XIV or XV Century?
- it_mach Machiavelli Il Principe, Italian, XVI Century
- lat_bon_n You are not allowed to view links. Register or Login to view., Latin, XIII Century, normalized transcription
- lat_bon_d Bonaventura, Soliloquium, Latin, diplomatic transcription
- lat_gen Latin Vulgate Genesis, IV Century
- lat_matt You are not allowed to view links. Register or Login to view., Latin, XVI Century
- lat_vir Virgil, Aeneid, Latin, I Century
This table shows the 20 most frequent word types in each text with my tentative Function(green) / Content(yellow) classification.
[attachment=5295]
The interesting result is that there is a significant positive correlation (0.91) between MATTR and the % of content tokens. After I ran these experiments, I found that this percentage is known as Lexical Density.
The correlation is possibly explained by the fact that inflected languages result in higher MATTR (since inflection produces more word types from each stem); these languages use fewer prepositions, since such information is sometimes carried by noun cases. Latin is about as inflected as Greek but additionally it has no articles, producing an even higher Density.
If one assumes that Voynichese words correspond to words in a language that is somehow comparable with the European languages I examined, it is possible to infer a rough estimate for Lexical Density in the VMS. MATTR (I used a 500 words window) has some uncertainty due both to differences between Currier A and B and the presence of uncertain spaces (I used the Zandbergen-Landini transliteration). Some additional uncertainty is due to the fact that the correlation with Density is not perfect, e.g. the Italian and Occitan texts have similar MATTR values and Lexical Density values about 12% apart. Overall, I would say that Density for the VMS could fall into the 55-75% range, i.e. 25-45% of word tokens can be expected to be function words.
Of course, it is possible that Voynichese words do not correspond to words in any language, or that the underlying language is so different from the languages I considered that it does not agree with the correlation discussed here.
NB: VMS dots in the plot show actual MATTR values, while I manually selected Y values to fall near the regression line.
16-02-2021, 05:48 PM
Very interesting, Marco. I agree that the correlation is because the kinds of languages that have a high TTR will tend to absorb function words into content words, resulting in fewer function words overall. Or the other way around, languages that are highly analytical will only have a few forms per content word (like English book-books) but will have to compensate this lack of inflectional possibilities with more function words.
I am not sure to what extent this impacts our supposed ability to detect function words. Maybe the consistency of function words in a high-TTR text should make them easier to spot, if anything?
I am not sure to what extent this impacts our supposed ability to detect function words. Maybe the consistency of function words in a high-TTR text should make them easier to spot, if anything?