The Voynich Ninja
"The Currier languages revisited" revisited - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: "The Currier languages revisited" revisited (/thread-5153.html)

Pages: 1 2 3 4 5 6 7


RE: "The Currier languages revisited" revisited - Rafal - 22-12-2025

Quote:Maybe the texts of Herbal-A and Herbal-B were taken from sources in two different dialects, or two very similar languages.  Like Northwest Lower West Bavarian and Southwest Lower West Bavarian...

The problem is that they don't have to be languages at all. They may be two different algorithms or heuristics of generating gibberish.

Or the same algorithm running with a different "seed" ( You are not allowed to view links. Register or Login to view. ) 

Personally I am still struggling with understanding statistical research of VM. These guys often claim that Voynichese behaves similarly to real languages. 
I wonder what kind of results do they have. Is it only PCA and finding clusters corresponding to quires and images or something more?

Because if it is only clusters then it doesn't prove similarity to real languages.


RE: "The Currier languages revisited" revisited - Jorge_Stolfi - 22-12-2025

(22-12-2025, 08:41 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.Personally I am still struggling with understanding statistical research of VM. These guys often claim that Voynichese behaves similarly to real languages. ... Because if it is only clusters then it doesn't prove similarity to real languages.

Indeed, there is no convincing proof yet that it *IS* natural language.  We only have many statistical properties that it has and that natural languages also have, but which some forms of gibberish would not have.  

For example, imagine an algorithm that made up each word by combining a a prefix, core, and suffix from three given sets (as in one of the many proposed "word paradigms") of sizes p,q,r, with each component drawn at random.  This  would generate a text where each word type would appear with the same frequency 1/(p q r), hence with a flat Zipf plot (of frequency x rank).  But Voynichese has Zipf plot that is similar to that of other languages.

One could have a more complicated algorithm where each of the three components is randomly chosen from its set according to unequal frequencies.  Like, by stuffing the "prefix" bag with four copies of qo, two copies of cho, etc.  This would generate words with unequal frequencies, and the result could approximate the Zipf law better.  But to get a good fit, as we see, would require careful tuning of the contents of each bag -- and awareness that such tuning was needed...

Or imagine an algorithm that generates random words with a Zipf-like distribution, but generates each word independently from the previous ones.  Then the frequency of a word pair "W1 W2" would be the just product of the overall frequencies of W1 and W2.  But that is not the case of natural languages: there are always forbidden pairs and pairs that occur more often than expected from their individual freqs.  And that is also the case of Voynichese...

People have noticed that the first word on a line, or after an obstruction like a plant stem, is somewhat longer than average; while the last word or two are somewhat shorter.  This has been taken as a sign that Voynichese is not a natural language and/or that the encoding algorithm is sensitive to line breaks -- the "LAAFU" hypothesis.  However, the same deviations in word length have been seen to occur when text in any language is written within fixed margins with the banal line breaking algorithm ("write the next word on the same line if it fits, otherwise break the line before it").  This result not only could explain away the LAAFU data, but, if so, it would be evidence that the text was copied from a draft, with the Scribe ignoring the line breaks of the latter and inserting new breaks as needed.  

And so on. AFAIK, no one has found a statistical property of Voynichese that does not occur also in some natural language.

All the best, --stolfi


RE: "The Currier languages revisited" revisited - nablator - 22-12-2025

(22-12-2025, 10:05 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.And so on. AFAIK, no one has found a statistical property of Voynichese that does not occur also in some natural language.

AFAIK, no one has found a natural language that explains any of these statistical properties:
- Major inconsistencies in basic glyph and glyph-bigram, trigram statistics between pages, sections.
- Currier "language" drift and "dialects",
- Word pairs statistics: You are not allowed to view links. Register or Login to view. You are not allowed to view links. Register or Login to view.
- Frequent local similarities (reduplication and almost-reduplication) including insanely high levels of clustering of k/t gallows especially in Currier B, You are not allowed to view links. Register or Login to view.
- Patterns across word breaks, by Emma M.S. and Marco P. You are not allowed to view links. Register or Login to view.
- "Vertical pairs" by Tavie You are not allowed to view links. Register or Login to view.
- Patrick Feaster's several statistical discoveries are yet to be explained by a quantitative study of any language,
- p/f gallows presence mostly on the first line of paragraphs,
- The absence of (function) words that are common in every section,
- I'm certainly forgetting others...


RE: "The Currier languages revisited" revisited - kckluge - 23-12-2025

(22-12-2025, 08:49 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.
(20-12-2025, 08:35 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I would be curious to see a version of the plot colored by Scribe/Hand. 

As always, it’s possible I made errors. I used the You are not allowed to view links. Register or Login to view. shared by Karl here to re-create the plot in You are not allowed to view links. Register or Login to view., colored by hand and adding labels.

It seems the two clusters match the hands very well.
Scribe4 has pages in the two different clusters. The two pages that end up in the B cluster on the right are the Rosettes page I think (f85v1, I could never understand exactly page numbering for the large foldout) and the last zodiac page: Sagittarius (f73v).
Scribe3 has a single “A” page: f58r. The position of f58v, just across the gap, is also interesting. See also Karl's comment #37 just above.

You are not allowed to view links. Register or Login to view., the zodiac pages show a drift from “close to A” to “close to B” (all in the range of Rene’s C/Cosmo intermediate language). You are not allowed to view links. Register or Login to view. shows that the drift takes place both in circle and label text.

At first glance that looks right. I used "f85v1" (which, IIRC, is the lower left panel of the "rose" side of the big foldout) rather than Rene's "fRos" for reasons irrelevant to the discussion. Since the Zodiac pages have an ordering, I plotted the "no labels" coordinates for the 1st two eigenvectors with lines connecting each page to the next one (the last one, Sagittarius, is the point at the bottom at x ~= 0.025):

   

I don't see any coherent trend in that. If I'm counting correctly, the point sticking out to the right at y ~= 0.045 is Leo,
which lines up to some degree with Leo & Sagittarius being peaks in your plot of % EVA "ed" in the non-label circle text, but it's still solidly on the A language side of the gap in the side 3-D view (You are not allowed to view links. Register or Login to view.). 

I have to acknowledge that the non-label Sagittarius text lands firmly (if on the edge of) the B language side of the A/B gap, and I have to acknowledge that "Zodiac labelese" is statistically intermediate between the A & B non-label vocabularies. Having said that, I'm still worried about making assumptions about the relationship between label and non-label text on the same page, and I'm still not convinced the overall bigram frequency stats for the non-label text on the Zodiac pages support them being any kind of continuous transitional form between A & B. (I'm happy to entertain the possibility/likelihood that "Zodiac labelese" played a role in inspiring the construction of Herbal B, however.)


RE: "The Currier languages revisited" revisited - ReneZ - 23-12-2025

I am not sure I understand the proposed points for or against PCA analysis.

PCA is only used to find the largest dimensions in a multi-dimensional cloud of points.
The result can be used for visualisation. On a piece of paper or a screen, one can only show two dimensions.
To do statistics (e.g. clustering) one can use all dimensions.

When a multi-dimensional cloud of points is projected onto two dimensions, all coordinates other than these two are set to zero. This means that a lot of data is eliminated or lost. 
The problem is that in such projections, points that look very close to each other may be separated in the invisible dimensions. 
To avoid or reduce the impact of that, one can show several different projections next to each other. When this becomes more than 3, the human mind isn't really capable of fully grasping it.
However, it still allows to see whether any grouping in each of the projections remains a grouping. 

There are other types of non-linear projections, which try to preserve the distances. These cannot preserve the scales of course. (I forget what these are called).


RE: "The Currier languages revisited" revisited - kckluge - 23-12-2025

(22-12-2025, 07:49 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(20-12-2025, 04:10 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.then I see no way to meaningfully interpret the results of PCA at all.

If we knew nothing else about the pages in question -- if they were just text -- I'd agree with you (oshfdk). We do know more, however, because of the illustrations. The relative bigram frequency statistics (which PCA is just a method for visualizing) tell us a couple things:

* the qualitative differences in illustration type correspond in most cases to quantitatively distinct clusters in that space (the cosmological and zodiac page non-label text clusters together, the "Rose" sheet pages cluster in with Herbal B, and the Pharma pages cluster together in an interesting way with a subset of the Herbal A pages), and 

* within the set of pages with a single large plant drawing on them, the text falls into three quantitatively distinct clusters, (Herbal A, low folio numbers), (Herbal B), and (Herbal A, high folio numbers), and

* the cluster containing the high folio number Herbal A pages also contains the Pharma pages, which are also physically close to them in the binding.

Those aren't minor things to know, they constrain possible explanations of the text in substantial non-trivial ways.

Quote:* Chances of getting a useful insight from these analyses will improve if one uses fewer data so as to reduce the number of factors that affect them. Like comparing Herbal-A and Herbal-B only, thus hopefully eliminating the "topic" factor.  Then maybe one can figure out whether the difference is a change of spelling, or something else.  Once one gets some insight on Herbal-A vs. Herbal-B, one can then consider what is happening in other sections.

When using PCA to visualize things, absolutely. Post #5 shows this with plotting just the A language sections and just the B language sections.

Quote:* It is not surprising that word frequencies are different in each section.  Even for the most common words, which may or may not be "function" words like "much", "is", "find", "good"; not to mention "content" words like "herb", "star", "blood", etc.  

* If word frequencies change, digraph frequencies will change too, since they are determined by the digraphs that appear in the most common words.  As I mentioned before, "rb" is probably much more frequent in an herbal text, (Latin or English) than in a text about astrology.  (Unless the latter it talks a lot about "orbits"...)

* What is surprising is that (IIUC) Herbal-A differs from Herbal-B noticeably more than either differs from Bio or Stars. Thus, besides the difference of topic, we indeed have a difference of language or spelling (or encryption).  Maybe the texts of Herbal-A and Herbal-B were taken from sources in two different dialects, or two very similar languages.  Like Northwest Lower West Bavarian and Southwest Lower West Bavarian...

[...]

* And the chances of obtaining useful insights will improve a lot if one uses the good old "scientific method": make an hypothesis, then devise the simplest and most effective way to test for it, and do that.

Al the best, --stolfi

...and the way to do that is to take (say) a half dozen languages from the early 15th century, and within each language take a set of samples with different authors and subjects (ideally the cross product of both of those), and then look at how the quantitative differences in relative bigram frequency space between those factors compare to the quantitative differences between sections in the Voynich Mss. That can be done by

* looking at the frequency data as points in an N-dimensional Euclidean space and using PCA and/or cluster analysis to visualize their distribution

* looking at them as vectors in an N-dimenstional space and using cosine similarity, using heat maps or single-linkage hierarchical clustering to visualize their distribution

* Looking at them as discrete probability distributions and using a Chi^2 value to measure similarity, visualizing as with cosine similarity

I haven't done that; I don't think anyone else has done that. If someone does, I'm very confident where I'm betting my chips (and I don't think it's the same place you'd bet yours).


RE: "The Currier languages revisited" revisited - Jorge_Stolfi - 23-12-2025

(22-12-2025, 10:27 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.AFAIK, no one has found a natural language that explains any of these statistical properties:

In some of those examples there has been no serious attempt at checking whether those statistical properties occur or not among natural languages.  The proponents just assumed that they "obviously" do not.

Quote:Major inconsistencies in basic glyph and glyph-bigram statistics between pages (some pages full of "or", "ol", "in" etc., some pages totally missing "e", "n", etc.

Can you give examples of these "major inconsistencies" in running text of pages within the same section?  

Quote:Currier "language" drifts and dialects

Natural languages do have dialects.  And spellings may vary.  And word frequencies (hence bigram frequenceis) will depend strongly on topic.

Quote:Word pairs statistics: You are not allowed to view links. Register or Login to view. You are not allowed to view links. Register or Login to view.

The first paper considers only Indo-European languages.  Not even Basque, Hugarian, Finnish, Estonian. Not Semitic languages (Arabic, Hebrew, Aramaic, Coptic, Ge'ez, Berber), not Kartvelian languages (Georgian, Mingrelian, Laz), not Turkish ...

... and not any East Asian languages, of course.  

Quote:Frequent local similarities (reduplication and almost-reduplication) including insanely high levels of clustering of k/t gallows especially in Currier B, You are not allowed to view links. Register or Login to view.

This is one case where the proponents did not even bother to check whether such "insanely high levels of clustering" occur in English, much less in other languages -- they just assumed it would not.  But there are a number of reasons why such clustering could occur in a natural language:
  • Again, statistics of characters and bigrams are determined by their occurrences in the most frequent words -- or, in this case, consecutive word pairs. If the most frequent word pairs happen to have a certain letter in common, then that letter will have anomalously high duplication frequency.  For example if you extract the sequence of vowels of English (after mapping "oo" and "ee" to single letters), every occurrence of "it is" or "if it" or "if I" or "in it" or "if in" or "I did" etc will generate an "ii" pair.  When all word pairs are considered, "ii" is unlikely to come out just as common as predicted by the frequency of "i"s.  Indeed, if Voynichese did not have anomalous duplication, it would be evidence that it was not natural language, and would suggest that each word generated independently by a random process.
  • The same argument above applies if certain topic-specific word pairs occur with significant frequency in a given section, like "hot tea" or "this star".
  • The definite article in Arabic is "al-"; unless the next word starts with "r", "s", or "z", in which case it changes to "ar-", "as-", or "az-".  Each occurrence of this rule enhances the frequency of "rr", "ss", and "zz".
  • Hungarian and Turkish have this thing called "vowel harmony". The vowels are divided in two sets, and all syllables of a word must use vowels from the same set. Thus the Turkish plural suffix is "-ler" or "-lar"; the plural of house "ev" is "evler", while that of  car "araba" is "arabalar".  These rules enhance the frequency of "ee" and "aa" (and other pairs) in the vowel sequence.  If these languages were written with each morpheme as a separate "word", this enhancement  would stand out even across multiple successive "words".
  • And you surely do not want to know about "tone sandhi"...
Maybe the "anomalous" duplication frequencies of natural languages are not as extreme as those of Voynichese.  Maybe they are even more extreme.  Either way, the proponents should have verified that...

Quote:Patterns across word breaks, by Emma M.S. and Marco P. You are not allowed to view links. Register or Login to view.
See the above answer, especially the first point.  Namely, the frequency of a character pair in this statistic is determined by its occurrence in the most common consecutive word pairs.  Every occurrence of "it is" in English increases the frequency of "t-i", and so on. Again, if Voynichese did not have anomalous frequencies of bigrams across word breaks, it would be evidence that it was not a natural language.

And the other points above also apply, mutatis mutandis. As people have immediately pointed out, even in English there is the rule for "a" or "an" depending on the first phoneme of the next word.

Quote:"Vertical pairs" by Tavie You are not allowed to view links. Register or Login to view.
One "anomaly" discussed in that thread is that the first word (only) of a line is longer in average, while the last 1-3 words are shorter.  As I explained in the previous post, this sort of anomaly is a guaranteed result of the trivial line-breaking algorithm.  Does it explain precisely the length anomaly of the VMS? I don't know; but until this explanation is tested, the anomaly cannot be used as evidence of LAAFU and/or that Voynichese is not natural language.

The other anomaly discussed in that thread is the distribution of bigrams in the sequence that one gets by taking the first character of every line.  In that 301-line table there are many pairs whose frequencies do not match the numbers expected by the formula fr(XY) =fr(X)fr(Y).  However in most cases the numbers are small so it is hard to tell whether the discrepancies are significant.  If you throw 4000 balls into an array of 20x20=400  bins, perfectly at random, there will be some bins with highly anomalous counts, that deviate a lot from that formula.  

Eyeballing that table, the anomalies that seem statistically significant are the q-o, q-q, o-q, and o-o pairs.  That by itself says nothing about the language, but only about the formatting of the text.  

Here is one of many possible explanations for those anomalous pairs.  Check You are not allowed to view links. Register or Login to view. (Swiss, paper, 410 pages, ~1430).  Note that, on this particular section, the scribes did not separate paragraphs as we do today.  Instead they seem to have marked the start of each sentence with a vertical slash through the first letter, and the start of each paragraph by underlining the first 1-3 words and placing a paragraph marker on the left margin, all in red ink.  Now suppose that the VMS scribe used a similar scheme: each sentence or clause in a paragraph may start anywhere in a line, and is marked there somehow; but then a q is also placed at the start of that line, to make it easier for the reader to find it.  Then, if each clause is two or more lines long, there will be no q-q pairs in the line initial sequence.

I can think of other explanations based on interactions between the grammar of the language, the average length of a sentence, and the initial-word-length anomaly discussed above.   But it is not worth detailing them here.  The point, overall, is that those q-o anomalies do not really imply LAAFU, much less "it is not a natural language".

Quote:Patrick Feaster's several statistical discoveries are yet to be explained by a quantitative study of any language
These anomalies all seem to be largely consequences of the anomalous distribution of words in line-initial and line-final position, discussed above.   There may be additional perturbations due to uncertain word spaces, which are expected to vary with position along a line.

And let me repeat again some advice for those doing statistical studies of the text: (1) don't try to analyze the whole book at the same time: if possible, limit the analysis to one or two sections the same "topic" and to a single specific type of text, such as multiline parags excluding the head lines (2) formulate an hypothesis and then collect the statistics that would best prove or disprove it; and (3) check whether the "anomaly" occurs in natural languages, including non-"European" ones.

All the best, --stolfi


RE: "The Currier languages revisited" revisited - Jorge_Stolfi - 23-12-2025

(23-12-2025, 12:55 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.PCA is only used to find the largest dimensions in a multi-dimensional cloud of points. The result can be used for visualisation.

Consider for example two populations of  points A and B.  Point set A has a distribution in 3-space that looks like an ellipsoid aligned with the X, Y, and Z axis, with diameters proportional to 4, 2, and 1 along these axes, respectively.  Point set B has a similar distribution, but displaced in Z and Y by +2 units each.

PCA should give three axes which are roughly u = X = (+1,0,0) , v = (0,+1,-1), and w = (0,+1,+1), with u having the largest eigenvalue.  Projecting the points on the uv (or uw) plane will show the two sets as overlapping.  But in fact they are well-separated by a horizontal plane.  To see this fact, the best axes are Y and any axis orthogonal to it.  Or (in this case) the v and w axes, skipping u.

The point is that the first 2 or 3 PCA axes are the directions along which the whole set has maximum extent. They are not necessarily the directions along which the centroids of the clusters of interest are maximally far apart, and do not necessarily include the direction(s) along which the clusters can be linearly separated.

All the best, stolfi.


RE: "The Currier languages revisited" revisited - ReneZ - 23-12-2025

All theoretically possible, but this is not what we are looking at here.
To be on the safe side I created plots involving six different 'base' vectors, which showed nothing of interest - just the cloud getting narrower and narrower.


RE: "The Currier languages revisited" revisited - kckluge - 23-12-2025

(20-12-2025, 04:26 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I find the subject of an A/B switch vs a progressive drift very interesting. It would be great if this problem could get a clear answer.

I hadn't previously realized that Scribe 4 apparently played a major role in the A/B transition or switch.

As always, it is likely I made errors, but I tried computing the frequency of the bigram ‘ed’ in each zodiac page. It’s lucky that the order of these pages is known. I processed Circle and Label text both separately and together. As a reference, the average frequency of 'ed' in Currier B is ~4%. 

After posting #44 earlier, I realized more clearly where our disagreement is coming from. Your plot above is addressing the question "to what extent is non-label word usage correlated with label word usage on a given Zodiac page as measured by % of words containing 'ed'?" And the plot does, in fact, show some level of correlation. The question I've been asking (perhaps not explicitly enough) is "to what extent are the non-label words drawn from the same underlying distribution as label words on a given Zodiac page as measured by overall bigram statistics?" If the answer to that question was yes, all three graphs on your plot would fall close to on top of each other. 

The thing that's puzzling given your plot is why the Zodiac pages plot so differently in the larger bigram space with and without labels given that sample sizes make the "circles" and "all" graphs on your chart look very similar (at least for the 'ed' bigram). Best guess is that there are other bigrams where the difference is more marked or it reflects a larger pattern of difference over multiple bigrams.

It's possible to make an argument along the line of: labels are statistically different from the other words on the page because they're nouns or proper names, and those have a different distribution than words in general. The problem with an explanation like that is it undercuts arguing for Zodiac labelese as a bridge between A & B non-label text because it's not an apples-to-apples comparison.