The Voynich Ninja
Can LAAFU effects be modeled? - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Can LAAFU effects be modeled? (/thread-4869.html)

Pages: 1 2 3 4


RE: Can LAAFU effects be modeled? - RadioFM - 04-09-2025

(04-09-2025, 07:45 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.No no, I meant that this not look good for Voynichese to be a language..

Well it's certainly not any straightforward cipher, that's evident.
We know that the encoding is a 1-to-many mapping; the ubiquitous ain, aiin, aiiin sequences are replaceable (f27v doesn't use them); the scribe has a certain amount of freedom to choose how to encode a piece of text, which can lower n-gram counts. We also know the 'echo' (reduplication and quasi-reduplication) is deliberate and spelling sometimes inconsistent, prefix o with a q and you can get a valid vord. Except for labels, we are certain a vord doesn't encode a whole plaintext word, so there's a chunking strategy that could potentially provide even more degrees of freedom to the scribe. And that's not even accounting for nulls - if all there's to LAAFU is preprending single letters at the line start and padding null sequences at the end, that too can mess up n-gram counts.


RE: Can LAAFU effects be modeled? - Jorge_Stolfi - 04-09-2025

(04-09-2025, 02:23 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.
Quote:IIRC, the first word of each paragraph generally does not occur elsewhere in the book, not even on the same page.
It does happen:
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.

Thanks! So the "rule" is more complex.  Looking at those three words:


pShol

  According to that (great!) browser, there are five occurrences of pShol.  Four are as the first word of the first line of a parag; of which only one in Herbal-A (as a separate word), two in Bio, and one in Pharma (all three possibly as part of a word):
 
    hea f8r.1     =pShol.Chor.otShal.~~~

    bio f80v.27   =pShol,kain.olkar.~~~
    bio f84r.20   =pShol,pChCFhdy.qokeedy.~~~

    pha f89r1.19  =pShol,Sheo.qoaIThy.~~~

  The fifth and most exceptional occurence is on f1r, in the middle of a parag:

    unk f1r.17    -~~~.Sheo,pShol.dydyd.~~~
   
  This occurrence too may be just part of the word SheopShol.  It is not clear whether the Sheo was mean to ne attached to the Pshol, or the hook of the p intruded into what should be a word space.  Maybe the p is a "honorific capital" (to indicate a proper name?), or maybe a foreign phoneme.

pChor

  That website lists 12 occurrences, all on parag head lines. Six are in Herbal-A, of which three are definitely separate words; three as the first word of the line, the other two in 2nd or 3rd word:
 
    hea f2v.1     =kooiin.Cheo,pChor.otaiin,o,dain.~~~
    hea f9v.5     =pChor,ypChChy.qotor.~~~
    hea f19v.1    =pChor.qodChy.dy.~~~
    hea f21r.1    =pChor.o,eeoCKhy.o,fyChey.~~~
    hea f45r.5    =kol,Sho.pChor.kChey.~~~
    hea f52v.1    =pChor.ChCPhol.CPhaiiin.~~~
   
  Three more occurrences are in Bio or in the two text-only pages at end of Bio; two are separate words at start of line, one (which has actually a weirdo instead of p) may be part of a word, 2nd or 3d in the line:
   
    bio f79r.1    =torain.Shedy,pChor.or.~~~
    bio f83r.9    =pChor.CheCPhedy.~~~
    unk f86v5.27  =pChor.ypChor.aiin.~~~
   
  And the other three are in the Starred (Recipes) section. One is a separate word at start of line, the other two are in the middle of the line, and one of them may be part of a word:
   
    str f103v.12  =~~~.l,ol,Shedy.pChor,pChedy.~~~
    str f105v.18  =pChor.Chedaiin.okaiin.~~~
    str f113r.42  =~~~.tCheor.qokChedy.pChor.aral=

pChodaiin

  That site lists six occurrences, all but two on parag head lines.  Three are in Herbal-A, of which two are parag-initial; but only one is definitely a single isolated word:

    hea f14r.1    =pCho,daiin.Chopol.~~~
    hea f20r.10  -~~~.Chor.sody,pChodaiin.Chetody.~~~
    hea f47v.7    =pChodaiin.dair.dCThy-
   
  Another occurrence is in the Cosmo section, in the short phrase labeliing the two rays that are joined together at 01:30, also as the first word:
   
    cos f67v1.19  =pChodaiin.otCh.oekeeo,dy=
   
  One is in the Pharma section.  It is the last of the next-to-last line. Maybe it was meant to be the first word of the next line (which starts with a tooin that is mis-aligned and seems retraced), but it was written in that position because there was not enough space for the full last line:
   
    pha f102v1.12 =~~~.CKhey.pChodaiin=
   
  And the last occurrence is in the Starred (Recipes) section, again at the start of a parag head line:
   
    str f106v.34  =pChodaiin.kCheeor,al.ky.~~~
 


There may be many possible explanations for these occurrence patterns, but anyway they seem compatible with the VMS being what it seems to be.  They do not seem compatible with the text being the output of a gibberish generation algorithm, or with the illustrations being only a decoy for unrelated text.

So, can we salvage the claim that "parag head words with puffs do not occur elsewhere"?  Perhaps pShol and pChor are not the full names of the plants, but qualifiers like "wild", "red", "sweet", "mountain", "grass" etc -- the second more common than the first.  Then the claim may still hold for the full name of the plant, like "sweet cicuta" or "turtlegrass"; except that the name may appear again in Pharma and Starred.  Maybe...

All the best, --jorge

PS. Sorry for the spurious blanks in the EVA text -- it is the MyBB editor doing its thing again...


RE: Can LAAFU effects be modeled? - quimqu - 04-09-2025

First of all, thank you for this thread.

As a Data Scientist, I want to check by myself all what has been said, specially Rene's thoughts in post, which were very clearly explained and have given me a lot to think. So I have prepared a 9 setps pipeline with Kaggle to check all this by myself and try to extract my conclusions. And now I try to summarize here:

Phase 1 – First glyphs in a line
I looked at the first symbols of each line.
I found, as René said, that the first line of a paragraph almost always starts with the set
{p, t, k, f}, while the other lines often start with {d, s, y, o, q}.
The difference is huge (about 84% with the rest of the glyphs) and highly significant. In other words, paragraph starts follow one rule, and the rest of the lines follow another. That’s very different from what we’d expect in a natural language.

Phase 2 – Word endings
I checked how often words end with m or g.
Globally, I saw that m and g are much more common at the end of a line than in the middle. For example, around 15% of line-final words end in m, compared with only 1% in the middle of a line.
This looks like an artificial line-final decoration, not something natural.

Phase 3 – Short lines and labels
I checked short lines (3 words or fewer).
The same line-start and line-end rules still apply to short lines. But for labels (the single-word captions next to drawings), the rules disappear. Labels behave differently: they don’t follow the line-start conventions, and they never end with m or g.
That makes sense if labels were generated differently from the running text.

Phase 4 – Vertical alternation of q
I looked at how often lines starting with q appear one after another.
If q starts happened at random, I would expect them to cluster sometimes. But instead, I found the opposite: q lines actively avoid following each other. Statistically, two q lines in a row are only about a quarter as common as expected.
It looks like the text was arranged to prevent such repetitions, as if someone wanted variety for the sake of appearance.

Phase 5 – Word length alternation
I measured how the length of a word relates to the length of the next word.
I had initially written that "in real languages, the correlation is usually slightly positive", but that was too strong. What I actually used was the standard Pearson correlation of word length vs. the next word’s length, which gave me a slightly negative value in the Voynich (around –0.07). That means it tends toward alternation (long-short, short-long).
Nablator pointed out that there is another way to define autocorrelation, by binarizing words into short or long relative to the mean, and then measuring how often pairs are concordant (short-short, long-long) vs. discordant (short-long, long-short). That measure and Pearson correlation are not the same, so it makes sense that they give different values. Both highlight different aspects of the rhythm.
From what I have seen in the literature, natural languages also tend to show slightly negative lag-1 autocorrelation (short words followed by long ones, and vice versa). At longer lags the correlation becomes slightly positive. So in this sense, the Voynich’s negative short-range autocorrelation is not anomalous: it actually resembles the short-term behavior of natural languages.
That said, when I scrambled the words within each line, the alternation became even stronger. This suggests that the original text has some structure that reduces alternation, meaning it is not purely random. In other words, the Voynich text does not behave like natural languages, but neither does it look entirely artificial.

Phase 6 – Repetitions across line breaks
I looked for repeated word sequences (bigrams or trigrams) that happen right at a line break.
When they repeat elsewhere, the break falls in the same position again. That means the line break itself is part of the "rule" — it isn’t just a visual convenience, it shapes the text generation.
In natural writing, repeats would not respect line breaks in this way.

Phase 7 – Steep vs. flat distributions
I measured how symbol frequencies drop off from most common to least common.
In first lines of paragraphs, the distribution falls off steeply: a few symbols dominate. In other lines, the curve is flatter: symbols are more evenly spread. So paragraph starts have a very different "profile".
This separation again suggests that the text is constrained by position in the layout, not by meaning.

Phase 8 – Stratified by scribe, Currier, and section
I repeated the earlier tests for different scribes, for Currier A vs. B, and for manuscript sections.
The same rules appear everywhere, but their strength varies. For example, Herbal A shows stronger constraints than Balneo.
This means the overall system is consistent, but each scribe or section had its own "dialect" of the rules.

Phase 9 – Methodological checks
I ran controls to make sure the patterns weren’t artifacts.
I tried collapsing special glyphs like sh, excluding very short lines, analyzing labels separately, and permuting lines to estimate chance. The patterns still held up under these tests.
That gave me confidence that what I’m seeing is real, not just a quirk of preprocessing.

In short, I have tried to summarize the rules described earlier and show how they appear in the data.

If anyone would like to check the details, I can share the full Python code so the results can be verified independently.

Note: To bring more clarity to this topic — which can be complex to explain, especially in English since it is not my native language — I have used ChatGPT to help summarize the concepts and make them easier to understand.


RE: Can LAAFU effects be modeled? - Jorge_Stolfi - 05-09-2025

(04-09-2025, 02:00 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.These are the most common two- and three- word sentences in the Herbal section of Culpeper's Herbal (17th century English). Here is the same for the novel I Promessi Sposi (modern Italian).  And here is the same for the Vulgate Genesis (Latin)

For comparison, here are the analogous lists for a Chinese text. The source is a set of transcripts of the Voice of America radio broadcasts in Mandarin Chinese, 1996-1998.  As before, for each N in {2,3,4}, the first 10'000 N-phrases (sequences of N consecutive words) were extracted, then the occurrences of each distinct N-phrase were counted, and the top 20 entries for each N were listed.  Here "word" means a single Chinese letter (= syllable).  Punctuation was ignored, but phrases spanning parag breaks were excluded. 

The Chinese letters in the original file were represented in the old GB2312 encoding.  They were transcoded to Unicode by a Linux utility (autogb).  The Pinyin and English equivalents were obtained through Google Translate.  They are very approximate since the N-phrases often contain only half of a two-character (two-syllable) compound word, whose meaning may be quite unrelated to that of its individual characters.

N = 2:

   191 中.国        | Zhōngguó              |    China.
   115 美.国        | Měiguó                |    US.                                     
    61 台.湾        | Táiwān                |    Taiwan.                                 
    41 报.道        | Bàodào                |    Report.                                 
    40 日.本        | Rìběn                 |    Japan.                                 
    38 表.示        | Biǎoshì               |    Said.                                   
    36 国.的        | Guó.de                |    Chinese.               
    33 政.府        | Zhèngfǔ               |    Government.
    27 关.系        | Guānxì                |    Relations.                             
    27 问.题        | Wèntí                 |    Issue.                                 
    24 制.裁        | Zhìcái                |    Sanctions.                             
    24 星.期        | Xīngqí                |    This week.                             
    23 但.是        | Dànshì                |    However.
    23 钓.鱼        | Diàoyú                |    Diaoyu.                       
    23 鱼.岛        | Yú.dǎo                |    Islands.                             
    22 记.者        | Jìzhě                 |    Reporter.   
    21 北.京        | Běijīng               |    Beijing.                             
    21 国.家        | Guójiā                |    National.                             
    21 国.政        | Guózhèng              |    State.                               
    21 进.行        | Jìnxíng               |    Proceed.                             

N = 3:

    24 钓.鱼.岛     | Diàoyúdǎo             |    Diaoyu Islands.                         
    20 中.国.的     | Zhōngguó.de           |    China.                                 
    19 国.政.府     | Guó.zhèngfǔ           |    Government.                             
    18 李.登.辉     | Lǐdēnghuī             |    Lee Teng-hui.                           
    18 的.报.道     | De.bàodào             |    Report.                                 
    16 中.国.政     | Zhōngguó.zhèng        |    Chinese.                               
    16 国.之.音     | Guózhī.yīn            |    Voice of.
    16 美.国.之     | Měiguózhī             |    America.                       
    14 发.言.人     | Fāyán.rén             |    A spokesperson.                         
    14 对.中.国     | Duì.zhōngguó          |    On the China.
    14 问.题.上     | Wèntí.shàng           |    Question question.                     
    13 中.国.外     | Zhōngguó.wài          |    China.                                 
    13 台.湾.的     | Táiwān.de             |    Taiwan.                                 
    13 外.交.部     | Wàijiāo.bù            |    Ministry of Foreign Affairs.           
    13 报.道.说     | Bàodào.shuō           |    Report.                                 
    13 的.时.候     | De.shíhòu             |    When.                                   
    12 位.听.众     | Wèi.tīngzhòng         |    Listeners.                             
    12 各.位.听     | Gèwèi.tīng            |    Listen.                                 
    12 说.中.国     | Shuō.zhōngguó         |    Talking about Chinese.                 
    12 领.导.人     | Lǐngdǎo.rén           |    Leaders.                               

N = 4:

    16 美.国.之.音  | Měiguó.zhī.yīn        |    Voice of America.                       
    14 中.国.政.府  | Zhōngguó.zhèngfǔ      |    Chinese government.                     
    13 各.位.听.众  | Gèwèi.tīngzhòng       |    Listeners.                             
    10 下.面.请.听  | Xiàmiàn.qǐng.tīng     |    Now, please listen.                     
    10 两.岸.关.系  | Liǎng'àn.guānxì       |    Cross-strait relations.                 
    10 之.音.记.者  | Zhī.yīn.jìzhě         |    Voice of America reporter.             
    10 听.美.国.之  | Tīng.měiguó.zhī       |    Listen to America.                     
    10 国.之.音.记  | Guó.zhī.yīn.jì        |    Voice of America reporter.             
    10 请.听.美.国  | Qǐng.tīng.měiguó      |    Please listen to America.               
    10 面.请.听.美  | Miàn.qǐng.tīng.měi    |    Please listen to America.               
     9 中.国.外.交  | Zhōngguó.wàijiāo      |    Chinese Foreign Ministry.               
     9 国.外.交.部  | Guó.wàijiāo.bù        |    Ministry of Foreign Affairs.           
     9 外.交.部.发  | Wàijiāo.bù.fā         |    Ministry of Foreign Affairs.           
     8 中.国.大.陆  | Zhōngguó.dàlù         |    Mainland China.                         
     8 交.部.发.言  | Jiāo.bù.fāyán         |    Ministry of Foreign Affairs statement. 
     8 部.发.言.人  | Bù.fāyán.rén          |    Ministry spokesperson.                 
     7 发.来.的.报  | Fā.lái.de.bào         |    Report from.                           
     7 李.登.辉.的  | Lǐdēnghuī.de          |    Lee Teng-hui.                           
     7 来.的.报.道  | Lái.de.bàodào         |    Report from.                           
     6 出.口.银.行  | Chūkǒu.yínháng        |    Export-Import Bank.

I will comment on these results in the next post.


RE: Can LAAFU effects be modeled? - Jorge_Stolfi - 05-09-2025

(05-09-2025, 06:17 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.For comparison, here are the analogous lists for a Chinese text. 

And here is the same analysis for another Mandarin Chinese text.  The source is the 18th-century novel Dream of the Red Mansion by Cáo Xuěqín. As before, for each N in {2,3,4}, the first 10'000 N-phrases (sequences of N consecutive words) were extracted, then the occurrences of each distinct N-phrase were counted, and the top 20 for each N were listed.  Here "word" means a single Chinese letter (= syllable).  Punctuation was ignored, but phrases spanning parag breaks were excluded. 

The Chinese letters in the original file were represented in the old GB2312 encoding.  They were transcoded to Unicode by a Linux utility (autogb).  The Pinyin and English equivalents were obtained through Google Translate.  They are very approximate since the N-phrases often contain only half of a two-character (two-syllable) compound word, whose meaning may be quite unrelated to that of its individual characters.

N = 2:

    52 雨.村       | Yǔcūn              | Yucun                         
    46 士.隐       | Shì.yǐn            | Shiyin                         
    24 笑.道       | Xiào.dào           | Laughed                       
    19 道.人       | Dàoren             | Taoist                         
    17 子.兴       | Zǐ.xìng            | Zixing                         
    16 不.知       | Bùzhī              | I don't know                   
    14 了.一       | Le.yī              | One                           
    14 如.今       | Rújīn              | As now
    13 封.肃       | Fēng.sù            | Fengsu                         
    13 有.一       | Yǒuyī              | There is                       
    13 那.僧       | Nà.sēng            | That monk                     
    12 不.能       | Bùnéng             | Can't                         
    11 原.来       | Yuánlái            | Originally                     
    10 一.个       | Yīgè               | One                           
    10 不.过       | Bùguò              | But                           
    10 两.个       | Liǎng.gè           | Two                           
    10 二.人       | Èr.rén             | Two                           
    10 听.了       | Tīn.le             | After hearing it                 
    10 心.中       | Xīnzhōng           | In his heart 
   
N = 3:   

     6 士.隐.听    | Shì.yǐn.tīng       | Shiyin listened               
     6 子.兴.道    | Zǐ.xìng.dào        | Zixing said                   
     6 那.道.人    | Nà.dàoren          | That Taoist                   
     5 女.学.生    | Nǚ.xuéshēng        | Female student                 
     5 段.故.事    | Duàn.gùshì         | A story                       
     5 笑.道.你    | Xiào.dào.nǐ        | Laughed                       
     5 道.人.道    | Dàoren.dào         | The Taoist said               
     5 隐.听.了    | Yǐn.tīngle         | Shiyin listened               
     4 世.人.都    | Shìrén.dōu         | Everyone knows                 
     4 个.儿.子    | Gè'er.zǐ           | A son                         
     4 了.士.隐    | Le.shì.yǐn         | Shiyin                         
     4 了.雨.村    | Le.yǔcūn           | Yucun                         
     4 人.都.晓    | Rén.dōu.xiǎo       | Everyone knows                 
     4 原.来.是    | Yuánlái.shì        | It turned out to be           
     4 家.娘.子    | Jiā.niángzǐ        | The lady of the family         
     4 忘.不.了    | Wàng.bùliǎo        | Unforgettable                 
     4 政.老.爷    | Zhèng.lǎoyé        | Master Zheng                   
     4 晓.神.仙    | Xiǎo.shénxiān      | Knowing the immortals         
     4 林.如.海    | Línrúhǎi           | Lin Ruhai                     
     4 甄.家.娘    | Zhēnjiāniáng       | Mother of the Zhen family 
     
N = 4:     

     5 士.隐.听.了  | Shì.yǐn.tīngle     | Shiyin listened               
     4 世.人.都.晓  | Shìrén.dōu.xiǎo    | Everyone knows                 
     4 人.都.晓.神  | Rén.dōu.xiǎo.shén  | Everyone knows about gods     
     4 晓.神.仙.好  | Xiǎo.shénxiān.hǎo  | They know that gods are good   
     4 甄.家.娘.子  | Zhēn.jiā.niángzǐ   | The Zhen family lady           
     4 空.空.道.人  | Kōngkōng.dào.ren   | The Taoist Kongkong           
     4 那.道.人.道  | Nà.dào.ren.dào     | The Taoist said               
     4 都.晓.神.仙  | Dōu.xiǎo.shénxiān  | Everyone knows about gods     
     4 雨.村.笑.道  | Yǔcūn.xiào.dào     | Yucun laughed                 
     3 一.僧.一.道  | Yī.sēng.yīdào      | A monk and a Taoist           
     3 一.段.故.事  | Yīduàn.gùshì       | A story                       
     3 两.个.儿.子  | Liǎng.gè.er.zǐ     | Two sons                       
     3 了.世.人.都  | Le.shìrén.dōu      | Everyone knows about gods     
     3 了.两.个.儿  | Le.liǎng.gè.er     | They have two sons             
     3 仙.好.只.有  | Xiān.hǎo.zhǐyǒu    | Only gods are good             
     3 子.兴.笑.道  | Zǐ.xìng.xiào.dào   | Zixing laughed                 
     3 朝.代.年.纪  | Cháodài.niánjì     | Dynasty and age               
     3 灵.秀.之.气  | Língxiù.zhī.qì     | A spiritual aura               
     3 神.仙.好.只  | Shénxiān.hǎo.zhǐ   | Only gods are good             
     3 那.僧.笑.道  | Nà.sēng.xiào.dào   | The monk laughed

Comments:

(1) The big difference between these two samples (VoA transcripts and novel) show that these statistics, like many others, are a property of the text more than of the language.

(2) Nevertheless, the nature of the repeating phrases is strikingly different.  For N=2, the English list is dominated by pairs of "function" words in the general sense (136 of.the, 73 in.the, 43 it.is, 30 and.the, 26 for.the, ...), and even in Latin most entries have at least one "function" word.  But Mandarin has few "words" that are strictly "function words".  Most "words" (characters) that could be translated into English function words are also elements of dozens of compounds with different grammatical functions. Mandarin does have articles, often dispenses with the verb "to be", etc.  Thus, in retrospect it is not surprising that the most common repeating phrases consist of "contents" words.

All the best, --jorge