The Voynich Ninja
Inside the Voynich Network: graph analysis - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Inside the Voynich Network: graph analysis (/thread-4998.html)

Pages: 1 2 3 4 5 6 7 8 9 10 11


RE: Inside the Voynich Network: graph analysis - Bernd - 02-11-2025

Ok, thank you!
I was afraid that the text size of more distinct Voynich 'dialects' would be too small.


RE: Inside the Voynich Network: graph analysis - quimqu - 03-11-2025

(01-11-2025, 06:58 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.You are sure that the other special characters [µß^~+`'°] are not deleted, remapped, or interpreted as separators at some point?  

Hello Jorge,

I have verified and the set of characters of the used text is: ["'", '+', '-', '.', '^', '`', 'a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'x', 'y', 'z', '~', '°', 'µ', 'ß']

Regards


RE: Inside the Voynich Network: graph analysis - Jorge_Stolfi - 04-11-2025

New Portuguese and Spanish samples for @Quimqu:

You are not allowed to view links. Register or Login to view.

All texts are derived from the classical 1899 novel "Dom Casmurro" by Brazilian author Machado de Assis.

The "text" folder has the full texts for reference -- more or less as published, will capitals, punctuation, etc.  The "full", "head", "tail" have the same texts, but converted to lowercase, with all punctuation (including apostrophe and hyphen) converted to blanks, and with numbers, symbols, and foreign language words replaced by " * "words. Paragraphs are separated by blank lines. Words in each paragraph are separated by blanks and newlines.  

The "full" files have all the words of the novel, which are ~65'500 for Portuguese and ~69'000 for Spanish. The "head" and "tail" files are the same, trimmed to the first ~38'000 words and the last ~38'000 tokens, respectively.  This token count should roughly match that of the VMS.  Thus there is some overlap between each head and the corresponding tail, which is ~10'500 words for Portuguese and ~7'000 for Spanish.  The last parag of "head" and the first parag of "tail" may be truncated.

The files are in the Unicode UTF-8 encoding.  Besides blank and newline, the characters in the "full", "head", and "tail" files should be 

  Spanish: abcdefghijklmnopqrstuvwxyz áéíóúñüªº ãç 
  Portuguese: abcdefghijklmnopqrstuvwxyz áéíóúâêôàãõçüº 

In each folder there are four files: "port-orig.txt" uses the original 1899 spelling (apart from capitalization), "port-curr.txt" uses the "modern" spelling (as of ~1999), "port-phon.txt" uses an ersatz phonetic spelling.  The files "span-curr.txt" have the Spanish translation of the novel, with (I suppose) current spelling.

The first three files in each folder should have almost 1:1 mapping of the tokens which matches the lexemes.  The differences are mostly hyphenated or separate words in one file that are joined in the other, and a few homographs and homophones that are distinguished or merged.  The phonetic files, in particular, have the adverbial suffix "mente" as a separate word.  

There is no such simple mapping between tokens of Portuguese and Spanish files, even though most lexemes are cognates. The word order and choice of synonyms is often different, etc.  I suppose that the Spanish file has more words because the Portuguese contractions are split: "da" <--> "de la" etc. 

In any structural word analysis that is independent of a lexical mapping, the three Portuguese samples in each folder should be almost coincident. 

And each Portuguese sample should plot close to the corresponding Spanish sample, because, even though they are different languages by different authors (Machado and the translator), the underlying text is the same.

On the other hand, each "head" should map close to the corresponding "tail", because they are different texts but have the same author the same language, the same topic, and the same literary genre and style.  They can therefore be used to test how sensitive the analysis is to sampling noise.   

The most interesting comparison will be between a Portuguese "head" and a Spanish "tail" (or vice-versa) because they have in common only the topic and genre, and partly the style.

Hope it helps, --stolfi


RE: Inside the Voynich Network: graph analysis - Trithemius - 04-11-2025

Hey quimqu,

Nice work-- 

I have also done some graph analysis, but slightly different. In my graph, each node is a folio and an edge connects two pages when they share one or more words that appear at least twice in the entire text and are 4 or more characters long. I then did cosine normalization to prevent pages with more words from dominating so that the connections would reflect shared content rather than just... content/word count.

   

Anyway what I found was that the manuscript seems to cluster into two clear groups. The left is (I think) pretty clearly herbal A, and the right is the balneological section and the final text-only pages. I think from this graph we can infer that the final pages concern themselves with the balneological matters, rather than the herbal or astrological ones.


RE: Inside the Voynich Network: graph analysis - quimqu - 04-11-2025

(04-11-2025, 07:32 PM)Trithemius Wrote: You are not allowed to view links. Register or Login to view.Hey quimqu,

Nice work-- 

I have also done some graph analysis, but slightly different. In my graph, each node is a folio and an edge connects two pages when they share one or more words that appear at least twice in the entire text and are 4 or more characters long. I then did cosine normalization to prevent pages with more words from dominating so that the connections would reflect shared content rather than just... content/word count.



Anyway what I found was that the manuscript seems to cluster into two clear groups. The left is (I think) pretty clearly herbal A, and the right is the balneological section and the final text-only pages. I think from this graph we can infer that the final pages concern themselves with the balneological matters, rather than the herbal or astrological ones.

Nice work!
Yes, you can take a look at my thread about You are not allowed to view links. Register or Login to view.. There, it is also clear that the last pages "talk" about balneological, but also about herbal. You can see how.tje automated found topics are.disteibuted.throughout the MS.


RE: Inside the Voynich Network: graph analysis - Trithemius - 04-11-2025

(04-11-2025, 08:45 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.
(04-11-2025, 07:32 PM)Trithemius Wrote: You are not allowed to view links. Register or Login to view.Hey quimqu,

Nice work-- 

I have also done some graph analysis, but slightly different. In my graph, each node is a folio and an edge connects two pages when they share one or more words that appear at least twice in the entire text and are 4 or more characters long. I then did cosine normalization to prevent pages with more words from dominating so that the connections would reflect shared content rather than just... content/word count.



Anyway what I found was that the manuscript seems to cluster into two clear groups. The left is (I think) pretty clearly herbal A, and the right is the balneological section and the final text-only pages. I think from this graph we can infer that the final pages concern themselves with the balneological matters, rather than the herbal or astrological ones.

Nice work!
Yes, you can take a look at my thread about You are not allowed to view links. Register or Login to view.. There, it is also clear that the last pages "talk" about balneological, but also about herbal. You can see how.tje automated found topics are.disteibuted.throughout the MS.

Thanks! This looks like interesting work, I'll check it out.


RE: Inside the Voynich Network: graph analysis - Jorge_Stolfi - 05-11-2025

(04-11-2025, 07:32 PM)Trithemius Wrote: You are not allowed to view links. Register or Login to view.Anyway what I found was that the manuscript seems to cluster into two clear groups. The left is (I think) pretty clearly herbal A, and the right is the balneological section and the final text-only pages. I think from this graph we can infer that the final pages concern themselves with the balneological matters, rather than the herbal or astrological ones.

Nice! That confirms several other studies that found clustering of pages by section. Like Currier's "languages", and the recent work of @QuimQu.  Or this old one of mine:

You are not allowed to view links. Register or Login to view.

You are not allowed to view links. Register or Login to view.

As for the clusters, while Bio and StarrdParags (SPS) overlap in that plot, they are not identical, and can be separated by other statistical criteria.

All the best, --stolfi


RE: Inside the Voynich Network: graph analysis - quimqu - 05-11-2025

Here are the plots with Jorge Stolfi's "Dom Casmurro's" and the text in Manx from Ptrarsi.

                       

The four Dom Casmurro versions—old Portuguese, modern Portuguese, phonetic Portuguese, and Spanish—behave almost the same. Their networks have very similar clustering, path length, and degree patterns. That’s expected, since they express the same content and share close grammatical roots. The Spanish version is only slightly denser, likely because Spanish uses more short function words, just as Jorge predicted.

Compared with Lazarillo de Tormes, Dom Casmurro is more uniform and modern in structure. Lazarillo shows greater lexical variety and less repetition, giving it a sparser, more modular network. This can be the difference between current Spanish and older, as the older has a less standardized syntax (16th-century Spanish) and it has a heavier use of subordinate clauses.

The Manx text is clearly different (but very away from the Voynich). It forms a compact network with shorter paths and higher clustering. It seems to be a morphologically rich language.


RE: Inside the Voynich Network: graph analysis - Jorge_Stolfi - 05-11-2025

(05-11-2025, 01:42 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Here are the plots with Jorge Stolfi's "Dom Casmurro's"

Great!

The most interesting result is the proximity of the Spanish "head" texts to the Portuguese "tail ones, and vice-versa.  Those are essentially distinct texts, in rather different languages and quite different spellings, technically by distinct authors (Machado in the latter, and Tapía translating Machado in the former).  What they had in common was the higher-level nature and style of the work (grammatical variety, clause length, predominant verbal tenses, etc.), the general topic (which determined proper names and common concepts and actions) and whatever part of the author's style could survive the translation.  

Quote:Compared with Lazarillo de Tormes, Dom Casmurro is more uniform and modern in structure. Lazarillo shows greater lexical variety and less repetition, giving it a sparser, more modular network. This can be the difference between current Spanish and older, as the older has a less standardized syntax (16th-century Spanish) and it has a heavier use of subordinate clauses.
 

Is Lazarillo the work of a single author, or a collection of stories by different authors that aggregated over a long time?

(When I was a kid I used to read an Italian children's magazine that once published the adventures of Lazarillo del Tormes as a graphical summary.  The story was so popular with the readers that, after the original ran out, the magazine started to publish "New Adventures" with the same boy in similar settings and plots.  I suppose that the same must have happened right after the original appeared in Spain...)

(And some time ago I learned that the longest known poem in the world is believed to be the Tibetan  You are not allowed to view links. Register or Login to view., with 120 volumes and over a million verses.  Which of course were added by countless poets and bards over many centuries and several countries of Central Asia...)

Quote:The Manx text is clearly different (but very away from the Voynich). It forms a compact network with shorter paths and higher clustering. It seems to be a morphologically rich language.

AFAIK Celtic languages differ from Romance and English in that they still have noun inflections for grammatical case, besides gender and number.  And  some may attach a definite article to the noun.


RE: Inside the Voynich Network: graph analysis - quimqu - 05-11-2025

(05-11-2025, 04:59 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Is Lazarillo the work of a single author, or a collection of stories by different authors that aggregated over a long time?

The author is anonymous. Here you can find the entry in Spanish Wikipedia. You are not allowed to view links. Register or Login to view.