The Voynich Ninja
Repetition of words - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Voynich Talk (https://www.voynich.ninja/forum-6.html)
+--- Thread: Repetition of words (/thread-4944.html)

Pages: 1 2 3


Repetition of words - Mark Knowles - 24-09-2025

It is possible for the same word to be repeated in a text in English and I think in other European languages, although I think it is quite uncommon to find the same word repeated.  However it is quite common to find the same words repeated in the Voynich. Has anyone researched how often words are repeated in other contemporary manuscripts? What is the probability that the following word will be the same as the previous word?


RE: Repetition of words - Mark Knowles - 24-09-2025

I have been thinking about a simple little algorithm which is to subtract all instances of words that are ever repeated in sequence in the manuscript. So for example taking the sentence:"I had a thought and I had had the opportunity to design this algorithm" then the word "had" is repeated and so all instances of this word should be removed from the text leaving "I a thought and I the opportunity to design this algorithm". This algorithm should be iterated recursively until there are no words that repeat in the text. It would be intriguing what the properties of the resultant text are and how much they differ from the original text. Quite a lot of words should be removed as if they repeat just once on just one page of the manuscript they should be deleted from the whole text of the manuscript.Words that only repeat once all other repeating words are deleted should also still be deleted.

Of course, I think that words that repeat have a very high probability of being filler words. There may be some filler words that don't ever repeat, but hopefully this process should help to prune the text considerably of many of the filler words.


RE: Repetition of words - Mark Knowles - 24-09-2025

A question that I am keen to answer is on average how many examples of repeated words should there be in a typical Latin manuscript of the period. I am inclined to assume that the answer is very few.


RE: Repetition of words - Mauro - 25-09-2025

(24-09-2025, 09:51 PM)Mark Knowles Wrote: You are not allowed to view links. Register or Login to view.A question that I am keen to answer is on average how many examples of repeated words should there be in a typical Latin manuscript of the period. I am inclined to assume that the answer is very few.

If I understood exactly what you mean,  I think this depends on the length of the text: the longer a text, the more probable is a word will appear at least twice.

I made a test with De Bello Gallico: it has 11030 word types, of which 6314 are hapax legomena. The whole text is 51503 words long, so most of the text (88%) would be culled by your procedure, but only ~43% of the vocabulary.


RE: Repetition of words - Jorge_Stolfi - 25-09-2025

(25-09-2025, 12:40 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I made a test with De Bello Gallico: it has 11030 word types, of which 6314 are hapax legomena. The whole text is 51503 words long, so most of the text (88%) would be culled by your procedure, but only ~43% of the vocabulary.

I understood that Mark considers deleting words that are repeated in consecutive positions.


RE: Repetition of words - Mark Knowles - 25-09-2025

(25-09-2025, 12:52 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(25-09-2025, 12:40 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I made a test with De Bello Gallico: it has 11030 word types, of which 6314 are hapax legomena. The whole text is 51503 words long, so most of the text (88%) would be culled by your procedure, but only ~43% of the vocabulary.

I understood that Mark considers deleting words that are repeated in consecutive positions.

Exactly. The Voynich manuscript has many words that repeat consecutively, maybe twice or three times in succession. I suspect that such words are very very likely to be null or filler words, so I am curious as to what the Voynich text might look like with all such words removed.

By repeated words I don't mean words that occur more than once in the manuscript; they must be words that are written next to each other in sequence. Once some words are removed there may be other words that are repeated next to each other, so they must be removed and this process repeated until there are no words in the Voynich text that are repeated in succession. I may implement this myself, but I wanted to put it down as a thought first.


RE: Repetition of words - Mauro - 25-09-2025

(25-09-2025, 03:18 PM)Mark Knowles Wrote: You are not allowed to view links. Register or Login to view.
(25-09-2025, 12:52 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(25-09-2025, 12:40 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I made a test with De Bello Gallico: it has 11030 word types, of which 6314 are hapax legomena. The whole text is 51503 words long, so most of the text (88%) would be culled by your procedure, but only ~43% of the vocabulary.

I understood that Mark considers deleting words that are repeated in consecutive positions.

Exactly. The Voynich manuscript has many words that repeat consecutively, maybe twice or three times in succession. I suspect that such words are very very likely to be null or filler words, so I am curious as to what the Voynich text might look like with all such words removed.

By repeated words I don't mean words that occur more than once in the manuscript; they must be words that are written next to each other in sequence. Once some words are removed there may be other words that are repeated next to each other, so they must be removed and this process repeated until there are no words in the Voynich text that are repeated in succession. I may implement this myself, but I wanted to put it down as a thought first.

Indeed I was unsure. I got it now, sorry. My guess is the text will change very little, if at all, but it's just a guess (I know there are languages where reduplication is an important feature, so it may be different with one of those, but my knowledge does not go beyond what the wikipedia article says).


RE: Repetition of words - anyasophira - 26-09-2025

Ok  eye quick test - may have errors but let’s try it. so let’s take 

folio 108v, there are 80 unique words that repeat at least once.

The top repeats are:
• qokeedy → 28 times
• qokeey → 21 times
• aiin → 14 times
• okeey → 13 times
• daiin → 11 times
• chey → 11 times
• ol → 9 times
• shedy → 9 times
• al → 9 times
• r → 9 times

Etc.

Here are the repeated words using Keys 

=== Page: You are not allowed to view links. Register or Login to view. ===
Total Tokens: 620
Total Keys: 2383


All Token Frequencies:
  D1A1Q1K3A2: 17
  A3G1: 16
  A1B2: 13
  A1Q1K3A2: 13
  A3B2: 12
  D1A1Q1K3B1A2: 12
  K1J1A2: 11
  B1A3G1: 9
  B2: 9
  K1J1A1B2: 9
  K1J1B1A2: 9
  A3F1: 8
  C1: 8
  D1A1Q1K2B1A2: 8
  A3C1: 6
  A1: 5
  B1A3F1: 5
  D1A1Q1J1A2: 5
  D1A1Q1J1B1A2: 5
  K1J1A3B2: 5
  L1J1B1A2: 5
  L1K3A2: 5
  A1Q2J1B1A2: 4
  A1Q2K3A2: 4
  A3B3: 4
  C2A3F1: 4
  D1A1B2: 4
  D1A1Q1K3BaA2: 4
  K1K3A2: 4
  L1J1A2: 4
  L1K2A2: 4
  A1C1: 3
  A1Q1A3B2: 3
  A1Q2K2A2: 3
  A1Q2K3B1A2: 3
  A3Ca: 3
  B2K1J1B1A2: 3
  B2Q1J1B1A2: 3
  B2Q1K2B1A2: 3
  C1A3G1: 3
  C2A3G1: 3
  D1A1Q1A3B2: 3
  D1A1Q1A3F1: 3
  D1A1Q1K2BaA2: 3
  D1A1Q2J1BaA2: 3
  K1J1U1A2: 3
  A1Q1A3F1: 2
  A1Q1A3G1: 2
  A1Q1J1A1B2: 2
  A1Q1J1A2: 2
  A1Q1J1B1A2: 2
  A1Q1K2B1A2: 2
  A1Q1K3B1A2: 2
  A1Q1K3BaA2: 2
  A1Q2A3B2: 2
  A1Q2A3G1: 2
  A3: 2
  B1A3C1: 2
  B2A1: 2
  B2K1J1A2: 2
  B2K1J1BaA2: 2
  B2Q1A2: 2
  B2Q1K3A2: 2
  D1A1: 2
  D1A1Q1A3C1: 2
  D1A1Q1J1BaA2: 2
  D1A1Q1K2A2: 2
  D1A1Q1K3A1: 2
  D1A1Q2J1A2: 2
  D1A1Q2K2B1A2: 2
  K1A1: 2
  K1A1B2: 2
  K1A3B2: 2
  K1A3C1: 2
  K1J1B1A3C1: 2
  K1J1BaA2: 2
  K1U1A2: 2
  L1J1BaA2: 2
  L1J1U1A2: 2
  Q1A3C1: 2
  Q1J1A1B2: 2
  Q1K2B1A2: 2
  Q1K2BaA2: 2
  Q1K3A3C1: 2
  Q2J1B1A2: 2

I took away all repeats not just those in a row. Because if it doesn’t work for that , I’m not sure if it would work for a smaller intervention like two in a row. 


pchedal.ltal.oteo.fcheey.otedar
shal.araiin.okam
ssheedal.chedalkedy.qokechedy.otedain.lol
dchedy.qokeed.arain
polaiin.otchedy.raraiin
oeeedain.lokeey.loety.kedarxy
pchedaiin.otedal.lkedeed.okedar.qoteol.otey
ysheedy.chedal.qoteedar.oty
chedam.chlal.okaldain.sheed.cthdy
leedaiin.shckhaiin.shee.ka{ith}y.chectham

tshedky.akey.teeody.tedam


pcheor.okear.ykeealkey.opsholal.ofaramoty
lshey.qokeeal.qotal.teedu
lkeed.qokeeos.qopchdy.qokeshdy.kedar.otal.raram
cheykedy.qoeedy.rair
qokeeol.olcheeey
polkeedal.lotedaiin.opchedaiin.otshedy
shol.olkeeey.oteedain.cheody.llod
oloeeedy
qokeeor.qoeey.chodaal.checthal.cheeky.lteedy
chekeek.odar.tchar.okeedaiin.oram
sheeol.kaiin.ctheo.qokeeaiin
sair.okeaiin.dlaiin.lkar.ldyr.ls
sal.lkeeey.olcheees.ykar.okalam
eees.lal.lkeeedy.otain.chol
chedeey
polshedaiin.qokeoy.chokeol.qotain
qokedain.checkhed
qokedar.olkeedy.qokam
lchs.cheedy
tolshey.ochey.cholkeedy
lchedam
shechy.chor.otalys
sheor.sheckhey.dalam
dsheol.otam
lkedain.rar.chear.olor.chedaiin
sheeal
checshal.dam
saar.okary
teokeey.chkoey
shkeeol.oteedy
keeey.tam.okain
choeear.odaiin.oloteol
chedy.cthlo
ckhoiin
tar.alkeey.chokeor.okor
pshedal
shaiin.okldaiin.otal.sor.lkar.oteal
dchey.sheo
qopar.qokear.qokey.chair
shekeeal.otar
okeey.qotar
slaral.kchey.okoeey
oky.shecthey
okydaiin.qokeeo
otaraiin
qoteeal.okaldar
pchey
shkey.dadain.olair.oky.qoked
qoteo
chedal.oky.otar.okear
qoeeey
oteo.shey
qoky.dam
qokeor.okear.qoteodal
otam




…. What gets follow form left to right
Still seems limited and repetitive. Certain letters must be in certain positions.  But I welcome insight maybe I didn’t interrupt this correctly.


RE: Repetition of words - Jorge_Stolfi - 26-09-2025

(24-09-2025, 09:51 PM)Mark Knowles Wrote: You are not allowed to view links. Register or Login to view.A question that I am keen to answer is on average how many examples of repeated words should there be in a typical Latin manuscript of the period. I am inclined to assume that the answer is very few.

I am trying to check that on my Latin samples.  Unless I goofed, there are no consecutive repetitions in the Alchemical Herbal that Marco transcribed.  

I got a dozen or so in other samples, like 

  ... vocatur ab aristotele policernia policernia autem secundum quosdam ...

in Ockam's "Dialogus"; but they all seem to be across punctuation.  The full text of the above is 

  ... Principans autem in civitate aliquando vocatur ab Aristotele policernia. Policernia autem secundum quosdam tres habet significationes ...

Should these count?  The VMS does not seem to have any punctuation, which I gather was still not uncommon in manuscripts from the 1400s.  But Marco's herbal has no consecutive repeats even after removing the punctuation...

On the other hand, consecutive repeats are quite common in my Chinese samples.  From the Mandarin novel Dream of the Red Chamber:

    44 0.12536 lao3.1
    32 0.09117 tai4
    21 0.05983 mei4
    17 0.04843 nai3
    14 0.03989 lao3
    12 0.03419 mu3.1
    12 0.03419 yi1
    11 0.03134 jie3
    ...

Here the first line means that the character with third-tone pinyin "lǎo", homophone 1 -- which may be  老 = "old", not sure yet -- occurs 44 times in consecutive positions, as  老老 = "old man".  That is 12.5% of all consecutive word repeats.  But again, these counts may or may not ignore punctuation -- I must check my messy scripts.

In my Vietnamese Pentateuch sample the word "đời" = "life" occurs 3 times as "đời đời"= "eternal".  Same caveat about punctuation applies.  However, since this sample is a translation from English, probably by Western missionaries, word doubling may have been unconsciously avoided because it is very bad by English literary standards.  I must get hold of a text written by a native...

From my Tibetan sample -- A Play of Mistaken Illusion, being some Frank Notes on the Nature of His Life written by a Certain Person Laboring under the Delusion that He Belongs to the String of Reincarnations of Jangchub Chupel [= Byang Chub Chos 'Phel], a Holder of the Throne of Je Tsongkapa by Skyabs Rje Khri Byang Rin Po Che [= Kyabje Trijang Rinpoche] (1901--1981), the junior tutor to the 14th Dalai Lama, as I am sure you really wanted to know, I get

    10 0.18519 KHANG
      7 0.12963 YANG
      6 0.11111 MDZES
      3 0.05556 CHUNG
      3 0.05556 DE
      2 0.03704 GNANG

So "KHANG KHANG" occurs 10 times in that sample.  Internet tools say that KHANG is ཁང = "house", and "KHANG KHANG" is ཁང་ཁང= "houses", "abodes", "accomodation", etc.  But I wouldn't put my mouse on the fire about that.

All the best, --jorge


RE: Repetition of words - quimqu - 26-09-2025

If we are talking about a fake cipher (a text with no meaning), it makes no sense to write such a quantity of identical words together. The author should be aware not to repeat words excessively, since words repeated too closely can lead one to think the text is fake. I mean, imagine writing and inventing words at the same time; consciously or not, you’re not going to repeat the same word twice in a row.

On the other hand, there might be stylistic reasons to emphasize adjectives or words (as in that old, old man…), but this is mostly common in novels or storytelling, and the MS does not seem to be that sort of book.

I would also make a point here, since we are talking about repetitions, regarding character repetition. It is extremely rare to find languages with such long consecutive repetitions of characters (e.g. ???ooooooooolar (if it is really a word, so werid with the dots after and before...), eeee appears in eight words, hhh in five words,  rrr in one word). While word repetition may sometimes be stylistic, character repetition within words makes the word itself different. I know I could write difeeeeeeeeerent and you will still understand it as different, as if I were screaming, but this is a modern style, and I doubt it was intended that way in the 15th century.