The Voynich Ninja - An Artificial Construction

Pages: 1 2 3 4 5 6 7

(13-05-2026, 03:25 PM)Stefan Wirtz_2 Wrote: You are not allowed to view links. Register or Login to view.I don‘t see how those tables allow the diagnose of something „artificial“.

Someone may assume that k and t
- could be consonants of a natural language
- are predecessed each by one or two out of the most frequent or all vowels of a natural language
- which may themselves are following 1-2 consonants at the beginning position
- or are predecessed by one or two consonants out of a limited, language-appropriate set

- and followed by the „second“ syllable in an equivalent structure.

I don't think this will produce a similar distribution. Could you name any language, any central letter and any word splitting principle at all for which there would be a similar independence of prefix and suffix? I can run the numbers if you or anyone knows of a good example of this. I think this is just not the way natural languages work

(13-05-2026, 04:29 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.I don't think this will produce a similar distribution.

I don‘t understand what kind of „distribution“ of „splits“ you see there.

In general, there a are 10+ „beginning syllables“ crossed with 21+ „ending syllables“.
A good third of the combinations in both tables for both characters are not set (not possible or not occurring in text sample).

„Split“ is an exaggerated word, since both characters can function a a beginning letter with follow-up „syllable“, or at second position with some (long) following.
As I remarked earlier, m may do some work as a T, just in last positions, so positional variation. That is not regarded in the tables.
From a quick look, those „2nd syllables“ consist of at least 16 different VMS characters in various combinations.
The „starter syllables“ come with at least 10 different VMS characters.
It is rather safe to say that long starter syllables are not often combined with long ending syllables, but nothing more.

So the „syllables“ could be easily something like „Dot-, Ot-, T-, Et-, ….“ (lines) and „-ukus, -an, -am, -uus, -ikus, -ar, …“ (columns).

What do you expect from me now? Giving you some 600-year-old text(!) where you can „calculate“ the distribution of some
„dotukus, dotan, dotam, dotuus, dotikus, dotar,….
otukus, otan, otam, otuus, otikus, otar,….
tukus, tan, tam, tuus, tikus, tar,….
etukus, etan, etam, etuus, etikus, etar,…
:
: etc.“ ??

For what? What could you prove with this?

I agree that the text clearly shows signs of artificial construction. That said, I’d like to add a historical thought. The person who created the Voynich manuscript was almost certainly a well educated individual from the early 15th century. Back then, any serious scholar studied the quadrivium arithmetic, geometry, music, and astronomy. Music wasn’t considered just “art” like today, but a proper mathematical discipline, focused on proportions, ratios, and structured repetition very much like we see in Gregorian chant and polyphony. Given how extremely repetitive and rhythmic Voynichese is, I wonder if the author might have drawn on his musical training when creating the system. Maybe he used ideas like limited notes the glyphs, controlled transitions between them, and some kind of internal rules of succession, similar to how neumes work. I’m not saying it’s literally Gregorian chant encoded or anything like that. I just think it’s possible that the structural inspiration came from that same intellectual and musical background.

(13-05-2026, 07:32 PM)Stefan Wirtz_2 Wrote: You are not allowed to view links. Register or Login to view.For what? What could you prove with this?

This can show that the way prefixes and suffixes in the Voynich Manuscript behave does not occur in natural languages. I think this may be true for any normal representation of any natural language in the history of humankind, doesn't have to be a 600 years old MS. I think this should work for whatever popular language theories are discussed now on the forum, for Bavarian, for Chinese, regardless if expressed as pinyin or characters. In a sense this may cast doubts on any plaintext theory.

I believe that it is an essential property of any language that for selecting any character or sequence B and splitting the text by some sequence (spaces or boundary characters) and getting all ABC chunks, it's impossible for A and C to not show strong dependence on one another.

But this is just my intuition now, to confirm/reject this one needs to run tests using different kinds of central characters and different kinds of word splitting.

(13-05-2026, 07:57 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.This can show that the way prefixes and suffixes in the Voynich Manuscript behave does not occur in natural languages. I think this may be true for any normal representation of any natural language in the history of humankind.

Have you tested the East Asian monosyllabic languages - Chinese,Vietnamese, Tibetan, Thai, Burmese, Lao, Khmer, ...

You are not allowed to view links. Register or Login to view. is a file of Mandarin (the Beijing Chinese language) in pinyin encoding. (It is in the Unicode UTF-8 encoding. You will have to download it; out web server displays it assuming the wrong encoding so it shows all garbled on your browser, and cut-paste will fail.)

Here is a python3 fragment that will read the file and convert tone diacritics to numeric tone suffixes (e.g. "shén" to "shen2") and "ü" to "uu" (e.g. "nǚ" to "nuu3"):

Code:
#! /usr/bin/python3

import os, sys, re;

from sys import stdin as inp, stdout as out, stderr as err

def main():

  inp.reconfigure(encoding="utf-8")

  out.reconfigure(encoding="utf-8")

  out.write("# Created by {convert_pinyn_to_numeric.py} - do not edit.\n")

  out.write("# -*- coding: utf-8 -*-\n")

  pinyin_vows = r"āēīōūǖ" + r"àèìòùǜ" + r"áéíóúǘ" + r"ǎěǐǒǔǚ"

  unmark_vows = r"aeiouü" * 4

  ntones_nums = r"111111" + r"444444" + r"222222" + r"333333"

  pats = []

  subs = []

  nv = len(pinyin_vows)

  for i in range (nv):

    pats.append(re.compile(pinyin_vows[i]))

    subs.append(unmark_vows[i] + ntones_nums[i])

  for line in inp:

    line = line.strip()

    if re.match(r"[ ]*([#]|$)", line):

      continue

    else:

      m = re.fullmatch(r"([<][a-z][0-9.]+[>])[ ]*(.*)", line)

      if m != None:

        loc = m.group(1)

        line = m.group(2)

      else:

        loc = ""

      line = re.sub(r"[\]\[.,;:()]", " ", line)

      words = line.split()

      out.write(loc);

      for word in words:

        word = word.lower()

        for i in range (nv):

          word = re.sub(pats[i], subs[i], word)

        word = re.sub(r"ü", "uu", word)

        word = re.sub(r"^(.*)([0-9])(.*)$", r"\1\3\2", word)

        if re.fullmatch(r"[a-z]+", word): word += "5"

        out.write(" "); out.write(word)

      out.write("\n")

  return

  # ----------------------------------------------------------------------

main()

All the best, --stolfi

(13-05-2026, 10:29 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Have you tested the East Asian monosyllabic languages - Chinese,Vietnamese, Tibetan, Thai, Burmese, Lao, Khmer, ...

As I said, I haven't tested anything yet, for the test I need a suggestion of the central character and the word separation algorithm. The main objection raised in recent posts here is that the property that @dashstofsk demonstrated for k words in Voynichese can occur in natural languages. I can't see how this is possible. For this experiment we need to select a central character or sequence (that would replace k in this experiment, let's call it B) and the chunking algorithm (the simplest would be using spaces). I can't select these, because for me it's obvious that any selection would be immediately wrong. To make it clear again, the test would be first to split the text into chunks according to the chunking algorithm (just split by spaces in the simplest case), extract all chunks that contain B, treat them as ABC where A and C are the prefix and the suffix and compute the statistics of co-occurrence of all prefix/suffix pairs vs the expected number of prefix-suffix pairs given the frequency of the prefix and the suffix.

I believe this test would work for any natural representation of the language (phonetic, ideographic), just due to the way texts in natural languages are structured, and the prefix and suffix combinations won't be statistically independent for most pairs.

(13-05-2026, 10:59 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.As I said, I haven't tested anything yet, for the test I need a suggestion of the central character and the word separation algorithm.

The words in that pinyin file are already separated by spaces or ascii punctuation.

Each pinyin syllable is zero or more consonants (which include "y" or "w"), one or more vowels, and an optional final "-n", "-ng", or (rarely) "-r".

For the core, you could use the vowel that has a tone diacritic. For example

"chuáng" = «chu-á-ng»"
"liǎo" = «li-ǎ-o»
"shuǐ" = «shu-í-»
"nián" = «ni-á-n»"
"hǎo" = «h-ǎ-o»
"áo" = «-á-o»
"uò" = «u-ò-»
"è" = «-è-»

etc.

In that file there are a handful of words without any diacritics. I suppose you can just discard them.

All the best, --stolfi

(13-05-2026, 11:33 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.For the core, you could use the vowel that has a tone diacritic.

(Edit: I've updated the image to include all pairs, I think I misread the description of the computation by dashstofsk, this doesn't affect the conclusions.)

[attachment=15550]

Here it is. I repeated @dashstofsk computation as described in the original post. This is what I would expect from a natural language - a lot of underrepresented combinations (and a few hugely overrepresented). Nothing like the Voynich MS chart for which the upper left corner mostly consists of numbers close to one.

I suspect that no natural language would allow for any fixed B independently picking A and C according to their general frequency to produce the distribution of ABC as seen in the Voynich MS, no matter what A, B and C correspond to - characters, words or syllables.

Now that I have the code in place, I can repeat the test for any text and any kind of B, but I expect the result to be largely the same.

(14-05-2026, 10:06 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Here it is. I repeated @dashstofsk computation as described in the original post. This is what I would expect from a natural language - a lot of underrepresented combinations (and a few hugely overrepresented). Nothing like the Voynich MS chart for which the upper left corner mostly consists of numbers close to one.

For comparisons, a metric for the bumpiness of the structural decomposition could be standard deviation of the distribution of all numbers in the table, without the empty/null values.

(14-05-2026, 10:59 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.For comparisons, a metric for the bumpiness of the structural decomposition could be standard deviation of the distribution of all cells in the table, without the empty cells of course.

There could be a 2nd metric for how asymmetric the distribution is: something like the (sum for all (i,j) of abs(cell(i,j) - cell(j,i))) / (sum for all (i,j) of abs(cell(i,j)) + abs(cell(j,i))).

I think this should be weighted by the expected count. There is nothing surprising for a combination that is expected to occur 3 times to actually occur 0 or 6 times, given there are dozens of cells, these would happen very often. However, a combination that is expected to occur 800 times, but only occurs 400 times is a very strong signal.

When I ran similar experiments a few years ago I used other metrics instead of actual/expected ratio, I tried to estimate the probability of each result to happen by chance given the total number of results. This was in pre-LLM time and I remember that the math was complicated, but I think now this can be easily reproduced with an AI assistant.

I'm not very keen on spending much time on this now, my conclusion is basically the same as what @dashstofsk says, that this doesn't look like a natural language. Interestingly, I see this as a strong indication that this is a cipher, while if I understand it correctly, @dashstofsk sees this as an indication of meaningless/hoax text.

Pages: 1 2 3 4 5 6 7