A simple substitution experiment

A simple substitution experiment - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: A simple substitution experiment (/thread-4930.html)

Pages: 1 2

RE: A simple substitution experiment - ReneZ - 17-09-2025

(17-09-2025, 11:49 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Do I know how to identify the plaintext language given it's some simple substitution cipher?

Almost....
I meant recovering the plain text, based on some reasonable assumptions (for this case).
For example: the language may be Latin, Italian or German.
Of course, the question is extended to everyone else.

The text may or may not be too short. I can provide a longer sample to anyone wanting to give it a try.

RE: A simple substitution experiment - Mauro - 17-09-2025

(17-09-2025, 12:30 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
(17-09-2025, 11:49 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Do I know how to identify the plaintext language given it's some simple substitution cipher?

Almost....
I meant recovering the plain text, based on some reasonable assumptions (for this case).
For example: the language may be Latin, Italian or German.
Of course, the question is extended to everyone else.

The text may or may not be too short. I can provide a longer sample to anyone wanting to give it a try.

Deciphering a simple substituion cipher is easy if the ciphered sample is long enough and one has statistics available for the possible target languages. Calculate a statistic, for instance bigram frequency, which works rather well in identifying a language. Then compare the bigrams distribution with the bigrams distributions of the target languages.

For example, here I took an Italian book ("Amore di Loredana", an old book I use for tests) and ciphered it with a simple substitution (I also substituted the character 'space'):

Quote:lnqdzchzknqdc m zcdkzldcdrhlnz tsnqdzk zbnlo fmh zcdkk zkdffdq zmnudkkdzkzk' lnqdzchzknqdc m zqnl mynzchzktbh mnzyùbbnkhzlhk mnzeq sdkkhzsqdudrzdchsnqhzoqnoqhdsàzkdssdq qh zhzchqhsshzchzqhoqnctyhnmdzdzchzsq ctyhnmdzrnmnzqhrdqu shzodqzstsshzhzo drhzbnloqdrhzk zrudyh zk zmnqudfh zdzk'nk mc zlhk mnzshozsqdudrzk' lnqdzchzknqdc m zoqhl zo qsdzhzoqdmchzptdkkdzu

Then I calculate the geometric distance (root-mean-square distance) of the bigrams distribution with those of different languages samples (DHR are the Declaration of Human Rights). These are the results, where the source text is recognized as a cipher of Italian:

Filename: deciphering.png Size: 31.6 KB 17-09-2025, 01:01 PM

I did not code the next step (get the substitutions tables), but it's anyway easy to do just by looking at the bigram frequencies tables (but just character frequencies would do at this point). Here the frequencies of the ciphered text are shown, together with the frequencies of a corpus of 309 Italian texts:

Filename: bigramscomp.png Size: 66.85 KB 17-09-2025, 01:07 PM

And it's easy to see that, for instance, 'space' has been replaced by 'z', 'a' has been replaced by 'space', 'e' has been replaced by 'd'.

RE: A simple substitution experiment - oshfdk - 17-09-2025

(17-09-2025, 12:30 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The text may or may not be too short. I can provide a longer sample to anyone wanting to give it a try.

It is long enough to be decoded to the point where it's possible to identify the source. I ran it via annealing pipeline that assumes U/V and I/J are encoded using the same symbol, probably that's why it struggled with V/M and generally couldn't uncover the correct text and couldn't decide between Italian or Spanish, but it correctly rejected Latin, Dutch, German, Polish, Greek, Occitan, Romansh, Catalan and French, and the result it produced was enough for me to find the original text. The best statistical match is the below. The code I use ignores spaces, so I stripped them from the ciphertext before decoding.

Code:
(83072, 'test', 'test', 'test', 'test-test', 'Spanish|Italian', 'temptest', 'QUANTO x1 s309174, UANTOA x1 s309174, ODELAL x1 s309174, CHELAM x1 s309174, TOLADO x1 s309174, OLADOM x1 s309174, LADOME x1 s309174, GUARDA x1 s309174, PIANET x1 s309174, IANETA x1 s309174', './temptest/testresults-wordscore.txt', 106, ['QUANTO', 'UANTOA', 'ODELAL', 'CHELAM', 'TOLADO', 'OLADOM', 'LADOME', 'GUARDA', 'PIANET', 'IANETA']), signature: test-test-test-test-test

E => A I => E A => I S => R U => O N => T R => L T => N C => C O => U L => D P => P M => S Q => M D => B V => H B => G H => Q F => F G => Z X => X

NELBEXXODELCABBINDINOSTRAMITABIRITROMAIPERUNASELMAOSCURACHELADIRITTAMIAERASBARRITAAHIQUANTOADIRQUALERAECOSADURAESTASELMASELMAGGIAEASPRAEFORTECHENELPENSIERRINOMALAPAURATANTEABARACHEPOCOEPIUBORTEBAPERTRATTARDELZENCHIMITROMAIDIRODELALTRECOSECHIMHOSCORTEIONONSOZENRIDIRCOBIMINTRAITANTERAPIENDISONNOAQUELPUNTOCHELAMERACEMIAAZZANDONAIBAPOICHIFUIALPIEDUNCOLLEGIUNTOLADOMETERBINAMAQUELLAMALLECHEBAMEADIPAURAILCORCOBPUNTOGUARDAIINALTOEMIDILESUESPALLEMESTITEGIADERAGGIDELPIANETACHEBENADRITTOALTRUIPEROGNECALLEALLORFULAPAURAUNPOCOQUETACHENELLAGODELCORBERADURATALANOTTECHIPASSAICONTANTAPIETAECOBEQUEICHECONLENAAFFANNATA

0.0000: 

Repeats -0.1389: ASELMAx3 ITROMAIx2 LAPAURAx2 PAURAx3 CHENELx2 ETACHEx2 ACHEx4 ODELCx2 RACHEx2 CHELAx2 ADURAx2 URATAx2

RE: A simple substitution experiment - RobGea - 17-09-2025

(17-09-2025, 11:45 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.So do you have a specific approach to decide what my 'cipher text' above would be?

Yep, bang it into Guballa Substitution Solver, try different languages , see that Spanish produces Italian looking plaintext and put that plaintext into a Search engine
and bingo "Nel mezzo del cammin di nostra vita" --> You are not allowed to view links. Register or Login to view.
Guballa --> You are not allowed to view links. Register or Login to view.

RE: A simple substitution experiment - Stefan Wirtz_2 - 17-09-2025

(17-09-2025, 04:08 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.[..]
The two questions I had were:
- would the text be somehow recognisable?
- would the text show any indication of being meaningful?

The first is a definite no. The second is a bit more subjective, but I would also argue that it is: 'rather not'.[..]

…which means nothing else than:
as long as one does not know alphabet, words and language, it remains impossible to assign alphabet + recognize words and language, until someone discovers some text which must appear as an equivalent at any VMS page.
So it is still a Rosetta problem, isn‘t it…?

RE: A simple substitution experiment - ReneZ - 17-09-2025

I'm intrigued by this 'plain text', which literally is still a simple substitution cipher, but with most characters correctly identified:

(17-09-2025, 02:12 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
Code:
NELBEXXODELCABBINDINOSTRAMITABIRITROMAIPERUNASELMAOSCURACHELADIRITTAMIAERASBARRITAAHIQUANTOADIRQUALERAECOSADURAESTASELMASELMAGGIAEASPRAEFORTECHENELPENSIERRINOMALAPAURATANTEABARACHEPOCOEPIUBORTEBAPERTRATTARDELZENCHIMITROMAIDIRODELALTRECOSECHIMHOSCORTEIONONSOZENRIDIRCOBIMINTRAITANTERAPIENDISONNOAQUELPUNTOCHELAMERACEMIAAZZANDONAIBAPOICHIFUIALPIEDUNCOLLEGIUNTOLADOMETERBINAMAQUELLAMALLECHEBAMEADIPAURAILCORCOBPUNTOGUARDAIINALTOEMIDILESUESPALLEMESTITEGIADERAGGIDELPIANETACHEBENADRITTOALTRUIPEROGNECALLEALLORFULAPAURAUNPOCOQUETACHENELLAGODELCORBERADURATALANOTTECHIPASSAICONTANTAPIETAECOBEQUEICHECONLENAAFFANNATA

It did not fully find the right answer, possibly because of the assumption about I/J and U/V (which is interesting in itself), but here I would say that the plain text is recognizeable.

If someone came up with this as a proposed solution for the Voynich MS, it should be acceptable, and probably the final tweek could be found soon enough.

RE: A simple substitution experiment - ReneZ - 17-09-2025

(17-09-2025, 03:15 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.
(17-09-2025, 11:45 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.So do you have a specific approach to decide what my 'cipher text' above would be?

Yep, bang it into Guballa Substitution Solver, try different languages , see that Spanish produces Italian looking plaintext and put that plaintext into a Search engine
and bingo "Nel mezzo del cammin di nostra vita" --> You are not allowed to view links. Register or Login to view.
Guballa --> You are not allowed to view links. Register or Login to view.

I am curious to see this Italian looking plaintext. Do you still have it?
So you found the true solution, but that was based on the fact that the plain text is known from other plain texts, which may not be the case.

RE: A simple substitution experiment - RobGea - 18-09-2025

You can just recreate it with the Guballa online tool, paste in the ciphertext, set Language ->Spanish and click 'Break Cipher'
Key:
abcdefghijklmnopqrstuvwxyz This clear text ...
igcpafyqexwdstubmlrnohvzkj ... maps to this cipher text

Code:
nel pezzo del cappin di nostra mita

pi ritromai ber una selma oscura

che la diritta mia era sparrita

ahi quanto a dir qual era e cosa dura

esta selma selmaggia e asbra e forte

che nel bensier rinoma la baura

tant e apara che boco e biu porte

pa ber trattar del yen ch i mi tromai

diro de l altre cose ch i m ho scorte

io non so yen ridir cop i m intrai

tant era bien di sonno a quel bunto

che la merace mia ayyandonai

RE: A simple substitution experiment - oshfdk - 18-09-2025

With a correctly transcribed ciphertext solving simple substitution for a single unknown language from a known list of languages is more or less trivial. It's also trivial even when there is a large number (say, 10%) of random transcription/scribal/encoding errors. However if there are systematic errors, like always transcribing two different characters as one, or interpreting one character as several different characters, then the standard solvers might not help.

For example,

SOURCE: the voynich manuscript is an illustrated codex hand written in an unknown script referred to as voynichese the vellum on which it is written has been carbon dated to the early fifteenth century
SUBSTITUION, NO ERRORS: vxj tzbeqlx fneuwlcqov qw ne qgguwvcnvjk lzkjr xnek scqvvje qe ne uehezse wlcqov cjijccjk vz nw tzbeqlxjwj vxj tjgguf ze sxqlx qv qw scqvvje xnw mjje lncmze knvjk vz vxj jncgb iqivjjevx ljevucb
GUBALLA DECODER: the founich manyscript is an illystrated codex hand written in an ynknown script regerred to as founichese the fellym on which it is written has been carbon dated to the earlu gigteenth centyru

Very good, a few minor issues.

Now imagine we treat ciphertext q and g as the same character (replace g with q), and imagine that we see two different versions of v (replace 50% of v with g to keep the size of the alphabet).

MISTRANSCRIBED CIPHERTEXT: gxj tzbeqlx fneuwlcqov qw ne qqquwgcnvjk lzkjr xnek scqvvje qe ne uehezse wlcqov cjijccjk gz nw tzbeqlxjwj gxj tjqquf ze sxqlx qv qw scqgvje xnw mjje lncmze knvjk gz vxj jncqb iqigjjevx ljegucb
GUBALLA DECODER: phe younich langscrimp is an iiigstraped codex hand wrippen in an gnknown scrimp referred to as younichese the yeiigl on which ip is writpen has been carbon daped to phe eariu fifteenph centgru

While it's still possible to recognize this, mistranscribing only two characters types greatly reduced our ability to get the optimal outcome.

Let's add one more transcription issue, let's say we decided that ciphertext w is actually the same as vv.

MISTRANSCRIBED CIPHERTEXT: vxj tzbeqlx fneuvvlcqov qvv ne qqquvvgcnvjk lzkjr xnek scqvvje qe ne uehezse vvlcqov cjijccjk gz nvv tzbeqlxjvvj gxj tjqquf ze sxqlx qv qvv scqgvje xnvv mjje lncmze knvjk gz vxj jncqb iqigjjevx ljegucb
GUBALLA DECODER: she lognich manusscrips iss an iiiusstrased codey hand wrissen in an unknown sscrips referred to ass lognichesse the leiium on which is iss writsen hass been carbon dased to she earig fifteensh centurg

If you know what it's supposed to say, you can read it. But only tree systematic mistakes make it nearly illegible.

But these were just simple changes, what if we misunderstood the mechanics of the cipher. Suppose it's actually to be read right to left or any other reordering happens. This will kill most simple statistical decoders instantly.

ORIGINAL CIPHERTEXT, NO TRANSCRIPTION MISTAKES, JUST REVERSED: bcuvejl xvejjviqi bgcnj jxv zv kjvnk ezmcnl ejjm wnx ejvvqcs wq vq xlqxs ez fuggjt jxv jwjxlqebzt wn zv kjccjijc voqclw eszeheu en eq ejvvqcs kenx rjkzl kjvncvwuggq en wq voqclwuenf xlqebzt jxv
GUBALLA DECODER: brusted nsteesmam blroe ens is hesoh tixrod teex pon tessarc pa sa ndanc ti fulleg ens ependatbig po is herremer swardp tcitytu to ta tessarc hton vehid hesorspulla to pa swardputof ndatbig ens
GUBALLA DECODER RESULT REVERSED BACK: sne gibtadn fotupdraws ap ot allupsroseh dihev noth crasset at ot utytict pdraws remerreh si op gibtadnepe sne gelluf it cnadn as ap crasset nop xeet dorxit hoseh si sne eorlb mamseetsn detsurb

That's it, one very simple misunderstanding of how the cipher works and most standard statistical decoders are useless.

RE: A simple substitution experiment - ReneZ - 18-09-2025

Many thanks!

This was one of the points that I was wondering about, and it seems that even small issues can have major effects!