The Voynich Ninja - Speculative fraud hypothesis

Pages: 1 2 3 4 5 6 7 8 9 10

(13-09-2025, 07:53 AM)Torsten Wrote: You are not allowed to view links. Register or Login to view.You are assuming the existence of a fixed seed text from which words are copied and modified. That is not the case. .... In the self-citation model, the Voynich text functions simultaneously as both the source and the outcome of the copying process.

No. I explicitly wrote that the source text available for copying grows as the algorithm progresses, and that is why the word pair distribution changes and tends to the random x random limit. But you need a seed text to start the process. You admit so much:

Quote:The algorithm requires only a minimal seed (e.g., a single line of text) to initialize. ... In our implementation, we used line f103v.P.9 of the VMS as seed—<pchal shal shorchdy okeor okain shedy pchedy qotchedy qotar ol lkar

But that seed already shows the distinctive, non-trivial, non-"European" Voynichese word structure. How did the Author come up with that seed, and why?

If you use a short text as a seed, for the first page or so you would get only repeated fragments of that text, with a few mutations. Did you find any part of the VMS where the text looks like that?

If the mutation probability is cranked up in order to hide that "small seed" effect, then the mutation procedure must be complicated enough to preserve the structure of Voynichese words, and tuned to produce each segment with the right probabilities. But then the generated text would quickly lose the "repetitiveness" character that was supposed to justify your method. Indeed the algorithm would quickly become equivalent to a zero-order Markov model, with a word distribution that is an attractor of the mutation procedure M. Namely, a distribution P such that P(x) = sum(P(y)*Prob(M(y) = x) : y}.

Quote:... to generate a corpus of more than 10,000 words. The resulting text contained 7,678 Voynich words (70%) and 3,156 non-Voynich words (30%).

But surely the percentage of Voynich words was higher than 70% at the beginning (when it was mostly copies of fragments of the seed line) and less than 70% near the end (where most words were the result of multiple mutation steps). And the percentage must have been decreasing; unless the mutation procedure was complicated and finely tuned as per above. And the word pair distribution must already have been visibly tending to that of a zero-order Markov model, namely random x random.

Here are some tests of your algorithm with a 14-word seed text in English (a bit longer than the one you used above). The mutation algorithm randomly deletes a letter, with increasing prob if the word is long; or inserts a letter chosen with the approximate English letter frequency, with increased prob if the word is short; or replaces a random letter by a loosely similar letter (vowel by vowel, stop by stop, sibilant by sibilant). (This algorithm is not trivial and somewhat "tuned" to English, but I suppose that this is still considerably simpler and less "tuned" than the mutation procedure you used for Voynichese, correct?) For each combination of parameters, the algorithm was used to generate N = 100000 words, and the first and last 100 were printed.

[EDIT: changed slightly how the {p_mutate} parameter is used and re-created the examples.]

seed = ['the', 'native', 'hue', 'of', 'resolution', 'is', 'sicklied', 'over', 'with', 'the', 'pale', 'cast', 'of', 'thought']

=== N = 10000 p_reset = 0.100 p_mutate = 0.100 ===

resolution is sicklied over sicklied over with the pale cast the
native hue oj resolution i sicklied over wigh the rpale resolution i
sicklied over wigh the rpale sicklied over wigh oj resolution i
sicklied over wigh the rpale resolution i sicklied over wigh the
rpale is wigh hue oj resolution e sicklied over wigh the rpale
resolution i sicklied oser wigh the rpale sicklied over wigh oj
resolution i sicklied over wigh the rpale resoluetion i sicklied
over wigh the rpale is wigh hue oj resolution e sicklied over wigh
the rpale resolution i sicklied oser wigh the cast
...
e sicklied over widh oser wigh lthe rpale sicklied over wigh oj oj
resolution be o sicklied over wigh rpane fesolution i sicklied orer
wih phe rpale resolutuona i sicklied is wygh the rpale is wigh wigh
dhe rpale wogh resolution i sicklied i siklied oser wih the cst the
native hue oj i sicklied sicklied over wtigh wiygh phe rpal wigh the
npale wygh hue o suckliied oer sicklied wigh resolution e sickliud
over sicklied ogver hue ij resoluion pavo cart rpale relolution i
sicknied ozer nwigh the the i sicklied thw pale the rale over wigh
el sicklied

Note that the first 100 generated words are essentially repetitions of fragments of the seed text. As the algorithm progresses, the first few words that happened to be copied become increasingly more likely to be copied, so that the text at first becomes even more repetitive. Then, as mutations accumulate, the output becomes random tosses of variations of those few lucky words.

=== N = 10000 p_reset = 0.100 p_mutate = 0.700 ===
resalution i sicklied over us sickhied over witj the pale cst ov hu
otf resolution is sicklyed ovepr wizth the plale casnt of thoufght
reasalution ogf ogf ogf oxf uxf uxj uxj uxj uxj uxj uxr uxr uxr wxr
pale dst ol hu otf renolution is sicklyid vepr wuzth he psale cesnt
lof thoufgh reasalutione ogf obf og yxf uxm uxj uxl uxmj us sickhiet
ovur wtj thu pyle kst oz fu ot resolutio is vicklyed vepr wizth the
dlale os sichied over witl ethe pane fcst af riesolution is sickvied
ovev wits the pale casat af thought resalutio e
...
dgsis snipkle seqr wuzp ie sily evnt jos ghumgh gl obj tult refti i
liihiut ujs abn ymy wisai ht ewnj ussvv utlve fhea vbmsg l
rsilultian wns yqr obq vqon wbes cdasa il movwen pask rejoltion em
rkebsoluta uh vklyeqe m casa vicklyep cvepr vckloed ghofbv
reasluione ogb qaj pan thoght rjaluio zwoj chus balbe cyat of exvwh
ij fe k resovutio ys vcklyed nepr oizkh ut reolta is evckloed evw
ovqery os tjhoyghd rvulutin oaje pnlae apesvtpb tmovgh asaluqone
uigf ebov oqqd caonat om lenoliotifo iin scklii sn if vcklyeq vwh iz
hu t resolutio is vcknyed vepr

Here, the high mutation probability eventually renders the seed irrelevant, and the output soon becomes a zero-order Markov text with word distribution defined by the mutation procedure. The output does not look like English at all, because the mutation procedure is not sufficiently complicated and tuned.

=== N = 10000 p_reset = 0.700 p_mutate = 0.100 ===
witf cast o is resolution us resolution of thought o is resolution
us resolution tought casp u witf is of resolution the native native
native native native native native witf cast nativi is resolution
resolution resolution us thought o resolution native natijve native
resolution tought native native native pale resolution native natine
witf resolution o is of native palw cast o is resolution of o is
resolution nativi native cast us thought ir native witf cysp u witf
hue o thought ntivi resolution resolution naytive natijve cast cast
o is resolution resolutigon native resolugtion cast native witf cysp
native is
...
tought is is is resolution uf resolutigan natinde rslution witf
resolution xo native resolution native rsolution o o ih resolution
reslugtion is resoltion is resomution native is witf is ovwr of with
us witf thougmt resolution cysp hought witf o is resolution witf
mytie resolugion resolution native cast sicklied native us
resolution natuve native palw o sicklied resolution resolution
native o resolution of u cast is onf hought sative cast with
resolution resolution native u native is natinde resolution ovwr
ntive resootion native thw is natinbe resolution is wvitf witf witf
u sicklied witf native natuve i witf reslugtion es

=== N = 10000 p_reset = 0.400 p_mutate = 0.400 ===
of the cast hue iwith rasolution pale cast thought ast hue iwith
sicklied pave caist of hue owith of native pae caszt thought ast
pive kaist of vue owith o owixh of native paw caszt thought art
owixh of nnative pae cuszt thoughb pav cast owith thought ast hue
caw caszt bhought art native of the ast pale cast vue owith cost
thought ast huw iwit rasolution pale cast thought asbt nnativi pae
owith w owaxh of native pai caszt thought art owixh cast thought
asbt nnativi owikh of cost dhought as huw iwib rasolution thoghb pav
nntivi o owixh
...
thouht cal pott thoubht raslution though caszt tfought ast sickiek
cist iwibv hyu thyukht casx oj o sibbkeg heua abx art oh fr tougrt
ieitv ceost dhooght thouhgt qule theuhg ptule hae itf buist huo
natie thdee totj fbough pawlea iu owdth uf rasolution paa cuszt
thoghb paywv cwst oiith cast oh piva casc vue dhgc ckaft ast
gthought thighb owiph wiph hyught of hue thoghct iwib tloghqa oh fr
of raoltaon casg cuszt s hlue iabt ast oj iwit ihf oth fr tougrt
kqahw iwith caszt iwtexh r tvough vue csq of eh oiith l huef vasg
sicglied pavek

All the best, --jorge

(13-09-2025, 11:27 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.But surely the percentage of Voynich words was higher than 70% at the beginning (when it was mostly copies of fragments of the seed line) and less than 70% near the end (where most words were the result of multiple mutation steps). And the percentage must have been decreasing; unless the mutation procedure was complicated and finely tuned as per above.

This argument seems fallacious: the space of results grows like a tree, so divergence from a specific outcome is expected - what should be compared is the size of the resulting result space, which can be estimated by running the algorithm using different seeds and compare their pair-wise difference. That way, you get an estimate of the variation in results space depending on parameter space.

In other words the algorithm doesn't purport to be able to generate an infinite amount of Voynich-like manuscripts, nor always giving a great match to Voynich. It just allows a voynich-length output that looks similar enough, indicating that generation of it using a similar method is plausible. In other words, if the Voynich lies in the results space, then there could have been a seed (perhaps even now lost!) that generated it.

An interesting question is if the algorithm's result space is too wide to be plausible: a thousand monkeys on a typewriter will eventually generate anything, including Voynich, and that doesn't make it a plausible algorithm for generating Voynich. But the algorithm in Timm's article is pretty basic and simple and retaining the characteristics of the initial seed for the length of the VMS, so I expect the result space (at the level of Voynich length, that is) to be comparatively narrow, similar to the "gibberish after all?" article from the Voynich conference.

Also, the used algorithm for Voynichese is described in Timm's published article so I'm not sure why you keep asking how 'tuned' it is...

(13-09-2025, 01:38 PM)dexdex Wrote: You are not allowed to view links. Register or Login to view.voynich-length output that looks similar enough

The question is what "similar" means. To someone who has never seen English text, this sentence may look "similar enough" to English:

reasluione ogb qaj pan thoght rjaluio zwoj chus balbe cyat of exvwh
ij fe k resovutio ys vcklyed nepr oizkh ut reolta is evckloed evw
ovqery os tjhoyghd rvulutin oaje pnlae apesvtpb tmovgh asaluqone

The previous message pointed out that, in the output of their test run, "70% were Voynich words"; implying that the similarity criterion was just that, namely the percentage of words (word instances or word forms, not clear) that were in the VMS lexicon. If that was the criterion, then the "divergence" (a drop in the similarity as the algorithm progresses) is a problem, because it means that a significant part of the similarity was due to the fact that the seed text had been taken from the VMS.

All the best, --jorge

(13-09-2025, 01:38 PM)dexdex Wrote: You are not allowed to view links. Register or Login to view.I'm not sure why you keep asking how 'tuned' it is

Sorry for my confusing language. The "unless" did not mean that I did not know. I was referring to the generic algorithm, where the Mutate procedure is a parameter. The specific Mutate that they used was highly tuned to Voynichese.

(13-09-2025, 03:36 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.The previous message pointed out that, in the output of their test run, "70% were Voynich words"; implying that the similarity criterion was just that, namely the percentage of words (word instances or word forms, not clear) that were in the VMS lexicon. If that was the criterion, then the "divergence" (a drop in the similarity as the algorithm progresses) is a problem, because it means that a significant part of the similarity was due to the fact that the seed text had been taken from the VMS.

All the best, --jorge

That was not the only criterion: various Zipfian laws as well as frequency distributions were also compared in the article.

(13-09-2025, 11:27 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Here are some tests of your algorithm with a 14-word seed text in English (a bit longer than the one you used above). The mutation algorithm randomly deletes a letter, with increasing prob if the word is long; or inserts a letter chosen with the approximate English letter frequency, with increased prob if the word is short; or replaces a random letter by a loosely similar letter (vowel by vowel, stop by stop, sibilant by sibilant).

Keep in mind that the VMS was produced by a human scribe, not by a computer program. A 15th-century writer could not have executed an algorithm that randomly deletes or inserts letters, since neither computers nor random number generators were available to him. (Siedenote: Someone from the 15th-century wouldn't even understand the concept of randomness. The word "random" originated in Old French as randon, meaning "speed" or "force," and entered English around the early 14th century, referring to haste or violence. The modern statistical meaning, implying equal chances for all outcomes, emerged in the late 19th century.) Instead, the scribe relied on visual recognition and cognitive processes: scanning the text for source words and applying intuitive modifications. In such a context, it is far more natural to substitute glyphs with visually similar ones than to introduce or remove glyphs at random.

(13-09-2025, 11:27 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.(This algorithm is not trivial and somewhat "tuned" to English, but I suppose that this is still considerably simpler and less "tuned" than the mutation procedure you used for Voynichese, correct?)

Our algorithm is designed to approximate how a human scribe might have carried out the self-citation method. This is necessarily more complex than a purely mechanical procedure, since it must approximate the human ability to recognize, compare, and adapt patterns. While a human scribe can intuitively judge whether two glyphs or words appear similar, a computer program requires explicit rules to determine which glyphs are considered visually similar.

Note: Our argument is that the self-citation method could have been executed with ease by a medieval scribe, without the aid of any additional tools. We do not claim that a computer was involved in the creation of the Voynich text, nor that our computer simulation fully captures the complexity of human behavior. Rather, our aim is to demonstrate the feasibility of generating a text as rich and complex as the VMS through the strikingly simple mechanism of the self-citation method.

(13-09-2025, 03:36 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.The previous message pointed out that, in the output of their test run, "70% were Voynich words"; implying that the similarity criterion was just that, namely the percentage of words (word instances or word forms, not clear) that were in the VMS lexicon.

The similarity criteria in our work are not based on counting how many word instances also appear in the VMS lexicon—we do not even mention this. Instead, our focus is on reproducing the manuscript’s broader statistical key properties, which we describe in detail in our paper [see You are not allowed to view links. Register or Login to view.].

(13-09-2025, 03:36 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.If that was the criterion, then the "divergence" (a drop in the similarity as the algorithm progresses) is a problem, because it means that a significant part of the similarity was due to the fact that the seed text had been taken from the VMS.

If you read our paper, you will see that we present a detailed analysis of the Voynich text. Building on this analysis, we introduce a concrete text-generation algorithm—the “self-citation” process—which could have been executed easily by a medieval scribe without any additional tools. We argue that the self-citation method offers the most concise and effective way to account for the features of the Voynich Manuscript.

An experiment by Gaskell and Bowern also strengthens our viewpoint. Gaskell and Bowern recruited Volunteers to write short “gibberish” documents, as a basis for a statistical comparison with the VMS and linguistically meaningful texts. Gaskell and Bowern write "Our results are generally consistent with the proposal of Timm and Schinner that the VMS was generated by a process of “self-citation”: that is, that the VMS scribe(s) generated the text largely by copying or modifying words appearing earlier in the same section". They write further in reference to the self-citation method “Informal interviews and class discussions conﬁrmed that many participants did indeed adopt this type of approach to create their texts, although they generally did so intuitively rather than by developing an explicit algorithm such as that published by Timm and Schinner” [You are not allowed to view links. Register or Login to view., p. 8]. ) In my eyes is further noteworthy that Gaskell and Bowern also report “[...] greater biases in character placement within lines and word placement within sections [...]” as the result of their experiment.

Another paper by Bowern and Lindemann further reports the test persons’ motivation for the word repetitions: “We tested this point in an undergraduate class and found that beyond about 100 words, the task of writing language-like non-language is very diﬃcult. It is too easy to make local repetitions [...]” This is an important point, because it clariﬁes that any scribe creating language mimicking gibberish will sooner or later replace the tedious task of inventing more and more words by the much easier reduplication of existing text (and stick with this strategy) [see You are not allowed to view links. Register or Login to view.].

(13-09-2025, 03:59 PM)dexdex Wrote: You are not allowed to view links. Register or Login to view.various Zipfian laws as well as frequency distributions were also compared in the article.

Still, the parameters of the algorithm can be tuned to make the output be similar to any meaningful text by all those criteria. The fact that the text they created matched those features of Voynichese only means that the parameters were properly tuned to achieve that result.

Note that every output word will be a copy of one of the seed words, modified by some random number M of iterations of Mutate. The expected value Mavg of M is approximately Q log(N/S), where Q is the mutation probability, N is the number of words generated before the word in question (including the seed text), and S is the number of words in the seed text.

In particular, for N = 5'000 (middle of the first 10'000 words) and S = 12 (as in their recent post), log(N/S) is ~7; that is, each output word will be a seed word copied ~7 times and mutated ~7 Q times, on average. If we let the algorithm run for 1'000'000 iterations and take only the last 10'000 words, letting N be 995'000 (middle of the taken words) gives log(N/S) as ~12, meaning that each output word will be a seed word copied ~12 times and mutated ~12 Q times.

To get a reasonable match of the FxR plot, the mutation probability Q must be large enough for most output words to be mutated at least once. (Otherwise the output would be mostly copies of the S seed words; other words would be few and would have much smaller frequencies.) Namely, for N = 10'000 and S = 12, Q must be on the order of 0.05 (5%) or more. Then at least 30% of the first 10'000 words will be mutated, which should result in ~3000 distinct words.

Digraph frequencies

However, if Q is large enough to produce a reasonable FxR plot, the letter and digraph distributions of the generated text will be determined largely by the Mutate procedure. These statistics will be "correct" only if the Mutate procedure is tuned to produce them. A simple "untuned" mutate procedure, that blindly inserted, deleted, or replaced a letter at random at a random point of the word -- even with the proper letter frequencies -- would ultimately produce a flat random x random distribution of digraphs.

For example, suppose that the Mutate procedure were simplistic and just added q at the beginning of the word with some probability P1 and deleted the first letter with some probability P2. By adjusting P1 and P2 one could get the "correct" frequency of q in the output. However, that procedure would create many qCh, qt, and qq pairs, which do not occur in Voynichese. In order to avoid this problem, the Mutate procedure must add q only as a pair qo, and only in front of words that do not start with q, with suitable probability. And a similar observation applies to every glyph pair.

Therefore, if the algorithm produces a large enough number of new words, and reproduces the letter and digraph distributions of Voynichese, it means only that that the Mutate procedure was well-tuned to preserve those letter and letter pair distributions. And indeed their Mutate procedure, IIUC, parses the word to be mutated according to a Voynichese word structure model, and changes each part in specific ways with carefully chosen probabilities.

FxR plot

It is known that the output of very simple random word generators, such as a character-based Markov chain of order zero, can have a frequency x rank plot (FxR plot) that follows Zipf's law. Thus it is not surprising that the output of the T&T algorithm can do the same, if the mutation probability is properly chosen.

As noted above, if Q is too low, most output words will be copies of the seed words. The number L of distinct words will be only a little bigger than the seed size S. Then the FxR plot will be roughly flat at about 1/S for ranks 1 to S, and then nearly zero from S to L.

On the other hand, if Q is too high, most output words will have been affected by multiple rounds of Mutate, and therefore will likely be unique in the output sample, or nearly so. Among the first 10'000 output words, the number L of distinct words will be several thousand, and each of these word types will occur only a few times. The FxR plot will then be nearly flat for ranks 1 to L.

Between those two extremes there will be a sweet spot where the output text will have the right mix of seed words with varied number of Mutate rounds. Words that have been through only one or two rounds of Mutate will often be equal. Therefore there will be word types with quite varied frequencies, and the FxR plot will be close to the ideal Zipf plot.

To illustrate this point, here are three runs of my implementation of the T&T algorithm,
with Q = 0.01, Q = 0.20, and Q = 0.99. (The reset probability was set to 0.20, so that the source index was reset after copying 5 words on average. The reference text "WoW" is 10'000 words extracted from Well's novel War of the Worlds. The seed text is the first 12 words of that sample, namely "like that of the thing called a siren in our manufacturing towns". My Mutate procedure is the "dumb" one I described previously, that is insensitive to word structure except that when inserting a letter it uses the approximate letter frequencies of English, irrespective of the insertion point; and when replacing a letter it tries to preserve its class -- vowel, stop, or sibilant.)

Q = 0.01
[attachment=11433]txout-200-010-2.png
called a siren in our manufacturing a siren our manufacturing towns
called a siren in our called called called manufacturing a our thing
called a siren in our ... called manufacturing called our a in towns
called a a called a a siren in called manuacturing a called
manufacturing called our manufacturing called called

9497 valid words, 503 invalid, 85 word types

Q = 0.20
[attachment=11434]txout-200-200-1.png
manufacturing towns siren manufacturing in our in our in caled in our
manufacturing towns manufacturinge tons siren manufacturing our
manufacturing owns manufacturinge ... manufturinge ol mamufacturng
is owr wn fih ouor ol tynz siren manufactring usiren in trat thak
owjs manufacturinge iz zide that ohmh like that ien the e o oz thop

3915 valid words, 6085 invalid, 1149 word types

Q = 0.99
[attachment=11435]txout-200-990-4.png
lie that manufactuing otowns phat ef the thiing calfled na ssiren in eur
siren ih eur sireen ir eyr sireer er eys if our manufakturing townz liy
that manfactuing manufactugring ... ni mnoaqueg lhiwwh thiifvc xafs
adifi nfw mrtwgjingh ea uhplzq uz ese yazo iwtya ucy qhynp ecegr hsiwnh
qlu kog h khianlx cyso geq odumo bsfineine wvf enl

324 valid words, 9676 invalid, 7469 word types

For reference, the English 10'000-word sample has 2360 word types. As one can see, with Q = 0.2 the FxR plot matches very closely that of the English sample. Even with that crude Mutate procedure, that almost always produces an invalid English word, almost 40% of the first 10'000 words are valid (meaning that they occur in the full text of the novel).

The "founder effect"

The T&T algorithm has another problem that is more apparent when the seed text is very short (like a single line, as stated in that post). Because of the random nature of the sampling, it often happens that, among the first couple dozen output words, some seed words are replicated several times more than others, just by chance. Any such initial imbalance will soon become "frozen" as those initial words get copied and re-copied.

For example, suppose that the seed is "from the summit of these pyramids forty centuries look down upon you' (S = 12). At this point, each of these words has frequency 1/12. Now let's say that the algorithm starts copying at the word 'upon', and keeps copying for five iterations. Now the text will have 17 words, and will end with "down upon you upon you upon you upon". It will have four occurences of "upon", three of "you", and one each for all the other words. (Assume there are no mutations so far.) Even if the source pointer is reset at this moment, new words will be about four times as likely to descend from "upon" than from "pyramids"; and this bias will persist for a very long time.

Because of this "founder effect", the frequencies of letters and digraphs in the output text may be significanly different than those in the seed text, and the FxR plot may deviate from the Zipf law at the high-frequency end. Moreover, these distortions will be different in different runs of the program, even if all parameters are the same.

Here are four runs of my implementation, all with Q = 0.20 and all other parameters fixe as above:

[attachment=11434]txout-200-200-1.png
[attachment=11436]txout-200-200-2.png
[attachment=11437]txout-200-200-3.png
[attachment=11438]txout-200-200-4.png

(25-08-2025, 07:55 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.One thing which could be interesting to do is to compare the copy&modify algorithm with a version of it without the copy&modify part, but with the same generation rules. The idea is to generate one word at time, always starting from a null string and applying the rules a random number of times (a few times) for each generated words (one needs also a mechanism for creating 'separable' words: at each step generate two words and decide with a certain probability if to add the 'join two words' rule, else keep only the first word). This could help in separating the effect of the copy&modify mechanism from the effect of the rules.

I think this is brilliant. I think it would tell us something.

(17-09-2025, 03:56 PM)Eiríkur Wrote: You are not allowed to view links. Register or Login to view.One thing which could be interesting to do is to compare the copy&modify algorithm with a version of it without the copy&modify part, but with the same generation rules. The idea is to generate one word at time, always starting from a null string and applying the rules a random number of times (a few times) for each generated word.

I will let others do that for Voynichese, with the T&T Mutate procedure. Meanwhile, just for fun, here are the results for English, using my "English-tuned" Mutate procedure.

Again, my Mutate procedure randomly chooses between "insert", "delete", and "replace, with probs proportional to the total frequency in an English corpus of words whose length is greater than, less than, or equal to the length of the given word. The procedure then scans the word collecting the possible interventions of that type, and the merit of the result of each intervention. Specifically:

For "insert", each candidate is a pair of consecutive letters XZ in the word and a letter Y, and the merit is the count of occurrences of the trigram XYZ in the corpus.
For "delete", each candidate is a triplet of consecutive letters XYZ in the word, and the merit is the count of occurrences of the digram "XZ" in the corpus.
For "replace", each candidate is a triplet of consecutive letters XYZ in the word, a letter W distinct from Y, and the merit is the count of occurrences of the trigram "XWZ" in the corpus.

Then the procedure chooses one candidate intervention with probability proportional to its merit, and executes it. The digrams and trigrams include start-of-word and end-of-word, so the three operations above may be applied to the first and last letter as well as internal letters.

The corpus is the full text of Well's novel War of the Worlds ("WoW").

The simulation starts with the word "phleghm" (which is not in the corpus), applies the Mutate procedure above M times, and outputs the result. This is repeated 10'000 times, independently.

Results [Note: the curve labeled "T&T" in the Zipf plots is actually for my Mutate applied M times]:

M = 1

pheghm phieghm pheghm phlegthm hleghm phleghmy hleghm paleghm
phlleghm phleg-m pheghm pheghm hleghm pheghm phlegh hleghm
phlighm phlegham pheghm phleghe pleghm pheghm pheghm pheleghm
pheghm pheghm phledghm phlethm phieghm pheghm pheghm pheghm
hleghm phreghm pheghm hleghm pheleghm phleghme phlleghm
phlerhm pheghm pheghm pheghm phleghme phlerghm pthleghm pheghm
phleghe hleghm hleghm ... hleghm phlegthm pheghm phlegh pheghm
pheghm pheghm hleghm pheleghm phleghe pthleghm pheghm pheghm
hleghm uphleghm phlighm pheghm pheghm poleghm pheghm phlegh
hleghm pheghm pheghm hleghm pheghm phlethm pheghm pheghm
hleghm phlighm puleghm pheghm pheghm pheghm pheghm phlerhm
hleghm pheghm pheghm phlegh phlegham pheghm pheghm phlerhm
hleghm pheghm phlegh pheghm phleg-m

[attachment=11443]
94 word types, 0 valid tokens, 10000 invalid

M = 2

phaghm phegthm hleghmy sleghm phllegh heghm pheghe hlegh
phligham pheghe plegh pheeghm phenghm phledshm pheghm pherhm
pieghm phreghm pheghe hlerghm phegh pheghmy phegh phegh leghm
pheghe theghm heghm heghm pherghm heghm ptheghm peghm heghm
hlighm phleghm phleth theghm phegh spheghm peghm phoeghm phegh
hleg-m heghm phtegh phegthm pheghme pheghe phighm ... phegh
heghm heghm heghm heghm phreghm phlegth theghm pthlegh hlenghm
phligh phegh ptheghm hleghe phegh phlech heghm phlengh puleghe
haleghm hleghim peghm heghm phegh leghm thlveghm heghm peghm
heghm pheghm thleghm pherhm hlegh phegh leghm phleth pheghmp
heghm heghm pheghm heghm phighm heghm phegh sleghm phegthm
peghm uleghm hlighm upheghm

[attachment=11448]
760 word types, 0 valid tokens, 10000 invalid

M = 3

haghm hlegthmy sledghm heg-m hegh hlighem heghe preghm
phenighm phethm pherm pellegh phegthe hegh pheghm pheg leghe
deghm henghm pheeighm heghim hegh hethm heaghm phlet paleg
sphegh pedghm pheth eghm phegh phenghme pheghed hegh pegthm
hethm philenh shegh pheg peghe popleghm theghe herghm phlug
herhm theghme pheghe lethm pegh eghm ... peghum hegh eghm
phlnegh ptherghm alleghm hlvegh hligh pheghm hegthm hedghm
hegh ptheg-m thlighe paletchm eghm hegh phugh pegam phighe
hehm highm eghm hlleghm phegh hlexhmy eghm pheighm scheghm
pthegh pechm theghm hegom haghm phigthm hegh eghm hlethe
theghm thleg-m threghm poughm hegum deghm eghm hregum phlext
eghm heg-m eghm

[attachment=11446]
1848 word types, 0 valid tokens, 10000 invalid

M = 4

hagham haghmy hlegh hegh phighas poleg phenithm pethm pinghm
hegthe heght hesh pegh ethm hugh highim hregh herm heilegh
thoghm pegh heth egh hlithm pheghe hugh thaghm theg heghm phed
pheght theg hethmy hugh theghmbe ligh teghem hegthe thethm
pherthm whegh hengthm heg-sm pheghe leghe poleeg peam wthlegh
theghmy hehe ... pigh slege-m eg-m hethm hege pegn peg hunghm
hegth hegh hehe plegh egh hlege thineghm heh hegho phe heg teghm
thegh heghe heghe peeg-m hlitim heg eg-m hilewh heghe hltegh
hegh pehe hetchm polegat lerh pthegh hlseth pole ceg-m hlegte
hehm ptenghm heghm upheg upegh eneghm beghe tiegh htogh heram

[attachment=11447]
3031 word types, 121 valid tokens, 9879 invalid

M = 5

agham henghmy peg hligre pole phathm ther egthe hegh thug aghm
hecanghm heghim eng-m phlet palar pregh hesh legthm pheghed
hughe mengh beghe phe peght thee hegh alighmbe cigh hegeam
thethe heghth hegh pe-gtum hlegher pheghe poleege pegs ileghmy
pegum enghim haigh ege thethe lighm yeg eghm egh hllegt
ptchega ... eathm eghme tllehim alegs hedghe pengh hedsam
hic-m hug lewh hedg pthweghe hlere hethm heghm phers legh
herkm mehm heckhm peghm theght hm sphedg egh peigh olergh
peshm igh hegh hig tethe tchigh peg heredgh opughim wegh haghe
hregs hegh phige g-m egre phege heghmy edg-m lege plege heck eghm

[attachment=11449]
4179 word types, 227 valid tokens, 9773 invalid

M = 6

aghamy edg-m ery higs aghe tcher hegh hesh pach einghm ighim
ng-m hreixegh alag hed leghm ptaghe erch prrhmy thleghle heghe
terthms hut eche tegeam tethe herhm tengthm heg-sit peghe
polege wthegs theghe ham ough ere theigre bene-m ghe eckhe
hleate hege hugh hathm eg-m er-m heg aghi gers ig ... perim
lerwhem eghe hle enum cleg hug-m eerm negme hegum eg ileg ere
thm upeghe high gh-m hegth knegthm pere-m ber pegts egthm egre
ig-im hatege bedg heghe aleg-m fleem heedshe ptath let aleg
thush hegth lplegh pewh acth hlueghe sple pege thegle ejedgh
h-ghm exhi eig egim there hethe

[attachment=11450]
5165 word types, 377 valid tokens, 9623 invalid

M = 20

wait ionedem whenct tesp rilved arioud foui cohee t'se ovoie-e
idin ne sph tur dun iffs lane ar asthe ale sabor tht scarn
ache at setes sued shon whined asppt ree swen en end an te
tiom amale bumm peong hrof the seme rine pevely feea plin juf
ablee asei ... nif mang ive mend aber grche thim isiay brelle
ebrhin ofire thagly yged mard hapan gr soo nifth whe menth
ievem lige tery agro mbe ime wik sthe ullft as aght head its
gond hup aye unses upil trre lper men sad gre tht edge bysie
fes ache tcht elll

[attachment=11451]
6940 word types, 995 valid tokens, 9005 invalid

Pages: 1 2 3 4 5 6 7 8 9 10