quimqu > 01-04-2026, 12:16 PM
| Hypothesis | Expected behavior | Observed failure |
| Random / weak structure | No stable local similarity or positional effects | Strong clustering and positional patterns persist |
| Sequential (Markov-like) | Next token predictable from previous ones | Bigram/HMM models add little or collapse |
| Copy–modify (parent-based) | Clear local derivations, strong nearest neighbor | Generative models produce too much similarity |
| Single dominant parent | One best local candidate per token | Multiple candidates with similar scores, no clear winner |
quimqu > 01-04-2026, 12:43 PM
| Assumption | Operational test |
| Candidate set constrained by line position | Predict top-k tokens from line features and measure recall |
| Local context restricts form, not identity | Build candidates by similarity and check if real token is included |
| Selection within the set is weak | Train ranking model and measure score flatness |
| Substructures drive compatibility | Use prefixes/suffixes only and test predictive power |
nablator > 01-04-2026, 12:50 PM
(01-04-2026, 12:16 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Simple sequential models and naive copy-and-modify processes do not fit.
quimqu > 01-04-2026, 01:12 PM
(01-04-2026, 12:50 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I tried to propose a less naïve self-citation generation method including sparse initialization causing bottlenecks that would explain local inhomogeneities, lazy source words selection patterns, non-sequential writing, and generation rules optimized to fit the transliteration data. It seems that no one knew how to test the hypothesis. What do you think about it? You are not allowed to view links. Register or Login to view.
Such a method, if it can be shown to replicate all known properties of Voynichese, does not exclude the possibility of a cipher such as You are not allowed to view links. Register or Login to view., but unnecessary complications can usually be discarded by Occam's razor when there is no evidence specifically pointing at them.
nablator > 01-04-2026, 03:32 PM
(01-04-2026, 01:12 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Most direct self-citation models tend to produce too much chaining and identifiable source–target links.
Quote:If you agree, I will try to model them and put them in my pipeline. Will let you know!
quimqu > 01-04-2026, 04:08 PM
(01-04-2026, 03:32 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I don't know what you mean by "identifiable". I believed that the frequency of some very unlikely sequential transfer patterns between two lines would almost certainly identify source-target links but then I was disappointed to find a similar frequency in Torsten Timm's generated_text file (see my last post in the other thread).
quimqu > 05-04-2026, 08:39 PM
| Model | Core idea | What it gets roughly right | Main failures |
|---|---|---|---|
| base | Samples tokens from global vocabulary with weak repetition penalty | Basic entropy levels, approximate repetition | Too little structure, wrong similarity, poor hapax and vocabulary control, no positional behavior |
| base_variant | Global sampling + occasional independent small edits | Repetition and some local similarity | Variation not tied to context, wrong hapax rate, no positional asymmetry, weak vocabulary structure |
| prev_morph | New token often derived from previous token (local edit process) | Short-range similarity and chaining effects | Over-relies on sequential edits, fails on repetition balance, hapax and vocabulary distribution |
| positional_prev_morph | Local pool + dependence on previous token + bias by position in line | Repetition, similarity, partial positional effects | Still poor hapax rate, vocabulary size, and incorrect start/end of line behavior |
| positional_prev_morph_new | Same as above + small probability of generating new forms (controlled novelty) | Best overall balance across repetition, similarity and entropy | Large errors remain in hapax rate, vocabulary, word-length asymmetry and positional distributions |
quimqu > 05-04-2026, 11:45 PM
| Aspect | Earlier models | New local compatibility model | Remaining gap |
|---|---|---|---|
| Generation mechanism | Sequential or global sampling | Selection from local compatible pool | Internal structure of the pool still unclear |
| Exact repetition | Too high or unstable | Very close to real text | Largely solved |
| Local similarity | Either too weak or too forced | Close to real behavior | Slightly too diffuse |
| Vocabulary control | Collapse or explosion | Balanced regime | Still somewhat smaller than real |
| Family persistence | Too strong (chaining) | Close to real low persistence | Mostly solved |
| Number of active families | Very low | Moderate increase | Still far below real text |
| Family distribution | Highly concentrated | Less dominated by a single family | Entropy still too low |
| Within-family coherence | Not well modeled | Partially captured | Families too loose internally |
| Overall conclusion | Sequential rules insufficient | Local selection works better | Need better structure of overlapping families |
quimqu > 06-04-2026, 10:12 AM
quimqu > 06-04-2026, 09:33 PM
| Category | Finding | What it suggests |
|---|---|---|
| Robust | Local similarity is higher than in global and positional shuffles (across multiple similarity measures) | There is a real local constraint structure, not an artefact of one metric |
| Robust | Best match comes from the local window, not from the previous token (2.19 vs 3.89) | The system behaves like a local pool, not a Markov chain |
| Robust | Past vs future is symmetric (difference ≈ 0) | No strong directional generation signal |
| Robust | Within-line shuffle preserves most of the signal | Order inside the line is less important than the set of nearby tokens |
| Representation-dependent | Character windows (char4, char5) still show structure | The effect is not purely due to spaces or tokenization |
| Representation-dependent | Token bigrams destroy the signal | Not every segmentation preserves the structure |
| Weak effects | Middle of line slightly more similar and dense than start/end | There is positional influence, but it is limited |
| Weak effects | Candidate pool size is stable (~19) across positions | No strong evidence that position changes the size of the allowed set |
| Model limitation | Local model overshoots similarity (too low Lev, too high density) | Simple local reuse is too strong and too compact |