The Voynich Ninja

Full Version: n-grams while ignoring spaces?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
I'm getting a lot of comments on my videos (much more than I can answer appropriately), and sometimes they are really good questions. What I like the most is that a lot of people are thinking about the system, rather than coming up with obscure languages. 

On my first video about entropy, @stoplight2554 commented:

Quote:wouldnt this suggest some sort of n-gram based system? there has to be some trade off between character length of a text block and encoded characters per length. this sort of has to assume that spaces are to be ignored though..

also, this only works if the ability to predict the next character from the last is not 'continuous' across a section of text. if it reliably fails to predict at a certain interval, then you have your n-gram length. if it never fails to predict the next character, then its too deterministic to express any meaning whatsoever (unless the meaning itself is the repeated pattern)


I like the way they think: the system does suggest n-grams as a possible part of the solution. They also take into account that considering heavy use of n-grams would almost certainly mean that spaces aren't spaces. 

Their experiment sounds interesting: you make a long string of characters with spaces removed, and test at which intervals entropy goes up. But would this be testable at all? You'd need to make choices for parsing (e.g. what's your initial treatment of [iin]?). And a single missing or extra character (by scribal error) would throw the system off. 

Maybe it's more useful to think in terms of entropy, which is more of an average? So like, how easy is it to predict two characters over when spaces are removed? 

I'd also assume that consistent use of, let's say, bigrams, would inflate your alphabet to such an extent that it would become impossible to compare to other texts?
Spaces do appear to have some significance as a separator, but based on the large number of long, unique vords that appear to be composed of other vords, and the number of ambiguous spaces, it may be useful to do analysis without spaces (still separating by lines), and see if any patterns emerge that line up with the use of spaces.

One of the biggest problems of variable length n-grams is telling things like if “okeol” is “ok-eol” or “oke-ol”. Spaces could serve as just a way of separating out ambiguous n-grams.
(20-10-2024, 04:46 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Their experiment sounds interesting: you make a long string of characters with spaces removed, and test at which intervals entropy goes up. But would this be testable at all?

I don't think this would work well, because a frequent bigram like yq should signal a probable word break. Infrequent bigrams are not the only bigrams where word breaks should occur.

Parsing space-less Voynichese into "chunks" must follow some simple rule that can be easily applied by a human (not a complex grammar), such as partial ordering in a series of "slots" as Massimiliano Zattera suggested, but without the ambiguities.

You are not allowed to view links. Register or Login to view.
Let me take the words in allemannic usage as an example.
zmorge, zmittag, znacht. So the first letter forms the word ‘zum’. Translated into German ‘zum Morgen’ but the meaning is ‘zum Frühstück’...Zum Mittagessen, zum Abendessen.
Sentences like ‘wän gitz zmorge?’ So the sentence is ‘When is breakfast?’ or ‘What's for breakfast?’ ‘what's for breakfast’.
So it happens that different meanings are linked to the noun.
Examples such as ‘88’, ‘oo’, if they stand next to each other, then you can already bet that the first one is an article. The frequency of ‘o’ at the beginning of a word also indicates this.

From my point of view, before I tear sentences apart and put them back together again, I first have to look around to see what was actually normal and how it is used. That presupposes that I already have to decide on a region.

Nehme ich als Beispiel die Wörter im allemannishen gebrauch.
zmorge, zmittag, znacht. So bildet der erste Buchstabe das Wort "zum". Übersetzt ins deutsche "zum Morgen" aber die Bedeutung ist "zum Frühstück"...Zum Mittagessen, zum Abendessen.
Sätze wie "wän gitz zmorge?" So lautet der Satz " Wann gibt es Frühstück?" oder "was gits zmorge?" "was gibt es zum Frühstück".
So kommt es vor das verschiedene Bedeutungen an das Hauptwort gekoppelt ist.
Beispiele wie "88", "oo" wenn sie hinereinander stehen, dann kannst Du bereits darauf wetten dass das erste ein Artikel ist. Die Häufigkeit von "o" am Wortanfang weist auch darauf hin.

Von meiner Seite aus betrachtet, bevor ich Sätze auseinander reisse und wieder zusammen setze muss ich mich zuerst umsehen, was war denn eigentlich normal und wie steht es mit der Anwendung. Das setzt voraus das ich mich bereits für eine Region entscheiden muss.


Beachte die Unterschiede im Vater unser zwischen 1600 und 1800. Und das war Amtsdeutsch in der Region.
You are not allowed to view links. Register or Login to view.
This discussion reminds me of an article in the 7 February issue of Science, "Whale song shows language-like statistical structure"  (Arnon et al., Science 387 (2025) 649).  It concludes that "subsequences" in humpback whale song follow a Zipfian power-law distribution.  A formal challenge along the way is to identify the boundaries between such subsequences.  Building on previous work in human infants, the authors segment whale-song recordings by identifying dips in transition probability between fundamental sound elements:  "Because words are statistically coherent, transitional probabilities within words are higher (on average) than those between words."  That the subsequences constitute a lexicon is "an assumption that cannot be reasonably applied to whales"—oh well—but their statistical structure is thought to facilitate learning and cultural transmission.

Relevant or not, the paper is entertaining, and the university has naughtily left a You are not allowed to view links. Register or Login to view. on their website.  For VMS researchers, it may suggest a new method of segmenting the text for analysis... or even calve a new theory of authorship.
Patrick Feaster could be the researcher who did more work on the nature of Voynich word spaces.

If I understand correctly, his transitional probabilities for the qokeedy.qokeedy (B) and chol.daiin (A) loops could partially confirm that spaces are consistent with uncertainty in the following character.

You are not allowed to view links. Register or Login to view.
Something that isn’t clear to me about the Whale song paper by Arnon et al: they seem to work with pure drops in transitional probabilities, but does this mean that rare symbols are taken to start new segments? For instance, EVA:ag ag is rare, the probability of a transition from a to g is low; would this result in g being usually taken as the start of a new sequence?
Thanks obelus, nice fun paper,  nice fun pun Smile

Hi MarcoP , i think it works like this  -> "identifying dips in transition probability between fundamental sound elements".
i am not sure that  "fundamental sound elements" can be equated with individual voynichese glyphs.

In Fig. S1 of Arnon et al, the segmentation method was described as "we cut whenever the ratio between two consecutive transitional probabilities was lower than 0.25".

EVA-ag is only 1 transition, so it would not be considered for segmentation when using the above method.
Yeah, we know that Voynichese is "very unlikely" to be phonetical, so possibly Arnon's research is not relevant. But the first post suggests that here we are considering a similar method anyway?

I was mentioning EVA:ag as an example of two consecutive glyphs, situations like chotag, okag, arag etc
I tried an experiment on Q13 based on the ideas discussed in this thread. As always, it is well possible that I made errors.

I used the ZL3a-n.txt EVA transliteration where benched, benched gallows and i/e-sequences were replaced by uppercase letters.
In the processing, all spaces are removed. For two consecutive characters X and Y, the ratio between the actual frequency of the XY bigram is compared with the expected frequency as freq(X)*freq(Y) (similar to the method Emma and I used in our paper about word-break combinations). When the ratio drops to less than 50% of the previous value, a “break” in the sequence is inferred (same threshold discussed in the Whale paper by Arnon et al.).

Original EVA and script output for the first 5 lines of You are not allowed to view links. Register or Login to view. (I treated the initial weirdo as p).

pykedy.olfchedy.qokedy.spchy.chedy.rol.dor.ofchedy.qokedy
pykEdy olfC EdyqokEdys pC yCEdy rol d orofC EdyqokEdy

olshedy.qokedy.rshedy.cthdy.otedy.kedy.dal,dal.dol.oty.dal
olSEdyqokEdy rSEdy Tdy otEdy kEdy dal dal d ol ot ydal

qokedy.chety.qolshedy.okedy.dol,eesolchey.qotedy.ol,dam
qokEdy CE tyqolSEdy okEdy d ol EsolCE yqotEdy ol dam

ol,chy.lshdy.lcheckhy.ol.keedy.lcheckhey.l,olkedykain,ol
olC ylS dy lCE Ky ol kEdy lCE KEylol kEdy kaIn ol

qor.olkeey.olkain.ol.chsey.ol.cheeky.dar.okal.dal.olchedy
qo rol kEyol kaIn olC s EyolCE kydar okal dal olCEdy

Large image showing plots for each character transition:
You are not allowed to view links. Register or Login to view.
The results are mixed. In several cases, Voynich spaces tend to follow the criteria discussed here, i.e. they tend to occur between two characters that are not tied by a strong sequential correlation. Results for the second line are particularly good. But many of the inferred spaces do not match the actual text; an obvious problem is yq which is never detected as a break in the sequence, since in Q13 it’s a frequent bigram when spaces are ignored.
Pages: 1 2