The Voynich Ninja

Full Version: Opinions on: line as a functional unit
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
(15-11-2025, 12:32 AM)Bernd Wrote: You are not allowed to view links. Register or Login to view.On a more serious note - I wonder how a meter could explain downwardness beyond a few lines.

How about this poem? (Please note, I'm not a poet.) Wink

Im Frühling erwacht alles neu, Morgen lichtet Schatten
Der Frühling haucht leise über dieses Morgen Wiesen
Frühling trägt sanfte Zeichen durch des Morgen Garten
Und Frühling führt uns am Morgen immer weiter

The AI offers the following Latin translation in hexameter:

In ver resurget, nova omnia, mane tollit umbras
(In ver | resur | get, no | va o | mni-a | mane tollit umbras)

Ver leniter spirat super has mane pratas
(Ver le | ni-ter | spi-rat | su-per | has ma | ne pra-tas)

Ver fert lenia signa per mane hortum
(Ver fert | le-ni-a | sig-na | per ma | ne hor | tum)

Et ver ducit nos mane leniter ad auroram
(Et ver | du-cit | nos ma | ne le | ni-ter | ad au-ro-ram)
(14-11-2025, 02:24 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I saw @tavie's presentation at the last Voynich day,but that was a lot of detail, and included speculation about the head lines...


To clarify, even though I cover top rows/head lines in both Voynich Day presentations, all the stats are separated out to avoid/minimize potential distortions from different effects. e.g. the stats around line start discrepancies are formed using a comparison between i) a "pure" basis that excludes line starts themselves, line ends, and top rows to generate predictions, and ii) the actual line start results which exclude likely paragraph starts.

(14-11-2025, 03:21 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Let me also remind people of the recent finding that the line-breaking algorithm used by scribes (not just on the VMS, but in any language and any epoch, even today) has the side effect of making the first word of each line longer than average, and the last few words shorter than average.  This phenomenon alone can have a significant effect on word frequencies at the start of the line...

I thought this was a theory proposed by Elmar Vogt and Ger Hungerink in 2012...but I didn't think it was proven as an explanation.   I find it interesting, and it makes sense that smaller words have a greater chance of sneaking it past the line end cut off; while larger words have a greater chance of missing out and therefore being line start.  But I had the impression only printed works were counted (is that wrong?), and it would be more interesting to see assessments of manuscripts, especially where longer words could in theory squeeze in at the end of the line by being abbreviated. It would also be interesting to separate out paragraph start words (we see some really long ones, as well as short ones like "pol") from the line start pack since they are almost certainly not being wrapped round.


Another point I'd make is that while the word wrap concept of longer words not fitting at line end is a potential explanation for the word length discrepancies, I don't see it as being a comprehensive potential explanation for the glyph discrepancies.  It might have an impact, but I don't see the evidence for it being the primary explanatory factor for why word types at line start are different.  We would expect the word types or glyph clusters "missing" at line end to be excessively popular at line start.  In many cases, we do not see this.  Something else is going on.
There are other examples of circular texts with fat more elaborate markers than the simple line noted in VMs Sagittarius Post #105.

Examples on White Aries (2), Cancer, the Cosmos, Central Rosette (2) and others.

The outer ring of text on White Aries is unusual among other zodiac examples because of its internal vord repetition.
(15-11-2025, 02:39 AM)tavie Wrote: You are not allowed to view links. Register or Login to view.I thought this was a theory proposed by Elmar Vogt and Ger Hungerink in 2012...but I didn't think it was proven as an explanation.

The effect itself is real. Someone posted histograms of first/mid/last word lengths to this forum not long ago.   And you saw my quick check on that Portuguese novel above.

Quote:But I had the impression only printed works were counted (is that wrong?), and it would be more interesting to see assessments of manuscripts, especially where longer words could in theory squeeze in at the end of the line by being abbreviated.

I don't recall whether t has been verified on the VMS, but I don't see why it would not hold for manuscripts too.   On manuscripts the scribe can squeeze the font itself to delay line breaking.  But even so a break is more likely to be unavoidable before a long word than a short one.

Quote:It would also be interesting to separate out paragraph start words (we see some really long ones, as well as short ones like "pol") from the line start pack since they are almost certainly not being wrapped round.

Indeed.  My advice is to exclude parag head lines and not look at them at all.  Whatever happens there is likely to be different and more complicated than what happens on the other lines.  So let's first understand the latter.  Then we can go back and study the head lines, using whatever we found about the body lines as a starting point... 

Quote:Another point I'd make is that while the word wrap concept of longer words not fitting at line end is a potential explanation for the word length discrepancies, I don't see it as being a comprehensive potential explanation for the glyph discrepancies.  It might have an impact, but I don't see the evidence for it being the primary explanatory factor for why word types at line start are different.

If the tokens at line start are longer than average, longer word types must have higher frequencies in that position than elsewhere, and the opposite must be true for shorter word types.  For instance, the frequency of qokeedy should be higher at line start than at mid-line, while the opposite should be true for ar.  But indeed it would be important to verify and quantify these differences for the VMS.  

Quote:We would expect the word types or glyph clusters "missing" at line end to be excessively popular at line start.  In many cases, we do not see this.  Something else is going on.

Indeed, it is not certain that these differences in word type frequencies will cause differences in character frequencies, but is certainly possible.   And even if the line breaking length bias turns out to be insufficient to explain the line-start anomalies, we would have to subtract its effects in order to understand the real anomalies and infer their causes.

One of the things I learned in the years that I spent watching cryptocurrencies is the term FOMO = "fear of missing out".  It is the tendency of many investors to put money into new things, even if they don't understand them, for fear of missing out on the next Apple or Facebook.  Maybe something like that is happening here too?  People being reluctant to focus on just one section, to discard head lines, and to collapse similar glyphs into classes, for fear of missing out on some important discovery? [Unicode emojis deleted by the forum's text editor]

All the best, --stolfi
(15-11-2025, 12:21 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Wow, Patrick's paper is quite a big meal to digest.

I'd recommend the You are not allowed to view links. Register or Login to view. over the earlier/longer blog post.

(15-11-2025, 12:21 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view. 
1. Remove the head lines of parags.  Among other things, that line is likely to have special contents (like plant names and aliases), which could well imply different word frequencies and positional patterns, and hence the same for characters and digraphs.

When I published that initial blog post, I hadn't yet worked out another kind of display I used in the conference paper, using differences of brightness and/or color to represent the frequency of particular features in different areas of lines and paragraphs.  Here are a few examples from the conference paper (which also goes into more detail about how these displays are generated).

[attachment=12463]

The first and last lines of paragraphs and the first and last words of lines are separated out into their own rows or columns, while everything in the "middle" is displayed relatively. As you can see, in spite of some notable differences, the key patterns visible in mid-paragraph also extend into first lines.  The [Sh] / [ch] case seems due mostly to [Sh] words being especially common in second position specifically.  The [k] / [t] and [qo] / [o] cases are more spread out -- loosely separated into a "first half of line" and "second half of line."

(15-11-2025, 12:21 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.One problem to watch for here is that parag breaks are sometimes not obvious.  In the Stars section, in particular, I suspect that there there is a run of 5-6 parags that were joined by the Scribe (a newbie?) into a single parag, before he returned to the normal format.

I doubt there are enough truly ambiguous cases of paragraph division to make much overall impact on these statistics, but I agree that it's worth being aware of some uncertainty on this front.

(15-11-2025, 12:21 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.2. Limit the analysis to just one of the sections with substantial running text -- Herbal-A, Herbal-B, Bio, and Stars.  If the anomalies are real, they should be noticeable, and probably even stronger, in one of those sections.   If they turn out to be absent or different in other sections, that by itself would be important information.

They're still noticeable within individual sections, and of course there's some variation across sections with this as with everything else. Rather than pursuing this exhaustively on my own, I've made the You are not allowed to view links. Register or Login to view. for generating grayscale or multicolor charts like the ones shown above available for anyone who would like to run whatever specific comparisons/contrasts they like.

(15-11-2025, 12:21 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.3. Try to identify the words that are responsible for the anomalies.  Maybe I have misread the tables in the paper, but among the Sh/Ch word pairs, some seem to have greater positional bias than others. 

I don't believe specific "words" are in fact responsible for the anomalies, in the sense I think you mean. From what I can tell, "words" generally show positional biases based on the positional patterns of their constituent parts.  If you want to predict the positional bias of any given word, your best bet seems to be to combine the positional biases of the glyphs or glyph groups that make it up -- odd as that might seem.

(15-11-2025, 12:21 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Could it be, for example, that the most leftward member of the pair often occurs after a long word, while the other member more often occurs after a short one? That could perhaps explain the positional anomaly as a consequence of the line-breaking word-length bias.

Or maybe the two members of the pair can get fused or split at different rates in the transcription.  So that that some of the Sheols are actually Sheoldy while most Cheols are indeed Cheols.  I can't think how this possible confounding factor could be addressed.  Although this may be one case 

I'd encourage you to test those hypotheses.

Still, if we generate similar displays based purely on raw glyph sequences, ignoring word breaks altogether, the skewed distribution of glyphs and glyph groups tends to recapitulate the skewed distribution of words containing those same glyphs and glyph groups -- so my sense is that what we're seeing here isn't likely to be an artifact of spacing ambiguities or the lengths of adjacent words.
(18-11-2025, 11:34 AM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.
(15-11-2025, 12:21 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Wow, Patrick's paper is quite a big meal to digest.

I'd recommend the You are not allowed to view links. Register or Login to view. over the earlier/longer blog post.

The graph is very nice. It looks like a frame, like even the last row of a page is special? And that there are even changes inside the text from left to right. 

The question remains, in which kind of other text we see this kind of patterns?
(15-11-2025, 06:15 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
Quote:Another point I'd make is that while the word wrap concept of longer words not fitting at line end is a potential explanation for the word length discrepancies, I don't see it as being a comprehensive potential explanation for the glyph discrepancies.  It might have an impact, but I don't see the evidence for it being the primary explanatory factor for why word types at line start are different.
If the tokens at line start are longer than average, longer word types must have higher frequencies in that position than elsewhere, and the opposite must be true for shorter word types.  For instance, the frequency of qokeedy should be higher at line start than at mid-line, while the opposite should be true for ar.  But indeed it would be important to verify and quantify these differences for the VMS.  

Quote:We would expect the word types or glyph clusters "missing" at line end to be excessively popular at line start.  In many cases, we do not see this.  Something else is going on.
Indeed, it is not certain that these differences in word type frequencies will cause differences in character frequencies, but is certainly possible.   And even if the line breaking length bias turns out to be insufficient to explain the line-start anomalies, we would have to subtract its effects in order to understand the real anomalies and infer their causes.


Tavie is tright. Something else is going on. And I have quantified. In case you might have skipped it I did present evidence to show that vertical pair repeats ( consecutive repeats in first-word first-characters ) occur far less that would be expected if the same first-word first-characters were placed randomly. I did compare the actual numbers against numbers obtained from simulations of such random placements to obtain a measure of the confidence in the conjecture that something significant is going on.

You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.
I conducted a small experiment with LAAFU. The initial question is: In the absence of vowels and consonants, is it possible to form chunks from Voynich words based purely on statistical frequency and then use them for further analysis? Here is the result.

Code:
from collections import Counter

# -----------------------------
# Settings / Configuration
# -----------------------------
INPUT_FILE = "RF1a-n-x7.txt"               
MIN_COUNT = 2                               
MIN_LEN = 2                                 
MAX_LEN = 10                               

OUTPUT_FILE_CHUNKS = "RF1a-n-x7_chunked.txt"
OUTPUT_FILE_TOP_CHUNKS = "RF1a-n-x7_top_chunks.txt"
OUTPUT_FILE_LINES = "RF1a-n-x7_line_start_end_pairs.txt"
OUTPUT_FILE_LINES_CSV = "RF1a-n-x7_line_start_end_pairs.csv"

# -----------------------------
# 1) Read all words line by line
# -----------------------------
words = []
lines = []

with open(INPUT_FILE, "r", encoding="utf-8") as f:
    for line in f:
        line_words = line.strip().split()
        if line_words:
            lines.append(line_words)
            words.extend(line_words)

# -----------------------------
# 2) Count all substrings (potential chunks)
# -----------------------------
freq = Counter()

for w in words:
    L = len(w)
    for l in range(MIN_LEN, min(MAX_LEN, L) + 1):
        for i in range(L - l + 1):
            substring = w[i:i+l]
            freq[substring] += 1

# Keep only substrings that occur at least MIN_COUNT times
common = {s for s, c in freq.items() if c >= MIN_COUNT}

# -----------------------------
# 3) Function: Chunk a single word
# -----------------------------
def chunk_word(word):
    """
    Break a word into statistically significant chunks.
    Chunks are scored by: frequency * length
    Highest score wins.
    """
    chunks = []
    i = 0
    L = len(word)
    while i < L:
        candidates = []
        for l in range(MIN_LEN, min(MAX_LEN, L - i) + 1):
            piece = word[i:i+l]
            if piece in common:
                score = freq[piece] * l
                candidates.append((score, l, piece))
        if candidates:
            candidates.sort(key=lambda x: (-x[0], -x[1]))
            _, l, match = candidates[0]
            chunks.append(match)
            i += l
        else:
            chunks.append(word[i])
            i += 1
    return chunks

# -----------------------------
# 4) Process each line and collect positional statistics
# -----------------------------
word_start = Counter()  # first chunk in word
word_mid = Counter()    # middle chunks
word_end = Counter()    # last chunk in word

line_pairs = Counter()  # Start → End pairs per line

chunked_words = []

for line_words in lines:
    start_chunk_line = None
    end_chunk_line = None

    for idx, w in enumerate(line_words):
        chunks = chunk_word(w)
        chunked_words.append(f"{w}: {','.join(chunks)}")

        if chunks:
            # register word-start/mid/end per word
            word_start[chunks[0]] += 1
            word_end[chunks[-1]] += 1
            for c in chunks[1:-1]:
                word_mid[c] += 1

        # For the line-level start-end pair:
        if idx == 0 and chunks:
            start_chunk_line = chunks[0]
        if idx == len(line_words) - 1 and chunks:
            end_chunk_line = chunks[-1]

    if start_chunk_line and end_chunk_line:
        line_pairs[(start_chunk_line, end_chunk_line)] += 1

# -----------------------------
# 5) Save chunked words to file
# -----------------------------
with open(OUTPUT_FILE_CHUNKS, "w", encoding="utf-8") as f:
    for line in chunked_words:
        f.write(line + "\n")

# -----------------------------
# 6) Function: Print top chunks
# -----------------------------
def print_top_chunks(counter, title, top_n=12):
    total = sum(counter.values())
    print(f"\n{title}")
    print("-" * 40)
    for chunk, count in counter.most_common(top_n):
        pct = count / total * 100
        print(f"{chunk:<6} {count:6} ({pct:5.2f}%)")

# -----------------------------
# 7) Save top chunks to file
# -----------------------------
with open(OUTPUT_FILE_TOP_CHUNKS, "w", encoding="utf-8") as f:
    for counter, title in [(word_start, "Word-Start Chunks"),
                          (word_mid, "Word-Mid Chunks"),
                          (word_end, "Word-End Chunks")]:
        f.write(f"{title}\n")
        f.write("-" * 40 + "\n")
        total = sum(counter.values())
        for chunk, count in counter.most_common(12):
            pct = count / total * 100
            f.write(f"{chunk:<6} {count:6} ({pct:.2f}%)\n")
        f.write("\n")

# -----------------------------
# 8) Save Start/End pairs per line (TXT)
# -----------------------------
with open(OUTPUT_FILE_LINES, "w", encoding="utf-8") as f:
    f.write("Line-Start → Line-End Chunk Pairs\n")
    f.write("-" * 50 + "\n")
    for (start, end), count in line_pairs.most_common():
        f.write(f"{start:<8} → {end:<8}  {count}x\n")

# -----------------------------
# 8b) Save Start/End pairs also as CSV
# -----------------------------
with open(OUTPUT_FILE_LINES_CSV, "w", encoding="utf-8") as f:
    f.write("start_chunk,end_chunk,count\n")
    for (start, end), count in line_pairs.most_common():
        f.write(f"{start},{end},{count}\n")

# -----------------------------
# 9) Console output
# -----------------------------
print_top_chunks(word_start, "Top 12 Word-Start Chunks")
print_top_chunks(word_mid, "Top 12 Word-Mid Chunks")
print_top_chunks(word_end, "Top 12 Word-End Chunks")

print(f"\nDone! Chunked text saved in {OUTPUT_FILE_CHUNKS}")
print(f"Top chunks saved in {OUTPUT_FILE_TOP_CHUNKS}")
print(f"Line start/end pairs saved in {OUTPUT_FILE_LINES}")
print(f"CSV start/end pairs saved in {OUTPUT_FILE_LINES_CSV}")


Here are the twelve most frequent chunks (start/middle/end), each listed
[attachment=12875]

Frequent pairs (start chunk first word of the line, end chunk last word of the line)
[attachment=12876]
The top 20 chunk pairs line by line
[attachment=12888]
(11-12-2025, 01:14 AM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.The top 20 chunk pairs

Looking at the list, I guess that this applies to start-end chunks that also have something in between, correct?
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13