bi3mw > 15-11-2025, 01:22 AM
(15-11-2025, 12:32 AM)Bernd Wrote: You are not allowed to view links. Register or Login to view.On a more serious note - I wonder how a meter could explain downwardness beyond a few lines.

tavie > 15-11-2025, 02:39 AM
(14-11-2025, 02:24 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I saw @tavie's presentation at the last Voynich day,but that was a lot of detail, and included speculation about the head lines...
(14-11-2025, 03:21 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Let me also remind people of the recent finding that the line-breaking algorithm used by scribes (not just on the VMS, but in any language and any epoch, even today) has the side effect of making the first word of each line longer than average, and the last few words shorter than average. This phenomenon alone can have a significant effect on word frequencies at the start of the line...
R. Sale > 15-11-2025, 05:04 AM
Jorge_Stolfi > 15-11-2025, 06:15 AM
(15-11-2025, 02:39 AM)tavie Wrote: You are not allowed to view links. Register or Login to view.I thought this was a theory proposed by Elmar Vogt and Ger Hungerink in 2012...but I didn't think it was proven as an explanation.
Quote:But I had the impression only printed works were counted (is that wrong?), and it would be more interesting to see assessments of manuscripts, especially where longer words could in theory squeeze in at the end of the line by being abbreviated.
Quote:It would also be interesting to separate out paragraph start words (we see some really long ones, as well as short ones like "pol") from the line start pack since they are almost certainly not being wrapped round.
Quote:Another point I'd make is that while the word wrap concept of longer words not fitting at line end is a potential explanation for the word length discrepancies, I don't see it as being a comprehensive potential explanation for the glyph discrepancies. It might have an impact, but I don't see the evidence for it being the primary explanatory factor for why word types at line start are different.
Quote:We would expect the word types or glyph clusters "missing" at line end to be excessively popular at line start. In many cases, we do not see this. Something else is going on.
pfeaster > 18-11-2025, 11:34 AM
(15-11-2025, 12:21 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Wow, Patrick's paper is quite a big meal to digest.
(15-11-2025, 12:21 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
1. Remove the head lines of parags. Among other things, that line is likely to have special contents (like plant names and aliases), which could well imply different word frequencies and positional patterns, and hence the same for characters and digraphs.
(15-11-2025, 12:21 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.One problem to watch for here is that parag breaks are sometimes not obvious. In the Stars section, in particular, I suspect that there there is a run of 5-6 parags that were joined by the Scribe (a newbie?) into a single parag, before he returned to the normal format.
(15-11-2025, 12:21 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.2. Limit the analysis to just one of the sections with substantial running text -- Herbal-A, Herbal-B, Bio, and Stars. If the anomalies are real, they should be noticeable, and probably even stronger, in one of those sections. If they turn out to be absent or different in other sections, that by itself would be important information.
(15-11-2025, 12:21 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.3. Try to identify the words that are responsible for the anomalies. Maybe I have misread the tables in the paper, but among the Sh/Ch word pairs, some seem to have greater positional bias than others.
(15-11-2025, 12:21 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Could it be, for example, that the most leftward member of the pair often occurs after a long word, while the other member more often occurs after a short one? That could perhaps explain the positional anomaly as a consequence of the line-breaking word-length bias.
Or maybe the two members of the pair can get fused or split at different rates in the transcription. So that that some of the Sheols are actually Sheoldy while most Cheols are indeed Cheols. I can't think how this possible confounding factor could be addressed. Although this may be one case
Kaybo > 24-11-2025, 08:41 AM
(18-11-2025, 11:34 AM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.(15-11-2025, 12:21 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Wow, Patrick's paper is quite a big meal to digest.
I'd recommend the You are not allowed to view links. Register or Login to view. over the earlier/longer blog post.
dashstofsk > 24-11-2025, 02:39 PM
(15-11-2025, 06:15 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Quote:Another point I'd make is that while the word wrap concept of longer words not fitting at line end is a potential explanation for the word length discrepancies, I don't see it as being a comprehensive potential explanation for the glyph discrepancies. It might have an impact, but I don't see the evidence for it being the primary explanatory factor for why word types at line start are different.If the tokens at line start are longer than average, longer word types must have higher frequencies in that position than elsewhere, and the opposite must be true for shorter word types. For instance, the frequency of qokeedy should be higher at line start than at mid-line, while the opposite should be true for ar. But indeed it would be important to verify and quantify these differences for the VMS.
Quote:We would expect the word types or glyph clusters "missing" at line end to be excessively popular at line start. In many cases, we do not see this. Something else is going on.Indeed, it is not certain that these differences in word type frequencies will cause differences in character frequencies, but is certainly possible. And even if the line breaking length bias turns out to be insufficient to explain the line-start anomalies, we would have to subtract its effects in order to understand the real anomalies and infer their causes.
bi3mw > 10-12-2025, 08:34 AM
from collections import Counter
# -----------------------------
# Settings / Configuration
# -----------------------------
INPUT_FILE = "RF1a-n-x7.txt"
MIN_COUNT = 2
MIN_LEN = 2
MAX_LEN = 10
OUTPUT_FILE_CHUNKS = "RF1a-n-x7_chunked.txt"
OUTPUT_FILE_TOP_CHUNKS = "RF1a-n-x7_top_chunks.txt"
OUTPUT_FILE_LINES = "RF1a-n-x7_line_start_end_pairs.txt"
OUTPUT_FILE_LINES_CSV = "RF1a-n-x7_line_start_end_pairs.csv"
# -----------------------------
# 1) Read all words line by line
# -----------------------------
words = []
lines = []
with open(INPUT_FILE, "r", encoding="utf-8") as f:
for line in f:
line_words = line.strip().split()
if line_words:
lines.append(line_words)
words.extend(line_words)
# -----------------------------
# 2) Count all substrings (potential chunks)
# -----------------------------
freq = Counter()
for w in words:
L = len(w)
for l in range(MIN_LEN, min(MAX_LEN, L) + 1):
for i in range(L - l + 1):
substring = w[i:i+l]
freq[substring] += 1
# Keep only substrings that occur at least MIN_COUNT times
common = {s for s, c in freq.items() if c >= MIN_COUNT}
# -----------------------------
# 3) Function: Chunk a single word
# -----------------------------
def chunk_word(word):
"""
Break a word into statistically significant chunks.
Chunks are scored by: frequency * length
Highest score wins.
"""
chunks = []
i = 0
L = len(word)
while i < L:
candidates = []
for l in range(MIN_LEN, min(MAX_LEN, L - i) + 1):
piece = word[i:i+l]
if piece in common:
score = freq[piece] * l
candidates.append((score, l, piece))
if candidates:
candidates.sort(key=lambda x: (-x[0], -x[1]))
_, l, match = candidates[0]
chunks.append(match)
i += l
else:
chunks.append(word[i])
i += 1
return chunks
# -----------------------------
# 4) Process each line and collect positional statistics
# -----------------------------
word_start = Counter() # first chunk in word
word_mid = Counter() # middle chunks
word_end = Counter() # last chunk in word
line_pairs = Counter() # Start → End pairs per line
chunked_words = []
for line_words in lines:
start_chunk_line = None
end_chunk_line = None
for idx, w in enumerate(line_words):
chunks = chunk_word(w)
chunked_words.append(f"{w}: {','.join(chunks)}")
if chunks:
# register word-start/mid/end per word
word_start[chunks[0]] += 1
word_end[chunks[-1]] += 1
for c in chunks[1:-1]:
word_mid[c] += 1
# For the line-level start-end pair:
if idx == 0 and chunks:
start_chunk_line = chunks[0]
if idx == len(line_words) - 1 and chunks:
end_chunk_line = chunks[-1]
if start_chunk_line and end_chunk_line:
line_pairs[(start_chunk_line, end_chunk_line)] += 1
# -----------------------------
# 5) Save chunked words to file
# -----------------------------
with open(OUTPUT_FILE_CHUNKS, "w", encoding="utf-8") as f:
for line in chunked_words:
f.write(line + "\n")
# -----------------------------
# 6) Function: Print top chunks
# -----------------------------
def print_top_chunks(counter, title, top_n=12):
total = sum(counter.values())
print(f"\n{title}")
print("-" * 40)
for chunk, count in counter.most_common(top_n):
pct = count / total * 100
print(f"{chunk:<6} {count:6} ({pct:5.2f}%)")
# -----------------------------
# 7) Save top chunks to file
# -----------------------------
with open(OUTPUT_FILE_TOP_CHUNKS, "w", encoding="utf-8") as f:
for counter, title in [(word_start, "Word-Start Chunks"),
(word_mid, "Word-Mid Chunks"),
(word_end, "Word-End Chunks")]:
f.write(f"{title}\n")
f.write("-" * 40 + "\n")
total = sum(counter.values())
for chunk, count in counter.most_common(12):
pct = count / total * 100
f.write(f"{chunk:<6} {count:6} ({pct:.2f}%)\n")
f.write("\n")
# -----------------------------
# 8) Save Start/End pairs per line (TXT)
# -----------------------------
with open(OUTPUT_FILE_LINES, "w", encoding="utf-8") as f:
f.write("Line-Start → Line-End Chunk Pairs\n")
f.write("-" * 50 + "\n")
for (start, end), count in line_pairs.most_common():
f.write(f"{start:<8} → {end:<8} {count}x\n")
# -----------------------------
# 8b) Save Start/End pairs also as CSV
# -----------------------------
with open(OUTPUT_FILE_LINES_CSV, "w", encoding="utf-8") as f:
f.write("start_chunk,end_chunk,count\n")
for (start, end), count in line_pairs.most_common():
f.write(f"{start},{end},{count}\n")
# -----------------------------
# 9) Console output
# -----------------------------
print_top_chunks(word_start, "Top 12 Word-Start Chunks")
print_top_chunks(word_mid, "Top 12 Word-Mid Chunks")
print_top_chunks(word_end, "Top 12 Word-End Chunks")
print(f"\nDone! Chunked text saved in {OUTPUT_FILE_CHUNKS}")
print(f"Top chunks saved in {OUTPUT_FILE_TOP_CHUNKS}")
print(f"Line start/end pairs saved in {OUTPUT_FILE_LINES}")
print(f"CSV start/end pairs saved in {OUTPUT_FILE_LINES_CSV}")
RF1a-n-x7_line_start_end_pairs.xlsx (Size: 29.65 KB / Downloads: 6)
bi3mw > 11-12-2025, 01:14 AM
ReneZ > 11-12-2025, 01:35 AM