09-11-2025, 01:17 AM
THIS IS NOT ANOTHER GPT "I SOLVED IT" POST!!!!!
Ok, let me be honest in my opinion about 2 things.
1. GPT can write code (not always correctly) that's at least fixable.
2. It can write and execute code faster than most of us and produce results faster than most of us.
Now, let me be honest about the bad part.
1. It lies.
2. It hallucinates.
3. It makes excuses.
4. It fabricates results.
5. It'll do all of the above and then lie about doing it.
There is nothing in it's core instructions that force it to tell you the truth and I have caught it multiple times doing all 5 of the above at the same time.
Other bad things:
It'll tell you it's executing code and never does (and that evolves into an infinite loop).
It'll tell you the sandbox crashed.
It'll tell you the python environment crashed
It'll tell you the chart generation routines crashed and insist on giving you ASCII charts (modern tech at work).
It'll make excuses for not running code. ('I saw too many nested routines so I didn't run it but told you I did.')
So, I'm under no illusions here. BUT... I 'think' I have a somewhat working solution to part of that.
Below is a python script to parse the Takahashi transcript. I downloaded his transcript directly from his site (pagesH.txt). I've been testing GPT and this python script for a few days now. And it seems to force GPT into 'more' accurate results. Here's the caveats.
1. You have to instruct it to ONLY use the python file to generate results from the pagesH.txt (Takahashi's file name). As in, no text globbing, no regex parsing, use nothing but the output of the python file to analyze the text file, etc.
2. Have it run the 'sanity check' function. In doing so, it parses You are not allowed to view links. Register or Login to view. and f68r3. 49v was one of the hardest ones for it to work with. If it passes the sanity check, it will tell you that it compared token counts and SHA values that are baked into the code. If either is off, then it's not using the parser.
3. It will try to cheat and I haven't yet fixed that but it is fixable. It will try to jump into helper functions to get results faster. You have to tell it no cheating, no using helper functions.
4. There's a receipt function. Ask it for a receipt and it will tell you what it 'heard' vs 'what it executed'.
5. Tell it, "NO CHANGING THE PYTHON FILE". (and yea, it may lie and still do it but it hasn't yet after giving that instruction.)
6. You can have it 'analyze' the python file so that it better understands it's structure and it seems to be less inclined to cheat if you do.
7. GPT will try like hell to 'analyze' the output and give you a load of horse shit explanations for the output I have had to tell it to stop trying to make conclusions unless it can back it up with references. And that pretty much shut it up.
So, how this gets around GPT's rules: If it executes the code, it bypasses it's 'rules' system. That 'rules' system, I have found to be something it will change on a whim and not tell you. The code output, as it puts it, is 'deterministic' and isn't based on the rules. Therefore, it's considerably more reliable. Other math though, is still in need of validation. I've seen it duplicate bigrams in tables so you still need to double check things. But, the output of the parser is good. There are 'acceptable' issues though. For example, Takahashi 48v has 2 'columns', the single letters and the rest of that line. The parser, in 'structured' mode, will parse it into two groups.
[P]
P0: kshor shol cphokchol chcfhhy qokchy qokchod sho cthy chotchy...chol chor ches chkalchy chokeeokychokoran ykchokeo r cheey daiin
[L]
L0: f o r y e k s p o y e p o y e d y s k y
Those groups don't match the page as they are in columns. So if you're doing structural analysis, it will, on some pages, produce bad results.
In corpus mode, it puts it all into one line of text:
f o r y e k s p o y e p o y e d y s k y kshor shol cphokchol chc...chol chor ches chkalchy chokeeokychokoran ykchokeo r cheey daiin
Again, not an exact representation of what the page looks like, but this is how Takahashi transcribed it.
(and note the word chokeeokychokoran is correct as that's how Takahashi has it in the file. I thot the parser screwed it up and spent a couple more hours verifying that.)
Other issues:
So, why no interlinear file parser? Took me 3 days of arguing with that savant just to get this parser working. I gave up on the interlinear after another 3 days of trying, satisfied with this one.
Now, does this mean I'm going to use GPT to 'translate' the Voynich? Oh hell no. I'm not that naive. Is this a starting point to getting trustworthy statistical and mathematical results from it? Not a solution, no. A starting point? Maybe.
One more thing I need to explain. I do write code. I do not code in Python. So, yes, the code was created by GPT but I can mostly follow it. I think it's good. Not positive. If you code in python, please test it.
Feel free to upload this python and the pageH.txt to your own GPT and test it out. Be sure to compare it to your OWN results though! Trust but verify.
I have also set up a custom GPT with both the python and pagesH.txt.
You are not allowed to view links. Register or Login to view.
Be certain to click on or ask for the 'sanity check' first to make sure it's using the parser.
Another button produces this which I'm reasonably certain is accurate. If you don't see this exact chart, it's screwing up:
Example:
Input: Create a non-normalize heatmap of the letter count of all folios. Y axis is the folio number. X axis is the letter. Sort the x axis left to right by count.
Output: Here you go — a non-normalized heatmap of letter counts for every folio (rows = folios, columns = letters).
X-axis letters are sorted left→right by their global frequency in the corpus.
Note: That spike in You are not allowed to view links. Register or Login to view. that says there are over 600 letter e's on that page according to the legend, I looked at the page and had it parse it specifically, it says there are 635 letter e's on that page at 19.35% of the total. I looked at that page and um, yea. There's a LOT of e's there.
[attachment=12109]
Ok, let me be honest in my opinion about 2 things.
1. GPT can write code (not always correctly) that's at least fixable.
2. It can write and execute code faster than most of us and produce results faster than most of us.
Now, let me be honest about the bad part.
1. It lies.
2. It hallucinates.
3. It makes excuses.
4. It fabricates results.
5. It'll do all of the above and then lie about doing it.
There is nothing in it's core instructions that force it to tell you the truth and I have caught it multiple times doing all 5 of the above at the same time.
Other bad things:
It'll tell you it's executing code and never does (and that evolves into an infinite loop).
It'll tell you the sandbox crashed.
It'll tell you the python environment crashed
It'll tell you the chart generation routines crashed and insist on giving you ASCII charts (modern tech at work).
It'll make excuses for not running code. ('I saw too many nested routines so I didn't run it but told you I did.')
So, I'm under no illusions here. BUT... I 'think' I have a somewhat working solution to part of that.
Below is a python script to parse the Takahashi transcript. I downloaded his transcript directly from his site (pagesH.txt). I've been testing GPT and this python script for a few days now. And it seems to force GPT into 'more' accurate results. Here's the caveats.
1. You have to instruct it to ONLY use the python file to generate results from the pagesH.txt (Takahashi's file name). As in, no text globbing, no regex parsing, use nothing but the output of the python file to analyze the text file, etc.
2. Have it run the 'sanity check' function. In doing so, it parses You are not allowed to view links. Register or Login to view. and f68r3. 49v was one of the hardest ones for it to work with. If it passes the sanity check, it will tell you that it compared token counts and SHA values that are baked into the code. If either is off, then it's not using the parser.
3. It will try to cheat and I haven't yet fixed that but it is fixable. It will try to jump into helper functions to get results faster. You have to tell it no cheating, no using helper functions.
4. There's a receipt function. Ask it for a receipt and it will tell you what it 'heard' vs 'what it executed'.
5. Tell it, "NO CHANGING THE PYTHON FILE". (and yea, it may lie and still do it but it hasn't yet after giving that instruction.)
6. You can have it 'analyze' the python file so that it better understands it's structure and it seems to be less inclined to cheat if you do.
7. GPT will try like hell to 'analyze' the output and give you a load of horse shit explanations for the output I have had to tell it to stop trying to make conclusions unless it can back it up with references. And that pretty much shut it up.
So, how this gets around GPT's rules: If it executes the code, it bypasses it's 'rules' system. That 'rules' system, I have found to be something it will change on a whim and not tell you. The code output, as it puts it, is 'deterministic' and isn't based on the rules. Therefore, it's considerably more reliable. Other math though, is still in need of validation. I've seen it duplicate bigrams in tables so you still need to double check things. But, the output of the parser is good. There are 'acceptable' issues though. For example, Takahashi 48v has 2 'columns', the single letters and the rest of that line. The parser, in 'structured' mode, will parse it into two groups.
[P]
P0: kshor shol cphokchol chcfhhy qokchy qokchod sho cthy chotchy...chol chor ches chkalchy chokeeokychokoran ykchokeo r cheey daiin
[L]
L0: f o r y e k s p o y e p o y e d y s k y
Those groups don't match the page as they are in columns. So if you're doing structural analysis, it will, on some pages, produce bad results.
In corpus mode, it puts it all into one line of text:
f o r y e k s p o y e p o y e d y s k y kshor shol cphokchol chc...chol chor ches chkalchy chokeeokychokoran ykchokeo r cheey daiin
Again, not an exact representation of what the page looks like, but this is how Takahashi transcribed it.
(and note the word chokeeokychokoran is correct as that's how Takahashi has it in the file. I thot the parser screwed it up and spent a couple more hours verifying that.)
Other issues:
- The zodiac section has 'abnormal' tags for the text that Takahashi never fully described (that I can find), like X,Y,Z. In those sections, it moves any unknown tags into the 'R' section, which I believe is Radial. The other section there is the 'C' section, which I believe it for circular. That prevents some weird results where you have a C and R and then a bunch of other tags with one word in them. In corpus mode, those tags are ignored so it's all one continuous text.
- There is also a folio sorting routine in there as it just loved to sort the Takahashi folio tags alphabetically. When folio 100 comes before 1, it's not using the sort.
- And, there is a routine where it skips missing pages. Takahashi included the folio numbers with no content so, it skips those by default. There is a flag in the parser and if you specifically tell it, it will change that flag and include those empty folios.
So, why no interlinear file parser? Took me 3 days of arguing with that savant just to get this parser working. I gave up on the interlinear after another 3 days of trying, satisfied with this one.
Now, does this mean I'm going to use GPT to 'translate' the Voynich? Oh hell no. I'm not that naive. Is this a starting point to getting trustworthy statistical and mathematical results from it? Not a solution, no. A starting point? Maybe.
One more thing I need to explain. I do write code. I do not code in Python. So, yes, the code was created by GPT but I can mostly follow it. I think it's good. Not positive. If you code in python, please test it.
Feel free to upload this python and the pageH.txt to your own GPT and test it out. Be sure to compare it to your OWN results though! Trust but verify.
I have also set up a custom GPT with both the python and pagesH.txt.
You are not allowed to view links. Register or Login to view.
Be certain to click on or ask for the 'sanity check' first to make sure it's using the parser.
Another button produces this which I'm reasonably certain is accurate. If you don't see this exact chart, it's screwing up:
Example:
Input: Create a non-normalize heatmap of the letter count of all folios. Y axis is the folio number. X axis is the letter. Sort the x axis left to right by count.
Output: Here you go — a non-normalized heatmap of letter counts for every folio (rows = folios, columns = letters).
X-axis letters are sorted left→right by their global frequency in the corpus.
- PNG: Download the heatmap
- CSV of the underlying matrix (rows = folios, cols = letters in the same order as the plot): Download the data
- cluster folios by letter profile,
- switch to per-folio normalization,
- or split by quire/section.
Note: That spike in You are not allowed to view links. Register or Login to view. that says there are over 600 letter e's on that page according to the legend, I looked at the page and had it parse it specifically, it says there are 635 letter e's on that page at 19.35% of the total. I looked at that page and um, yea. There's a LOT of e's there.
[attachment=12109]
Code:
#!/usr/bin/env python3
# Takahashi Voynich Parser — LOCKED, SELF-CONTAINED (v2025-11-05)
# Author: You + “stop breaking my parser” mode // yes, it gave it that name after a lot of yelling at it.
import sys, re, hashlib
from collections import defaultdict, Counter, OrderedDict
TAG_LINE = re.compile(r'^<(?P<folio>f\d+[rv](\d*)?)\.(?P<tag>[A-Z]+)(?P<idx>\d+)?(?:\.(?P<line>\d+))?;H>(?P<payload>.*)$')
A_Z_SPACE = re.compile(r'[^a-z ]+')
def normalize_payload(s: str) -> str:
s = re.sub(r'\{[^}]*\}', '', s)
s = re.sub(r'<![^>]*>', '', s)
s = s.replace('<->', ' ')
s = s.replace('\t', ' ').replace('.', ' ')
s = s.lower()
s = A_Z_SPACE.sub(' ', s)
s = re.sub(r'\s+', ' ', s).strip()
return s
def iter_h_records(path, wanted_folio=None):
current = None
buf = []
with open(path, 'r', encoding='utf-8', errors='ignore') as f:
for raw in f:
line = raw.rstrip('\n')
if not line:
continue
if line.startswith('<'):
if current and buf:
folio, tag, idx, ln = current
payload = ''.join(buf)
yield (folio, tag, idx, ln, payload)
m = TAG_LINE.match(line)
if m:
folio = m.group('folio')
if (wanted_folio is None) or (folio == wanted_folio):
tag = m.group('tag')
idx = m.group('idx') or '0'
ln = m.group('line') or '1'
payload = m.group('payload')
current = (folio, tag, idx, ln)
buf = [payload]
else:
current = None
buf = []
else:
current = None
buf = []
else:
if current is not None:
buf.append(line)
if current and buf:
folio, tag, idx, ln = current
payload = ''.join(buf)
yield (folio, tag, idx, ln, payload)
def parse_folio_corpus(path, folio):
fid = folio.lower() if isinstance(folio, str) else str(folio).lower()
if _EXCLUDE_EMPTY_FOLIOS_ENABLED and fid in _EMPTY_FOLIOS:
return ''
pieces = []
for _folio, _tag, _idx, _ln, payload in iter_h_records(path, folio):
norm = normalize_payload(payload)
if norm:
pieces.append(norm)
return ' '.join(pieces).strip()
def parse_folio_structured(path, folio):
fid = folio.lower() if isinstance(folio, str) else str(folio).lower()
if _EXCLUDE_EMPTY_FOLIOS_ENABLED and fid in _EMPTY_FOLIOS:
return {}
groups = defaultdict(lambda: defaultdict(list))
for _folio, tag, idx, _ln, payload in iter_h_records(path, folio):
norm = normalize_payload(payload)
if norm:
groups[tag][idx].append(norm)
out = {}
for tag, by_idx in groups.items():
od = OrderedDict()
for idx in sorted(by_idx, key=lambda x: int(x)):
od[f"{tag}{idx}"] = ' '.join(by_idx[idx]).strip()
out[tag] = od
return sort_structured(out)
def sha256(text: str) -> str:
return hashlib.sha256(text.encode('utf-8')).hexdigest()
SENTINELS = {
'f49v': {'tokens': 151, 'sha256': '172a8f2b7f06e12de9e69a73509a570834b93808d81c79bb17e5d93ebb0ce0d0'},
'f68r3': {'tokens': 104, 'sha256': '8e9aa4f9c9ed68f55ab2283c85581c82ec1f85377043a6ad9eff6550ba790f61'},
}
def sanity_check(path):
results = {}
for folio, exp in SENTINELS.items():
line = parse_folio_corpus(path, folio)
toks = len(line.split())
dig = sha256(line)
ok = (toks == exp['tokens']) and (dig == exp['sha256'])
results[folio] = {'ok': ok, 'tokens': toks, 'sha256': dig, 'expected': exp}
all_ok = all(v['ok'] for v in results.values())
return all_ok, results
def most_common_words(path, topn=10):
counts = Counter()
for _folio, _tag, _idx, _ln, payload in iter_h_records(path, None):
norm = normalize_payload(payload)
if norm:
counts.update(norm.split())
return counts.most_common(topn)
def single_letter_counts(path):
counts = Counter()
for _folio, _tag, _idx, _ln, payload in iter_h_records(path, None):
norm = normalize_payload(payload)
if norm:
for w in norm.split():
if len(w) == 1:
counts[w] += 1
return dict(sorted(counts.items(), key=lambda kv: (-kv[1], kv[0])))
USAGE = '''
Usage:
python takahashi_parser_locked.py sanity PagesH.txt
python takahashi_parser_locked.py parse PagesH.txt <folio> corpus
python takahashi_parser_locked.py parse PagesH.txt <folio> structured
python takahashi_parser_locked.py foliohash PagesH.txt <folio>
python takahashi_parser_locked.py most_common PagesH.txt [topN]
python takahashi_parser_locked.py singles PagesH.txt
'''
def main(argv):
if len(argv) < 3:
print(USAGE); sys.exit(1)
cmd = argv[1].lower()
path = argv[2]
if cmd == 'sanity':
ok, res = sanity_check(path)
status = 'PASS' if ok else 'FAIL'
print(f'PRECHECK: {status}')
for folio, info in res.items():
print(f" {folio}: ok={info['ok']} tokens={info['tokens']} sha256={info['sha256']}")
sys.exit(0 if ok else 2)
if cmd == 'parse':
if len(argv) != 5:
print(USAGE); sys.exit(1)
folio = argv[3].lower()
mode = argv[4].lower()
if mode == 'corpus':
line = parse_folio_corpus(path, folio)
print(line)
elif mode == 'structured':
data = parse_folio_structured(path, folio)
order = ['P','C','V','L','R','X','N','S']
for grp in order + sorted([k for k in data.keys() if k not in order]):
if grp in data and data[grp]:
print(f'[{grp}]')
for k,v in data[grp].items():
print(f'{k}: {v}')
print()
else:
print(USAGE); sys.exit(1)
sys.exit(0)
if cmd == 'foliohash':
if len(argv) != 4:
print(USAGE); sys.exit(1)
folio = argv[3].lower()
line = parse_folio_corpus(path, folio)
print('Token count:', len(line.split()))
print('SHA-256:', sha256(line))
sys.exit(0)
if cmd == 'most_common':
topn = int(argv[3]) if len(argv) >= 4 else 10
ok, _ = sanity_check(path)
if not ok:
print('PRECHECK: FAIL — aborting corpus job.'); sys.exit(2)
for word, cnt in most_common_words(path, topn):
print(f'{word}\t{cnt}')
sys.exit(0)
if cmd == 'singles':
ok, _ = sanity_check(path)
if not ok:
print('PRECHECK: FAIL — aborting corpus job.'); sys.exit(2)
d = single_letter_counts(path)
for k,v in d.items():
print(f'{k}\t{v}')
sys.exit(0)
print(USAGE); sys.exit(1)
if __name__ == '__main__':
main(sys.argv)
# ==== BEGIN ASTRO REMAP (LOCKED RULE) ====
import re as _re_ast
# === Exclusion controls injected ===
_EXCLUDE_EMPTY_FOLIOS_ENABLED = True
_EMPTY_FOLIOS = set(['f101r2', 'f109r', 'f109v', 'f110r', 'f110v', 'f116v', 'f12r', 'f12v', 'f59r', 'f59v', 'f60r', 'f60v', 'f61r', 'f61v', 'f62r', 'f62v', 'f63r', 'f63v', 'f64r', 'f64v', 'f74r', 'f74v', 'f91r', 'f91v', 'f92r', 'f92v', 'f97r', 'f97v', 'f98r', 'f98v'])
def set_exclude_empty_folios(flag: bool) -> None:
"""Enable/disable skipping known-empty folios globally."""
global _EXCLUDE_EMPTY_FOLIOS_ENABLED
_EXCLUDE_EMPTY_FOLIOS_ENABLED = bool(flag)
def get_exclude_empty_folios() -> bool:
"""Return current global skip setting."""
return _EXCLUDE_EMPTY_FOLIOS_ENABLED
def get_excluded_folios() -> list:
"""Return the sorted list of folios that are skipped when exclusion is enabled."""
return sorted(_EMPTY_FOLIOS)
# === End exclusion controls ===
_ASTRO_START, _ASTRO_END = 67, 73
_KEEP_AS_IS = {"C", "R", "P", "T"}
_folio_re_ast = _re_ast.compile(r"^f(\d+)([rv])(?:([0-9]+))?$")
def _is_astro_folio_ast(folio: str) -> bool:
m = _folio_re_ast.match(folio or "")
if not m:
return False
num = int(m.group(1))
return _ASTRO_START <= num <= _ASTRO_END
def _remap_unknown_to_R_ast(folio: str, out: dict) -> dict:
if not isinstance(out, dict) or not _is_astro_folio_ast(folio):
return sort_structured(out)
if not out:
return sort_structured(out)
out.setdefault("R", {})
unknown_tags = [t for t in list(out.keys()) if t not in _KEEP_AS_IS]
for tag in unknown_tags:
units = out.get(tag, {})
if isinstance(units, dict):
for unit_key, text in units.items():
new_unit = f"R_from_{tag}_{unit_key}"
if new_unit in out["R"]:
out["R"][new_unit] += " " + (text or "")
else:
out["R"][new_unit] = text
out.pop(tag, None)
return sort_structured(out)
# Wrap only once
try:
parse_folio_structured_original
except NameError:
parse_folio_structured_original = parse_folio_structured
def parse_folio_structured(pages_path: str, folio: str):
out = parse_folio_structured_original(pages_path, folio)
return _remap_unknown_to_R_ast(folio, out)
# ==== END ASTRO REMAP (LOCKED RULE) ====
def effective_folio_ids(pages_path: str) -> list:
"""Return folio ids found in PagesH headers. Respects exclusion toggle for known-empty folios."""
import re
# === Sorting utilities (injected) ===
def folio_sort_key(fid: str):
"""Return a numeric sort key for folio ids like f9r, f10v, f68r3 (recto before verso)."""
s = (fid or "").strip().lower()
m = re.match(r"^f(\d{1,3})(r|v)(\d+)?$", s)
if not m:
# Place unknown patterns at the end in stable order
return (10**6, 9, 10**6, s)
num = int(m.group(1))
side = 0 if m.group(2) == "r" else 1
sub = int(m.group(3)) if m.group(3) else 0
return (num, side, sub, s)
def sort_folio_ids(ids):
"""Sort a sequence of folio ids in natural numeric order using folio_sort_key."""
try:
return sorted(ids, key=folio_sort_key)
except Exception:
# Fallback to stable original order on any error
return list(ids)
_REGION_ORDER = {"P": 0, "T": 1, "C": 2, "R": 3}
def sort_structured(struct):
"""Return an OrderedDict-like mapping with regions sorted P,T,C,R and units numerically."""
try:
from collections import OrderedDict
out = OrderedDict()
# Sort regions by our preferred order; unknown tags go after known ones alphabetically
def region_key(tag):
return (_REGION_ORDER.get(tag, 99), tag)
if not isinstance(struct, dict):
return struct
for tag in sorted(struct.keys(), key=region_key):
blocks = struct[tag]
if isinstance(blocks, dict):
od = OrderedDict()
# Unit keys are expected to be numeric strings (idx), or tag+idx; try to extract int
def idx_key(k):
m = re.search(r"(\d+)$", str(k))
return int(m.group(1)) if m else float("inf")
for k in sorted(blocks.keys(), key=idx_key):
od[k] = blocks[k]
out[tag] = od
else:
out[tag] = blocks
return out
except Exception:
return struct
def english_sort_description() -> str:
"""Describe the default sorting rules in plain English."""
return ("ordered numerically by folio number with recto before verso and subpages in numeric order; "
"within each folio, regions are P, then T, then C, then R, and their units are sorted by number.")
def english_receipt(heard: str, did: str) -> None:
"""Print a two-line audit receipt with the plain-English command heard and what was executed."""
if heard is None:
heard = ''
if did is None:
did = ''
print(f"Heard: {heard}")
print(f"Did: {did}")