Here are some advances in the comparison between the Starred Parags (SPS) section and the Shennong Bencao Jing (SBJ). Recall that the files are:
- You are not allowed to view links. Register or Login to view. The Starred Paragraphs section (SPS) from Takeshi's transcription in the 1.6e6 interlinear file, from page You are not allowed to view links. Register or Login to view. to line 30 of f116r. With one parag per line, in the EVA encoding, with all alignment fillers and comments removed, all weirdos and missing chars mapped to '*', one "=" at start and end of each line (= parag).
- You are not allowed to view links. Register or Login to view. The SBJ from the webpage posted by @oshfdk, minus the introduction 《上卷》 and section headers (see below), converted to pinyin by Google Translate, mapped to lowercase.
Both files are in UTF-8 encoding. Again, if you just click on those links you will see gibberish, because the server at my Univ expects plain text files to be in ISO-Latin-1 and thus messes up the formatted HTML that it sends to your browser. You will have to download the files and look at them with any text editor or viewer that understands UTF-8.
While analyzing the number of words per paragraph in the SBJ file ("bencao.pin") posted earlier, I noticed that there were several parags with only 3--5 Chinese words. It turns out that those are subsection headers. Here they are. The locus 1.X.YYY means that it is subsection X of section 1 《中卷》starting at line YYY. The notation 2.X.YYY is analogous but for section 2 《下卷》
Code:
1.1.001 玉石部上品 yùshí ù shàngpǐn Top grade jade
1.2.019 玉石部中品 yùshí bù zhōng pǐn Jade department middle grade
1.3.033 玉石部下品 yùshí bùxià pǐn jade subordinate product
1.4.044 草部上品 cǎo bù shàngpǐn Top grade grass
1.5.102 草部中品 cǎo bù zhōng pǐn Kusanabe middle grade
1.6.162 草部下品 cǎo bùxià pǐn The lowest grade of grass
1.7.219 木部上品 mù bù shàngpǐn Top grade wood
1.8.234 木部中品 mù bù zhōng pǐn Kibe middle grade
1.9.253 木部下品 mù bùxià pǐn Kibe inferior grade
2.1.001 蟲獸部上品 chóng shòu bù shàngpǐn Top quality insects and beasts
2.2.017 蟲獸部中品 chóng shòu bù zhōng pǐn Insect and animal department medium quality
2.3.042 蟲獸部下品 chóng shòu bùxià pǐn Insect Beast Subordinates
2.4.069 果菜部上品 guǒcài bù shàngpǐn Top quality fruits and vegetables department
2.5.080 果菜部中品 guǒcài bù zhōng pǐn Medium range of fruits and vegetables
2.6.087 果菜部下品 guǒcài bùxià pǐn Fruit and vegetable products
2.7.091 米穀部上品 mǐgǔ bù shàngpǐn Top grade rice cereals
2.8.094 米穀部中品 mǐgǔ bù zhōng pǐn Mid-grade rice
2.9.098 米穀部下品 mǐgǔ bùxià pǐn The inferior product of Rice Valley
The pinyin readings and translations are from Google Translate. I left them unedited for the lulz.
After commenting those header lines out, the shortest remaining entry seemed to be normal:
Code:
2.3.044 鼯鼠 主墮胎,令易產。 wú shǔ zhǔ duòtāi, lìng yì chǎn
Flying squirrel: causes abortion and makes childbirth easier.
(If that can be called "normal"...)
And then I noticed that the Starred Parags file ("starps.eva") too had a few anomalously short parags of ~4 Voynichese words. Those were so-called "titles", short lines with anomalous justification:
Code:
<f105r.T1.9a> =sairy.ore.daiindy.ytam=
<f105r.T2.36> =otoiis.chedaiin.otair.otaly=
<f108v.T1.52> =olchar.olchedy.lshy.otedy=
<f114r.T1.34> =ytain.olkaiin.ykar.chdar.alkam=
The title <f114r.T1.34> is a right-jutified line after a parag that ends with a full line. It had been assumed to be the last line of the previous parag that the Scribe skipped and then inserted in that non-standard position. However, the first line of the next parag <f114r.P1.35> bends down to avoid that title. Thus, if that conjecture is true, the Scribe must have realized the omission after writing the firat 4 lines of <f114r.P1.35>. I have now re-interpreted <f114r.T1.34> as a title.
It is possible that other section headers were not recognized as such and were joined with adjacent parags.
After commenting out the subsection titles on both files, I counted again the number of words and parags, and basic statistics (min, max, average, and standard deviation) of the number of words per paragraph (nwp):
Code:
statistic ! bencao ! starps
-----------+---------+--------
parags | 354 | 330
words | 10874 | 10457
min nwp | 7 | 11
max nwp | 76 | 72
avg nwp | 30.8 | 31.7
dev nwp | 8.5 | 11.2
Here is the histogram of the word counts (nwp):
[
attachment=10830]
At first sight the histograms are different, but there are some intriguing similarities. Note that both files have 23 entries with 27 words (the most common entry length in both files), six entries with 23 words, 8 entries with 37 words, 2 entries with 47 words, one entry with 53 words, one entry with 59 words, and one entry with 62 words. In both files, there are anomalously few entries with 23, 37, and 43 words.
Considering the missing bifolio in the SPS quire, we have 6 surprising near coincidences: number of entries, and the mode, min, max, average, and deviation of the number of words per paragraph. (The total number of words is not an extra coincidence since it is the average npw times the number of entries.)
Compared to the SBJ, the SPS has a somewhat broader npw histogram, as implied by the standard deviation. It has more entries with 10-20 words and 35-70 words, and fewer with 21-34 words. In particular, the SBJ has a second mode: 23 parags of 34 words, whereas the SPS has only 11.
These discrepancies could be the result of the some word spaces being incorrectly inserted or omitted in the SPS as it was digitized; somewhat at random, with almost the same probability.
Alternatively, some parag breaks in the SPS may be wrong, causing, for example, two consecutive parags that should have 22 and 32 words to become parags of 16 and 38 words; and two parags that should have 7 and 76 words to become parags with 13 and 70 words.
Both kinds of errors would have little effect on the average npw, but would increase its standard deviation, as observed.
There is also the bonus coincidence of both files having originally subsection titles with ~4 words each, althout the number of such titles is vastly different. More on that later.
Now for the bad news. As @oshfdk observed, there are hundreds of multiword sequences that occur many times in the SBJ. In particular, there is a 10-word phrase that occurs six times, on six consecutive lines:
Code:
久食輕身不老,延年神仙。一名
<s1.4.045> iǔ.shí.qīng.shēn.bùlǎo.yán.nián.shénxiān.yī.míng
<s1.4.046> iǔ.shí.qīng.shēn.bùlǎo.yán.nián.shénxiān.yī.míng
<s1.4.047> iǔ.shí.qīng.shēn.bùlǎo.yán.nián.shénxiān.yī.míng
<s1.4.048> iǔ.shí.qīng.shēn.bùlǎo.yán.nián.shénxiān.yī.míng
<s1.4.049> iǔ.shí.qīng.shēn.bùlǎo.yán.nián.shénxiān.yī.míng
<s1.4.050> iǔ.shí.qīng.shēn.bùlǎo.yán.nián.shénxiān.yī.míng
Eating it for a long time will make you light and immortal. It is also called
[code]
In contrast, the longest phrases that occur more than oncein the SPS have only 3 words; and the most common occurs only three times:
[code]
<f103r.P1.52> chedy.qokeey.qokeey
<f108v.P1.44> chedy.qokeey.qokeey
<f112v.P1.15> chedy.qokeey.qokeey
I will discuss the implications of this difference in another post.