The Voynich Ninja

Full Version: Relations among pattern studies?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
Following up on one facet of Voynich Manuscript Day:
(05-08-2024, 08:13 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.All research was well presented, for example Emma explained statistical concepts well.... But as the data was being presented in the text-focused talks, I felt myself losing the forest through the trees, wondering how things fit into the bigger picture....  I think it would be extremely valuable to the community if someone was able to write an "explain like I'm five" version of these talks, trying to focus on the bigger picture and how Emma's, tavie's and Patrick's findings relate to each other. This might be an assignment even the authors themselves struggle with, but it would be an invaluable exercise.
I'm sure I won't be able to do justice to this assignment on my own, but it's interesting enough that I didn't want to leave it unaddressed.

The main thing I think our three presentations shared in common was the goal of searching for structural patterns outside words as earnestly as people have long been searching for structural patterns inside them.  We each investigated cases where the actual prevalence of a text element (glyph, word, etc.) turns out to be significantly greater or lesser in a particular context than it "should" be in a random distribution.
  • In tavie's presentation, we saw that there are many patterns specific to line starts, line ends, and top rows -- more patterns, and stronger ones, than past casual assessments have suggested.  Her detection of You are not allowed to view links. Register or Login to view. felt especially new and exciting.
  • In Emma's presentation, we saw that there are many patterns in which features of words pair preferentially or dispreferentially with features of adjacent words -- not just word-break combinations (linking the end of one word to the start of the next word), but also pairs of successive beginnings and endings, or words that simply contain particular features anywhere within them.  Her introduction of z-scores brings valuable statistical rigor to this type of study.
  • In my presentation, I tried to show that treating glyph sequences within words and between words as parts of the same system, rather than analyzing these separately, lets us identify some interesting cyclical patterns that don't necessarily coincide with words as units but do fairly well at "predicting" longer repeating sequences.  Here are the slides and script in case anybody wants them:[attachment=8981][attachment=8982]
Emma and I were both examining features based on their positions relative to other features, and we even identified a few of the same patterns, although from different perspectives and using very different methods.  For example, we both identified [dy] as less likely than expected to be followed by [k]:
[attachment=8980]
As I mentioned in Q&A, I've also tried studying higher-order transitional probabilities, such as third-order [dyk>a], which is the same glyph sequence as Emma's [dy.ka], but again, approached from a different perspective and, in that case, broken up differently.  Even higher-order transitional probabilities, such as [ydaii>n] or [qokeedy>d], would similarly overlap with some of Emma's start-start and end-end pairings, especially when populated by wild-card characters as in [y***>n] or [q*****>d], except that they'd be defined by an intervening glyph count (with all the attendant uncertainty over what counts as one glyph) rather than by positions within words (with all the attendant uncertainty over word breaks).  I sense potential weaknesses in both approaches, but I'm not sure how to get around them.

I assume there's got to be some connection between this type of glyph-by-glyph or word-by-word pattern and the other type of pattern described by tavie, centered on differences by line and paragraph position.  After all, for those two types of pattern both to be valid, they must overlap and complement each other, and I even showed a few examples of transitional probabilities that vary strongly by location within lines and paragraphs.  But how these two types of patterns interrelate with each other strikes me as still very much a mystery.  Do they just coexist?  Or do they both result from some other common factor?  Or does one level of patterning somehow cause the other level of patterning?

In Q&A, I briefly suggested that the cumulative effect of glyph-by-glyph or word-by-word patterns might be responsible for some kinds of line pattern.  If individual glyph-by-glyph or word-by-word patterns of preference are asymmetrical, tending towards certain combinations and away from others, that could perhaps account for some features being unevenly distributed within lines.  But I've never been able to demonstrate any such thing statistically, so right now it's no more than an idle guess on my part.  Another apparent tendency of certain glyphs to recur preferentially after an interval might account for the greater probability of [p] further along in lines that begin with [p], but it wouldn't offer any insight into the reasons for paragraph-initial [p] itself.  (In Emma's analysis, I believe that relationship would translate into a start-anywhere combination.)

Or it could be the other way around.  Different paragraph and line positions might cause glyph-by-glyph or word-by-word patterns to vary.

It seems there must be some relationship, but for now, I really don't know what it is, and I'm unsure how to go about trying to find out.
Hi Patrick,
 
here are some of my thought regarding your speech. In my eyes the properties you describe as loops are caused by the network of similar vords. [see Timm & Schinner 2020, p. 4]. The general principle within this network of similar words is that "high-frequency tokens also tend to have high numbers of similar words. ... words (i.e. unconnected nodes in the graph) usually appear just once in the entire VMS while the most frequent token <daiin> (836 occurrences) has 36 counterparts with edit distance 1. [Timm & Schinner 2020, p. 6]. For this reason the most likely paths in your transitional probability matrix must result in frequently used words.

The cause for the network of similar words is "an existing deep correlation between frequency, similarity, and spatial vicinity of tokens within the VMS text" [Timm & Schinner 2020, p. 4] Or with other words "all pages containing at least some lines of text do have in common that pairs of frequently used words with high mutual similarity appear." [Timm & Schinner 2020, p. 3].

One of your questions was why the repetition counts for [ol] are lower then the counts for [qokeedy] and for [chol]. Regarding this question it is interesting to look on folio 15v. On f15v not only “oror” exists, there is also “oror or” and “or or oro r” immediately above each other on the first two lines. There are five instances for [oror], seven instances of [arar], even 15 instance for [olol], and 2 instances for [dydydy]. This means for shorter vords like [or], [ol], [dy] it is also possible to combine two instances of a word like [dy] into a new word like [dydy] or [dydydy].
Another observation is that for longer words most transitions result in similar word that are recognizable. For instance is the similarity between [font='Open Sans', sans-serif][qokedy], [font='Open Sans', sans-serif][qokeedy] and [okeedy] [/font]still obvious. For shorter words the transition of a glyph automatically replaces a larger part of the word. If for instance [ol] transforms into [al], [or], or [kol] 50 or 33 % of the glyphs are different. Therefore sequences like [or.ar.y.kar.ol.al] on You are not allowed to view links. Register or Login to view. or [tor.ol.dol.or] on You are not allowed to view links. Register or Login to view. are maybe less eye catching compared to sequences like [qokeedy.qokeedy.chey.qokeedy.qokedy] on f108.P.37.[/font]

In my eyes an important observation for the Voynich text is its variation. Equally distributed glyphs or words doesn't exists in the VMS. (See also my You are not allowed to view links. Register or Login to view. about the distribution of vords containing the sequences 'ed', 'ho', and 'in' in the VMS.)  See for instance the vord <qokeey>. It is the most common vord on You are not allowed to view links. Register or Login to view. and the third most frequent vord in Quire 20 (see You are not allowed to view links. Register or Login to view.). However <qokeey> is only frequently used on some of the pages in Q20. On three pages <qokeey> ist even absent. You can choose every vord you like the behavior is always the same: "No obvious rule can be deduced which words form the top-frequency tokens at a specific location, since a token dominating one page might be rare or missing on the next one." (Timm & Schinner 2019, p. 3). See also You are not allowed to view links. Register or Login to view.@Github. Did that mean the it is necessary to assume different default loops for different pages?

In the long run it is observable that if in a section [chedy] is "used more frequently, this also increases the frequency of similar words, like [shedy] or [qokeedy]. "At the same time, also words using the prefix [qok-] are becoming more and more frequent, whereas words typical for Currier A like [chol] and [chor] vanish gradually." [Timm & Schinner 2020, p. 6].
This could mean that at least for each section of the text different transitional probabilities exists. Therefore it would probably be necessary to assume different default loops for each section of the text. For instance since the group [eo] is more common in Pharma A than in Herbal A it might be reasonable to assume [choldaiin] as default loop for Herbal A and [cheoldaiin] as default loop for Pharma A.
choldaiin in Herbal A - see for instance [chotchol.daiin.cthol.doiin.daiin] on f30v.P.3
cheoldaiin for Pharma A - see for instance [ycheol.cheol.doiir.shekcheor.sar.cheor] on f102v1.P1.6
chdyaiin for Herbal B - see for instance [chdy.chdor.chtol.chdy] on f48v.P.10
chedyqokaiin for Bio B - see for instance [qokaiin.shedy.chedy.qol.chedy.qokaiin.qokaiin.checkhy] on f77r.P.26
qokeeyaiin for Stars B - see for instance [qokeey.okeoey.qokey.qokey.qokeey] on f107v.P.47
(08-08-2024, 09:53 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.In my eyes the properties you describe as loops are caused by the network of similar vords. [see Timm & Schinner 2020, p. 4]. The general principle within this network of similar words is that "high-frequency tokens also tend to have high numbers of similar words. ... words (i.e. unconnected nodes in the graph) usually appear just once in the entire VMS while the most frequent token <daiin> (836 occurrences) has 36 counterparts with edit distance 1. [Timm & Schinner 2020, p. 6]. For this reason the most likely paths in your transitional probability matrix must result in frequently used words.

The cause for the network of similar words is "an existing deep correlation between frequency, similarity, and spatial vicinity of tokens within the VMS text" [Timm & Schinner 2020, p. 4] Or with other words "all pages containing at least some lines of text do have in common that pairs of frequently used words with high mutual similarity appear." [Timm & Schinner 2020, p. 3].

I definitely agree that there's a close connection between the patterns I was describing and the network you (and Schinner) have written about.  Still, I want to be careful when drawing conclusions about what's causing what.  You write here that the so-called "loop" properties are caused by the network (first paragraph), but then that the network is in turn caused by a "deep correlation between frequency, similarity, and spatial vicinity of tokens" (second paragraph).  Since I'd consider "deep correlation between frequency, similarity, and spatial vicinity of tokens" to be a reasonably good description of the system I was describing, those two paragraphs strike me as forming a little loop of their own.  So what came first, the chicken or the egg?

On the one hand, we have a network of similar words whose frequency correlates remarkably well with their degree of similarity to a few specific models.

On the other hand, we have a set of generative rules (involving glyph-by-glyph transitional probabilities) that would produce approximately that same set of words with approximately the same frequencies.  Other models (e.g., some "word paradigms") may be able to do the same.

If I understand things correctly, the self-citation hypothesis holds that the network exists due to the dynamics of copying words with minor changes, guided by the subjective / aesthetic preferences of the writer (and hence non-random).  Still, if we could identify a set of more concrete and specific rules, and had to account for its relationship with a network of words that just happened to follow those rules, I'd think Occam's razor would point to the rules causing the network rather than the network causing the rules.

From the perspective of the self-citation hypothesis, I suppose the rules (whether the ones I presented on, or any others) would then need to be interpreted as manifesting the subjective / aesthetic preferences of the writer, since they're not predictable in terms of a raw copying dynamic as such.  But even so, I'd presume any such rules would "really" have existed in the writer's mind to whatever extent grammatical rules exist in the mind of a speaker who's not consciously aware of them.

Of course the rules could take on other kinds of significance within other hypotheses.

(08-08-2024, 09:53 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.One of your questions was why the repetition counts for [ol] are lower then the counts for [qokeedy] and for [chol]. Regarding this question it is interesting to look on folio 15v. On You are not allowed to view links. Register or Login to view. not only “oror” exists, there is also “oror or” and “or or oro r” immediately above each other on the first two lines. There are five instances for [oror], seven instances of [arar], even 15 instance for [olol], and 2 instances for [dydydy]. This means for shorter vords like [or], [ol], [dy] it is also possible to combine two instances of a word like [dy] into a new word like [dydy] or [dydydy].
These are all impressive cases of repeating / looping.  And yes, if we ignore spacing, it looks like there are well over 100 tokens of the glyph sequence [olol], which is clearly more than the 32 tokens of the glyph sequence [cholchol], based in both cases on a very hasty count.  My observation was limited to repetitions of whole discrete words, which I'll admit is a distinction I don't make for the rest of my analysis.
(08-08-2024, 09:53 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.Another observation is that for longer words most transitions result in similar word that are recognizable. For instance is the similarity between [qokedy], [qokeedy] and [okeedy] still obvious. For shorter words the transition of a glyph automatically replaces a larger part of the word. If for instance [ol] transforms into [al], [or], or [kol] 50 or 33 % of the glyphs are different. Therefore sequences like [or.ar.y.kar.ol.al] on f34r.P.15 or [tor.ol.dol.or] on f54r.P.10 are maybe less eye catching compared to sequences like [qokeedy.qokeedy.chey.qokeedy.qokedy] on f108.P.37.

That's a good point, although [or.ar.y.kar.ol.al] also contains a lower proportion of "default" transitions than [qokeedy.qokeedy.chey.qokeedy.qokedy], which I'd take as indicating higher overall entropy.

(08-08-2024, 09:53 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.Did that mean the it is necessary to assume different default loops for different pages?

To a point, yes.  Statistics for individual pages or bifolios of the VMS will often point to other default loops.  With a reduced dataset, two or more transitions will sometimes tie for top position, further complicating matters.  Like the network of related words with most of its "slots" filled in, default loops only emerge reliably from a larger-scale aggregate view.

For what it's worth, just three bifolios produce straightforward [qokeedy] loops all by themselves: 76+83, 104+115, and 108+111.  And only one bifolio produces a straightfoward [choldaiin] loop by itself -- 2+7 -- although for 3+6, 28+29, and 51+54, a tie leaves [choldaiin] as one of the loops.  Other bifolio-specific loops include [chol], [daiin], [qokedy], [ol], [chedy], [oldaiin], and even [qokaiinchedy].

That said, the transitional probability matrices as a whole tend to be more stable than the loops are.  To "break" a loop, it's only necessary for one transition to overtake another somewhere, which -- depending on how "close" things are -- can just be a matter of statistical noise.  But the question of how much variation there is in the whole system of transitional probability matrices from section to section, or even page to page, would be well worth investigating, and probably more informative than a survey of the "default" loops.
(09-08-2024, 12:54 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.I definitely agree that there's a close connection between the patterns I was describing and the network you (and Schinner) have written about.  Still, I want to be careful when drawing conclusions about what's causing what.  You write here that the so-called "loop" properties are caused by the network (first paragraph), but then that the network is in turn caused by a "deep correlation between frequency, similarity, and spatial vicinity of tokens" (second paragraph).  Since I'd consider "deep correlation between frequency, similarity, and spatial vicinity of tokens" to be a reasonably good description of the system I was describing, those two paragraphs strike me as forming a little loop of their own.  So what came first, the chicken or the egg?

Indeed, I argue here that the network describes the text and not that the network is causing the text. Since "all pages containing at least some lines of text do have in common that
pairs of frequently used words with high mutual similarity appear. The exact cooccurrences may vary: there are pages where [daiin] is paired with [dain], but also pages where it is frequently used together with [aiin] (f41v, f46r, f55v, f89v2, v105v and f114r) or [saiin] (f2r, f16r, and f90r2)." [Timm & Schinner 2020, p. 3]. With other words if you write for whatever reason for every two instance of [daiin] also an instance of [aiin] this must result in an observable correlation.

However, my point is that the properties of glyph-by-glyph transitional probabilities mirror those of the network of similar words. Within this network, alongside [daiin], we also find words like [dain], [aiin], and [saiin]. Therefore, not only does [daiin] contribute to the transitional probability for the sequence [da], but all other words containing [da] do as well. Similarly, all tokens containing [ai] contribute to the transitional probability for a sequence where [a] is followed by [i]. Since the general principle in this network of similar words is that high-frequency tokens tend to have a greater number of similar words, it is unsurprising that the most likely paths in the transitional probability matrix result in frequently used words.

(18-06-1975, 11:47 AM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.On the one hand, we have a network of similar words whose frequency correlates remarkably well with their degree of similarity to a few specific models.

On the other hand, we have a set of generative rules (involving glyph-by-glyph transitional probabilities) that would produce approximately that same set of words with approximately the same frequencies.  Other models (e.g., some "word paradigms") may be able to do the same.

To construct the glyph-by-glyph transitional probability tables, it is necessary to count the frequency with which certain glyphs follow one another. These tables are thus derived directly from the text. However, given that the Voynich text exhibits variation, it would be essential to use different transitional probabilities for different pages/sections of the manuscript. If one assumes that a device was used to generate the text, then multiple devices would be required to account for these variations.

If on the other hand someone is copying words with minor changes, they only need pattern recognition to identify words and a bit of creativity to modify them. No manual is required to explain how to use one's eyes to select a source word or how to employ creativity to alter it. Since the generated words will inherently share similarities, and because the process is recursive, this approach would naturally produce a text with statistical features resembling those found in the Voynich manuscript.
However, since words are mostly copied from the same page, the text would vary from page to page. On one page, the scribe might start with an instance of [daiin] and modify it into [dain], then copy [dain] into [daiir], and later change [daiir] back into [daiin]. On another page, the scribe might also start with [daiin] but modify it into [saiin], then change [daiin] into [dain], then use [saiin] again and modify it into [dain], and finally [dain] back into [daiin]. Even if the starting point is the same, the outcome would differ, resulting in variation from folio to folio. 
At the same time, the scribe might introduce new spelling variants. For example, he could decide to add [aiin] alongside [daiin]. This change would affect only the text generated after [aiin] was introduced, leading to observable developments in the manuscript. 
Decide for yourself whether the patterns observed in the Voynich text align with this description.
(09-08-2024, 03:36 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.To construct the glyph-by-glyph transitional probability tables, it is necessary to count the frequency with which certain glyphs follow one another. These tables are thus derived directly from the text. However, given that the Voynich text exhibits variation, it would be essential to use different transitional probabilities for different pages/sections of the manuscript. If one assumes that a device was used to generate the text, then multiple devices would be required to account for these variations.

Or a device that's responsive to input, or a device that builds cumulatively on its previous output.

I suspect the overall transitional probability matrices are about as stable across pages and sections as the broad distinction between Currier A and Currier B, which likewise doesn't account for the kind of local variation you're describing.

But I'll admit I don't have a good sense of the overall range of variation.  Could you list a few of the most deviant pages or bifolios that your work has shown to have the most locally distinctive word forms?  We could then take a look at the matrices for them and see how -- and how far -- they differ from the norm.
(09-08-2024, 12:54 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.I definitely agree that there's a close connection between the patterns I was describing and the network you (and Schinner) have written about.  Still, I want to be careful when drawing conclusions about what's causing what.  You write here that the so-called "loop" properties are caused by the network (first paragraph), but then that the network is in turn caused by a "deep correlation between frequency, similarity, and spatial vicinity of tokens" (second paragraph).  Since I'd consider "deep correlation between frequency, similarity, and spatial vicinity of tokens" to be a reasonably good description of the system I was describing, those two paragraphs strike me as forming a little loop of their own.  So what came first, the chicken or the egg?

On the one hand, we have a network of similar words whose frequency correlates remarkably well with their degree of similarity to a few specific models.

On the other hand, we have a set of generative rules (involving glyph-by-glyph transitional probabilities) that would produce approximately that same set of words with approximately the same frequencies.  Other models (e.g., some "word paradigms") may be able to do the same.

If I understand things correctly, the self-citation hypothesis holds that the network exists due to the dynamics of copying words with minor changes, guided by the subjective / aesthetic preferences of the writer (and hence non-random).  Still, if we could identify a set of more concrete and specific rules, and had to account for its relationship with a network of words that just happened to follow those rules, I'd think Occam's razor would point to the rules causing the network rather than the network causing the rules.

I tend to agree with all of this, and realise that it is a quite condensed summary of a broad topic.

The network of similar words is real and it is of great interest. Languages tend not to work that way. Number systems work that way, but the Voynich network is far too complex to be 'just' a number system.

It may have several different causes or origins. I see three main possibilities, but there could certainly be more.

1. It just arose as a consequence of the authors generating words one at the time using the self-citation method
2. This is the vocabulary that was generated before the MS was written (using this vocabulary)
3. It is the consequence of some verbose method to convert plain text

Options 1 and 2 are similar yet opposite. The self-citation method could have been used to create a list of words before writing the text.

What may not be immediately obvious is that the network of similar words stands out even more for consistent avoidances than for the allowed modifications that create the pattern.

This concept of avoiding things was one of the points that stood out to me in the presentation by Tavie, in particular avoiding consecutive lines starting with the same character. Even when human attempts at doing something at random will not be random, but develop certain preferred patterns, avoiding things has a different dimension to me.
This is more like the result of planning, or the indirect consequence of some rules.
(09-08-2024, 06:16 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.Or a device that's responsive to input, or a device that builds cumulatively on its previous output.

Indeed, a simple device available in medieval times that's responsive to input (like page boundaries) and that builds on its previous output.

(09-08-2024, 06:16 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.Could you list a few of the most deviant pages or bifolios that your work has shown to have the most locally distinctive word forms?  We could then take a look at the matrices for them and see how -- and how far -- they differ from the norm.

To illustrate my point I have used three different colors to mark all instances of vords containing the sequences 'ed' (plum), 'ho' (green), and 'in' (yellow):  You are not allowed to view links. Register or Login to view.

The pages are obviously not independent of one another, since pages colored in similar way tend to be adjacent to one another in the manuscript. However if we look into the details the distribution of vords appears much more complicated.

There are at least two kinds of pages using herbal illustrations. There are herbal pages dominated by green + yellow (Currier A).
But there are also herbal pages dominated by plum + yellow (Currier B). However, some herbal pages in Currier A also contain vords colored in plum (see You are not allowed to view links. Register or Login to view.) and herbal pages in Currier B frequently contain vords colored in green. One page even contains a paragraph colored in green + yellow and another paragraph colored in plum + yellow (see You are not allowed to view links. Register or Login to view.).

For the stars section it is even possible to point to pages dominated by vords colored in plum (see You are not allowed to view links. Register or Login to view.) whereas the very next page is dominated by 'yellow' vords. Even another page within the very same section contains an unusual high number of vords colored in green (see You are not allowed to view links. Register or Login to view.).

Therefore I would suggest that you start by building a transition table for the stars section.

Note: The idea of using state transitions was already published by Donald Fisk back in 2017. (see You are not allowed to view links. Register or Login to view. 2017) For more details, see his You are not allowed to view links. Register or Login to view. and the discussion linked You are not allowed to view links. Register or Login to view..
Hi Patrick, just been rereading your slides, getting to grips with the ideas and details in it.

I have a couple of questions about spaces, and wonder if you have any thoughts?

  1. You mention the difference between Strong and Weak breakpoints, but I wonder if all breakpoints tend toward 100% in the right conditions? I mean, this should be obviously so, but I think that those conditions might be quite complex in some cases. Break points aren't strictly predictable for some parts of the pattern, which is a striking contrast to the near 100% predictability of some transitions.
  2. Given this, are spaces part of the same system as the loops or something which interweaves with it?

I'm curious as sometimes the differences in token counts for words with small changes is used as evidence for the underlying system, but those are dependent on the presence of spaces. That is, [chol] and [daiin] and [choldaiin] are all different word types, but parts of the same loops.
(10-08-2024, 08:46 PM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.I have a couple of questions about spaces, and wonder if you have any thoughts?


1. You mention the difference between Strong and Weak breakpoints, but I wonder if all breakpoints tend toward 100% in the right conditions? I mean, this should be obviously so, but I think that those conditions might be quite complex in some cases. Break points aren't strictly predictable for some parts of the pattern, which is a striking contrast to the near 100% predictability of some transitions.

It might seem self-evident that for any extended sequence of glyphs, there should be one and only one correct way of breaking it into words, and that if we aren't yet able to predict the actual spacing in some situations, that's because the conditions governing those situations are more complex than we understand.

But I suspect some breakpoints may really be "weaker" -- more ambiguous or tentative -- than others.

One piece of evidence for this involves patterns among comma breaks in the ZL transcription -- that is, cases in which a well-informed and experienced transcriber wasn't sure whether a space was present or not, so that they're marked with a comma (,) instead of a period / full stop (.).

A while back, I ran some statistics and found that the glyph pairs that were often separated by commas (reflecting transcriber uncertainty) also tended to have a greater tendency to be written inconsistently -- that is, sometimes definitely with a break, but also sometimes definitely without a break, with neither option being clearly preferred.

I suppose it's possible that the transcriber may -- consciously or unconsciously -- have factored knowledge of which glyphs do and don't typically have spaces between them into decisions about which breaks to mark as ambiguous.  But whenever I've checked the transcription against a facsimile, the judgment call seems correct.  Cases marked (.) or without a break look definite; cases marked (,) look ambiguous.

Of course, we might suppose that the presence or absence of a space between two glyphs should depend on the particular word or words the glyphs belong to.  In English, after all, there's no space in another, but there is a space in an otter.  But in the VMS, the spaced and unspaced variants don't seem to cluster as we'd expect by analogy with that type of distinction.  The same pairs of words or word parts tend to occur written definitely together, definitely apart, and uncertainly.

So that's what led me to classify these as "weak" breakpoints.  My hypothesis is that these breakpoints are secondary and optional in a way that "strong" breakpoints aren't, and that they might reflect some kind of structural division that's convenient or helpful to reflect in writing, but not essential.

It's worth noting that some vernacular European writing traditions roughly contemporaneous with the VMS were likewise consistently inconsistent about certain spacing decisions -- I've seen French, German, and Italian examples.  And in at least some of those cases, I've encountered a (potentially) similar tendency towards ambiguous spacing, which is to say, a kind of tentative half-space used in situations where spacing is otherwise particularly inconsistent.  It can be frustrating when you're trying to transcribe an old legal document and can't resolve these ambiguities by falling back on how spacing is "usually" handled.  This often seems to involve definite articles and elements that can function both as prepositions or prefixes.

(10-08-2024, 08:46 PM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.2. Given this, are spaces part of the same system as the loops or something which interweaves with it?


I'm curious as sometimes the differences in token counts for words with small changes is used as evidence for the underlying system, but those are dependent on the presence of spaces. That is, [chol] and [daiin] and [choldaiin] are all different word types, but parts of the same loops.

My hypothesis is that spaces are inserted into a continuous stream of glyphs to make it easier to parse, but that they don't add any information beyond that, much like the commas in the numeral 1,234,567.  But it's just a hypothesis, and it could certainly be tested more robustly than I've tried to do.  I have not, for example, experimented to see what would happen if a space were treated as another option in transitional probability matrices.  I also haven't tried to work out or model any "rules" for spacing that extend beyond the single glyphs to either side of each potential breakpoint.  There's a lot more that could be investigated here.

In addition to strong and weak breakpoints, I suspect there may be another force at play that discourages both very short and very long words in running text, but I haven't tried to confirm that or work out any details.  (Possible example: if spacing rules would ordinarily dictate that the first word in a line should be "y," the "y" will instead be joined to the following word.)

The main case where I used discrete words as evidence was, I guess, the word counts from my stochastically generated text.  In that case, I generated the glyph sequence first and broke it into words in a second, entirely independent step (using the calculated probability of a space between each glyph pair).  What I think the results show is that a glyph sequence generated based on transitional probability matrices, chopped crudely into words at typical breakpoints, yields something reasonably close to the vocabulary and relative word frequencies of Voynichese.  That is, it seems possible that the observed word structure could arise out of this dynamic, or one similar to it.  But the reason for breaking the stream of glyphs into words here was really just to make the results compatible and comparable with word-count statistics.  I have to say I was surprised when [chedy] popped out at the top of the list -- I wasn't expecting that!
I don't think that we can ever be certain about the meaning of spaces until the solution (whatever it is) has been found. 
choldaiin versus chol.daiin  is just one example.
If choldaiin is a composite word, then we see something that also happens in regular languages.

Then again, there are plenty of cases of arar vs. ar ar etc.

I don't see how "ar" can be a word.
(I can see how it could be a number, see e.g. my music presentation.)

The safe approach seems to me to completely ignore spaces, but doing that we might also be ignoring useful information. Which brings me back to my very first point.
Pages: 1 2 3