(01-12-2025, 06:57 PM)srjskam Wrote: You are not allowed to view links. Register or Login to view.I had done some comparisons with natural languages. The realization that certain pairs of small groups of common words are very frequent probably was the inspiration for this line of investigation. Like in German you'll very frequently have in/an/von/zu + der/die/dem. Similarly in other European languages that have prepositions and articles.
Thanks for the tables, but I don't see how to read them. Are there ": : patterns" in them?
Quote:(What's a sensible way to describe this... Cartesian product-like behaviour?
I don't know of a good name either, but in linear algebra it would be described as a "2x2 submatrix that has low determinant", or "is nearly singular", or "has a small eigenvalue".
The determinant of a 2x2 matrix M = [[a,b],[c,d]] is D = ad-bc. When D is zero we say that M is singular. This happens if and only if one row is a multiple of the other. Or, equivalently, if and only if one column is a multiple of the other. Or, equivalently, if and only if there are numbers R,S and X,Y such that M = [[RX,RY],[SX,SY]].
The eigenvalues of the 2x2 matrix M are the numbers L1 = T/2 + sqrt(T^2/4 - D) and L2 = T/2 - sqrt(T^2/4 - D), where T is the "trace" of the matrix, T = a + d. These may be complex numbers. However, for our purposes the order of rows and columns does not matter, so if the determinant D is positive you can swap the rows (or columns) of M before using those formulas. Then D will be negative, the thing inside the sqrt() will be positive, and L1, L2 will be real numbers.
Discard the signs of L1 and L2. The ratio R between the smallest and the largest of these numbers is a measure of how close to singular the matrix M is.
Let the two words on the left be A1 and A2, and the two on the right be B1 and B2. The maximum value of R is 1, which occurs if the matrix of frequencies is [[a,0],[0,d]] or [[0,b],[c,0]]; that is, if A1 always pairs with B1 and A2 with B2, or vice-versa.
The minimum value of R is 0, which occurs if M is singular, namely fits the ": : pattern". This means that the choice between B1 or B2 after A1 or A2 does not depend on which of these was the previous word. Or, equivalently, that the choice between A1 and A2 before a B1 or B2 does not depend on which of these is the next word.
In natural languages these ": : patterns" seem rare. Even when the German grammar seems to allow "in" or "vor" pair indifferently with "die" or "der", in any particular text you will find that "in" has a definite tendency to partner with "die" and "vor" with "der". Or vice-versa.
And you must be aware that many languages (including Classical Latin) make little or no use of prepositions and articles, and use word order or declensions instead. Articles and prepositions as separate words are a characteristic feature of Romance and Germanic languages.
And you also must try to account for the uncertainty in the frequencies of words and word paits that comes from sampling error. I should know the formulas for that, but can't remember them now. But the point is that if the matrix of occurrence counts for {A1,A2} x {B1,B2} looks like [[2,1],[2,1]], one cannot count that as an occurrence of the ": : pattern", because those numbers are mostly sampling error noise.
All the best, --stolfi