I have installed the package "stylo" in R according to the instructions to find out similarities between individual pages of the VMS and to group them accordingly. Maybe I have misunderstood something, but the plot does not look like sorting to me. Maybe someone can do something with it:
[
attachment=7838]
You are not allowed to view links.
Register or
Login to view.
An attempt with 32 Herbal A and 32 Herbal B pages also does not really lead anywhere (I have selected "other" as the parameter for the language).
[
attachment=7840]
With "stylo", the output is of course parameter-dependent. If anyone is familiar with this, I would be grateful for any suggestions.
(02-11-2023, 08:17 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.I have installed the package "stylo" in R according to the instructions to find out similarities between individual pages of the VMS and to group them accordingly. Maybe I have misunderstood something, but the plot does not look like sorting to me.
Looks good to me. Why do you want to sort them anyway? Can you try something else? There must be a way to do
principal component analysis, cluster analysis, bootstrap consensus trees with this package, as advertised.
I had hoped that the cluster analysis would recognize the differences between Currier language A and B and group them accordingly. After all, these are two different "writing styles". The bootstrap consensus tree only produces garbage for me.
Principal Components Analysis:
[
attachment=7842]
(03-11-2023, 02:51 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.I had hoped that the cluster analysis would recognize the differences between Currier language A and B and group them accordingly. After all, these are two different "writing styles". The bootstrap consensus tree only produces garbage for me.
The binary tree in You are not allowed to view links.
Register or
Login to view.is it a "bootstrap consensus tree" ?
It does seem to group A/B pages pretty well, with a few surprises: f90r2, f58v, f57r, f58r, f94r, f68r2.
Code:
A f90r2
B f86r3
B f41v
B f31v
f85v2 ???
f85v1 ???
f86v3 ???
f86r2 ???
B f86v4
B f95v1
B f46r
B f39r
B f48r
B f43r
B f33v
B f43v
B f34v
B f46v
B f34r
B f26v
B f41r
B f26r
B f31r
B f66v
B f50v
B f33r
B f95v2
B f55v
B f40v
B f94v
B f48v
B f39v
B f55r
B f50r
B f40r
B f95r1
B f95r2
B f114r
B f104r
B f114v
B f113v
B f106v
B f113r
B f105v
B f107r
B f107v
B f86v6
B f86v5
A f58v
B f115r
B f105r
B f104v
B f115v
B f106r
B f85r1
B f66r
B f111r
B f108v
B f112v
B f112r
B f108r
B f80v
B f80r
B f82v
B f79r
B f76r
B f111v
B f103v
B f103r
B f116r
B f81v
B f75v
B f78v
B f81r
B f79v
B f84r
B f78r
B f75r
B f84v
B f82r
B f77r
B f83v
B f77v
B f83r
B f76v
A f102v1
A f101v1
A f100v
A f17v
A f101r
A f102v2
A f89v2
A f101v2
A f89r2
A f99r
A f102r2
A f99v
A f89v1
A f89r1
A f88v
A f96r
A f52v
A f90v1
A f88r
A f3r
A f54v
A f90v2
A f54r
A f24r
A f93r
- f68v3
A f93v
A f53v
A f102r1
- f67v2
- f65r
A f87v
- f68r1
- f72v2
- f67v1
- f67v1
- f72v3
- f68v2
- f68v1
- f65v
B f57r
- f72r3
- f70v2
- f70v1
- f72v1
- f67r2
A f58r
- f57v
- f70r2
- f69r
- f71r
- f68r3
- f73r
- f70r1
- f73v
- f69v
- f67r1
- f72r2
- f71v
- f72r1
A f23v
A f23r
A f1v
A f6r
A f17r
A f18r
A f6v
A f35r
A f32v
A f4r
A f36r
A f7v
A f44v
A f9r
A f15r
A f45r
A f14r
A f11r
A f16v
A f14v
A f10r
A f45v
A f9v
A f37r
A f29r
A f15v
A f52r
A f36v
B f94r
A f44r
A f22v
A f18v
A f13r
A f11v
A f19r
A f51v
A f51r
A f2r
A f21r
A f42v
A f87r
A f3v
A f96v
A f37v
A f30v
A f4v
A f90r1
A f25r
A f16r
A f30r
A f100r
A f28v
A f10v
A f5v
A f53r
A f28r
A f13v
A f32r
A f21v
A f22r
A f19v
A f7r
A f38r
A f29v
A f20v
A f35v
A f25v
A f49v
A f49r
A f20r
- f68r2
A f24v
A f27v
A f5r
A f56v
A f1r
A f8r
A f56r
A f8v
A f42r
A f47r
A f27r
A f47v
A f2v
A f38v
No, it is simply a cluster analysis ( selection option 1 ). The result seems to be much better than I thought. There's a saying here: "I can't see the wood for the trees"

Since the parameters for the cluster analysis plot were good, here again the corresponding output with Principal Components Analysis (PCA):
[
attachment=7860]
I have run both plots ( Cluster Analysis and Principal Components Analysis ) again with prefixes in the file names for better differentiation.
Green = Currier language B
Red = Currier language A
Blue = unknown
Principal Components Analysis:
[
attachment=7862]
Enlarged:
[
attachment=7863]
Cluster Analysis:
[
attachment=7865]
Enlarged:
[
attachment=7864]
It clearly shows that the two different Currier languages can be distinguished by means of stylometry. Furthermore, one gets an impression of where the unknown pages might belong.
The same V-shape occurs in You are not allowed to view links.
Register or
Login to view. (I guess Stylo uses word frequencies?). See also You are not allowed to view links.
Register or
Login to view. and You are not allowed to view links.
Register or
Login to view..
Unluckily, in the plots from the first link above, I had PC1 on the Y axis, so the diagram must be rotated 90 degrees clockwise to be comparable with Stylo's results. It must also be mirrored, since PC1 has high values for Currier-A in the Stylo plot and low values in the bigram plot.
Colour-coding by sections or by scribes would probably show more patterns. In the bigram-based plot, "Pharma" (yellow) is where the two arms of the V meet, with high PC2 values. In the Stylo plot, it seems to cluster at the centre, near the origin. In both cases, PC1 seems to be driven by Currier A vs B and Pharma falls in an intermediated position, with PC1 close to 0.
(04-11-2023, 08:51 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I guess Stylo uses word frequencies?
Yes, Stylo uses word frequencies. The settings for the most frequent words (MFW) are variably adjustable. However, it is also possible to search by chars (ngrams).
Culling is also possible. A culling value of 20, for example, means that only words that occur in at least 20% of the texts are analyzed, while a value of 100 means that only words that occur in all texts in the collection are taken into account.
[
attachment=7866]