[split] (lack of) word groups - Printable Version

[split] (lack of) word groups - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: [split] (lack of) word groups (/thread-2841.html)

Pages: 1 2 3 4

RE: [split] (lack of) word groups - Koen G - 02-07-2019

Nablator, I got the following error:
source_file.java:4: error: class xMATTR is public, should be declared in a file named xMATTR.java

RE: [split] (lack of) word groups - davidjackson - 02-07-2019

You have to rename the file currently called source_file.java to xMATTR.java

RE: [split] (lack of) word groups - nablator - 02-07-2019

(02-07-2019, 10:29 AM)davidjackson Wrote: You are not allowed to view links. Register or Login to view.You have to rename the file currently called source_file.java to xMATTR.java

Yes.

I did a few test runs and got lower results with Latin prose than poetry, not unexpectedly. Less careful writers than poets reuse the same groups of words more often. The problem with lack of clear phrase structure in the VMS goes deeper than just having few long patterns of words: logical/functional links between words should be detectable not only between adjacent words but at intermediate distances too. Word ordering and grouping, even in Latin, is far from random: the same author is very likely to have a consistent style of word ordering. The statistical tools described in the WPPA document are very interesting for that and would be very much worth running your corpus of texts on.

RE: [split] (lack of) word groups - Koen G - 02-07-2019

(02-07-2019, 11:42 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.the statistical tools described in the WPPA document are very interesting for that and would be very much worth running on your corpus of texts.

Maybe it's easier if I just mail you my txt files Confused

RE: [split] (lack of) word groups - nablator - 02-07-2019

(02-07-2019, 11:55 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.
(02-07-2019, 11:42 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.the statistical tools described in the WPPA document are very interesting for that and would be very much worth running on your corpus of texts.

Maybe it's easier if I just mail you my txt files

I don't have the WPPA executable(s). I'll try to write a new Java implementation to check and better understand what Marke Fincher discovered.

Also I can write an entropy calculator for words (h0, h1, h2).

RE: Gordon Rugg, "Neither researchers nor the media can put down the world’s most..." - nablator - 02-07-2019

(01-07-2019, 01:55 PM)Anton Wrote: You are not allowed to view links. Register or Login to view."But the words in the Voynich Manuscript You are not allowed to view links. Register or Login to view. in their order."

This is an interesting statement, with reference to Currier (not sure to what place exactly).

Are there any quantitative results for second-order word entropy?

If there is any result anywhere I'd like to test the program that I put together in the last 10 minutes. Smile

RE: [split] (lack of) word groups - Koen G - 02-07-2019

(02-07-2019, 12:08 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I don't have the WPPA executable(s).

DId anyone mail Fincher already? If not I'll mail him to see if he still has the exe's.

RE: [split] (lack of) word groups - Koen G - 03-07-2019

Okay I got it all working. Do I understand correctly that if I enter for example 500 3 it will use a 500-word window and calculate TTR for 3-word strings?

Example: chaucer.txt 0.9927007607274456

VM values:
01. ZL_2 fewerspace.txt 0.9987576052770742
02. ZL_2a morespace.txt 0.9986908882700465
03.VM_GC.txt 0.9871678048102964
04.TT_ivtff_v0a.txt 0.9985955011100115
05.Herbal A.txt 0.999699481865285
06.Herbal B.txt 1.0
07.Q13.txt 0.9985492083203975
08.SPlantsNoLab.txt 0.9992566180443004
09.Q20.txt 0.9996877287613176
10.lab.txt 1.0
11.noq13.txt 0.9996536355182034

Which would imply that for example the GC transcription (03) repeats more 3-word strings than Chaucer?
(just checking if I'm reading this right before moving further)

RE: [split] (lack of) word groups - nablator - 03-07-2019

(03-07-2019, 10:54 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Okay I got it all working. Do I understand correctly that if I enter for example 500 3 it will use a 500-word window and calculate TTR for 3-word strings?

Yes.

Quote:03.VM_GC.txt 0.9871678048102964

I don't know why it is so low, and I get an even lower value, 0.979. Something is wrong.

Did you remove labels or something else?

When GC is converted to nearest EVA, I get a more normal 0.998.

EDIT: I know what's wrong.

V101 uses some digits that are interpreted as word separators by split("[0-9 \t]+")

Just use split("[ \t]+") if you want to allow 0-9 in words.

RE: [split] (lack of) word groups - Koen G - 03-07-2019

Right, I will have to check that later. Normally I'd prefer digits to be removed since more often than not they are part of modern layout.

I collected some numbers already, you can see them here (final sheet xMATTR):
You are not allowed to view links. Register or Login to view.

Of the texts I checked so far, here's the percentage that have no repeating strings of the following lengths:

2: 1%
3: 6%
4: 20%
5: 32%
6: 48%

So about half of these texts never repeat a string of 6 words exactly (taking into account the 1000-word window I always used, which seems reasonably large).
What may be more interesting is that 9 of my txt files never even repeat a 4-word string.

I have to leave now, but I'll already attach an example so someone could double check.