![]() |
|
[split] (lack of) word groups - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: [split] (lack of) word groups (/thread-2841.html) |
RE: [split] (lack of) word groups - Koen G - 02-07-2019 Nablator, I got the following error: source_file.java:4: error: class xMATTR is public, should be declared in a file named xMATTR.java RE: [split] (lack of) word groups - davidjackson - 02-07-2019 You have to rename the file currently called source_file.java to xMATTR.java RE: [split] (lack of) word groups - nablator - 02-07-2019 (02-07-2019, 10:29 AM)davidjackson Wrote: You are not allowed to view links. Register or Login to view.You have to rename the file currently called source_file.java to xMATTR.javaYes. I did a few test runs and got lower results with Latin prose than poetry, not unexpectedly. Less careful writers than poets reuse the same groups of words more often. The problem with lack of clear phrase structure in the VMS goes deeper than just having few long patterns of words: logical/functional links between words should be detectable not only between adjacent words but at intermediate distances too. Word ordering and grouping, even in Latin, is far from random: the same author is very likely to have a consistent style of word ordering. The statistical tools described in the WPPA document are very interesting for that and would be very much worth running your corpus of texts on. RE: [split] (lack of) word groups - Koen G - 02-07-2019 (02-07-2019, 11:42 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.the statistical tools described in the WPPA document are very interesting for that and would be very much worth running on your corpus of texts. Maybe it's easier if I just mail you my txt files
RE: [split] (lack of) word groups - nablator - 02-07-2019 (02-07-2019, 11:55 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.(02-07-2019, 11:42 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.the statistical tools described in the WPPA document are very interesting for that and would be very much worth running on your corpus of texts. I don't have the WPPA executable(s). I'll try to write a new Java implementation to check and better understand what Marke Fincher discovered. Also I can write an entropy calculator for words (h0, h1, h2). RE: Gordon Rugg, "Neither researchers nor the media can put down the world’s most..." - nablator - 02-07-2019 (01-07-2019, 01:55 PM)Anton Wrote: You are not allowed to view links. Register or Login to view."But the words in the Voynich Manuscript You are not allowed to view links. Register or Login to view. in their order." If there is any result anywhere I'd like to test the program that I put together in the last 10 minutes.
RE: [split] (lack of) word groups - Koen G - 02-07-2019 (02-07-2019, 12:08 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I don't have the WPPA executable(s). DId anyone mail Fincher already? If not I'll mail him to see if he still has the exe's. RE: [split] (lack of) word groups - Koen G - 03-07-2019 Okay I got it all working. Do I understand correctly that if I enter for example 500 3 it will use a 500-word window and calculate TTR for 3-word strings? Example: chaucer.txt 0.9927007607274456 VM values: 01. ZL_2 fewerspace.txt 0.9987576052770742 02. ZL_2a morespace.txt 0.9986908882700465 03.VM_GC.txt 0.9871678048102964 04.TT_ivtff_v0a.txt 0.9985955011100115 05.Herbal A.txt 0.999699481865285 06.Herbal B.txt 1.0 07.Q13.txt 0.9985492083203975 08.SPlantsNoLab.txt 0.9992566180443004 09.Q20.txt 0.9996877287613176 10.lab.txt 1.0 11.noq13.txt 0.9996536355182034 Which would imply that for example the GC transcription (03) repeats more 3-word strings than Chaucer? (just checking if I'm reading this right before moving further) RE: [split] (lack of) word groups - nablator - 03-07-2019 (03-07-2019, 10:54 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Okay I got it all working. Do I understand correctly that if I enter for example 500 3 it will use a 500-word window and calculate TTR for 3-word strings? Yes. Quote:03.VM_GC.txt 0.9871678048102964 I don't know why it is so low, and I get an even lower value, 0.979. Something is wrong. Did you remove labels or something else? When GC is converted to nearest EVA, I get a more normal 0.998. EDIT: I know what's wrong. V101 uses some digits that are interpreted as word separators by split("[0-9 \t]+") Just use split("[ \t]+") if you want to allow 0-9 in words. RE: [split] (lack of) word groups - Koen G - 03-07-2019 Right, I will have to check that later. Normally I'd prefer digits to be removed since more often than not they are part of modern layout. I collected some numbers already, you can see them here (final sheet xMATTR): You are not allowed to view links. Register or Login to view. Of the texts I checked so far, here's the percentage that have no repeating strings of the following lengths: 2: 1% 3: 6% 4: 20% 5: 32% 6: 48% So about half of these texts never repeat a string of 6 words exactly (taking into account the 1000-word window I always used, which seems reasonably large). What may be more interesting is that 9 of my txt files never even repeat a 4-word string. I have to leave now, but I'll already attach an example so someone could double check. |