Transliteration files and formats

Transliteration files and formats - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Transliteration files and formats (/thread-3004.html)

Pages: 1 2 3

Transliteration files and formats - ReneZ - 22-11-2019

Some time ago, I defined a format similar to the interlinear file format, but which can also host the GC transliteration.
I represented most common historical transliterations in this format. I also updated my own tool to process such files.

However, there were some things that were still not easy to do.
Historically, people recorded the ends of paragraphs in these files, but it is of interest to be able to do separate statistics for first lines of paragraphs. I decided to introduce a new dedicated comment for this, and added this to my own transcription file.

All links can be found on these two pages:
- You are not allowed to view links. Register or Login to view.
- You are not allowed to view links. Register or Login to view.

Here are the most relevant ones:
- You are not allowed to view links. Register or Login to view.
- You are not allowed to view links. Register or Login to view.
- You are not allowed to view links. Register or Login to view.

With this, I could finally check the real preference of p and f on top lines of paragraphs quite easily. I will post about that next. There is another quite interesting area of statistics that is now possible, which will take a bit more time.

RE: Transliteration files and formats - davidjackson - 22-11-2019

I think this is a very solid and well thought out proposal for a standard.

Possibly it fails to be extensible enough to make room for personal future innovations in transcriptions. For example, is there a standard to identify potential handwriting differences within lines? Or ink colour? I understand such identifiers can only be included in the page headers (section 6.3).

There could be an identifier in the locus permitting such identifiers to be included; failing this, a scheme to include transcription specific identifiers into the loci.

RE: Transliteration files and formats - ReneZ - 22-11-2019

This is meant to be 'intermediate' as the name says. The next step should be a database that allows full annotation, starting with the exact location of each item.
In my opinion of course.

RE: Transliteration files and formats - -JKP- - 22-11-2019

(22-11-2019, 08:32 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.This is meant to be 'intermediate' as the name says. The next step should be a database that allows full annotation, starting with the exact location of each item.
In my opinion of course.

By exact location do you mean page coordinates (like a clickmap) or paragraph/line/word markers?

I already have a database that maps every token and every common subset of tokens, including common beginnings and ends that go with specific blocks, but it does so in terms of sections and subsections. I haven't had time to add actual coordinates (or p/l/w markers). I developed this in conjunction with my transcripts and my concordance.

RE: Transliteration files and formats - ReneZ - 23-11-2019

With location I mean the coordinates. We don't have any standard for that yet, but clearly the Jason Davies voyager uses one definition (which we can see) and "voynichese.com" uses another which we cannot see.

My idea would be to keep it simple and use the most original source: the digital images of the Beinecke.
Each page in the MS can be mapped to an image file, and each character in the MS has its location (X,Y) in pixels in one of these files.
At least the start point of each locus should be recorded, but better each character (which could be a bit complicated due to the different alphabets, but nothing that can't be solved).

Setting up such a database is too much work to do manually, so some tool should be employed.
Apart from coordinates, it should also record character size and orientation.
With this, one suddenly has access to a lot of new information, basically related to handwriting.

RE: Transliteration files and formats - -JKP- - 23-11-2019

If it's tied to coordinates, one would want a relative system so that if the resolutions of the Beinecke scans were ever changed, it could be applied to the new scans without a lot of reworking.

Also, it should be relative to certain defined points on the folio (rather than the edges of the scan) because the scan edges might change from version to version.

(23-11-2019, 09:33 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
...
Apart from coordinates, it should also record character size and orientation.
With this, one suddenly has access to a lot of new information, basically related to handwriting.

The challenge, of course, is making a distinction between meaningful variations and those that are simply variations of the pen or the scribe.

RE: Transliteration files and formats - ReneZ - 23-11-2019

(23-11-2019, 10:57 AM)-JKP- Wrote: You are not allowed to view links. Register or Login to view.If it's tied to coordinates, one would want a relative system so that if the resolutions of the Beinecke scans were ever changed, it could be applied to the new scans without a lot of reworking.

Not necessarily. The new coordinates would have to be re-computed in almost all cases, and this can always be done by a piece of software.
(For example: the grouping of pages in one image may change, the margins may change).

Furthermore, as long as the original images are preserved, they can even remain the reference.

RE: Transliteration files and formats - ReneZ - 23-11-2019

Just as an example of the use of the IVTFF format and the ivtt tool, this is how I computed the stats in You are not allowed to view links. Register or Login to view. .
First I made a script including a series of ivtt commands to:
- remove unnecessary bits from the input file ( ivtt -u1 -h2 -c4 <file0 >file1 )
- split into paragraph loci and other loci ( ivtt +@P ... and ivtt -@P ... )
- split the paragraph loci in 'first' and 'all other' ( ivtt -q1 ... and ivtt -q2 ... )
- leave only the individual word tokens (for each of the three files: ivtt -f1 -s3 ... )
- then count the words using standard Unix grep and wc commands.

This script can be run for the entire text (ZL.txt).
The A language part is generated by: ivtt +LA ZL.txt ZLA.txt
The Quire-13 part is generated by: ivtt +QM ZL.txt ZLbio.txt
(and similar for the others).

RE: Transliteration files and formats - farmerjohn - 23-11-2019

It might be better not to include all the info in one single file, but to split it into files, just as html and css do: “html” for transliteration itself, “css” files for each of coordinates, colors, style... This would somewhat resolve extensibility problem. To make it possible transliteration file should provide convenient way to refer to words, lines, etc, and also separate versioning of transliteration and transliteration file format is needed.

RE: Transliteration files and formats - ReneZ - 23-11-2019

I do agree with that. However, a database would be even better.
This is of course also (in a way) a combination of files, but it already comes with a set of tools to relate the information.