The Voynich Ninja

Full Version: Voynich Manuscript RESTful API
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi everyone,

[font=arial, sans-serif]I have been working on a project for the Voynich Manuscript based on the well-known 'interlinear' file. I have built a RESTful API on top of the interlinear and am presenting it here: [/font]
[font=arial, sans-serif]You are not allowed to view links. Register or Login to view.[/font]

[font=arial, sans-serif]Essentially it is an online, public version of the Voynich Transcription Tool that gets mentioned from time to time. There is documentation for how to use the API plus a set of examples showing what sort of applications can be built on top of API queries. I've implemented some classic examples e.g. word length distribution and Sukhotin's vowel identification algorithm. I don't think these examples are of great value, but they are interesting in terms of showing a transparent methodology with clear data-set and a repeatable methodology (i.e. source code available).[/font]

[font=arial, sans-serif]An example API query could be:[/font]

[font=arial, sans-serif]You are not allowed to view links. Register or Login to view.[/font]

[font=arial, sans-serif]Which would translate as 'fetch me page You are not allowed to view links. Register or Login to view. Takahashi transcription, without interlinear comments with columns for pageId, currierHand, illustrationType, unitCode, lineNumber and using a morpheme groupings of cfh, ckh, cph, cth, eee, iii, ch, ee, ii, qo, sh'.[/font]

[font=arial, sans-serif][font=tahoma, verdana, arial, sans-serif]There are two main methods (routes) - 'tokens' and 'morphemes'. 'Tokens' gets the effectively raw transcription from the interlinear and 'morphemes' does the same things but applies a grouping algorithm that identifies e.g. 'qo', 'sh' and 'eee' etc. The morpheme groups are user configurable. I took some feedback on this (thanks, Nick Pelling) and decided that it is an open question as to whether we should see 'qo' as a single morpheme, or actually see 'qot' and 'qok' separately from 'qo' etc. There are many similar questions in this topic of word morphology that would be better served, in my opinion, by clearer data.[/font][/font]

[font=arial, sans-serif]My intention in sharing this is to enable transparency, repeat-ability and share-ability of experiments that people conduct on the text. [/font]

[font=arial, sans-serif]Regards,[/font]
[font=arial, sans-serif]Robin[/font]
It's very difficult for me personally.

And if I want to find a word that is composed of three "words" EVA, how do I do?
Hi Ruby - thanks for giving it a try.

My intention is for people to use this tool when they want to present their analyses of the manuscript in a way that other people can replicate them. 

My experience with Voynich Manuscript research is that there are plenty of people who make claims of interpretation - even very basic ones - but fail to present a) their source data, and b) their methodology. 

I have a problem with people who are not presenting data and methodology because I believe they are contributing to the bad reputation that surrounds the study of this manuscript. 

We need to address this, and my contribution is to provide an opportunity for researchers to present their analysis in a transparent and repeatable way.


The API is based on Jorge Stolfi's interlinear file that can be easily downloaded from his website. I present my parsing method on the documentation page. My intention is to share the parsing source code  as an open source effort so it is clear how I handled it. Broadly, it follows the rules articulated within the file itself, plus I've added bits of interpretation from wider reading e.g. Rene Zandbergen's website and so forth.

I accept the criticism that there is a quite high barrier to entry for using this tool. To use this tool, you will need to be comfortable with coding, and also comfortable with presenting that code. 

I also accept the criticism that the source code for the parser and API is not currently available. I am totally happy to share, and will get round to putting it up on the internet very soon.

The samples on the website are intended to demonstrate how people can go about using the API. Therefore, my response to your comment that you want to combine words, is that you should look into how you might code this and then contribute that methodology back to the research community so people can build upon your work.

Good luck with your research,
Robin
So your long message, Robin, was that you can't help me?  Sad

What a pity, I was hoping  find finally a tool easy to use and useful. Well, I have no choice, I'll learn coding.
Hi Ruby,

The development that Robin is speaking about is a software interface, not a tool that can be used directly by a human.
I am sorry that I did not yet take the time to look at your tool.

I fully agree about the importance of reproducible results, which is made possible by
clearly stating the data that has been used by anyone, in any analysis.

From your descriptions, it seems to be similar to a tool by Elias Schwerdtfeger:
You are not allowed to view links. Register or Login to view.
and I wonder if you saw it.

I will certainly have a look, and report back.
Hi Robin,
Two questions: Do you have the list of transcriber codes? And which transcription is the most complete?
Hi Robin,
Another question if you will!

Is it possible to directly query for a single word using the Stolfi interlinear locator code?
IE, I want to get the label at
<f57v.X.2>


So that's folio f57v, line X, word 2.

I would assume a call like
You are not allowed to view links. Register or Login to view.

But I can't seem to narrow it down to the word. Removing unitCode gives me the complete list (from which I can then find my query programmatically of course Rolleyes )
Hi Rene,

Yes, the tool does a very similar job to Elias Schwerdtfeger's Voynich Information Browser. I believe we've followed a very similar path to get to a very similar outcome. I doubt you will find anything new in the tool or the research I've presented. Effectively, it is just a new medium that aligns with common standards being used across the internet nowadays. The key difference is that this API creates a URL addressable reference to the source data for an analysis.

For example, to just get You are not allowed to view links. Register or Login to view. by Takahashi, removing inline comments then you would call:

You are not allowed to view links. Register or Login to view.

Elias has made a great tool and I'm sure that many people have made good use of it. However, one thing that I am hoping to add in terms of functionality is simply to make the data consistently accessible to any web-based presentation of research. Of course, one might go to the VIB and make a query; download the result; then use that in an analysis and some presentation of that analysis - and make the source document available. The alternative with the API is simply to reference a URL and if anyone wants to inquire what the source was then one can simply point to that URL. The choice of chunking of word-parts - what I call morphemes - is a big part of the API and offers a significant point of difference to the VIB.

The API is also delivering the result as JSON which we can say is one of the current dominant standards for online data interchange. Instead of using the format of e.g. 

<f1r.P1.1;H> fachys.ykal.ar...

I am delivering the token information as e.g. 

[
  {"pageId":"f1r","unitCode":"P1","lineNumber":"1","item":"fachys"},
  {"pageId":"f1r","unitCode":"P1","lineNumber":"1","item":"ykal"},
  {"pageId":"f1r","unitCode":"P1","lineNumber":"1","item":"ar"}
  ...
]

In order to parse <f1r.P1.1;H> fachys.ykal.ar.. then I will need to pre-process this by removing <*> and then splitting on . and so forth. By presenting the data as a JSON array, then a Javascript programmer will easily see how to process the data without these preliminary steps. In fact, most common programming languages have libraries for processing JSON data.

A second thing is that when people present their findings online, the source is static. However, we know that outcomes can often differ by small and seemingly inconsequential detail. For example, I attempted to replicate Stolfi's classic analysis where he demonstrates a fit of the word-length distribution to a certain function. In his article he makes some thought-provoking comments about the language structure. But if you choose some different assumptions for the experiment, you will find that the fit to the binomial function isn't actually so great. Look, I'm not trying to disparage the experiment - I like that it is objective and he clearly explains his steps and so forth. However, the outcome is different if you change the starting conditions - you can see what I am talking about here:

You are not allowed to view links. Register or Login to view.

If you run the plot with the defaults I put in you will get a close fit to the Binom(9, k-1). The defaults are: cfh,ckh,cph,cth,ch,sh. If you check out Stolfi's original analysis, I believe he mentions these and that's why I selected them. I deliberately wanted to get the fit that he found.

Now, please run it again with this selection: cfh,ckh,cph,cth,ch,sh,qo,iin,in,ol,al. The extra 'morphemes' I've added to the list are 'qo', 'iin', 'ol' and 'al' - a selection of morphemes that are common as prefixes and suffixes. Note that the distribution is no longer such an awesome fit. Small choices in the parsing of the text can yield significantly different results. 

We can run it a few more times - try the original set of morphemes (i.e. cfh,ckh,cph,cth,ch,sh) but with Bio-B and Herbal-A. To do this you amend the query strings:

* Bio-B - transcriber=H&isWord=1&hasFiller=0&isAmbiguous=0&illustrationType=B&currierLanguage=B
* Herbal-A - transcriber=H&isWord=1&hasFiller=0&isAmbiguous=0&illustrationType=H&currierLanguage=A

And so on... 

I think all the work you and others put in a few years ago is important - we need to 'carry the fire' if you take my meaning.


All the best,
Robin

(09-10-2016, 08:16 AM)davidjackson Wrote: You are not allowed to view links. Register or Login to view.Hi Robin,
Two questions: Do you have the list of transcriber codes? And which transcription is the most complete?

Hi David - 

The list of transcriber codes is available by this API call:

You are not allowed to view links. Register or Login to view.

Which will give you:

{"values":["H","C","F","N","U","D","X","J","G","V","Z","R","K","Q","L","P","I","T"],"dailyRequestCounter":"23"}

This maps to the list in the interlinear:

# Transcriber codes

# -----------------
#
# [ The following transcriber codes were inherited from INTERLN.EVT: ]
#
#   C: Currier's transcription plus new additions from members of the
#      voynich list as found in the file voynich.now.
#   F: First study group's (Friedman's) transcription including various
#      items as found in the file FSG.NEW.
#   T: John Tiltman's transcription of some pages.
#   L: Don Latham's recent transcription of some pages.
#   R: Mike Roe's recent transcription of some pages.
#   K: Karl Kluge's transcription of some labels from Petersen's copies.
#   J: Jim Reed's transcription of some previously unreadable characters.
#   
# [ The following codes were added by J. Stolfi after 05 Nov 1997,
# in the unfolding of "[|]" groups:
#
#   D: second choice from [|] in "C" lines.
#   G: second choice from [|] in "F" lines, mostly from [1609|16xx].
#   I: second choice from [|] in "J" lines.
#   Q: second choice from [|] in "K" lines.
#   M: second choice from [|] in "L" lines. 
#   
# The following codes were assigned by J. Stolfi for use in 
# "new" transcriptions:
#
#   H: Takeshi Takahashi's full transcription (see f0.K).
#   N: Gabriel Landini.
#   U: Jorge Stolfi.
#   V: John Grove.
#   P: Father Th. Petersen (a few readings reported by K. Kluge).
#   X: Denis V. Mardle.
#   Z: Rene Zandbergen.
# ]

Quote:And which transcription is the most complete?

I am not sure - I haven't looked at this in as much detail as I should! The data in the API is simply a parsed version of the interlinear file - 'text16e6.txt'

I use Takahashi's transcription all the time simply because Stolfi refers to it as 'full'.

Stolfi posted a 'majority-vote' file online which I am intending to roll into the tool. I haven't started an analysis of whether it can be included within the same data-set as the 'text16e6.txt' file.

I think this is a very important question.

Thanks,
Robin

(09-10-2016, 09:31 AM)davidjackson Wrote: You are not allowed to view links. Register or Login to view.Hi Robin,
Another question if you will!

Is it possible to directly query for a single word using the Stolfi interlinear locator code?
IE, I want to get the label at
<f57v.X.2>


So that's folio f57v, line X, word 2.

I would assume a call like
You are not allowed to view links. Register or Login to view.

But I can't seem to narrow it down to the word. Removing unitCode gives me the complete list (from which I can then find my query programmatically of course Rolleyes )

Hi David,

I would use this call (for line number 3):
You are not allowed to view links. Register or Login to view.

Which gives:

Code:
{
    "parameters": ["pageId=f57v", "lineNumber=3", "transcriber=H", "isWord=1"],
    "selectedColumns": ["pageId", "unitCode", "lineNumber", "item"],
    "tokens": [{
        "pageId": "f57v",
        "unitCode": "X",
        "lineNumber": "3",
        "item": "otardaly"
    }, {
        "pageId": "f57v",
        "unitCode": "Y",
        "lineNumber": "3",
        "item": "ocfhor"
    }, {
        "pageId": "f57v",
        "unitCode": "Y",
        "lineNumber": "3",
        "item": "okear"
    }],
    "dailyRequestCounter": "25"
}


Which will give you all token on the lines - you will have to get the 2nd token yourself.

Cheers,
Robin