(Today, 12:24 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Again, you are adding complications which Torsten never included in his descriptions.
Does this mean that you have already seen that the basic approach does not work?
The approach doesn't have to be basic. The VM is a complex medieval book, not the output of a short computer program. Again, I don't care what settings, limitations and simplifying assumptions Torsten Timm's "basic approach" algorithm and app have as I don't intend to copy any of them.
The improvements that I propose are aimed to model (and minimize) the work of a human scribe on pages. They have measurable consequences and, I hope, important consequences on statistics. The selection of multiple close sources together and the initialization of a page from other pages should be modeled: a human would naturally prefer what is easier/faster, it's the principle of least effort. Also the generation process should not be carried out sequentially: there is no reason to assume that pages were generated sequentially one by one (they could have been written in parallel) and from the top to the bottom of each page, because we know that the VM's lines were not always written that way. These "complications" are all necessary. I don't know if they will result in a better fit to the VM than the output of Torsten Timm's app, I haven't done the work yet. I hope they will: there is definitely room for improvement and you should not reject the self-citation method because of the shortcomings of the "basic approach".
If the set of "seed" words was small and the rules didn't allow new words to be diverse enough on the first page (I don't know how many seed words and rules are needed and how many generations are allowed together on the same page) it may have looked bad. If not, there is no problem at all. So, if we are in the worst-case scenario, a totally speculative situation that we have no idea if it happened, we have the solution that Mauro outlined: everything evens out eventually in a multi-generation mix of generated words. We don't care how the VM started, because if the first page(s) looked very different from what we see now, they ended up in the trash. Pseudo-problem solved.