Tuesday, August 13, 2013

Language markup goes to Language mapping



The midterm evaluation at GSoC passed and I am proud to say that I passed. The passing rate is around 94% so one might think this it is not a big deal... Well, we are a bunch of dedicated and smart people :)

However the midterm evaluation made us (mostly +Dirk Haun ) do a thorough review of the code and it revealed a few bugs which were easy to patch. However it also revealed a infinite loop. When I sat down to fix this I realized that the problem is a bit deeper then a infinite loop. After we sat down to think about it, the language markup (read here) revealed more and more problems.

The problems

Geeklog does some string shortening on some pages, e.g. "This text would be shorter" in rendered as this "This text". This caused the infinite loop, the language markup algorithm relies on having pairs of _-start_ and _-end_ tags. The natural approach to fix this would maybe be to find all the pairs of _-start_ and _-end_ and just ignore any "string" which has another _-start_ before an end tag. The natural approach would fail because Geeklog has nested strings. (When I write strings here I mean strings from $LANG variables).

The second problem we got was that not all Internet users have JavaScript enabled. Before concluding that this is a small number of users etc etc, lets make one thing clear, I am thinking of web crawlers. They (as far as I know) do not have JavaScript enabled and you don't want your page to be represented as 
_-start_Another nifty Geeklog site_-end_.

The final problem was that in some cases element (such as forms) ids are set to be LANG arrays, after the page is rendered and my JavaScript kicks in the complete page is purged of all language markups => The ids of the forms would not match the expected ids in the PHP script.

All of these problems might even be fixed, but they were bound be reincarnated later on in some way.

The solution

+Dirk Haun and I have bounced ideas left and right for a few days trying to figure out a new solution. At one point I even suggested an API (desperate times). However a somewhat better (and crazier) idea came to mind and I have been on it ever since.
The solution is language_mapper.php. The logic behind it: When the plugin is installed it will traverse Geeklog's file tree and find all the .php files (for now we are avoiding plugin folders). I am going with a recursive depth first approach for this.


After all the files have been found they are "analyzed" , the code will compile an array holding all the actual LANG array names and search for them in the file's code. It will save the list of found array names as well as a list of included (required) files.


All of this is saved to the database.
Later on when a page is loaded the plugin's JS file will send an AJAX request and get the form HTML. Most of this cod stayed the same, in fact generally speaking most of the code stayed the same. I listened to what smarter people than me said and used a certain level of abstraction with my code. It was a bit hard or unachievable in some places but I did it well enough that most of the code simply worked with the "new data" provided to it. This made me happy as I don't have to redo all of it, and probably +Dirk Haun as we were not moved back to square one.

Why save the included files?

The included files (obviously ) contribute to the script in some way, as far as I know that contribution could be in form of text or HTML code, so in order to really assemble a list of LANG's used on a page I have to include the one's used on included pages.


The problem with this


The problem with this approach is that in coding we have all those conditionals, if, switch .... This means that not ALL the code will be executed ALL the time. So I end up with a list of 349 LANG elements out of which 149 have been actually displayed on the page. In order to keep the 'in context' translation it is sort of necessary to remove the overhead. I have wrestled with this for the better part of last night but all I got was a very slow JS function. My guess is that it is slow because of the nature of the strings, they are not fixed. They have variables in them so I have to use regex matching. Another problem might be that most of the search terms are not present on the page so the complete page has to be searched

In conclusion

Although there is a problem with the approach it has been said "long ago" that this has no "nice solution". I like how it behaves for the most part and will try to fix this "small" inconvenience.

Cheers