InvisibleXML Logo jωInvisibleXML Logo processor - HELP

An interactive SaxonJS/JavaScript workbench processor of Invisible XML - Workbench Version 1.4

Table of Contents

TODO:

Introduction

For the runtime files for the iXML processor and a sample jωiXML application see here.

This workbench runs entirely within the browser client, using SaxonJS as the top-level program and the jwiXML JavaScript library. There is no server-side processing, apart from initial delivery of necessary files.

You can input and edit an iXML grammar either ab initio or by loading from a file on your computer or one of the test-case or sample grammars from the iXML GitHub repository. The input string can also be edited or loaded from file or, in the case of a test-case or sample grammar, by selecting from one of a number of provided input strings, relevant to the grammar in question.

Local file selection is either by a conventional selector dialog or by drag-and-drop of a file onto the relevant text area. (Note that on Firefox security hurdles may preclude drag-and-drop.)

Editing grammars and inputs

iXML grammars are edited in the upper textarea, where usual keystokes are supported, but there is no 'syntax awareness' during input:

The 'format' button above the textarea will, for a valid grammar, 'pretty-print' by replacing the text with a canonical ixml rendering of the parsed grammar. This form will line up all the rules so all their names are right aligned and their definitions left-aligned. Strings should be enclosed in the quotation characters used in the original (doubling such characters within the string as necessary). For alternatives, the separator character used is the first separator character (i.e. ';' or '|') encountered whilst parsing that set of alternatives. If the serialisation of the definition of a rule will be longer than 50 characters, top-level alternatives will have a newline and appropriate indentation attached to their separators.

This means that, for instance, an original line of the form:

a: [L] | "s" ; '"', #a; bcdef ; bcdef ,"a", bcdef; bcdef, "b", bcdef.
bcdef: [N]|("1"; "2" | "3").

will be formatted to

     a: [L]| 
        "s"| 
        '"',#a| 
        bcdef| 
        bcdef,"a",bcdef| 
        bcdef,"b",bcdef. 
 bcdef: [N]| ("1";"2";"3"). 

Input strings to be parsed by that grammar are edited in the lower text-area:

The size of both of the textareas can be adjusted (at least on Chrome and Firefox) by the resizer 'chevron' at the lower right hand corner.

Parsing an input against a grammar

With the grammar and potential input string edited, clicking on the GO! button causes the following actions:

Parsing and compiling the grammar

The grammar defined by the text in the Grammar window is parsed and compiled as an iXML grammar to produce an internal object representing the compiled grammar. Assuming the grammar has valid iXML syntax, this is then displayed in the 'Grammar Details' section (which is normally hidden - just click on the bar to reveal or hide).

Here various projections of the grammar can be displayed either in XML format or an iXML textual serialisation, or a colour-highlighted iXML form. Either the original parsed grammar or the compilation (i.e. where the grammar has been reduced to a canonical form) thereof can be shown.:

If you want to copy the parsed XML-format grammar the 'select grammar' button will select the whole of the grammar XML, so a simple 'Copy' keystroke action can get it into the clipboard as text.

Parsing the input string against the grammar

Assuming the grammar has compiled and the input text string is not empty (or the 'Allow an empty string as input' option is checked), the text string is then parsed against the grammar, giving results in the Results section.

As with the parsed grammar, if you want to copy the resulting XML, the 'select result' button will select the whole of the result XML (and multiples if ambiguous or record-oriented processing was performed), so a simple 'Copy' keystroke action can get it into the clipboard as text.

Errors

There are three main categories of errors detected:

Grammar errors

Errors in the iXML grammar presented will be displayed under the 'Grammar Details' section. Such errors can be one of:

Input parsing errors

The parsing of an input string against a valid iXML grammar can fail for a number of reasons. Such errors are displayed in the 'Result' portion of the workbench, currently as, in line with the specification, an XML document with @ixml:state="failed" on the topmost element, such as:

<ixml xmlns:ixml="http://invisiblexml.org/NS" ixml:state="failed">
Failure at line 1 column 2
Given '/' (codepoint 47). Expecting one of:
   "+" {#8: sum: term, "+", term++"+".},
   "×" {#10: product: term, "×", operand.},
   [<=>≠≤≥] {#4: compare: ["<=>≠≤≥"].}
Input:
   a/b
    ^
</ixml>

where an unexpected (operator) character was encountered. In this case the processor attempts to identify what characters would have been admissable at this point in the parse, and in which rules (identified by line number and with original source) the parse failed. (This is currently not available for internally-generated rules, such as those for repetition constructs.)

In cases of high potential ambiguity, such as the grammar:

specification: "{", rule*, "}".
rule: definition*.
definition: id, "=", value.
id:[L].
value:[N].

when run with an input that can trigger such ambiguity, such as {} (possible solutions could include no rule or a potentially infinite sequence of rules each containing no definitions) the jωiXML processor can get into an infinite loop. Internally there is a limit of triggering 1000 productions on processing a character from the input string. If this limit is reached a failure will assumed:

<ixml xmlns:ixml="http://invisiblexml.org/NS" ixml:state="failed">
Probable looping processing character '{' @ line 1, column 1</ixml>

Serialisation errors

The input string may have parsed correctly, but it is still possible that the conversion (serialisation) to XML fails as the resulting tree would not be a valid XML document. In such cases the error is reported, again in the 'Result' section:

<ixml xmlns:ixml="http://invisiblexml.org/NS" ixml:state="failed">
An attribute node may not be the final parse result @input</ixml>

or

<ixml xmlns:ixml="http://invisiblexml.org/NS" ixml:state="failed">
Multiple nodes may not be the final parse result:<expression/>,@compare,<expression/></ixml>

Note that when 'Treat as records' is enabled, multiple document trees can be generated and will be serialised in sequence in the result display.

Actions

The following actions on grammars displayed in the grammar textbox are supported via simple buttons:

Action Effect
reformat Pretty-print the ixml grammar to a standard format, where all rule names are right-aligned, and non-simple (non-terminal and multiple terminals) top-level alternatives are placed on newlines
reduce Transform the given grammar into a 'reduced tree' version (which will have most single-child parent elements removed). See the section "Truncation of Deep Trees" of “Variations on an Invisible Theme" for more details

Options

The following options controlling grammar parsing, input string treatment and result display are supported via checkboxes. Where appropriate the corresponding option or invocation in the jwiXML.processor API is described:

Option Default Effect API equivalent
Show Advanced options Make the advanced (and experimental) options visible. Ordinarily these shouldn't be needed - the defaults maximise the comformity of the processor. See the table below for a description of these options.
Prohibit unreachable non-terminals When checked, all non-terminals in the grammar must be reachable through a reference path from the start (first) rule. compile() option 'unreachable'
Allow an empty string as input Normally if the input is an empty string, no attempt is made to parse - just the grammar is processed and displayed. Checking this allows processing of an empty string as input, which is probably only needed for certain test cases.
Treat as records When checked, the input is assumed to be a sequence of records separated by character sequences which match a given regular expression (for which '\n' is the default). The separator can be edited in the displayed text input, when this option is selected.
The result is a sequence of documents, each serialised in the output result area as children of the 'parent' element, which can be edited
For repetitively structured data where the repetition separators do not appear in the data 'records' this technique can be vastly more efficient than describing the repetition/separation in the iXML grammar itself.
parseRecords() 3rd argument $separator
Show only one ambiguous solution When the parse is ambiguous, with multiple possible solutions, this forces only one to be returned, which will still be marked as ambiguous. parse() option 'justOne'
Indent result When checked, the results will be displayed as a serialization of the XML tree with indentation applied. This means that whitespace-only text nodes may be altered or in some cases deleted. If your application requires strict whitespace preservation, uncheck this option.
Indent with a single space per level When checked, the results, if indented, will be displayed with a single additional space of indentation per tree-level depth (as opposed to the default 3.) This is intended to make very deep trees display 'narrower' in the serialisation (and avoid really wide scrolling displays).
Advanced Option Default Effect API equivalent
Permit missing non-terminals When checked, missing non-terminals may be referenced in the grammar (e.g. for experimentation in grammar combination). Using this during input parsing will lead to unpredictable results - usually some sort of crash. compile() option 'missing'
Tovey-Walsh rewrites When checked, f+ constructs are rewritten as f+ => f-plus. f-plus: f, f-plus| (). rather than the f-plus: f, f*. rewrite given in the spec. This is currently the default, as it seems to perform significantly quicker. compile() option 'twRewrites'
Show Parser States Displays the internal state transitions of the Earley parser operating on the input. This is NOT recommended for use with large grammars and inputs as memory overflows can be encountered.
Show all processed marks If checked, directive marks for deletion ('-'), attribute ('@') or rename aliasing ('>') serialisation of non-terminals, or deletion ('-') or insertion ('+') serialisation of quoted strings are not honoured but rather placed on the full parse tree output either as an @ixml:mark or @ixml:alias attribute or an ixml:insert or ixml:delete element.
This ony applies to marks on the original grammar and not to artefactual marks generated in compiling the grammar, such as those used for generated non-terminals implementing optionality or repetition.
parse() option 'suppressMarks''
Support iXML version 1.1 Support additional features of iXML version 1.1 (e.g. renaming)
Keep multi-character strings Multi-character terminal strings (corresponding to the quoted production of the iXML spec.) are expanded during compilation to a sequence of single (quoted) character strings, to correspond to a character-by-character processing of the Earley parser. If this option is checked such strings are retained as multiple characters and the Earley parser 'jumps-forward' to write a new state several character positions forward in its state records. compile() option 'longStrings'
Use regular expression matches Uses regular expression matching for quoted strings, inclusions and exclusions and in some cases repetition and optional forms of the same. This will only work properly when the ranges of characters matched by the following term in sequence can be guaranteed disjoint with that of the current term. Automatic determination of safe conditions to do this is a type-analysis research issue. compile() option 'regEx'
Max state loops 1000 Set the limit on the number of state-generating processing loops per input character during the parse. If this limit is exceeded, the parse will terminate. parse() option 'loopLimit'
Parse timeout/ms: 4000 Set the time limit (in milliseconds) for the complete parse. At the completion of each character of input this timeout is checked - if exceeded, the parse will terminate. Due to the single-thread nature of JavaScript, this is the only way, without significant overhaul of the entire parser architecture, to limit runaway parsing in cases of very high ambiguity. parse() option 'parseTimeout'

Loading grammars and inputs from external sources

Both grammar and input texts can be read from local filestore by using the appropriate 'Choose file' (or 'Browse') button, which permits a file to be read and its text loaded into the textarea. The name of the file loaded is displayed next to the file chooser. Files can also be 'drag-and-dropped' onto the textarea, though in Firefox security settings will probably have to be altered (it seems to work fine in Chrome).

Grammars can also be loaded from web-repositories, in particular from the InvisibleXML test-suites or sample grammars using the Grammar 'Test/sample' dropdown:

A browsable catalog of the test suite is also available.

When there are sample inputs available for one of these test or sample grammars, the 'Test/Sample' dropdown above the Input textarea will be populated.

Grammars in XML

Note that some of these test cases provide the iXML grammar in its XML serialisation form. The workbench recognises such a situation and will show and use that form, but editing in the textarea under these circumstances will have no effect on the grammar being used in parsing. Files containing iXML grammars serialised as XML loaded by other means (file selection, drag-and-drop) will be (currently) treated as simple text and currently will therefore fail to be parsed. (The jwlProcessor.xsl library contains a jwl:parseXML() function that will accept XML-serialised grammars)

Notes on the XPath/XQuery 4 grammars

I've been involved in the QT4 Community Group since its start, and have been using the grammars for XPath and XQuery as large test examples for this iXML processor. These grammars are also available to support experimentation for those involved in that community to perhaps handle 'what-if' experiments on the grammar.

The 'Test/Sample' dropdown contains (at the top) iXML grammars for the 'near-current' draft 4.0 versions of the Xpath and XQuery grammars. These are derived programmatically (with a modicum of automated patching) directly from the grammar definitions used to generate the specification grammars XPath EBNF and XQuery EBNF. (See my Balisage paper for details of how these are constructed.)

The grammars are presented in three different versions of iXML:

For both these grammars, there is also a 'reduced tree' version which truncates the very deep trees that normally result from parsing. A suitable moderately large sample expression to test parsing is included for each grammar.

Each of the (iXML) grammars contains a date-stamp in a comment near the top which can be used to verify whether it is 'up to date', by comparison with the history of the file: https://github.com/qt4cg/qtspecs/blob/master/specifications/grammar-40/xpath-grammar.xml

Whilst the grammars have been tested across the 35k expressions of the QT4 test sets, with only some 50 failures, there are known areas where the grammar fails or produces ambiguity. These notes are intended to explain.

Failures

Ambiguities

The grammars, lacking a tokenizer and the ability to look-ahead without consumption, retain a few ambiguities that are inhernet in the basic EBNF grammars (i.e. without notes)

The Set-subtraction operator

The grammars use an experimental 'set-subtraction' operator ¬ to indicate certain reserved keywords such as limitations on function names:

         FunctionCall: EQName ¬ reservedFunctionNames, ArgumentList.
reservedFunctionNames: types |
                       commands.
                types: "attribute"; "comment"; ... .
             commands: "fn"; "function"; "if"; "switch"; "typeswitch".         

Removal of the operator and its RH argument will permit the grammar to run on a 1.0 conformant iXML processor, though now comment() can be ambiguous. The operator is currently used on the productions for FunctionCall, NamedFunctionRef, DirCommentContents, CompPIConstructor, ElementContentChar, QuotAttrContentChar and AposAttrContentChar and indirectly from CompElemConstructor and CompAttrConstructor (I can arrange for such grammars to be generated - let me know if you wish so.)

Change history

Version 1.1 - 2022nov01

Version 1.3 - 2023may27

Version 1.4 - 2024feb13

Version 1.5 - 2024sep23