4. Extending Syntax
Because of Syntax’s modular design, it is pretty straightforward to create your own syntax modules. The hardest part is doing the actual tokenizing of your chosen syntax.
You can use the existing syntax modules to guide your own implementation if you wish, but note that each module will have to solve unique problems, because of the uniqueness of each different syntax.
Your new syntax implementation should extend
Syntax::Tokenizer—this sets up a rich domain-specific language for scanning and tokenizing.
Then, all you need to implement is the
#step method, which should take no parameters. Each invocation of
#step should extract at least one token, but may extract as many as you need it to. (Fewer is generally better, though.)
Additionally, you may also implement
#setup, to perform any initialization that should occur when tokenizing begins. Similarly,
#teardown may be implemented to do any cleanup that is needed.
Within a tokenizer, you have access to a rich set of methods for scanning the text. These methods correspond to the methods of the StringScanner class (i.e.,
Additionally, subgroups of recent regexps (used in
scan, etc.) can be obtained via
subgroup, which takes as a parameter the group you want to query.
Tokenizing proceeds as follows:
- Identify a token (using
- Start a new token group (using
#start_group, passing the symbol for the group and optionally any text you want to seed the group with).
- Append text to the current group either with additional calls to
#start_groupusing the same group, or with
#append(which just takes the text to append to the current group)
#start_group, you can also use
#start_region, which begins a new region for the given group, and
#end_region, which closes the region.
Here is an example of a very, very simple tokenizer, that simple extracts words and numbers from the text:
require 'syntax' class SimpleTokenizer < Syntax::Tokenizer def step if digits = scan(/\d+/) start_group :digits, digits elsif words = scan(/\w+/) start_group :words, words else start_group :normal, scan(/./) end end end
Once you’ve written your new syntax module, you need to register it with the Syntax library so that it can be found and used by the framework. To do this, just add it to the
require 'syntax' class SimpleTokenizer < Syntax::Tokenizer ... end Syntax::SYNTAX['simple'] = SimpleTokenizer
That’s it! Once you’ve done that, you can now use your syntax just by requiring the file that defines it, and then using the standard Syntax framework methods:
require 'simple-tokenizer' require 'syntax/convertor/html' convertor = Syntax::Convertors::HTML.for_syntax "simple" puts convertor.convert( "hello 15 worlds!" )