Lexer Ideas
This page is essentially a scratchpad for my ideas regarding the DCGen lexer.
Token Type Definitions
- Literal
- A single-quoted string obtained from the grammar rules themselves.
- Typed
- A token specified by @foo, @alnum, etc.
- Fixed-length
- A token of a fixed length, though not necessarily a literal.
- Variable-length
- A token that could be of any length.
- Static
- A token that can be entirely lexed before parsing
- Dynamic
- A token that must be lexed at parse-time because it could have multiple semantic interpretations, and / or could overlap with other tokens of other types.
Finding lexer literals in the parse rules
The parse rules will surround lexer tokens / literals by single quotes (a la 'literal'). When a lexer rule is desired, it will be surrounded by single quotes and prefixed with '@' (a la '@lexerRule').
Output
The lexer will take a character stream as input and output a list of tokens. Each token will have a label attached, so the appearance in the token list will be [label - token, ...]; this label will either be the literal itself or the literal's rule name. So:
'literal' ===> [literal - token] '@lexerRule' ===> [lexerRule - token]
When the parser attempts to match a literal, it will match against the label, not the token itself.
Built-in rules
The following rules will be built in:
- @alnum, @alpha, @digit, etc.: Their equivalent values in SWI Prolog's char_type/2 predicate.
- @id: Basic identifier. Equivalent to @csymf (@csym)*
- @int: Integer type. Equivalent to (@digit)+
- @float: Floating point type. Equivalent to ('-' | '+')? @digit+ '.' @digit+. Needs to be able to handle exponentiation, too.
- @empty: The epsilon / empty production. Matches nothing. Perhaps should have the hook for the first, follow sets, etc.?
- @squote: A single-quoted string that handles escaped quotes. Does not allow newlines in the middle of the string (may change). 'example'
- Equivalent to @squote --> '\'' (not('\n') | '\\\'')* first('\''). (may need to change syntax to handle the escape, but this should work fine for now).
- Does not handle escaped characters (such as '\n') - these will appear in the string as the characters ..., '\', 'n', .... However, escaped characters can be converted (e.g., '\n' to newline) using the built-in semantic rule $convEscapedChars().
- @dquote: A double-quoted string that handles escaped quotes. Does not allow newlines in the middle of the string (may change). "example"
- Equivalent to @dquote --> '"' (not('\n') | '\\"')* first('"')..
- See note about escaped characters under @squote, above.
Rule Output
Other ideas
Rule aliases
Sometimes, it is more declarative to use a parser-specific name for a token rule, rather than a built-in rule (e.g., @lit instead of @squote). This can be done simply by creating the named rule and having the desired target be the only token in the body of the rule. For example, @lit --> @squote.. In this case, though nothing needs to be done, we can do a slight optimization by substituting @squote in for each occurrence of @lit. This can be done for any token rule in which the body is a single literal or a token rule reference. Since we're going for readable code, however, we should make this an opt-in option (not done by default), so the generated code reflects the User's grammar more closely.
