Advanced Computing Environment
Hosted by SourceForge
brix-os project page

Previous: Basics ----- Up: Contents ----- Next: Parsing and Rewrite Patterns

Lexical Syntax

show notes
Features that need to be added to the page or ideas that haven't been thought out.
  • lexer tracks whitespace to the left of all tokens. compound accessors use this to enforce the no-whitespace requirement between the object and parameter list. the dot operator also requires no-whitespace between the object and slot/method.
  • any unicode characters that should be in symbol set?
  • the lexer could convert blocks, lists and arrays that only contain operator characters to an operator token wrapped with curly brackets, parentheses or square brackets. this would allow for some more unique operator names but would they be useful?
    	{:} -> "{:}" token
    	(+) -> "(+)" token
    	[-] -> "[-]" token
    the characters should all have the same shifting as the brackets or parentheses to make the operator easier to type. this wouldn't be required but highly suggested.

White Space
White space consists of spaces, tabs, and newlines.

Single-line Comments
The space between the double-slashes and comment is optional.

	// single-line comment to end-of-line

Multi-line Comments
Multi-line comments can span several lines and are useful for commenting out a section of code or for large descriptive comments. Space is not required before and after the multi-line comment symbols.

	// multi-line comment
	/* multi-line
	comment */

	// inside expressions
	a = /*comment*/ 1;

	// nested comments
	/* foo
	  /* bar */
	  // baz

Blocks, Lists and Arrays
Tokens are stored one after another in a single list but those inside {}, () or [] are stored in sub-lists. Blocks use curly brackets, lists use parentheses and arrays use square brackets.

Semicolons and Commas
Each of these is stored in its own token to allow extensions flexibility in their meaning.

A string is a sequence of UTF-8 characters between double-quotes. The \" sequence is added to the string and does not cause it to be terminated. Strings are passed to the evaluator in raw form were escape sequences are then converted to character codes. The \R sequence at the beginning of the string keeps the string in raw form after stripping off the tag. See the ?? Strings page for all escape sequences.

	"string" -> string
	"\"\x61; string\"" -> "a string"
	"\R\"\x61; string\"" -> \"\x61; string\"

A character is a sequence of UTF-8 characters between single-quotes. The \' sequence is added to the string and does not cause it to be terminated. The character token is no different than the string token and extensions may alter how they get translated. The default evaluator processes the sequence and returns a four byte Unicode literal unless cast to another character set. See the ?? Characters page for all escape sequences.

	'\''	// '
	'a'	// a
	'\x61'	// a

Numbers can not begin with an underscore but may contain them to aid in readability, 0xFFFF_FFFF is a valid number. Dot tuples contain a sequence of unsigned elements useful for version numbers and IP addresses. A two element sequence defaults to a float but automatically converts to a dot tuple when needed.

	unsigned binary = [0-1][0-1_]*[bB]

	unsigned octal = 0[0-7_]+

	unsigned hex = (0[xX][0-9a-fA-F][0-9a-fA-F_]*|[0-9][0-9a-fA-F_]*[hH])

	signed decimal = [0-9][0-9_]*

	float = [0-9][0-9_]*([.][0-9_]+([eE][+-]?[0-9][0-9_]*)?|[eE][+-]?[0-9][0-9_]*)

	dot tuple = [0-9]+[.][0-9]+([.][0-9]+)*

There are two sets of identifiers, the symbol set and the alphanumeric set. The symbol set consists of +*%<>~&|^$!=@:#/\.- and splits on any character not in the set. A /* sequence will split the symbol and begin a block comment and the */ sequence generates an error. The alphanumeric set is any sequence of characters (including underscore and question mark) other than whitespace, ()[]{};,"'` or the symbol set.

Flags begin with a grave accent (`) and contain alphanumeric characters and hyphens. The language uses these as context-oriented symbols and binds them to the first non-flag token to their right unless consumed by an extension.

	`const pi := 3.14 // constant

Previous: Basics ----- Up: Contents ----- Next: Parsing and Rewrite Patterns