Lossless parsing of Julia code with precise source mapping
Production quality error recovery, reporting and unit testing
Parser structure as similar as possible to Julia's flisp-based parser
Speedy enough for interactive editing
"Compilation as an API" to support all sorts of tooling
Grow to encompass the rest of the compiler frontend: macro expansion,
desugaring and other lowering steps.
Once mature, replace Julia's flisp-based reference frontend in Core
Design Opinions
Parser implementation should be independent from tree data structures. So
we have the ParseStream interface.
Tree data structures should be layered to balance losslessness with
abstraction and generality. So we have SyntaxNode (an AST) layered on top
of GreenNode (a lossless parse tree). We might need other tree types later.
Fancy parser generators still seem marginal for production compilers. We use
a boring but flexible recursive descent parser.
Status
The library is in pre-0.1 stage, but parses all of Base correctly with only a
handful of failures remaining in the Base tests and standard library.
The tree data structures should be somewhat usable but will evolve as we try
out various use cases.
Examples
Here's what parsing of a small piece of code currently looks like in various
forms. We'll use the parseall convenience function to demonstrate, but
there's also a more flexible parsing interface with JuliaSyntax.parse().
First, a source-ordered AST with SyntaxNode (call-i in the dump here means
the call has the infix -i flag):
Internally this has a full representation of all syntax trivia (whitespace and
comments) as can be seen with the more raw "green tree" representation with
GreenNode. Here ranges on the left are byte ranges, and ✔ flags nontrivia
tokens. Note that the parentheses are trivia in the tree representation,
despite being important for parsing.
To use JuliaSyntax as the default Julia parser to include() files,
parse code with Meta.parse(), etc, call
julia> JuliaSyntax.enable_in_core!()
This causes some startup latency, so to reduce that you can create a custom
system image by running the code in ./sysimage/compile.jl as a Julia script
(or directly using the shell, on unix). Then use julia -J $resulting_sysimage.
Using a custom sysimage has the advantage that package precompilation will also
go through the JuliaSyntax parser.
Parser implementation
Our goal is to losslessly represent the source text with a tree; this may be
called a "lossless syntax tree". (This is sometimes called a "concrete syntax
tree", but that term has also been used for the parse tree of the full formal
grammar for a language including any grammar hacks required to solve
ambiguities, etc. So we avoid this term.)
JuliaSyntax uses a mostly recursive descent parser which closely
follows the high level structure of the flisp reference parser. This makes the
code familiar and reduces porting bugs. It also gives a lot of flexibility for
designing the diagnostics, tree data structures, compatibility with different
Julia versions, etc. I didn't choose a parser generator as they still seem
marginal for production compilers — for the parsing itself they don't seem
greatly more expressive and they can be less flexible for the important
"auxiliary" code which needs to be written in either case.
Lexing
We use a version of Tokenize.jl
which has been modified to better match the needs of parsing:
Newline-containing whitespace is emitted as a separate kind
Tokens inside string interpolations are emitted separately from the string
Strings delimiters are separate tokens and the actual string always has the
String kind
Additional contextual keywords (as, var, doc) have been added and
moved to a subcategory of keywords.
Nonterminal kinds were added (though these should probably be factored out again)
Various bugs fixed and additions for newer Julia versions
This copy of Tokenize lives in the JuliaSyntax source tree due to the volume
of changes required but once the churn settles down it would be good to figure
out how to un-fork the lexer in some way or other.
Parsing with ParseStream
The main parser innovation is the ParseStream interface which provides a
stream-like I/O interface for writing the parser. The parser does not
depend on or produce any concrete tree data structure as part of the parsing
phase but the output spans can be post-processed into various tree data
structures as required. This is like the design of rust-analyzer though with a
simpler implementation.
Parsing proceeds by recursive descent;
The parser consumes a flat list of lexed tokens as input using peek() to
examine tokens and bump() to consume them.
The parser produces a flat list of text spans as output using bump() to
transfer tokens to the output and position()/emit() for nonterminal ranges.
Diagnostics are emitted as separate text spans
Whitespace and comments are automatically bump()ed and don't need to be
handled explicitly. The exception is syntactically relevant newlines in space
sensitive mode.
Parser modes are passed down the call tree using ParseState.
The output spans track the byte range, a syntax "kind" stored as an integer
tag, and some flags. The kind tag makes the spans a sum
type but where the type is
tracked explicitly outside of Julia's type system.
For lossless parsing the output spans must cover the entire input text. Using
bump(), position() and emit() in a natural way also ensures that:
Spans are cleanly nested with children contained entirely within their parents
Siblings spans are emitted in source order
Parent spans are emitted after all their children.
These properties make the output spans naturally isomorphic to a
"green tree"
in the terminology of C#'s Roslyn compiler.
Tree construction
The build_tree function performs a depth-first traversal of the ParseStream
output spans allowing it to be assembled into a concrete tree data structure,
for example using the GreenNode data type. We further build on top of this to
define build_tree for the AST type SyntaxNode and for normal Julia Expr.
Error recovery
The goal of the parser is to produce well-formed hierarchical structure from
the source text. For interactive tools we need this to work even when the
source text contains errors; it's the job of the parser to include the recovery
heuristics to make this work.
Concretely, the parser in JuliaSyntax should always produce a green tree
which is well formed in the sense that GreenNodes of a given Kind have
well-defined layout of children. This means the GreenNode to SyntaxNode
transformation is deterministic and tools can assume they're working with a
"mostly valid" AST.
What does "mostly valid" mean? We allow the tree to contain the following types
of error nodes:
Missing tokens or nodes may be added as placeholders when they're needed
to complete a piece of syntax. For example, we could parse a + (b * as
(call-i a + (call-i * b XXX)) where XXX is a placeholder error node.
A sequence of unexpected tokens may be removed by collecting
them as children of an error node and treating them as syntax trivia during
AST construction. For example, a + b end * c could be parsed as the green
tree (call-i a + b (error-t end * c)), and turned into the AST (call + a b).
We want to encode both these cases in a way which is simplest for downstream
tools to use. This is an open question, but for now we use K"error" as the
kind, with the TRIVIA_FLAG set for unexpected syntax.
Syntax trees
Julia's Expr abstract syntax tree can't store precise source locations or
deal with syntax trivia like whitespace or comments. So we need some new tree
types in JuliaSyntax.
JuliaSyntax currently deals in three types of trees:
GreenNode is a minimal lossless syntax tree where
Nodes store a kind and length in bytes, but no text
Syntax trivia are included in the list of children
Children are strictly in source order
SyntaxNode is an abstract syntax tree which has
An absolute position and pointer to the source text
Children strictly in source order
Leaf nodes store values, not text
Trivia are ignored, but there is a 1:1 mapping of non-trivia nodes to the
associated GreenTree nodes.
Expr is used as a conversion target for compatibility
Wherever possible, the tree structure of GreenNode/SyntaxNode is 1:1 with
Expr. There are, however, some exceptions.
Tree differences between GreenNode and Expr
First, GreenNode inherently stores source position, so there's no need for
the LineNumberNodes used by Expr. There's also a small number of other
differences
Flattened generators
Flattened generators are uniquely problematic because the Julia AST doesn't
respect a key rule we normally expect: that the children of an AST node are a
contiguous range in the source text. This is because the fors in
[xy for x in xs for y in ys] are parsed in the normal order of a for loop to
mean
for x in xs
for y in ys
push!(xy, collection)
so the xy prefix is in the body of the innermost for loop. Following this,
the standard Julia AST is like so:
(flatten
(generator
(generator
xy
(= y ys))
(= x xs)))
however, note that if this tree were flattened, the order would be
(xy) (y in ys) (x in xs) and the x and y iterations are opposite of the
source order.
However, our green tree is strictly source-ordered, so we must deviate from the
Julia AST. The natural representation seems to be to remove the generators and
use a flattened structure:
(flatten
xy
(= x xs)
(= y ys))
Whitespace trivia inside strings
For triple quoted strings, the indentation isn't part of the string data so
should also be excluded from the string content within the green tree. That is,
it should be treated as separate whitespace trivia tokens. With this separation
things like formatting should be much easier. The same reasoning goes for
escaping newlines and following whitespace with backslashes in normal strings.
Detecting string trivia during parsing means that string content is split over
several tokens. Here we wrap these in the K"string" kind (as is already used
for interpolations). The individual chunks can then be reassembled during Expr
construction. (A possible alternative might be to reuse the K"String" and
K"CmdString" kinds for groups of string chunks (without interpolation).)
Take as an example the following Julia fragment.
x ="""$a b"""
Here this is parsed as (= x (string-s a "\n" "b")) (the -s flag in
string-s means "triple quoted string")
Looking at the green tree, we see the indentation before the $a and b are
marked as trivia:
We generally track the type of syntax nodes with a syntax "kind", stored
explicitly in each node an integer tag. This effectively makes the node type a
sum type in the type system
sense, but with the type tracked explicitly outside of Julia's type system.
Managing the type explicitly brings a few benefits:
Code and data structures for manipulating syntax nodes is always concretely
typed from the point of view of the compiler.
We control the data layout and can pack the kind into very few bits along
with other flags bits, as desired.
Predicates such as is_operator can be extremely efficient, given that we
know the meaning of the kind's bits.
The kind can be applied to several different tree data structures, or
manipulated by itself.
Pattern matching code is efficient when the full set of kinds is closed and
known during compilation.
There's arguably a few downsides:
Normal Julia dispatch can't express dispatch over syntax kind. Luckily,
a pattern matching macro can provide a very elegant way of expressing such
algorithms over a non-extensible set of kinds, so this is not a big problem.
Different node kinds could come with different data fields, but a syntax
tree must have generic fields to cater for all kinds. (Consider as an analogy
the normal Julia AST QuoteNode with a single field vs Expr with generic
head and args fields.) This could be a disadvantage for code which
processes one specific kind but for generic code processing many kinds
having a generic but concrete data layout should be faster.
Differences from the flisp parser
Practically the flisp parser is not quite a classic recursive descent
parser, because it
often looks back and modifies the output tree it has already produced. We've
tried to eliminate this pattern it favor of lookahead where possible because
It works poorly when the parser is emitting a stream of node spans with
strict source ordering constraints.
It's confusing to reason about this kind of code
However, on occasion it seems to solve genuine ambiguities where Julia code
can't be parsed top-down with finite lookahead. Eg for the kw vs =
ambiguity within parentheses. In these cases we put up with using the
functions look_behind and reset_node!().
Code structure
Large structural changes were generally avoided while porting. In particular,
nearly all function names for parsing productions are the same with -
replaced by _ and predicates prefixed by is_.
Some notable differences:
parse-arglist and a parts of parse-paren- have been combined into a
general function parse_brackets. This function deals with all the odd
corner cases of how the AST is emitted when mixing , and ; within
parentheses. In particular regard to:
Determining whether ; are block syntax separators or keyword parameters
Determining whether to emit parameter sections based on context
Emitting key-value pairs either as kw or = depending on context
The way that parse-resword is entered has been rearranged to avoid parsing
reserved words with parse-atom inside parse-unary-prefix. Instead, we
detect reserved words and enter parse_resword earlier.
Flisp parser bugs
Here's some behaviors which seem to be bugs. (Some of these we replicate in the
name of compatibility, perhaps with a warning.)
Macro module paths allow calls which gives weird stateful semantics!
b() = rand() > 0.5 ? Base : Core
b().@info "hi"
Misplaced @ in macro module paths like [email protected] is parsed as odd
broken-looking AST like (macrocall (. A (quote (. B @x)))). It should
probably be rejected.
Operator prefix call syntax doesn't work in the cases like +(a;b,c) where
keyword parameters are separated by commas. A tuple is produced instead.
const and global allow chained assignment, but the right hand side is not
constant. a const here but not b.
const a = b = 1
Parsing the ncat array concatenation syntax within braces gives
strange AST: {a ;; b} parses to (bracescat 2 a b) which is the same as
{2 ; a ; b}, but should probably be (bracescat (nrow 2 a b)) in analogy
to how {a b} produces (bracescat (row a b)).
export a, \n $b is rejected, but export a, \n b parses fine.
In try-catch-finally, the finally clause is allowed before the catch, but
always executes afterward. (Presumably was this a mistake? It seems pretty awful!)
When parsing "[x \n\n ]" the flisp parser gets confused, but "[x \n ]" is
correctly parsed as Expr(:vect) (maybe fixed in 1.7?)
f(x for x in in xs) is accepted, and parsed very strangely.
Octal escape sequences saturate rather than being reported as errors. Eg,
"\777" results in "\xff". This is inconsistent with
Base.parse(::Type{Int}, ...)
Leading dots in import paths with operator-named modules are parsed into
dotted operators rather than a relative path. Ie, we have import .⋆ parsing
to (import (. .⋆)) whereas it should be (import (. . ⋆)) for consistency
with the parsing of import .A.
Looking back on the output disregards grouping parentheses which can lead to
odd results in some cases. For example, f(((((x=1))))) parses as a keyword
call to function f with the keyword x=1, but arguably it should be an
assignment.
Hexfloat literals can have a trailing f for example, 0x1p1f
but this doesn't do anything. In the flisp C code such cases are treated as
Float32 literals and this was intentional JuliaLang/julia#2925
but this has never been officially supported in Julia. It seems this bug
arises from (set! pred char-hex?) in parse-number accepting hex exponent
digits, all of which are detected as invalid except for a trailing f when
processed by isnumtok_base.
Parsing / AST oddities and warts
Questionable allowed forms
There's various allowed syntaxes which are fairly easily detected in the
parser, but which will be rejected later during lowering. To allow building
DSLs this is fine and good but some such allowed syntaxes don't seem very
useful, even for DSLs:
macro (x) end is allowed but there are no anonymous macros.
abstract type A < B end and other subtype comparisons are allowed, but
only A <: B makes sense.
x where {S T} produces (where x (bracescat (row S T))). This seems pretty weird!
[x for outer x in xs] parses, but outer makes no real sense in this
context (and using this form is a lowering error)
kw and = inconsistencies
There's many apparent inconsistencies between how kw and = are used when
parsing key=val pairs inside parentheses.
Inconsistent parsing of tuple keyword args inside vs outside of dot calls
(a=1,) # (tuple (= a 1))f.(a=1) # (tuple (kw a 1))
Mixtures of , and ; in calls give nested parameter AST which parses
strangely, and is kind-of-horrible to use.
# (tuple (parameters (parameters e f) c d) a b)
(a,b; c,d; e,f)
Long-form anonymous functions have argument lists which are parsed
as tuples (or blocks!) rather than argument lists and this mess appears to be
papered over as part of lowering. For example, in function (a;b) end the
(a;b) is parsed as a block! This leads to more inconsistency in the use of
kw for keywords.
Other oddities
Operators with suffices don't seem to always be parsed consistently as the
same operator without a suffix. Unclear whether this is by design or mistake.
For example, [x +y] ==> (hcat x (+ y)), but [x +₁y] ==> (hcat (call +₁ x y))
global const x=1 is normalized by the parser into (const (global (= x 1))).
I suppose this is somewhat useful for AST consumers, but reversing the source
order is pretty weird and inconvenient when moving to a lossless parser.
let bindings might be stored in a block, or they might not be, depending on
special cases:
# Special cases not in a block
let x=1 ; end ==> (let (= x 1) (block))
let x::1 ; end ==> (let (:: x 1) (block))
let x ; end ==> (let x (block))
# In a block
let x=1,y=2 ; end ==> (let (block (= x 1) (= y 2) (block)))
let x+=1 ; end ==> (let (block (+= x 1)) (block))
The elseif condition is always in a block but not the if condition.
Presumably because of the need to add a line number node in the flisp parser
if a xx elseif b yy end ==> (if a (block xx) (elseif (block b) (block yy)))
Spaces are allowed between import dots — import . .A is allowed, and
parsed the same as import ..A
import A.. produces (import (. A .)) which is arguably nonsensical, as .
can't be a normal identifier.
The raw string escaping rules are super confusing for backslashes near
the end of the string: raw"\\\\ " contains four backslashes, whereas
raw"\\\\" contains only two. However this was an intentional feature to
allow all strings to be represented and it's unclear whether the situation
can be improved.
In braces after macrocall, @S{a b} is invalid but both @S{a,b} and
@S {a b} parse. Conversely, @S[a b] parses.
Macro names and invocations are post-processed from the output of
parse-atom / parse-call, which leads to some surprising and questionable
constructs which "work":
Absurdities like @(((((a))))) x ==> (macrocall @a x)
Infix macros!? @(x + y) ==> (macrocall @+ x y) (ok, kinda cute and has
some weird logic to it... but what?)
Similarly additional parentheses are allowed @(f(x)) ==> (macrocall @f x)
Allowing @ first in macro module paths (eg @A.B.x instead of A.B.@x)
seems like unnecessary variation in syntax. It makes parsing valid macro
module paths more complex and leads to oddities like @$.x y ==> (macrocall ($ (quote x)) y where the $ is first parsed as a macro name, but turns out
to be the module name after the . is parsed. But $ can never be a valid
module name in normal Julia code so this makes no sense.
Triple quoted var"""##""" identifiers are allowed. But it's not clear these
are required or desired given that they come with the complex triple-quoted
string deindentation rules.
Deindentation of triple quoted strings with mismatched whitespace is weird
when there's nothing but whitespace. For example, we have
"\"\"\"\n \n \n \"\"\"" ==> "\n \n" so the middle line of whitespace
here isn't dedented but the other two longer lines are?? Here it seems more
consistent that either (a) the middle line should be deindented completely,
or (b) all lines should be dedented only one character, as that's the
matching prefix.
Parsing of anonymous function arguments is somewhat inconsistent.
function (xs...) \n body end parses the argument list as (... xs), whereas
function (x) \n body end parses the argument list as (tuple x).
The difference between multidimensional vs flattened iterators is subtle, and
perhaps too syntactically permissive. For example,
[(x,y) for x * in 1:10, y in 1:10] is a multidimensional iterator
[(x,y) for x * in 1:10 for y in 1:10] is a flattened iterator
[(x,y) for x in 1:10, y in 1:10 if y < x] is a flattened iterator
It's this last case which seems problematic (why not require the second
form as a more explicit way to indicate flattening?). It's not even pretty
printed correctly:
julia> :([(x,y) for x in 1:10, y in 1:10 if y < x])
:([(x, y) for $(Expr(:filter, :(y < x), :(x = 1:10), :(y = 1:10)))])
Comparisons to other packages
Official Julia compiler
The official Julia compiler frontend lives in the Julia source tree. It's
mostly contained in just a few files:
The flisp runtime and C extensions for Julia in src/flisp
Supporting utility functions in a few other .scm and .c files.
There's two issues with the official reference frontend which suggest a rewrite.
First, there's no support for precise source locations and the existing data
structures (bare flisp lists) can't easily be extended to add these. Fixing
this would require changes to nearly all of the code.
Second, it's written in flisp: an aestheically pleasing, minimal but obscure
implementation of Scheme. Learning Scheme is actually a good way to appreciate
some of Julia's design inspiration, but it's quite a barrier for developers of
Julia language tooling. (Flisp has no user-level documentation but non-schemers
can refer to the Racket documentation which is
quite compatible for basic things.) In addition to the social factors, having
the embedded flisp interpreter and runtime with its own separate data
structures and FFI is complex and inefficient.
JuliaParser.jl
JuliaParser.jl
was a direct port of Julia's flisp reference parser but was abandoned around
Julia 0.5 or so. However it doesn't support lossless parsing and doing so would
amount to a full rewrite. Given the divergence with the flisp reference parser
since Julia-0.5, it seemed better just to start with the reference parser
instead.
Tokenize.jl
Tokenize.jl
is a fast lexer for Julia code. The code from Tokenize has been
imported and used in JuliaSyntax, with some major modifications as discussed in
the lexer implementation section.
CSTParser.jl
CSTParser.jl
is a (mostly?)
lossless parser with goals quite similar to JuliaParser and used extensively in
the VSCode / LanguageServer / JuliaFormatter ecosystem. CSTParser is very useful
but I do find the implementation hard to understand and I wanted to try a fresh
approach with a focus on:
"Production readyness": Good docs, tests, diagnostics and maximum similarity
with the flisp parser, with the goal of getting the new parser into Core.
Learning from the latest ideas about composable parsing and data structures
from outside Julia. In particular the implementation of rust-analyzer is
very clean, well documented, and a great source of inspiration.
Composability of tree data structures — I feel like the trees should be
layered somehow with a really lightweight green tree at the most basic level,
similar to Roslyn or rust-analyzer. In comparison CSTParser uses a more heavy
weight non-layered data structure. Alternatively or additionally, have a
common tree API with many concrete task-specific implementations.
A big benefit of the JuliaSyntax parser is that it separates the parser code
from the tree data structures entirely which should give a lot of flexibility
in experimenting with various tree representations.
I also want JuliaSyntax to tackle macro expansion and other lowering steps, and
provide APIs for this which can be used by both the core language and the
editor tooling.