• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    迪恩网络公众号

andrewcooke/ParserCombinator.jl: A parser combinator library for Julia

原作者: [db:作者] 来自: 网络 收藏 邀请

开源软件名称:

andrewcooke/ParserCombinator.jl

开源软件地址:

https://github.com/andrewcooke/ParserCombinator.jl

开源编程语言:

Julia 100.0%

开源软件介绍:

Build status Codecov

ParserCombinator

A parser combinator library for Julia, similar to those in other languages, like Haskell's Parsec or Python's pyparsing. It can parse any iterable type (not just strings) (except for regexp matchers, of course).

ParserCombinator's main advantage is its flexible design, which separates the matchers from the evaluation strategy. This makes it easy to "plug in" memoization, or debug traces, or to restrict backtracking in a similar way to Parsec - all while using the same grammar.

It also contains pre-built parsers for Graph Modelling Language and DOT.

Example

using ParserCombinator


# the AST nodes we will construct, with evaluation via calc()

abstract type Node end
Base.:(==)(n1::Node, n2::Node) = n1.val == n2.val
calc(n::Float64) = n
struct Inv<:Node val end
calc(i::Inv) = 1.0/calc(i.val)
struct Prd<:Node val end
calc(p::Prd) = Base.prod(map(calc, p.val))
struct Neg<:Node val end
calc(n::Neg) = -calc(n.val)
struct Sum<:Node val end
calc(s::Sum) = Base.sum(map(calc, s.val))


# the grammar (the combinators!)

sum = Delayed()
val = E"(" + sum + E")" | PFloat64()

neg = Delayed()       # allow multiple (or no) negations (eg ---3)
neg.matcher = val | (E"-" + neg > Neg)

mul = E"*" + neg
div = E"/" + neg > Inv
prd = neg + (mul | div)[0:end] |> Prd

add = E"+" + prd
sub = E"-" + prd > Neg
sum.matcher = prd + (add | sub)[0:end] |> Sum

all = sum + Eos()


# and test

# this prints 2.5
calc(parse_one("1+2*3/4", all)[1])

# this prints [Sum([Prd([1.0]),Prd([2.0])])]
parse_one("1+2", all)

Some explanation of the above:

  • I used rather a lot of "syntactic sugar". You can use a more verbose, "parser combinator" style if you prefer. For example, Seq(...) instead of +, or App(...) instead of >.

  • The matcher E"xyz" matches and then discards the string "xyz".

  • Every matcher returns a list of matched values. This can be an empty list if the match succeeded but matched nothing.

  • The operator + matches the expressions to either side and appends the resulting lists. Similarly, | matches one of two alternatives.

  • The operator |> calls the function to the right, passing in the results from the matchers on the left.

  • > is similar to |> but interpolates the arguments (ie uses ...). So instead of passing a list of values, it calls the function with multiple arguments.

  • Delayed() lets you define a loop in the grammar.

  • The syntax [0:end] is a greedy repeat of the matcher to the left. An alternative would be Star(...), while [3:4] would match only 3 or 4 values.

And it supports packrat parsing too (more exactly, it can memoize results to avoid repeating matches).

Still, for large parsing tasks (eg parsing source code for a compiler) it would probably be better to use a wrapper around an external parser generator, like Anltr.

Note: There's an issue with the Compat library which means the code above (the assignment to Delayed.matcher) doesn't work with 0.3. See calc.jl for the uglier, hopefully temporary, 0.3 version.

Install

julia> Pkg.add("ParserCombinator")

Manual

Evaluation

Once you have a grammar (see below) you can evaluate it against some input in various ways:

  • parse_one() - a simple, recursive decent parser with backtracking, but no memoization. Returns a single result or throws a ParserException.

  • parse_all() - a packrat parser, with memoization, that returns an iterator (evaluated lazily) over all possible parses of the input.

  • parse_lines() - a parser in which the source is parsed line by line. Pre-4.0.0 Julia copies strings that are passed to regex, so this reduces memory use when using regular expressions.

  • parse_try() - similar to Haskell's Parsec, with backtracking only inside the Try() matcher. More info here.

  • parse_dbg() - as parse_one(), but also prints a trace of evaluation for all of the matchers that are children of a Trace() matchers. Can also be used with other matchers via the keword delegate; for example parse_dbg(...; delegate=Cache) will provide tracing of the packrat parser (parse_all() above). More info here.

These are all implemented by providing different Config subtypes. For more information see Design, types.jl and parsers.jl.

Basic Matchers

In what follows, remember that the power of parser combinators comes from how you combine these. They can all be nested, refer to each other, etc etc.

Equality

julia> parse_one("abc", Equal("ab"))
1-element Array{Any,1}:
 "ab"

julia> parse_one("abc", Equal("abx"))
ERROR: ParserCombinator.ParserException("cannot parse")

This is so common that there's a corresponding string literal (it's "e" for `Equal(), the corresponding matcher).

julia> parse_one("abc", e"ab")
1-element Array{Any,1}:
 "ab"

Sequences

Matchers return lists of values. Multiple matchers can return lists of lists, or the results can be "flattened" a level (usually more useful):

julia> parse_one("abc", Series(Equal("a"), Equal("b")))
2-element Array{Any,1}:
 "a"
 "b"

julia> parse_one("abc", Series(Equal("a"), Equal("b"); flatten=false))
2-element Array{Any,1}:
 Any["a"]
 Any["b"]

julia> parse_one("abc", Seq(Equal("a"), Equal("b")))
2-element Array{Any,1}:
 "a"
 "b"

julia> parse_one("abc", And(Equal("a"), Equal("b")))
2-element Array{Any,1}:
 Any["a"]
 Any["b"]

julia> parse_one("abc", e"a" + e"b")
2-element Array{Any,1}:
 "a"
 "b"

julia> parse_one("abc", e"a" & e"b")
2-element Array{Any,1}:
 Any["a"]
 Any["b"]

Where Series() is implemented as Seq() or And(), depending on the value of flatten (default true).

Warning - The sugared syntax has to follow standard operator precedence, where | binds more tightly that +. This means that

   matcher1 + matcher2 | matcher3

is almost always an error because it means:

   matcher1 + (matcher2 | matcher3)

while what was intended was:

   (matcher1 + matcher2) | matcher3

Empty Values

Often, you want to match something but then discard it. An empty (or discarded) value is an empty list. This may help explain why I said flattening lists was useful above.

julia> parse_one("abc", And(Drop(Equal("a")), Equal("b")))
2-element Array{Any,1}:
 Any[]
 Any["b"]

julia> parse_one("abc", Seq(Drop(Equal("a")), Equal("b")))
1-element Array{Any,1}:
 "b"

julia> parse_one("abc", ~e"a" + e"b")
1-element Array{Any,1}:
 "b"

julia> parse_one("abc", E"a" + e"b")
1-element Array{Any,1}:
 "b"

Note the ~ (tilde / home directory) and capital E in the last two examples, respectively.

Alternates

julia> parse_one("abc", Alt(e"x", e"a"))
1-element Array{Any,1}:
 "a"

julia> parse_one("abc", e"x" | e"a")
1-element Array{Any,1}:
 "a"

Warning - The sugared syntax has to follow standard operator precedence, where | binds more tightly that +. This means that

   matcher1 + matcher2 | matcher3

is almost always an error because it means:

   matcher1 + (matcher2 | matcher3)

while what was intended was:

   (matcher1 + matcher2) | matcher3

Regular Expressions

julia> parse_one("abc", Pattern(r".b."))
1-element Array{Any,1}:
 "abc"

julia> parse_one("abc", p".b.")
1-element Array{Any,1}:
 "abc"

julia> parse_one("abc", P"." + p"b.")
1-element Array{Any,1}:
 "bc"

As with equality, a capital prefix to the string literal ("p" for "pattern" by the way) implies that the value is dropped.

Note that regular expresions do not backtrack. A typical, greedy, regular expression will match as much of the input as possible, every time that it is used. Backtracking only exists within the library matchers (which can duplicate regular expression functionality, when needed).

Repetition

julia> parse_one("abc", Repeat(p"."))
3-element Array{Any,1}:
 "a"
 "b"
 "c"

julia> parse_one("abc", Repeat(p".", 2))
2-element Array{Any,1}:
 "a"
 "b"

julia> collect(parse_all("abc", Repeat(p".", 2, 3)))
2-element Array{Any,1}:
 Any["a","b","c"]
 Any["a","b"]

julia> parse_one("abc", Repeat(p".", 2; flatten=false))
2-element Array{Any,1}:
 Any["a"]
 Any["b"]

julia> collect(parse_all("abc", Repeat(p".", 0, 3)))
4-element Array{Any,1}:
 Any["a","b","c"]
 Any["a","b"]
 Any["a"]
 Any[]

julia> collect(parse_all("abc", Repeat(p".", 0, 3; greedy=false)))
4-element Array{Any,1}:
 Any[]
 Any["a"]
 Any["a","b"]
 Any["a","b","c"]

You can also use Depth() and Breadth() for greedy and non-greedy repeats directly (but Repeat() is more readable, I think).

The sugared version looks like this:

julia> parse_one("abc", p"."[1:2])
2-element Array{Any,1}:
 "a"
 "b"

julia> parse_one("abc", p"."[1:2,:?])
1-element Array{Any,1}:
 "a"

julia> parse_one("abc", p"."[1:2,:&])
2-element Array{Any,1}:
 Any["a"]
 Any["b"]

julia> parse_one("abc", p"."[1:2,:&,:?])
1-element Array{Any,1}:
 Any["a"]

Where the :? symbol is equivalent to greedy=false and :& to flatten=false (compare with + and & above).

There are also some well-known special cases:

julia> collect(parse_all("abc", Plus(p".")))
3-element Array{Any,1}:
 Any["a","b","c"]
 Any["a","b"]
 Any["a"]

julia> collect(parse_all("abc", Star(p".")))
4-element Array{Any,1}:
 Any["a","b","c"]
 Any["a","b"]
 Any["a"]
 Any[]

Full Match

To ensure that all the input is matched, add Eos() to the end of the grammar.

julia> parse_one("abc", Equal("abc") + Eos())
1-element Array{Any,1}:
 "abc"

julia> parse_one("abc", Equal("ab") + Eos())
ERROR: ParserCombinator.ParserException("cannot parse")

Transforms

Use App() or > to pass the current results to a function (or datatype constructor) as individual values.

julia> parse_one("abc", App(Star(p"."), tuple))
1-element Array{Any,1}:
 ("a","b","c")

julia> parse_one("abc", Star(p".") > string)
1-element Array{Any,1}:
 "abc"

The action of Appl() and |> is similar, but everything is passed as a single argument (a list).

julia> type Node children end

julia> parse_one("abc", Appl(Star(p"."), Node))
1-element Array{Any,1}:
 Node(Any["a","b","c"])

julia> parse_one("abc", Star(p".") |> x -> map(uppercase, x))
3-element Array{Any,1}:
 "A"
 "B"
 "C"

Lookahead And Negation

Sometimes you can't write a clean grammar that just consumes data: you need to check ahead to avoid something. Or you need to check ahead to make sure something works a certain way.

julia> parse_one("12c", Lookahead(p"\d") + PInt() + Dot())
2-element Array{Any,1}:
 12
   'c'

julia> parse_one("12c", Not(Lookahead(p"[a-z]")) + PInt() + Dot())
2-element Array{Any,1}:
 12
   'c'

More generally, Not() replaces any match with failure, and failure with an empty match (ie the empty list).

Other

Backtracking

By default, matchers will backtrack as necessary.

In some (unusual) cases, it is useful to disable backtracking. For example, see PCRE's "possessive" matching. This can be done here on a case-by-case basis by adding backtrack=false to Repeat(), Alternatives() and Series(), or by appending ! to the matchers that those functions generate: Depth!, Breadth!, Alt!, Seq! and And!.

For example,

collect(parse_all("123abc", Seq!(p"\d"[0:end], p"[a-z]"[
                      

鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
热门推荐
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap