Rendered at 14:26:22 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
joshmarinacci 16 hours ago [-]
I’ve been following the Ohm project for years and it’s the best way to build parsers I’ve ever seen. I’ve used it to parse many program languages and even markdown. I’m happy to see it get even faster.
What is great about the Ohm approach compared to typical lex/yacc/ANTLR parsers is that it avoids ambiguity by using ordered choice (the first matching rule wins), instead of requiring you to resolve conflicts explicitly. This makes working with Ohm/PEGs less painful in the initial phase of a project.
It's also important to highlight that this makes the parsing process slower.
cakoose 5 hours ago [-]
> it avoids ambiguity by using ordered choice (the first matching rule wins)
PEG parsing tool authors often say that ordered choice solves the problem of ambiguity, that's very misleading.
Yes, ordered choice is occasionally useful as a way to resolve grammatic overlap. But as a grammar author, it's more common for me to want to express unordered choice between two sub-grammars. A tool that supports unordered choice will then let you know when you have an unexpected ambiguity.
PEG-based tools force you to use ordered choice for everything. You may be surprised later to find out that your grammar was actually ambiguous, and the ambiguity was "resolved" somewhat arbitrarily by picking the first sub-grammar.
> This makes working with Ohm/PEGs less painful in the initial phase of a project.
I do agree with this. But then what happens in the later phases? Do you switch to a tool that supports unordered choice to see if you have any ambiguities? And potentially have to change your grammar to fix them?
Now I don't know what to think. The author's got a ton more experience than me. It seems there's a big enough market out there for people wanting non-ambiguity proofs and linear running-time proofs.
Then again, the more I think about parsing, the more I think it's a completely made-up problem. I'm pretty sure there's a middle ground between Lisp (or even worse, Forth) and Python. Fancy parsing has robbed us of the possibilities of metaprogramming and instead opened up a market for Stephen Wolfram, whose product features a homo-iconic language.
I've been gorging on Formal Language Theory literature recently. I am now fully convinced that Regular Languages are a very good idea: They are precisely the string-search problems that can be solved in constant space. If you had to find occurrences of certain patterns in a massive piece of text, you would naturally try to keep the memory usage of your search program independent of the text's size. But the theory of Regular Languages only actually gets off the ground when you wonder if Regular Languages are closed under concatenation or not. It turns out that they are closed under concatenation, but this requires representing them as Non-Deterministic Finite-State Automata - which leads to the Kleene Star operation and then Regular Expressions. This is a non-obvious piece of theory which solves a well-formulated problem. So now I suspect that if history were to repeat again, Regular Expressions would still have been invented and used for the same reasons.
By contrast, I find Context-Free Grammars much more dubious, and LR almost offensive? The problem with LR is I can't find a description of what it is that isn't just a gory description of how it works. And furthermore, it doesn't appear to relate to anything other than parsing. There's no description anywhere of how any of its ideas could be used in any other area.
DevelopingElk 14 hours ago [-]
The issue with Regex for parsing is it can't handle balanced parentheses. https://en.wikipedia.org/wiki/Regular_expression. More generally, they can't handle nested structure. Context free grammars are the most natural extension that can. It adds a substitution operator to Regex that makes it powerful enough to recognize nested structure. So, Regex would be reinvented if history was rerun, but so would Context Free Grammars. Part of the complexity in parsing is attaching semantic meaning to the parse. Regex mostly avoids this by not caring how a string matches, just if it matches or not.
Now, I do agree that LR grammars are messy. Nowadays, they have mostly fallen from favor. Instead, people use simpler parsers that work for the restricted grammars actual programming languages have.
IIRC there is some research into formalizing the type of unambiguous grammar that always uses () or [] as nesting elements, but can use Regex for lexing.
ogogmad 13 hours ago [-]
I understand what a CFG is and why Dyck's language (matching parens) is not a regular language. My point was that CFG/CFL is less motivated by a reasonable and uniquely characterising constraint - such as making memory usage independent of the size of an input string - than regex is.
Then again, you are right that CFGs are very natural. And they do admit a few easy O(n^3) parsing algorithms, like Earley and CYK.
I think your last sentence relates to Visible Pushdown Grammars. See also Operator Precedence Grammars.
- Slang (http://slang.kylestetz.com/), a browser-based audio programming language, and turtle.audio, a music sequencer inspired by turtle graphics programming.
- Simpletalk (https://github.com/dkrasner/Simpletalk) is an expressive, programmable authoring system inspired by Hypercard and Smalltalk. (See also the StrangeLoop talk).
- Wildcard (https://www.geoffreylitt.com/wildcard/), a browser extension that empowers anyone to modify websites to meet their own specific needs, uses Ohm for its spreadsheet formulas.
At the bottom of that page is a list of “Here are some awesome things people have built using Ohm:”
zamadatix 14 hours ago [-]
Dunno if the link was changed or something but I had to go to the main page to see the list at the bottom https://ohmjs.org/. Hope that saves someone some searching!
I have been very curious to try Ohm. I'm currently using a hand-rolled parser combinator library, but Ohm looks slick. The online editor is nice too https://ohmjs.org/editor/
wener 12 hours ago [-]
The only thing I prefe peggyjs over ohm is runtime deps, I thinkg this might let ohm aot the grammar to avoid the deps.
quotemstr 14 hours ago [-]
Am I the only one who doesn't like PEGs and prefers EBNF-style parser generators? The order-dependence of PEG alternatives and the lack of ambiguity detection are footguns, IMHO
https://joshondesign.com/2021/07/16/ohm_markdown_parser
It's also important to highlight that this makes the parsing process slower.
PEG parsing tool authors often say that ordered choice solves the problem of ambiguity, that's very misleading.
Yes, ordered choice is occasionally useful as a way to resolve grammatic overlap. But as a grammar author, it's more common for me to want to express unordered choice between two sub-grammars. A tool that supports unordered choice will then let you know when you have an unexpected ambiguity.
PEG-based tools force you to use ordered choice for everything. You may be surprised later to find out that your grammar was actually ambiguous, and the ambiguity was "resolved" somewhat arbitrarily by picking the first sub-grammar.
> This makes working with Ohm/PEGs less painful in the initial phase of a project.
I do agree with this. But then what happens in the later phases? Do you switch to a tool that supports unordered choice to see if you have any ambiguities? And potentially have to change your grammar to fix them?
Now I don't know what to think. The author's got a ton more experience than me. It seems there's a big enough market out there for people wanting non-ambiguity proofs and linear running-time proofs.
Then again, the more I think about parsing, the more I think it's a completely made-up problem. I'm pretty sure there's a middle ground between Lisp (or even worse, Forth) and Python. Fancy parsing has robbed us of the possibilities of metaprogramming and instead opened up a market for Stephen Wolfram, whose product features a homo-iconic language.
I've been gorging on Formal Language Theory literature recently. I am now fully convinced that Regular Languages are a very good idea: They are precisely the string-search problems that can be solved in constant space. If you had to find occurrences of certain patterns in a massive piece of text, you would naturally try to keep the memory usage of your search program independent of the text's size. But the theory of Regular Languages only actually gets off the ground when you wonder if Regular Languages are closed under concatenation or not. It turns out that they are closed under concatenation, but this requires representing them as Non-Deterministic Finite-State Automata - which leads to the Kleene Star operation and then Regular Expressions. This is a non-obvious piece of theory which solves a well-formulated problem. So now I suspect that if history were to repeat again, Regular Expressions would still have been invented and used for the same reasons.
By contrast, I find Context-Free Grammars much more dubious, and LR almost offensive? The problem with LR is I can't find a description of what it is that isn't just a gory description of how it works. And furthermore, it doesn't appear to relate to anything other than parsing. There's no description anywhere of how any of its ideas could be used in any other area.
Now, I do agree that LR grammars are messy. Nowadays, they have mostly fallen from favor. Instead, people use simpler parsers that work for the restricted grammars actual programming languages have.
IIRC there is some research into formalizing the type of unambiguous grammar that always uses () or [] as nesting elements, but can use Regex for lexing.
Then again, you are right that CFGs are very natural. And they do admit a few easy O(n^3) parsing algorithms, like Earley and CYK.
I think your last sentence relates to Visible Pushdown Grammars. See also Operator Precedence Grammars.
[1] https://en.wikipedia.org/wiki/OMeta
[2] https://tinlizzie.org/VPRIPapers/tr2008003_experimenting.pdf
[3] https://tinlizzie.org/IA/index.php/Papers_from_Viewpoints_Re...
Here are some awesome things people have built using Ohm:
- Shopify's theme-tools (https://github.com/Shopify/theme-tools/), used in their online code editor and the official VS code extension.
- Seymour (https://github.com/harc/seymour), a live programming environment for the classroom.
- Shadama (https://tinlizzie.org/~ohshima/shadama2/live2017/), a particle simulation language designed for high-school science.
- Slang (http://slang.kylestetz.com/), a browser-based audio programming language, and turtle.audio, a music sequencer inspired by turtle graphics programming.
- Simpletalk (https://github.com/dkrasner/Simpletalk) is an expressive, programmable authoring system inspired by Hypercard and Smalltalk. (See also the StrangeLoop talk).
- Wildcard (https://www.geoffreylitt.com/wildcard/), a browser extension that empowers anyone to modify websites to meet their own specific needs, uses Ohm for its spreadsheet formulas.
- JAMScript (https://citelab.github.io/JAMScript/) is a programming language for edge-based IoT applications.
- Bruno (https://github.com/usebruno/bruno) is an open source IDE for exploring and testing APIs.
It has also been used in a number of projects by Ink & Switch (https://inkandswitch.com), including Potluck (Dynamic documents as personal software: https://www.inkandswitch.com/potluck/), the Deja Vu project at MIT (https://sdg.csail.mit.edu/project/deja-vu/), ...