Tree-Sitter, LLVM, and the future of language tooling
By Tzvi Melamed
Before I dropped out of school, I remember a professor explaining how they would catch us cheating on our Computer Science assignments. In an English class, thousands of students can be given the same prompt for writing an essay, but we wouldn't expect students to come up with the same essay. A trained eye could see right through this ploy even if one student used a thesaurus to change some words from their classmate's writing. Code, our professor said, is just like an essay.
As rigorous and formal as programming languages are, they're ultimately for people. Sure, they're not the most emotionally expressive (I wouldn't try catching up with an old friend over coffee in JavaScript or telling my wife what I feel when I gaze into her eyes with a beautiful poem in C++). Still, they are the best way we know to express a logical flow of thought.
Different (key)strokes for different folks
Different programming languages are better at expressing different kinds of thoughts. For example, Dart, the language that powers the Flutter framework, is ideally suited for expressing UI code with its built-in handling of asynchronous code, accessible runtime type system, and syntax for constructing deeply nested objects. Rust, on the other hand, provides deeply integrated solutions for safe concurrency and is ideally suited for applications that require both high performance and provable safety guarantees.
Programming languages are complicated.
There are hundreds (thousands?) of programming languages, but making one is notoriously hard. Once you learn the tricks of the trade, the theory is quite approachable, but building a compiler for a language that makes the language appealing in the face of established ecosystems like JavaScript, Go, Rust, C++, and Java requires a lot of expertise.
Let's start with parsing. Almost every programming language tool begins with a process of breaking up a large block of text into something meaningful for the tool to use. This is the job of a parser. Unfortunately, parsing itself is complex. Efficient parsers may use advanced techniques like streaming iteration with back pressure, algorithms that produce useful output in the face of errors, concurrent and incremental computation, substructural memory sharing, and arena allocation.
Programming languages are easy.
Despite the complexity of parsing and compilation, programming languages are not that hard. Most of the problems that come up have been solved many times over. Not only are many of these solutions well documented in research, undergraduate textbooks, and open-source projects abound, but code for things like parsing can even be auto-generated. Tools like plex, yacc, and PEG.js have been around for decades.
Trade-offs can vary
Not all parsers are the same, though. Trade-offs must be made based on the desired outcomes. For example, the rust programming language prides itself on its beloved tooling and ecosystem. Rust's compiler, rustc
, does a lot of work at compile time to allow high-level concepts to be abstracted away with macros at zero runtime cost. Because the process is lengthy, parse time must be minimal to ensure it isn't bottlenecking the system. Additionally, rustc
has some of the best error messaging you'll ever see in a programming language compiler. To facilitate this, the parser needs to figure out what the user's intention is when an error occurs, affecting the entire design of the parser.
Another example is Dart's Language Server Protocol (LSP) Implementation. Dart's LSP Server aims to provide semantic features like code completion in a user's IDE. For this, the parser must quickly respond to changes as the user types, and errors will almost certainly be present while the user is in the middle of typing their code. Dart's LSP parser uses an incremental approach focussing on error recovery. It also uses a data structure that acts as an AST and a linked list of tokens for efficiently pinpointing the part of the AST where the user's cursor is in the editor.
Languages aren't just Languages.
These examples are interesting in their own right, but one higher-level point is that programming languages aren't all about the language itself. Perhaps more important than the language itself is its ecosystem and tooling. Various language communities focus on improving different aspects of the developer experience.
Unfortunately, the fantastic work on the user experience for errors in rustc
doesn't immediately benefit the JavaScript, Go, or Swift communities. The linked list approach in Dart's LSP doesn't do much for your typical Kotlin, C++, or Haskell developer. And the advanced tooling based on higher-order types in Coq, OCaml, and Idris has yet to influence the developers who spend their days debugging segfaults in Zig or C++.
Tree-sitter: a new era in universal language tooling
In the past few years, universal tooling has shown promise. Tree-Sitter, announced by Max Brunsfeld in late 2017, is a tool for generating and running incremental parsers for any programming language. Tree-sitter has been used to power semantically aware tooling across languages by the syntax highlighter in the Atom editor and the semantic analysis built into GitHub.
Tree-sitter has proven that building universal tools to optimize parts of the developer experience can work. The primary research behind tree-sitter isn't new at all. Much of it was developed by Tim Wagner in the late 1990s, and Tony Brooker and Jay Earley discussed concepts like universal parser generators in the 1960s.
Universal tooling makes different trade-offs.
One thing apparent in the research literature is that universal language tooling makes a very different set of trade-offs than language-specific tooling. Often, the goal is to apply universal tooling for many programming languages, so time spent specializing individual languages must be minimized. For example, Tim Wagner's research1 mentions certain decisions were made “in order to decrease the time and effort required to port existing language descriptions” 2. Many articles also note, "A few recovery methods cannot be included within a parser generator, requiring the parser writer to provide a significant portion of the coding effort." 3, "a tree-sitter grammar, with it’s declarative specification, provides a lot of neat features for free (like error recovery), but places a ceiling on possibilities for future improvement" 4 and "Error-recovery is sometimes opaque and not conducive to precise diagnostics and debugging." 5.
Jack of all trades, master of none?
Focusing on reducing the cost of adding a new language is terrific, but limiting the expressive power of the language's definition ultimately limits what the tool can do. After years of watching tree-sitter develop, however, I've noticed that, in practice, language communities often contribute to tweaking the specialized implementation of the universal tool for their language. A tool like tree-sitter might strive to make it easy to define grammars for rust, C++, and JavaScript. The tool's authors may even write the initial implementations for each of those languages themselves, but eventually, if the tool gains traction, JavaScript developers will start filing issues and self-managing the JavaScript grammar for the tool (or making several versions of the grammar), Rustaceans will have their processes for improving the rust grammar, C++ developers for theirs, etc.
The future, part 1
In this article, I propose a more extensible approach to universal language tooling—one where features like better error handling are optional and can be developed incrementally. One where trade-offs can be made abstractly and language specifications can be used to generate multiple parsers and other tools that make different trade-offs based on their specific use case. I'd like to see a world in which advances in UX for surfacing errors can be made universal and all developers, regardless of their chosen programming language, can reap the benefits of advancement across the industry.
LLVM, Rowan, and language tooling libraries
Parsing isn't the only part of compilers that has seen universal approaches. LLVM is a C++ library and tool ecosystem that abstracts away the platform-specific part of machine-code generation and optimization. Starting as a research project, LLVM is now a vital part of the compilation process for C, C++, Rust, Swift, and even Haskell. LLVM accomplishes this universality, though, using a different approach than tree-sitter. While efficient parsing can be complex, it pales in comparison to generating optimized machine-dependent assembly code for x86, arm, RISC-V, and other architectures on Windows, Linux, Mac, and the toy operating system your old classmate is building in their musky windowless basement apartment on weekends. Because of this, LLVM can get away with requiring more work from language authors. One project in the LLVM ecosystem, MLIR, takes the concept of code generation further by providing multiple levels of intermediate representation and enabling language features like built-in vectorized computing with minimal effort.
Rowan solves a similar problem to tree-sitter: parsing for IDE use cases. Developed initially for rust-analyzer, Rowan provides an efficient lossless syntax tree implementation that is now used by other projects like rslint. Rowan takes an approach like LLVM, requiring more work from its users, but it has seen significantly less adoption than tree-sitter or LLVM. Why? Since parsing is relatively easy compared to code generation, if the tool has too much friction, it won't be worth the effort to use it.
Learning from our past
Can a solution like tree-sitter be generalized to output code for rowan? I think it can be. Both rslint and rust-analyzer use scripted code generation as part of their integration with rowan. Although the code generation is more specialized to each use case, it suggests that a library like rowan could be used as a kind of back-end for something more general like tree-sitter.
The future, part 2
What if tree-sitter had multiple back-ends? A single grammar could generate code optimized for different use cases: an incremental parser with full fidelity based on rowan, a slightly faster batch parser that fails on its first error, or even a parser designed for runtime environments like Scheme's REPL. Trade-offs could be made during generation, but multiple derivations could be made from the same language definition, allowing for more flexibility.
How far can this type of tooling go? Can we build universal Type Checking? LSP servers? Linters? Refactoring tools? Module trees? Package managers? Documentation? Tracing? REPLs? Runtimes? JIT compilers? Can libraries be built generically to work with more than one language using more efficient mechanisms than FFI? Can generic FFI be built with type safety?
Call to action
I hope this article inspires you to look at advancements in different software industry segments and learn from related work. As an industry, I hope we can build more universal tooling, allowing us to focus on what we do best and reap the benefits of what others do together.
If you like this article, or if you disagree with it, please send me an email, tweet, or DM on whichever platform you prefer, and I'll be happy to chat with you. I love new perspectives, and I'm excited to see what we'll all build together.
Footnotes
-
Wagner, T. A. (1998). Practical Algorithms for Incremental Software Development Environments (Issue UCB/CSD-97-946) [PhD Thesis, EECS Department, University of California, Berkeley]. http://www2.eecs.berkeley.edu/Pubs/TechRpts/1998/5885.html ↩
-
Ibid. s.s. 2.3.1 ↩
-
Bottom-Up Parsing (Compiler Writing) Part 13. (n.d.). Retrieved October 21, 2022, from http://what-when-how.com/compiler-writing/bottom-up-parsing-compiler-writing-part-13/ ↩
-
Zimmerman, J. (400 C.E., 43:46). Is tree-sitter good enough? – Jake Zimmerman. https://blog.jez.io/tree-sitter-limitations/ ↩
-
Semantic. (2022). [Haskell]. GitHub. https://github.com/github/semantic/blob/793a876ae45d38a6bd1744e137db71d148081bf1/docs/why-tree-sitter.md (Original work published 2019) ↩