On Implementation of CSS Parsers

May 07, 2021#css#internals

Web browser has a CSS parser inside to parse CSS and generate the CSSOM tree according to CSS syntax module, together with the DOM, to build the render tree, which is in turn used by the browser to layout and paint the web page.

CSS syntax is simple, and so the parser will be much simpler than a general programming language parser. A CSS parser contains two main steps:

CSS
↓
(tokenizing by tokenizer)
↓
tokens
↓
(parsing by parser)
↓
AST
  • Tokenizing: a process of reading input string character by character and builds a sequence of tokens. A token is structured as a pair consisting of a token name and an optional token value, token name is a category of lexical unit (identifier, keyword, separator, etc.)
  • Parsing: a process after tokenizing, of takes the stream of tokens and turns it into a data structure. The output data structure type can be any suitable representation of the abstract syntactic structure of source code. Tree type is a common and standard choice, output tree is in the form of a parse tree or an abstract syntax tree (AST).

Parse tree is a one-to-one mapping from the grammar to a tree-form. AST represents the structure of the code in a compact and useful way that facilitates convenient analysis and further processing. Parsers usually either construct ASTs directly in their actions, or first construct parse trees and then convert them to ASTs.

CSS parsers can also be found inside CSS processors (Sass, Less, PostCSS, etc.) or toolset like CSSTree. These custom open-source CSS parser implementations are W3C spec compliant and focus on CSS analyzing and source-to-source transforming tasks with following expected features:

  • Detailed parsing with an adjustable level of detail, for example, you can disable parsing of selectors or declaration values for component parts.
  • Tolerant to errors by design, attempts to recover gracefully, throwing away only the minimum amount of content before returning to parsing as normal.
  • Fast and efficient on performance and effective memory consumption.
  • Syntax validation defined by W3C.

Custom CSS parsers can define output AST formats themselves and those formats don’t need to be compatible between parsers. The ASTs produced by CSS parsers could then be transformed by plugins, then serialized into pure CSS string by a stringifier.

CSS
↓
(tokenizing by tokenizer)
↓
tokens
↓
(parsing by parser)
↓
AST
↓
(transforming by plugins)
↓
modified AST
↓
(serializing by stringifier)
↓
new CSS

The only requirement for serialization is that it must round-trip with parsing, that is, parsing the stylesheet must produce the same data structures as parsing, serializing, and parsing again, except for consecutive whitespace tokens, which may be collapsed into a single token.

Being able to process and transform CSS before running in browser is powerful and open to a lot of possibilities like using latest CSS features, stripping unused styles, auto vendor prefixing, checking accessibility, optimizing performance, and you name it.