cmark — Scanner System

Overview

The scanner system (scanners.h, scanners.re, scanners.c) provides fast pattern-matching functions used throughout cmark's block and inline parsers. The scanners are generated from re2c specifications and compiled into optimized C switch-statement automata. They perform context-free matching only (no backtracking, no captures beyond match length).

Architecture

Source Files

  • scanners.re — re2c source with pattern specifications
  • scanners.c — Generated C code (committed to the repository, regenerated manually)
  • scanners.h — Public declarations (macro wrappers and function prototypes)

Generation

Scanners are regenerated from re2c source via:

re2c --case-insensitive -b -i --no-generation-date --8bit -o scanners.c scanners.re

Flags:

  • --case-insensitive — Case-insensitive matching
  • -b — Use bit vectors for character classes
  • -i — Use if statements instead of switch
  • --no-generation-date — Reproducible output
  • --8bit — 8-bit character width

The generated code consists of state machines implemented as nested switch/if blocks with direct character comparisons. There are no regular expression structs, no DFA tables — the patterns are compiled directly into C control flow.

Scanner Interface

The _scan_at Wrapper

#define _scan_at(scanner, s, p) scanner(s->input.data, s->input.len, p)

All scanner functions share the signature:

bufsize_t scan_PATTERN(const unsigned char *s, bufsize_t len, bufsize_t offset);

Parameters:

  • s — Input byte string
  • len — Total length of s
  • offset — Starting position within s

Return value:

  • Length of the match (in bytes) if successful
  • 0 if no match at the given position

Common Pattern

// In blocks.c:
matched = _scan_at(&scan_thematic_break, &input, first_nonspace);

// In inlines.c:
matched = _scan_at(&scan_autolink_uri, subj, subj->pos);

Scanner Functions

Block Structure Scanners

Scanner Purpose Used In
scan_thematic_break Matches ***, ---, ___ (with optional spaces) blocks.c
scan_atx_heading_start Matches #{1,6} followed by space or EOL blocks.c
scan_setext_heading_line Matches =+ or -+ at line start blocks.c
scan_open_code_fence Matches ``` or ~~~ (3+ fence chars) | blocks.c
scan_close_code_fence Matches closing fence (≥ opening length) blocks.c
scan_html_block_start Matches HTML block type 1-5 openers blocks.c
scan_html_block_start_7 Matches HTML block type 6-7 openers blocks.c
scan_html_block_end_1 Matches </script>, </pre>, </style> blocks.c
scan_html_block_end_2 Matches --> blocks.c
scan_html_block_end_3 Matches ?> blocks.c
scan_html_block_end_4 Matches > blocks.c
scan_html_block_end_5 Matches ]]> blocks.c
scan_link_title Matches "...", '...', or (...) titles inlines.c

Inline Scanners

Scanner Purpose Used In
scan_autolink_uri Matches URI autolinks <scheme:path> inlines.c
scan_autolink_email Matches email autolinks <user@host> inlines.c
scan_html_tag Matches inline HTML tags (open, close, comment, PI, CDATA, declaration) inlines.c
scan_entity Matches HTML entities (&amp;, &#123;, &#x1F;) inlines.c
scan_dangerous_url Matches javascript:, vbscript:, file:, data: URLs html.c
scan_spacechars Matches runs of spaces and tabs inlines.c

Link/Reference Scanners

Scanner Purpose Used In
scan_link_url Matches link destinations (parenthesized or bare) inlines.c
scan_link_title Matches quoted link titles inlines.c

Scanner Patterns (from scanners.re)

Thematic Break

thematic_break = (('*' [ \t]*){3,} | ('-' [ \t]*){3,} | ('_' [ \t]*){3,}) [ \t]* [\n]

Three or more *, -, or _ characters, optionally separated by spaces/tabs.

ATX Heading

atx_heading_start = '#{1,6}' ([ \t]+ | [\n])

1-6 # characters followed by space/tab or newline.

Code Fence

open_code_fence = '`{3,}' [^`\n]* [\n] | '~{3,}' [^\n]* [\n]

Three or more backticks (not followed by backtick in info string) or three or more tildes.

HTML Block Start (Types 1-7)

The CommonMark spec defines 7 types of HTML blocks, each matched by different scanners:

  1. <script>, <pre>, <style> (case-insensitive)
  2. <!--
  3. <?
  4. <! followed by uppercase letter (declaration)
  5. <![CDATA[
  6. HTML tags from a specific set (e.g., <div>, <table>, <h1>, etc.)
  7. Complete open/close tags (not <script>, <pre>, <style>)

Autolink URI

autolink_uri = '<' scheme ':' [^\x00-\x20<>]* '>'
scheme = [A-Za-z][A-Za-z0-9+.\-]{1,31}

Autolink Email

autolink_email = '<' [A-Za-z0-9.!#$%&'*+/=?^_`{|}~-]+ '@'
                 [A-Za-z0-9]([A-Za-z0-9-]{0,61}[A-Za-z0-9])?
                 ('.' [A-Za-z0-9]([A-Za-z0-9-]{0,61}[A-Za-z0-9])?)* '>'

HTML Entity

entity = '&' ('#' ('x'|'X') [0-9a-fA-F]{1,6} | '#' [0-9]{1,7} | [A-Za-z][A-Za-z0-9]{1,31}) ';'

Dangerous URL

dangerous_url = ('javascript' | 'vbscript' | 'file' | 'data'
                 (not followed by image MIME types)) ':'

Data URLs are allowed if followed by image/png, image/gif, image/jpeg, or image/webp.

HTML Tag

html_tag = open_tag | close_tag | html_comment | processing_instruction | declaration | cdata
open_tag = '<' tag_name attribute* '/' ? '>'
close_tag = '</' tag_name [ \t]* '>'
html_comment = '<!--' ...
processing_instruction = '<?' ...
declaration = '<!' [A-Z]+ ...
cdata = '<![CDATA[' ...

Generated Code Structure

The generated scanners.c contains functions like:

bufsize_t _scan_thematic_break(const unsigned char *p, bufsize_t len,
                               bufsize_t offset) {
  const unsigned char *marker = NULL;
  const unsigned char *start = p + offset;
  // ... re2c-generated state machine
  // Returns (bufsize_t)(p - start) on match, 0 on failure
}

Each function is a self-contained state machine that:

  1. Starts at p + offset
  2. Walks forward byte-by-byte through the pattern
  3. Returns the match length or 0

The generated code is typically hundreds of lines per scanner function, with deeply nested if/switch chains for the character transitions.

Performance Characteristics

  • O(n) in the length of the match — each scanner reads input exactly once
  • No backtracking — re2c generates DFA-based scanners
  • No allocation — scanners work on existing buffers, no heap allocation
  • Branch prediction friendly — the common case (no match) typically hits the first branch

Usage Example

A typical block-parsing sequence using scanners:

// Check if line starts a thematic break
if (!indented &&
    (input.data[first_nonspace] == '*' ||
     input.data[first_nonspace] == '-' ||
     input.data[first_nonspace] == '_')) {
  matched = _scan_at(&scan_thematic_break, &input, first_nonspace);
  if (matched) {
    // Create thematic break node
  }
}

The manual character check before calling the scanner is an optimization — it avoids the function call overhead when the first character can't possibly start the pattern.

Cross-References

Was this handbook page helpful?

This page is part of the Project Tick Handbook, which is licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license. View full license details.
Last updated: April 18, 2026