cmark — Inline Parsing

Overview

Inline parsing is Phase 2 of cmark's pipeline. Implemented in inlines.c, it processes the text content of paragraph and heading nodes, recognizing emphasis (*, _), code spans (`), links ([text](url)), images (![alt](url)), autolinks (<url>), raw HTML inline, hard line breaks, soft line breaks, and smart punctuation.

The entry point is cmark_parse_inlines(), called from process_inlines() in blocks.c after all block structure has been finalized.

The `subject` Struct

All inline parsing state is tracked in the subject struct:

typedef struct {
  cmark_mem *mem;                          // Memory allocator
  cmark_chunk input;                       // The text being parsed
  unsigned flags;                          // Skip flags for HTML constructs
  int line;                                // Source line number
  bufsize_t pos;                           // Current byte position in input
  int block_offset;                        // Column offset of the containing block
  int column_offset;                       // Adjustment for multi-line source position tracking
  cmark_reference_map *refmap;             // Link reference definitions
  delimiter *last_delim;                   // Top of delimiter stack (linked list, newest first)
  bracket *last_bracket;                   // Top of bracket stack (linked list, newest first)
  bufsize_t backticks[MAXBACKTICKS + 1];   // Cached positions of backtick sequences
  bool scanned_for_backticks;              // Whether the full input has been scanned for backticks
  bool no_link_openers;                    // Optimization: set when no link openers remain
} subject;

MAXBACKTICKS is defined as 1000. The backticks array caches the positions of backtick sequences of each length, enabling O(1) lookup once the input has been fully scanned.

Skip Flags

The flags field uses bit flags to track which HTML constructs have been confirmed absent:

#define FLAG_SKIP_HTML_CDATA        (1u << 0)
#define FLAG_SKIP_HTML_DECLARATION  (1u << 1)
#define FLAG_SKIP_HTML_PI           (1u << 2)
#define FLAG_SKIP_HTML_COMMENT      (1u << 3)

Once a scan for a particular HTML construct fails, the flag is set to avoid rescanning.

The Delimiter Stack

Emphasis and smart punctuation use a delimiter stack. Each entry is:

typedef struct delimiter {
  struct delimiter *previous;    // Link to older delimiter
  struct delimiter *next;        // Link to newer delimiter (towards top)
  cmark_node *inl_text;          // The text node created for this delimiter run
  bufsize_t position;            // Position in the input
  bufsize_t length;              // Number of delimiter characters remaining
  unsigned char delim_char;      // '*', '_', '\'', or '"'
  bool can_open;                 // Whether this run can open emphasis
  bool can_close;                // Whether this run can close emphasis
} delimiter;

The stack is a doubly-linked list with last_delim pointing to the newest entry.

The Bracket Stack

Links and images use a separate bracket stack:

typedef struct bracket {
  struct bracket *previous;      // Link to older bracket
  cmark_node *inl_text;          // The text node for '[' or '!['
  bufsize_t position;            // Position in the input
  bool image;                    // Whether this is an image opener '!['
  bool active;                   // Can still match (set to false when deactivated)
  bool bracket_after;            // Whether a '[' appeared after this bracket
} bracket;

Brackets are deactivated (set active = false) when:

A matching ] fails to produce a valid link (the opener is deactivated to prevent infinite loops)
An inner link is formed (outer brackets are deactivated per spec)

Emphasis Flanking Rules: `scan_delims()`

static int scan_delims(subject *subj, unsigned char c, bool *can_open,
                       bool *can_close);

This function determines whether a run of *, _, ', or " characters can open and/or close emphasis, following the CommonMark spec's Unicode-aware flanking rules:

The function looks at the character before the run and the character after the run.
It uses cmark_utf8proc_iterate() to decode the surrounding characters as full Unicode code points.
It classifies them using cmark_utf8proc_is_space() and cmark_utf8proc_is_punctuation_or_symbol().

The flanking rules:

Left-flanking: numdelims > 0, character after is not a space, AND (character after is not punctuation OR character before is a space or punctuation)
Right-flanking: numdelims > 0, character before is not a space, AND (character before is not punctuation OR character after is a space or punctuation)

For *: can_open = left_flanking, can_close = right_flanking

For _:

*can_open = left_flanking &&
            (!right_flanking || cmark_utf8proc_is_punctuation_or_symbol(before_char));
*can_close = right_flanking &&
             (!left_flanking || cmark_utf8proc_is_punctuation_or_symbol(after_char));

For ' and " (smart punctuation):

*can_open = left_flanking &&
     (!right_flanking || before_char == '(' || before_char == '[') &&
     before_char != ']' && before_char != ')';
*can_close = right_flanking;

The function advances subj->pos past the delimiter run and returns the number of delimiter characters consumed. For quotes, only 1 delimiter is consumed regardless of how many appear.

Emphasis Resolution: `S_insert_emph()`

static delimiter *S_insert_emph(subject *subj, delimiter *opener,
                                delimiter *closer);

When a closing delimiter is found that matches an opener on the stack, this function creates emphasis nodes:

If the opener and closer have combined length >= 2 AND both have individual length >= 2, create a CMARK_NODE_STRONG node (consuming 2 characters from each).
Otherwise, create a CMARK_NODE_EMPH node (consuming 1 character from each).
All inline nodes between the opener and closer are moved to become children of the new emphasis node.
Any delimiters between the opener and closer are removed from the stack.
If the opener is exhausted (length == 0), it's removed from the stack.
If the closer is exhausted, it's removed too; otherwise, processing continues.

Code Span Parsing: `handle_backticks()`

static cmark_node *handle_backticks(subject *subj, int options);

When a backtick is encountered:

take_while(subj, isbacktick) consumes the opening backtick run and records its length.
scan_to_closing_backticks() searches forward for a matching backtick run of the same length.

The scanning function uses the subj->backticks[] array to cache positions of backtick sequences. If subj->scanned_for_backticks is true and the cached position for the needed length is behind the current position, it immediately returns 0 (no match).

If no closing backticks are found, the opening run is emitted as literal text. If found, the content between is extracted, normalized via S_normalize_code():

static void S_normalize_code(cmark_strbuf *s) {
  // 1. Convert \r\n and \r to spaces
  // 2. Convert \n to spaces
  // 3. If content begins and ends with a space and contains non-space chars,
  //    strip one leading and one trailing space
}

Link Parsing

When ] is encountered after an opener on the bracket stack:

Inline Links: `[text](url "title")`

The parser looks for ( immediately after ], then:

Skips optional whitespace
Tries to parse a link destination (URL)
Skips optional whitespace
Optionally parses a link title (in single quotes, double quotes, or parentheses)
Expects )

Reference Links: `[text][ref]` or `[text][]` or `[text]`

If the inline link syntax doesn't match, the parser tries:

[text][ref] — explicit reference
[text][] — collapsed reference (label = text)
[text] — shortcut reference (label = text)

Reference lookup uses cmark_reference_lookup() against the parser's refmap.

URL Cleaning

unsigned char *cmark_clean_url(cmark_mem *mem, cmark_chunk *url);

Trims the URL, unescapes HTML entities, and handles angle-bracket-delimited URLs.

Autolinks

static inline cmark_node *make_autolink(subject *subj, int start_column,
                                        int end_column, cmark_chunk url,
                                        int is_email);

Autolinks (<http://example.com> or <user@example.com>) are detected via the scan_autolink_uri() and scan_autolink_email() scanner functions. Email autolinks have mailto: prepended to the URL automatically:

static unsigned char *cmark_clean_autolink(cmark_mem *mem, cmark_chunk *url,
                                           int is_email) {
  cmark_strbuf buf = CMARK_BUF_INIT(mem);
  cmark_chunk_trim(url);
  if (is_email)
    cmark_strbuf_puts(&buf, "mailto:");
  houdini_unescape_html_f(&buf, url->data, url->len);
  return cmark_strbuf_detach(&buf);
}

Smart Punctuation

When CMARK_OPT_SMART is enabled, the inline parser transforms:

static const char *EMDASH = "\xE2\x80\x94";           // —
static const char *ENDASH = "\xE2\x80\x93";           // –
static const char *ELLIPSES = "\xE2\x80\xA6";         // …
static const char *LEFTDOUBLEQUOTE = "\xE2\x80\x9C";  // "
static const char *RIGHTDOUBLEQUOTE = "\xE2\x80\x9D"; // "
static const char *LEFTSINGLEQUOTE = "\xE2\x80\x98";  // '
static const char *RIGHTSINGLEQUOTE = "\xE2\x80\x99";  // '

--- becomes em dash (—)
-- becomes en dash (–)
... becomes ellipsis (…)
' and " are converted to curly quotes using the delimiter stack (open/close logic)

Hard and Soft Line Breaks

Hard line break: Two or more spaces before a line ending, or a backslash before a line ending. Creates a CMARK_NODE_LINEBREAK node.
Soft line break: A line ending not preceded by spaces or backslash. Creates a CMARK_NODE_SOFTBREAK node.

Special Character Dispatch

static bufsize_t subject_find_special_char(subject *subj, int options);

This function scans forward from subj->pos looking for the next special character that needs inline processing. Special characters include:

Line endings (\r, \n)
Backtick (`)
Backslash (\)
Ampersand (&)
Less-than (<)
Open bracket ([)
Close bracket (])
Exclamation mark (!)
Emphasis characters (*, _)

Any text between special characters is collected as a CMARK_NODE_TEXT node.

Source Position Tracking

static void adjust_subj_node_newlines(subject *subj, cmark_node *node,
                                     int matchlen, int extra, int options);

When CMARK_OPT_SOURCEPOS is enabled, this function adjusts source positions for multi-line inline constructs. It counts newlines in the just-matched span and updates:

subj->line — incremented by the number of newlines
node->end_line — adjusted for multi-line spans
node->end_column — set to characters after the last newline
subj->column_offset — adjusted for correct subsequent position calculations

Inline Node Factory Functions

The inline parser uses efficient factory functions:

// Macros for simple nodes
#define make_linebreak(mem) make_simple(mem, CMARK_NODE_LINEBREAK)
#define make_softbreak(mem) make_simple(mem, CMARK_NODE_SOFTBREAK)
#define make_emph(mem) make_simple(mem, CMARK_NODE_EMPH)
#define make_strong(mem) make_simple(mem, CMARK_NODE_STRONG)

// Fast child appending (bypasses S_can_contain validation)
static void append_child(cmark_node *node, cmark_node *child) {
  cmark_node *old_last_child = node->last_child;
  child->next = NULL;
  child->prev = old_last_child;
  child->parent = node;
  node->last_child = child;
  if (old_last_child) {
    old_last_child->next = child;
  } else {
    node->first_child = child;
  }
}

This append_child() is a simplified version of the public cmark_node_append_child(), skipping containership validation since the inline parser always produces valid structures.

The Main Parse Loop

void cmark_parse_inlines(cmark_mem *mem, cmark_node *parent,
                         cmark_reference_map *refmap, int options);

This function initializes a subject from the parent node's data field, then repeatedly calls parse_inline() until the input is exhausted. Each call to parse_inline() finds the next special character, emits any preceding text as a CMARK_NODE_TEXT, and dispatches to the appropriate handler.

After all characters are processed, the delimiter stack is processed to resolve any remaining emphasis, and then cleaned up.

Cross-References

inlines.c — Full implementation
inlines.h — Internal API declarations
block-parsing.md — Phase 1 that produces the input for inline parsing
reference-system.md — How link references are stored and looked up
scanner-system.md — Scanner functions for HTML tags, autolinks, etc.
utf8-handling.md — Unicode character classification for flanking rules