cmark — Architecture
High-Level Design
cmark implements a two-phase parsing pipeline that converts CommonMark Markdown into an Abstract Syntax Tree (AST), which can then be rendered into multiple output formats. The design separates concerns cleanly: block-level structure is identified first, then inline content is parsed within the appropriate blocks.
Input Text (UTF-8)
│
▼
┌────────────────────┐
│ S_parser_feed │ Split input into lines (blocks.c)
│ │ Handle UTF-8 BOM, CR/LF normalization
└────────┬───────────┘
│
▼
┌────────────────────┐
│ S_process_line │ Line-by-line block structure analysis (blocks.c)
│ │ Open/close containers, detect leaf blocks
└────────┬───────────┘
│
▼
┌────────────────────┐
│ finalize_document │ Close all open blocks (blocks.c)
│ │ Resolve reference link definitions
└────────┬───────────┘
│
▼
┌────────────────────┐
│ process_inlines │ Parse inline content in paragraphs/headings (blocks.c → inlines.c)
│ │ Delimiter stack algorithm for emphasis
│ │ Bracket stack for links/images
└────────┬───────────┘
│
▼
┌─────────────────────────────┐
│ AST (cmark_node tree) │
└────────┬────────────────────┘
│
▼
┌───────────────────┐
│ Renderer │ Iterator-driven traversal
│ (html/xml/ │ Enter/Exit events per node
│ latex/man/cm) │
└───────────────────┘
│
▼
Output String
Module Dependency Graph
The internal header dependencies reveal the layered architecture:
cmark.h (public API — types, enums, function declarations)
├── cmark_export.h (generated — DLL export macros)
└── cmark_version.h (generated — version constants)
node.h (internal — struct cmark_node)
├── cmark.h
└── buffer.h
parser.h (internal — struct cmark_parser)
├── references.h
├── node.h
└── buffer.h
iterator.h (internal — struct cmark_iter)
└── cmark.h
render.h (internal — struct cmark_renderer)
└── buffer.h
buffer.h (internal — cmark_strbuf)
└── cmark.h
chunk.h (internal — cmark_chunk)
├── cmark.h
├── buffer.h
└── cmark_ctype.h
references.h (internal — cmark_reference_map)
└── chunk.h
inlines.h (internal — inline parsing API)
├── chunk.h
└── references.h
scanners.h (internal — scanner function declarations)
├── cmark.h
└── chunk.h
houdini.h (internal — HTML/URL escaping)
└── buffer.h
cmark_ctype.h (internal — locale-independent char classification)
(no cmark dependencies)
utf8.h (internal — UTF-8 processing)
└── buffer.h
Phase 1: Block Structure (blocks.c)
Block parsing operates on a state machine maintained in the cmark_parser struct (defined in parser.h):
struct cmark_parser {
struct cmark_mem *mem; // Memory allocator
struct cmark_reference_map *refmap; // Link reference definitions
struct cmark_node *root; // Document root node
struct cmark_node *current; // Deepest open block
int line_number; // Current line being processed
bufsize_t offset; // Byte position in current line
bufsize_t column; // Virtual column (tabs expanded)
bufsize_t first_nonspace; // Position of first non-whitespace
bufsize_t first_nonspace_column; // Column of first non-whitespace
bufsize_t thematic_break_kill_pos; // Optimization for thematic break scanning
int indent; // Indentation level (first_nonspace_column - column)
bool blank; // Whether current line is blank
bool partially_consumed_tab; // Tab only partially used for indentation
cmark_strbuf curline; // Current line being processed
bufsize_t last_line_length; // Length of previous line (for end_column)
cmark_strbuf linebuf; // Buffer for accumulating partial lines across feeds
cmark_strbuf content; // Accumulated content for the current open block
int options; // Option flags
bool last_buffer_ended_with_cr; // For CR/LF handling across buffer boundaries
unsigned int total_size; // Total bytes fed (for reference expansion limiting)
};
Line Processing Flow
For each line, S_process_line() does the following:
- Increment line number, store current line in
parser->curline. - Check open blocks (
check_open_blocks()): Walk through the tree from root to the deepest open node. For each open container node, try to match the expected line prefix:- Block quote: expect
>(optionally preceded by up to 3 spaces) - List item: expect indentation matching
marker_offset + padding - Code block (fenced): check for closing fence or skip fence offset spaces
- Code block (indented): expect 4+ spaces of indentation
- HTML block: check type-specific continuation rules
- Block quote: expect
- Try new container starts: If not all open blocks matched, check if the current line starts a new container (block quote, list item).
- Try new leaf blocks: If the line doesn't continue an existing block or start a new container, check for:
- ATX heading (lines starting with 1-6
#characters) - Setext heading (underlines of
=or-following a paragraph) - Thematic break (3+
*,-, or_on a line by themselves) - Fenced code block (3+ backticks or tildes)
- HTML block (7 different start patterns)
- Indented code block (4+ spaces of indentation)
- ATX heading (lines starting with 1-6
- Add line content: For blocks that accept lines (paragraph, heading, code block), append the line content to
parser->content. - Handle lazy continuation: Paragraphs support lazy continuation where a non-blank line can continue a paragraph even without matching container prefixes.
Finalization
When a block is closed (either explicitly or because a new block replaces it), finalize() is called:
- Paragraphs: Reference link definitions at the start are extracted and stored in
parser->refmap. If only references remain, the paragraph node is deleted. - Code blocks (fenced): The first line becomes the info string; remaining content becomes the code body.
- Code blocks (indented): Trailing blank lines are removed.
- Lists: Tight/loose status is determined by checking for blank lines between items and their children.
Phase 2: Inline Parsing (inlines.c)
After all block structure is finalized, process_inlines() walks the AST with an iterator and calls cmark_parse_inlines() for every node whose type contains_inlines() — specifically, CMARK_NODE_PARAGRAPH and CMARK_NODE_HEADING.
The inline parser uses a subject struct that tracks:
typedef struct {
cmark_mem *mem;
cmark_chunk input; // The text to parse
unsigned flags; // Skip flags for HTML constructs
int line; // Source line number
bufsize_t pos; // Current position in input
int block_offset; // Column offset of containing block
int column_offset; // Adjustment for multi-line inlines
cmark_reference_map *refmap; // Reference definitions
delimiter *last_delim; // Top of delimiter stack
bracket *last_bracket; // Top of bracket stack
bufsize_t backticks[MAXBACKTICKS + 1]; // Cache of backtick positions
bool scanned_for_backticks; // Whether full backtick scan done
bool no_link_openers; // Optimization flag
} subject;
Delimiter Stack Algorithm
Emphasis (*, _) and smart quotes (', ") use a delimiter stack. When a run of delimiter characters is found:
scan_delims()determines whether the run can open and/or close emphasis, based on Unicode-aware flanking rules (checking whether surrounding characters are spaces or punctuation usingcmark_utf8proc_is_space()andcmark_utf8proc_is_punctuation_or_symbol()).- The delimiter is pushed onto the stack as a
delimiterstruct. - When a closing delimiter is found, the stack is scanned backwards for a matching opener, and
S_insert_emph()createsCMARK_NODE_EMPHorCMARK_NODE_STRONGnodes.
Bracket Stack Algorithm
Links and images use a separate bracket stack:
[pushes a bracket entry;![pushes one marked asimage = true.- When
]is encountered, the bracket stack is searched for a matching opener. - If found, the parser looks for
(url "title")or[ref]after the]. - For reference-style links,
cmark_reference_lookup()is called against the parser'srefmap.
Phase 3: AST Rendering
All renderers traverse the AST using the iterator system. There are two rendering architectures:
Direct Renderers (no framework)
- HTML (
html.c): Usescmark_strbufdirectly. TheS_render_node()function handles enter/exit events in a large switch statement. HTML escaping is done viahoudini_escape_html(). - XML (
xml.c): Similar direct approach with XML-specific escaping and indentation tracking.
Framework Renderers (via render.c)
- LaTeX (
latex.c), man (man.c), CommonMark (commonmark.c): These use thecmark_render()generic framework, which provides:- Line wrapping at a configurable width
- Prefix management for indented output (block quotes, list items)
- Breakpoint tracking for intelligent line breaking
- Escape dispatch via function pointers (
outc)
The framework signature:
char *cmark_render(cmark_node *root, int options, int width,
void (*outc)(cmark_renderer *, cmark_escaping, int32_t, unsigned char),
int (*render_node)(cmark_renderer *, cmark_node *,
cmark_event_type, int));
Each format-specific renderer supplies its own outc (character-level escaping) and render_node (node-level output) callback functions.
Key Design Decisions
Owning vs. Non-Owning Strings
cmark uses two string types:
cmark_strbuf(buffer.h): Owning, growable byte buffer. Used for accumulating output and parser state. Memory is managed via thecmark_memallocator.cmark_chunk(chunk.h): Non-owning slice (pointer + length). Used for referencing substrings of the input during parsing without copying.
Node Memory Layout
Every cmark_node uses a discriminated union (node->as) to store type-specific data without separate allocations:
union {
cmark_list list; // list marker, start, tight, delimiter
cmark_code code; // info string, fence char/length/offset
cmark_heading heading; // level, setext flag, internal_offset
cmark_link link; // url, title
cmark_custom custom; // on_enter, on_exit
int html_block_type; // HTML block type (1-7)
} as;
Open Block Tracking
During block parsing, open blocks are tracked via the CMARK_NODE__OPEN flag in node->flags. The parser maintains a current pointer to the deepest open block. When new blocks are created, they're added as children of the appropriate open container. When blocks are finalized (closed), control returns to the parent.
Reference Expansion Limiting
To prevent superlinear growth from adversarial reference definitions, parser->total_size tracks total bytes fed. After finalization, parser->refmap->max_ref_size is set to MAX(total_size, 100000), and each reference lookup deducts the reference's size from the available budget.
Error Handling
cmark follows a defensive programming model:
- NULL checks on all public API entry points (return 0 or NULL for invalid arguments)
assert()for internal invariants (only active in debug builds with-DCMARK_DEBUG_NODES)- Abort-on-allocation-failure in the default memory allocator
- No exceptions (pure C99)
- Invalid UTF-8 sequences are replaced with U+FFFD (when
CMARK_OPT_VALIDATE_UTF8is set)
Thread Safety
cmark is not thread-safe for concurrent access to the same parser or node tree. However, separate parser instances and separate node trees can be used in parallel from different threads, as there is no global mutable state (the DEFAULT_MEM_ALLOCATOR is read-only after initialization).
Cross-References
- block-parsing.md — Detailed block-level parsing logic
- inline-parsing.md — Delimiter and bracket stack algorithms
- ast-node-system.md — Node struct internals
- render-framework.md — Generic render engine
- memory-management.md — Allocator and buffer details
- iterator-system.md — AST traversal mechanics