cmark — Block Parsing
Overview
Block parsing is Phase 1 of cmark's two-phase parsing pipeline. Implemented in blocks.c, it processes the input line-by-line, identifying block-level document structure: paragraphs, headings, code blocks, block quotes, lists, thematic breaks, and HTML blocks. The result is a tree of cmark_node block nodes with accumulated text content. Inline parsing occurs in Phase 2.
The algorithm follows the CommonMark specification's description at http://spec.commonmark.org/0.24/#phase-1-block-structure.
Key Constants
#define CODE_INDENT 4 // Spaces required for indented code block
#define TAB_STOP 4 // Tab stop width for column calculation
Parser State
The parser state is maintained in the cmark_parser struct (from parser.h). During line processing, these fields track the current position:
offset— byte position in the current linecolumn— virtual column number (tabs expanded toTAB_STOPboundaries)first_nonspace— byte position of first non-whitespace characterfirst_nonspace_column— column of first non-whitespace characterindent— the differencefirst_nonspace_column - column, representing effective indentationblank— whether the line is blank (only whitespace before line end)partially_consumed_tab— set when a tab is only partially used for indentation
Input Feeding: S_parser_feed()
The entry point for input is S_parser_feed(), which splits raw input into lines:
static void S_parser_feed(cmark_parser *parser, const unsigned char *buffer,
size_t len, bool eof);
Line Splitting Logic
The function scans for line-ending characters (\n, \r) and processes complete lines via S_process_line(). Partial lines are accumulated in parser->linebuf.
Key handling:
- UTF-8 BOM: Skipped if found at the start of the first line (3-byte sequence
0xEF 0xBB 0xBF). - CR/LF across buffer boundaries: If the previous buffer ended with
\rand the next starts with\n, the\nis skipped. - NULL bytes: Replaced with the UTF-8 replacement character (U+FFFD,
0xEF 0xBF 0xBD). - Total size tracking:
parser->total_sizeaccumulates bytes fed, capped atUINT_MAX, used later for reference expansion limiting.
Line Termination
Each line is terminated at \n, \r, or \r\n. The line content passed to S_process_line() does NOT include the line-ending characters themselves.
Line Processing: S_process_line()
The main per-line processing function. For each line, it:
- Stores the line in
parser->curline - Creates a
cmark_chunkwrapper for the line data - Increments
parser->line_number - Calls
check_open_blocks()to match existing containers - Attempts to open new containers and leaf blocks
- Adds line content to the appropriate block
Step 1: Check Open Blocks
static cmark_node *check_open_blocks(cmark_parser *parser, cmark_chunk *input,
bool *all_matched);
Starting from the document root, this walks through the tree of open blocks (following last_child pointers). For each open container, it tries to match the expected line prefix.
The matching rules for each container type:
Block Quote
static bool parse_block_quote_prefix(cmark_parser *parser, cmark_chunk *input);
Expects > preceded by up to 3 spaces of indentation. After matching the >, optionally consumes one space or tab after it.
List Item
static bool parse_node_item_prefix(cmark_parser *parser, cmark_chunk *input,
cmark_node *container);
Expects indentation of at least marker_offset + padding characters. If the line is blank and the item has at least one child, the item continues (lazy continuation).
Fenced Code Block
static bool parse_code_block_prefix(cmark_parser *parser, cmark_chunk *input,
cmark_node *container, bool *should_continue);
For fenced code blocks: checks if the line is a closing fence (same fence char, length >= opening fence length, preceded by up to 3 spaces). If it is, the block is finalized. Otherwise, skips up to fence_offset spaces and continues.
For indented code blocks: requires 4+ spaces of indentation, or a blank line.
HTML Block
static bool parse_html_block_prefix(cmark_parser *parser, cmark_node *container);
HTML block types 1-5 accept blank lines (continue until end condition is met). Types 6-7 do NOT accept blank lines.
Step 2: New Container Starts
If not all open blocks were matched (!all_matched), the parser checks if the unmatched portion of the line starts a new container:
- Block quote: Line starts with
>(preceded by up to 3 spaces) - List item: Line starts with a list marker (bullet character or ordered number + delimiter)
Step 3: New Leaf Blocks
The parser checks for new leaf block starts using scanner functions:
- ATX heading:
scan_atx_heading_start()— lines starting with 1-6#characters - Fenced code block:
scan_open_code_fence()— 3+ backticks or tildes - HTML block:
scan_html_block_start()andscan_html_block_start_7()— 7 different HTML start patterns - Setext heading:
scan_setext_heading_line()— underlines of=or-(only when following a paragraph) - Thematic break:
S_scan_thematic_break()— 3+*,-, or_characters
Step 4: Content Accumulation
For blocks that accept lines (accepts_lines() returns true for paragraphs, headings, and code blocks), the line content is appended to parser->content via add_line():
static void add_line(cmark_chunk *ch, cmark_parser *parser) {
int chars_to_tab;
int i;
if (parser->partially_consumed_tab) {
parser->offset += 1; // skip over tab
chars_to_tab = TAB_STOP - (parser->column % TAB_STOP);
for (i = 0; i < chars_to_tab; i++) {
cmark_strbuf_putc(&parser->content, ' ');
}
}
cmark_strbuf_put(&parser->content, ch->data + parser->offset,
ch->len - parser->offset);
}
When a tab is only partially consumed (e.g., the tab represents 4 columns but only 1 was needed for indentation), the remaining columns are emitted as spaces.
Adding Child Blocks
static cmark_node *add_child(cmark_parser *parser, cmark_node *parent,
cmark_node_type block_type, int start_column);
When a new block is detected, add_child() creates it:
- If the parent can't contain the new block type (checked via
can_contain()), the parent is finalized and the function moves up the tree until it finds a suitable ancestor. - A new node is created with
make_block()(which setsCMARK_NODE__OPEN). - The node is linked as the last child of the parent.
Container Acceptance Rules
static inline bool can_contain(cmark_node_type parent_type,
cmark_node_type child_type) {
return (parent_type == CMARK_NODE_DOCUMENT ||
parent_type == CMARK_NODE_BLOCK_QUOTE ||
parent_type == CMARK_NODE_ITEM ||
(parent_type == CMARK_NODE_LIST && child_type == CMARK_NODE_ITEM));
}
Only documents, block quotes, list items, and lists (for items only) can contain other blocks.
List Item Parsing
static bufsize_t parse_list_marker(cmark_mem *mem, cmark_chunk *input,
bufsize_t pos, bool interrupts_paragraph,
cmark_list **dataptr);
This function detects list markers:
Bullet markers: *, -, or + followed by whitespace.
Ordered markers: Up to 9 digits followed by . or ) and whitespace. The 9-digit limit prevents integer overflow (max value ~999,999,999 fits in a 32-bit int).
Paragraph interruption rules: When interrupts_paragraph is true (the marker would interrupt a preceding paragraph):
- Bullet markers require non-blank content after them
- Ordered markers must start at 1
List Matching
static int lists_match(cmark_list *list_data, cmark_list *item_data) {
return (list_data->list_type == item_data->list_type &&
list_data->delimiter == item_data->delimiter &&
list_data->bullet_char == item_data->bullet_char);
}
Two list items belong to the same list only if they share the same list type, delimiter style, and bullet character. This means - item and * item create separate lists.
Offset Advancement
static void S_advance_offset(cmark_parser *parser, cmark_chunk *input,
bufsize_t count, bool columns);
This function advances parser->offset and parser->column. The columns parameter determines whether count measures bytes or virtual columns. Tab expansion is handled here:
- When counting columns and a tab appears,
chars_to_tab = TAB_STOP - (column % TAB_STOP)determines how many columns the tab represents - If only part of the tab is consumed (advancing fewer columns than the tab provides),
parser->partially_consumed_tabis set
Finding First Non-Space
static void S_find_first_nonspace(cmark_parser *parser, cmark_chunk *input);
Scans from parser->offset forward, setting:
parser->first_nonspace— byte positionparser->first_nonspace_column— column of first non-whitespaceparser->indent—first_nonspace_column - columnparser->blank— whether the line is blank
This function is idempotent — it won't re-scan if first_nonspace > offset.
Thematic Break Detection
static int S_scan_thematic_break(cmark_parser *parser, cmark_chunk *input,
bufsize_t offset);
Checks for 3 or more *, _, or - characters (optionally separated by spaces/tabs) on a line by themselves. Uses parser->thematic_break_kill_pos as an optimization to avoid re-scanning positions that already failed.
ATX Heading Trailing Hash Removal
static void chop_trailing_hashtags(cmark_chunk *ch);
After an ATX heading line is identified, trailing # characters are removed from the content if they're preceded by a space. This implements the CommonMark rule that ## Heading ## renders as "Heading" without trailing # marks.
Block Finalization
static cmark_node *finalize(cmark_parser *parser, cmark_node *b);
When a block is closed (no longer accepting content), finalize() processes its accumulated content:
Paragraph Finalization
Reference link definitions at the start are extracted:
static bool resolve_reference_link_definitions(cmark_parser *parser);
This repeatedly calls cmark_parse_reference_inline() from inlines.c to parse reference definitions like [label]: url "title". If the paragraph becomes empty after extracting all references, the paragraph node is deleted.
Code Block Finalization
- Fenced: The first line becomes the info string (after HTML unescaping and trimming). Remaining content becomes the code body.
- Indented: Trailing blank lines are removed, and a final newline is appended.
Heading and HTML Block Finalization
Content is simply detached from the parser's content buffer and stored in data.
List Finalization
Determines tight/loose status by checking:
- Non-final, non-empty list items ending with a blank line → loose
- Children of list items that end with blank lines (checked recursively via
S_ends_with_blank_line()) → loose - Otherwise → tight
Document Finalization
static cmark_node *finalize_document(cmark_parser *parser);
Called by cmark_parser_finish():
- All open blocks are finalized by walking from
parser->currentup toparser->root. - The root document is finalized.
- Reference expansion limit is set:
refmap->max_ref_size = MAX(parser->total_size, 100000). process_inlines()is called, which uses an iterator to find all nodes that contain inlines (paragraphs and headings) and callscmark_parse_inlines()on each.- After inline parsing, the content buffer of each processed node is freed.
Inline Content Detection
static inline bool contains_inlines(cmark_node_type block_type) {
return (block_type == CMARK_NODE_PARAGRAPH ||
block_type == CMARK_NODE_HEADING);
}
Only paragraphs and headings have their string content parsed for inline elements. Code blocks, HTML blocks, and other leaf nodes preserve their content as-is.
Lazy Continuation Lines
The CommonMark spec defines "lazy continuation lines" — lines that continue a paragraph without matching all container prefixes. For example:
> This is a block quote
with a lazy continuation line
The second line doesn't start with > but still belongs to the paragraph inside the block quote. The parser handles this by checking whether the line could be added to an existing open paragraph rather than closing and starting a new one.
Cross-References
- parser.h — Parser struct definition
- blocks.c — Full implementation
- inline-parsing.md — Phase 2 parsing
- scanner-system.md — Scanner functions used for block detection
- reference-system.md — How reference definitions are extracted
- ast-node-system.md — Node creation and tree structure