cmark — AST Node System
Overview
The AST (Abstract Syntax Tree) node system is defined across node.h (internal struct definitions) and node.c (node creation, destruction, accessor functions, and tree manipulation). Every element in a parsed CommonMark document is represented as a cmark_node. Nodes form a tree via parent/child/sibling pointers, with type-specific data stored in a discriminated union.
The cmark_node Struct
Defined in node.h, this is the central data structure of the entire library:
struct cmark_node {
cmark_mem *mem; // Memory allocator used for this node
struct cmark_node *next; // Next sibling
struct cmark_node *prev; // Previous sibling
struct cmark_node *parent; // Parent node
struct cmark_node *first_child; // First child
struct cmark_node *last_child; // Last child
void *user_data; // Arbitrary user-attached data
unsigned char *data; // String content (for text, code, HTML)
bufsize_t len; // Length of data
int start_line; // Source position: starting line (1-based)
int start_column; // Source position: starting column (1-based)
int end_line; // Source position: ending line
int end_column; // Source position: ending column
uint16_t type; // Node type (cmark_node_type enum value)
uint16_t flags; // Internal flags (open, last-line-blank, etc.)
union {
cmark_list list; // List-specific data
cmark_code code; // Code block-specific data
cmark_heading heading; // Heading-specific data
cmark_link link; // Link/image-specific data
cmark_custom custom; // Custom block/inline data
int html_block_type; // HTML block type (1-7)
} as;
};
The union as means each node only occupies memory for one type-specific payload, keeping the struct compact. The largest union member determines the node's size.
Type-Specific Structs
cmark_list — List Properties
typedef struct {
int marker_offset; // Indentation of list marker from left margin
int padding; // Total indentation (marker + content offset)
int start; // Starting number for ordered lists (0 for bullet)
unsigned char list_type; // CMARK_BULLET_LIST or CMARK_ORDERED_LIST
unsigned char delimiter; // CMARK_PERIOD_DELIM, CMARK_PAREN_DELIM, or CMARK_NO_DELIM
unsigned char bullet_char;// '*', '-', or '+' for bullet lists
bool tight; // Whether the list is tight (no blank lines between items)
} cmark_list;
marker_offset and padding are used during block parsing to track indentation levels for list continuation. The tight flag is determined during block finalization by checking whether blank lines appear between list items or their children.
cmark_code — Code Block Properties
typedef struct {
unsigned char *info; // Info string (language hint, e.g., "python")
uint8_t fence_length; // Length of opening fence (3+ backticks or tildes)
uint8_t fence_offset; // Indentation of fence from left margin
unsigned char fence_char; // '`' or '~'
int8_t fenced; // Whether this is a fenced code block (vs. indented)
} cmark_code;
For indented code blocks, fenced is 0, and info, fence_length, fence_char, and fence_offset are unused. For fenced code blocks, info is extracted from the first line of the opening fence and stored as a separately allocated string.
cmark_heading — Heading Properties
typedef struct {
int internal_offset; // Internal offset within the heading content
int8_t level; // Heading level (1-6)
bool setext; // Whether this is a setext-style heading (underlined)
} cmark_heading;
ATX headings (# Heading) have setext = false. Setext headings (underlined with = or -) have setext = true. The level field is shared and defaults to 1 when a heading node is created.
cmark_link — Link and Image Properties
typedef struct {
unsigned char *url; // Destination URL (separately allocated)
unsigned char *title; // Optional title text (separately allocated)
} cmark_link;
Both url and title are separately allocated strings that must be freed when the node is destroyed. This struct is used for both CMARK_NODE_LINK and CMARK_NODE_IMAGE.
cmark_custom — Custom Block/Inline Properties
typedef struct {
unsigned char *on_enter; // Literal text rendered when entering the node
unsigned char *on_exit; // Literal text rendered when leaving the node
} cmark_custom;
Custom nodes allow embedding arbitrary content in the AST for extensions. Both strings are separately allocated.
Internal Flags
The flags field uses bit flags defined in the cmark_node__internal_flags enum:
enum cmark_node__internal_flags {
CMARK_NODE__OPEN = (1 << 0), // Block is still open (accepting content)
CMARK_NODE__LAST_LINE_BLANK = (1 << 1), // Last line of this block was blank
CMARK_NODE__LAST_LINE_CHECKED = (1 << 2), // blank-line status has been computed
CMARK_NODE__LIST_LAST_LINE_BLANK = (1 << 3), // (unused/reserved)
};
CMARK_NODE__OPEN: Set when a block is created during parsing. Cleared byfinalize()when the block is closed. The parser'scurrentpointer always points to a node with this flag set.CMARK_NODE__LAST_LINE_BLANK: Set/cleared byS_set_last_line_blank()inblocks.cto track whether the most recent line added to this block was blank. Used for determining list tightness.CMARK_NODE__LAST_LINE_CHECKED: Prevents redundant traversal when checkingS_ends_with_blank_line(), which recursively descends into list items.
Node Creation
cmark_node_new_with_mem()
The primary creation function (in node.c):
cmark_node *cmark_node_new_with_mem(cmark_node_type type, cmark_mem *mem) {
cmark_node *node = (cmark_node *)mem->calloc(1, sizeof(*node));
node->mem = mem;
node->type = (uint16_t)type;
switch (node->type) {
case CMARK_NODE_HEADING:
node->as.heading.level = 1;
break;
case CMARK_NODE_LIST: {
cmark_list *list = &node->as.list;
list->list_type = CMARK_BULLET_LIST;
list->start = 0;
list->tight = false;
break;
}
default:
break;
}
return node;
}
The calloc() zeroes all fields, so pointers start as NULL and numeric fields as 0. Only heading and list nodes need explicit default initialization.
make_block() — Parser-Internal Creation
During block parsing, make_block() in blocks.c creates nodes with source position and the CMARK_NODE__OPEN flag:
static cmark_node *make_block(cmark_mem *mem, cmark_node_type tag,
int start_line, int start_column) {
cmark_node *e;
e = (cmark_node *)mem->calloc(1, sizeof(*e));
e->mem = mem;
e->type = (uint16_t)tag;
e->flags = CMARK_NODE__OPEN;
e->start_line = start_line;
e->start_column = start_column;
e->end_line = start_line;
return e;
}
Inline Node Creation
The inline parser in inlines.c uses two factory functions:
// Create an inline with string content (text, code, HTML)
static inline cmark_node *make_literal(subject *subj, cmark_node_type t,
int start_column, int end_column) {
cmark_node *e = (cmark_node *)subj->mem->calloc(1, sizeof(*e));
e->mem = subj->mem;
e->type = (uint16_t)t;
e->start_line = e->end_line = subj->line;
e->start_column = start_column + 1 + subj->column_offset + subj->block_offset;
e->end_column = end_column + 1 + subj->column_offset + subj->block_offset;
return e;
}
// Create an inline with no value (emphasis, strong, etc.)
static inline cmark_node *make_simple(cmark_mem *mem, cmark_node_type t) {
cmark_node *e = (cmark_node *)mem->calloc(1, sizeof(*e));
e->mem = mem;
e->type = t;
return e;
}
Node Destruction
S_free_nodes() — Iterative Subtree Freeing
The S_free_nodes() function in node.c avoids recursion by splicing children into a flat linked list:
static void S_free_nodes(cmark_node *e) {
cmark_mem *mem = e->mem;
cmark_node *next;
while (e != NULL) {
switch (e->type) {
case CMARK_NODE_CODE_BLOCK:
mem->free(e->data);
mem->free(e->as.code.info);
break;
case CMARK_NODE_TEXT:
case CMARK_NODE_HTML_INLINE:
case CMARK_NODE_CODE:
case CMARK_NODE_HTML_BLOCK:
mem->free(e->data);
break;
case CMARK_NODE_LINK:
case CMARK_NODE_IMAGE:
mem->free(e->as.link.url);
mem->free(e->as.link.title);
break;
case CMARK_NODE_CUSTOM_BLOCK:
case CMARK_NODE_CUSTOM_INLINE:
mem->free(e->as.custom.on_enter);
mem->free(e->as.custom.on_exit);
break;
default:
break;
}
if (e->last_child) {
// Splice children into list for flat iteration
e->last_child->next = e->next;
e->next = e->first_child;
}
next = e->next;
mem->free(e);
e = next;
}
}
This splicing technique converts the tree into a flat list, allowing O(n) iterative freeing without a recursion stack. For each node with children, the children are prepended to the remaining list by connecting last_child->next to e->next and e->next to first_child.
Containership Rules
The S_can_contain() function in node.c enforces which node types can contain which children:
static bool S_can_contain(cmark_node *node, cmark_node *child) {
// Ancestor loop detection
if (child->first_child != NULL) {
cmark_node *cur = node->parent;
while (cur != NULL) {
if (cur == child) return false;
cur = cur->parent;
}
}
// Documents cannot be children
if (child->type == CMARK_NODE_DOCUMENT) return false;
switch (node->type) {
case CMARK_NODE_DOCUMENT:
case CMARK_NODE_BLOCK_QUOTE:
case CMARK_NODE_ITEM:
return cmark_node_is_block(child) && child->type != CMARK_NODE_ITEM;
case CMARK_NODE_LIST:
return child->type == CMARK_NODE_ITEM;
case CMARK_NODE_CUSTOM_BLOCK:
return true; // Custom blocks can contain anything
case CMARK_NODE_PARAGRAPH:
case CMARK_NODE_HEADING:
case CMARK_NODE_EMPH:
case CMARK_NODE_STRONG:
case CMARK_NODE_LINK:
case CMARK_NODE_IMAGE:
case CMARK_NODE_CUSTOM_INLINE:
return cmark_node_is_inline(child);
default:
break;
}
return false;
}
Key rules:
- Document, block quote, list item: Can contain any block except items
- List: Can only contain items
- Custom block: Can contain anything (no restrictions)
- Paragraph, heading, emphasis, strong, link, image, custom inline: Can only contain inline nodes
- Leaf blocks (thematic break, code block, HTML block): Cannot contain anything
Tree Manipulation
Unlinking
The internal S_node_unlink() function detaches a node from its parent and siblings:
static void S_node_unlink(cmark_node *node) {
if (node->prev) {
node->prev->next = node->next;
}
if (node->next) {
node->next->prev = node->prev;
}
// Update parent's first_child / last_child pointers
if (node->parent) {
if (node->parent->first_child == node)
node->parent->first_child = node->next;
if (node->parent->last_child == node)
node->parent->last_child = node->prev;
}
node->next = NULL;
node->prev = NULL;
node->parent = NULL;
}
String Setting Helper
The cmark_set_cstr() function manages string assignment with proper memory handling:
static bufsize_t cmark_set_cstr(cmark_mem *mem, unsigned char **dst,
const char *src) {
unsigned char *old = *dst;
bufsize_t len;
if (src && src[0]) {
len = (bufsize_t)strlen(src);
*dst = (unsigned char *)mem->realloc(NULL, len + 1);
memcpy(*dst, src, len + 1);
} else {
len = 0;
*dst = NULL;
}
if (old) {
mem->free(old);
}
return len;
}
This function allocates a new copy of the source string, assigns it, then frees the old value — ensuring no memory leaks even when overwriting existing data.
Node Data Storage Pattern
Nodes store their text content in two ways depending on type:
-
Direct storage (
data+len): Used byCMARK_NODE_TEXT,CMARK_NODE_CODE,CMARK_NODE_CODE_BLOCK,CMARK_NODE_HTML_BLOCK, andCMARK_NODE_HTML_INLINE. Thedatafield points to a separately allocated buffer containing the text content. -
Union storage (
as.*): Used by lists, code blocks (for the info string), headings, links/images, and custom nodes. These store structured data rather than raw text. -
Hybrid:
CMARK_NODE_CODE_BLOCKuses both —datafor the code content andas.code.infofor the info string.
The cmark_node_check() Function
For debug builds, cmark_node_check() validates the structural integrity of the tree. It checks that parent/child/sibling pointers are consistent and that the tree forms a valid structure. It returns the number of errors found and prints details to the provided FILE*.
Cross-References
- node.h — Struct definitions
- node.c — Implementation
- iterator-system.md — How nodes are traversed
- block-parsing.md — How block nodes are created during parsing
- inline-parsing.md — How inline nodes are created
- memory-management.md — Allocator integration