cmark — AST Node System

Overview

The AST (Abstract Syntax Tree) node system is defined across node.h (internal struct definitions) and node.c (node creation, destruction, accessor functions, and tree manipulation). Every element in a parsed CommonMark document is represented as a cmark_node. Nodes form a tree via parent/child/sibling pointers, with type-specific data stored in a discriminated union.

The cmark_node Struct

Defined in node.h, this is the central data structure of the entire library:

struct cmark_node {
  cmark_mem *mem;                   // Memory allocator used for this node

  struct cmark_node *next;          // Next sibling
  struct cmark_node *prev;          // Previous sibling
  struct cmark_node *parent;        // Parent node
  struct cmark_node *first_child;   // First child
  struct cmark_node *last_child;    // Last child

  void *user_data;                  // Arbitrary user-attached data

  unsigned char *data;              // String content (for text, code, HTML)
  bufsize_t len;                    // Length of data

  int start_line;                   // Source position: starting line (1-based)
  int start_column;                 // Source position: starting column (1-based)
  int end_line;                     // Source position: ending line
  int end_column;                   // Source position: ending column
  uint16_t type;                    // Node type (cmark_node_type enum value)
  uint16_t flags;                   // Internal flags (open, last-line-blank, etc.)

  union {
    cmark_list list;                // List-specific data
    cmark_code code;                // Code block-specific data
    cmark_heading heading;          // Heading-specific data
    cmark_link link;                // Link/image-specific data
    cmark_custom custom;            // Custom block/inline data
    int html_block_type;            // HTML block type (1-7)
  } as;
};

The union as means each node only occupies memory for one type-specific payload, keeping the struct compact. The largest union member determines the node's size.

Type-Specific Structs

cmark_list — List Properties

typedef struct {
  int marker_offset;        // Indentation of list marker from left margin
  int padding;              // Total indentation (marker + content offset)
  int start;                // Starting number for ordered lists (0 for bullet)
  unsigned char list_type;  // CMARK_BULLET_LIST or CMARK_ORDERED_LIST
  unsigned char delimiter;  // CMARK_PERIOD_DELIM, CMARK_PAREN_DELIM, or CMARK_NO_DELIM
  unsigned char bullet_char;// '*', '-', or '+' for bullet lists
  bool tight;               // Whether the list is tight (no blank lines between items)
} cmark_list;

marker_offset and padding are used during block parsing to track indentation levels for list continuation. The tight flag is determined during block finalization by checking whether blank lines appear between list items or their children.

cmark_code — Code Block Properties

typedef struct {
  unsigned char *info;      // Info string (language hint, e.g., "python")
  uint8_t fence_length;     // Length of opening fence (3+ backticks or tildes)
  uint8_t fence_offset;     // Indentation of fence from left margin
  unsigned char fence_char; // '`' or '~'
  int8_t fenced;            // Whether this is a fenced code block (vs. indented)
} cmark_code;

For indented code blocks, fenced is 0, and info, fence_length, fence_char, and fence_offset are unused. For fenced code blocks, info is extracted from the first line of the opening fence and stored as a separately allocated string.

cmark_heading — Heading Properties

typedef struct {
  int internal_offset;      // Internal offset within the heading content
  int8_t level;             // Heading level (1-6)
  bool setext;              // Whether this is a setext-style heading (underlined)
} cmark_heading;

ATX headings (# Heading) have setext = false. Setext headings (underlined with = or -) have setext = true. The level field is shared and defaults to 1 when a heading node is created.

cmark_link — Link and Image Properties

typedef struct {
  unsigned char *url;       // Destination URL (separately allocated)
  unsigned char *title;     // Optional title text (separately allocated)
} cmark_link;

Both url and title are separately allocated strings that must be freed when the node is destroyed. This struct is used for both CMARK_NODE_LINK and CMARK_NODE_IMAGE.

cmark_custom — Custom Block/Inline Properties

typedef struct {
  unsigned char *on_enter;  // Literal text rendered when entering the node
  unsigned char *on_exit;   // Literal text rendered when leaving the node
} cmark_custom;

Custom nodes allow embedding arbitrary content in the AST for extensions. Both strings are separately allocated.

Internal Flags

The flags field uses bit flags defined in the cmark_node__internal_flags enum:

enum cmark_node__internal_flags {
  CMARK_NODE__OPEN              = (1 << 0),  // Block is still open (accepting content)
  CMARK_NODE__LAST_LINE_BLANK   = (1 << 1),  // Last line of this block was blank
  CMARK_NODE__LAST_LINE_CHECKED = (1 << 2),  // blank-line status has been computed
  CMARK_NODE__LIST_LAST_LINE_BLANK = (1 << 3), // (unused/reserved)
};
  • CMARK_NODE__OPEN: Set when a block is created during parsing. Cleared by finalize() when the block is closed. The parser's current pointer always points to a node with this flag set.
  • CMARK_NODE__LAST_LINE_BLANK: Set/cleared by S_set_last_line_blank() in blocks.c to track whether the most recent line added to this block was blank. Used for determining list tightness.
  • CMARK_NODE__LAST_LINE_CHECKED: Prevents redundant traversal when checking S_ends_with_blank_line(), which recursively descends into list items.

Node Creation

cmark_node_new_with_mem()

The primary creation function (in node.c):

cmark_node *cmark_node_new_with_mem(cmark_node_type type, cmark_mem *mem) {
  cmark_node *node = (cmark_node *)mem->calloc(1, sizeof(*node));
  node->mem = mem;
  node->type = (uint16_t)type;

  switch (node->type) {
  case CMARK_NODE_HEADING:
    node->as.heading.level = 1;
    break;
  case CMARK_NODE_LIST: {
    cmark_list *list = &node->as.list;
    list->list_type = CMARK_BULLET_LIST;
    list->start = 0;
    list->tight = false;
    break;
  }
  default:
    break;
  }

  return node;
}

The calloc() zeroes all fields, so pointers start as NULL and numeric fields as 0. Only heading and list nodes need explicit default initialization.

make_block() — Parser-Internal Creation

During block parsing, make_block() in blocks.c creates nodes with source position and the CMARK_NODE__OPEN flag:

static cmark_node *make_block(cmark_mem *mem, cmark_node_type tag,
                              int start_line, int start_column) {
  cmark_node *e;
  e = (cmark_node *)mem->calloc(1, sizeof(*e));
  e->mem = mem;
  e->type = (uint16_t)tag;
  e->flags = CMARK_NODE__OPEN;
  e->start_line = start_line;
  e->start_column = start_column;
  e->end_line = start_line;
  return e;
}

Inline Node Creation

The inline parser in inlines.c uses two factory functions:

// Create an inline with string content (text, code, HTML)
static inline cmark_node *make_literal(subject *subj, cmark_node_type t,
                                       int start_column, int end_column) {
  cmark_node *e = (cmark_node *)subj->mem->calloc(1, sizeof(*e));
  e->mem = subj->mem;
  e->type = (uint16_t)t;
  e->start_line = e->end_line = subj->line;
  e->start_column = start_column + 1 + subj->column_offset + subj->block_offset;
  e->end_column = end_column + 1 + subj->column_offset + subj->block_offset;
  return e;
}

// Create an inline with no value (emphasis, strong, etc.)
static inline cmark_node *make_simple(cmark_mem *mem, cmark_node_type t) {
  cmark_node *e = (cmark_node *)mem->calloc(1, sizeof(*e));
  e->mem = mem;
  e->type = t;
  return e;
}

Node Destruction

S_free_nodes() — Iterative Subtree Freeing

The S_free_nodes() function in node.c avoids recursion by splicing children into a flat linked list:

static void S_free_nodes(cmark_node *e) {
  cmark_mem *mem = e->mem;
  cmark_node *next;
  while (e != NULL) {
    switch (e->type) {
    case CMARK_NODE_CODE_BLOCK:
      mem->free(e->data);
      mem->free(e->as.code.info);
      break;
    case CMARK_NODE_TEXT:
    case CMARK_NODE_HTML_INLINE:
    case CMARK_NODE_CODE:
    case CMARK_NODE_HTML_BLOCK:
      mem->free(e->data);
      break;
    case CMARK_NODE_LINK:
    case CMARK_NODE_IMAGE:
      mem->free(e->as.link.url);
      mem->free(e->as.link.title);
      break;
    case CMARK_NODE_CUSTOM_BLOCK:
    case CMARK_NODE_CUSTOM_INLINE:
      mem->free(e->as.custom.on_enter);
      mem->free(e->as.custom.on_exit);
      break;
    default:
      break;
    }
    if (e->last_child) {
      // Splice children into list for flat iteration
      e->last_child->next = e->next;
      e->next = e->first_child;
    }
    next = e->next;
    mem->free(e);
    e = next;
  }
}

This splicing technique converts the tree into a flat list, allowing O(n) iterative freeing without a recursion stack. For each node with children, the children are prepended to the remaining list by connecting last_child->next to e->next and e->next to first_child.

Containership Rules

The S_can_contain() function in node.c enforces which node types can contain which children:

static bool S_can_contain(cmark_node *node, cmark_node *child) {
  // Ancestor loop detection
  if (child->first_child != NULL) {
    cmark_node *cur = node->parent;
    while (cur != NULL) {
      if (cur == child) return false;
      cur = cur->parent;
    }
  }

  // Documents cannot be children
  if (child->type == CMARK_NODE_DOCUMENT) return false;

  switch (node->type) {
  case CMARK_NODE_DOCUMENT:
  case CMARK_NODE_BLOCK_QUOTE:
  case CMARK_NODE_ITEM:
    return cmark_node_is_block(child) && child->type != CMARK_NODE_ITEM;

  case CMARK_NODE_LIST:
    return child->type == CMARK_NODE_ITEM;

  case CMARK_NODE_CUSTOM_BLOCK:
    return true;  // Custom blocks can contain anything

  case CMARK_NODE_PARAGRAPH:
  case CMARK_NODE_HEADING:
  case CMARK_NODE_EMPH:
  case CMARK_NODE_STRONG:
  case CMARK_NODE_LINK:
  case CMARK_NODE_IMAGE:
  case CMARK_NODE_CUSTOM_INLINE:
    return cmark_node_is_inline(child);

  default:
    break;
  }
  return false;
}

Key rules:

  • Document, block quote, list item: Can contain any block except items
  • List: Can only contain items
  • Custom block: Can contain anything (no restrictions)
  • Paragraph, heading, emphasis, strong, link, image, custom inline: Can only contain inline nodes
  • Leaf blocks (thematic break, code block, HTML block): Cannot contain anything

Tree Manipulation

Unlinking

The internal S_node_unlink() function detaches a node from its parent and siblings:

static void S_node_unlink(cmark_node *node) {
  if (node->prev) {
    node->prev->next = node->next;
  }
  if (node->next) {
    node->next->prev = node->prev;
  }
  // Update parent's first_child / last_child pointers
  if (node->parent) {
    if (node->parent->first_child == node)
      node->parent->first_child = node->next;
    if (node->parent->last_child == node)
      node->parent->last_child = node->prev;
  }
  node->next = NULL;
  node->prev = NULL;
  node->parent = NULL;
}

String Setting Helper

The cmark_set_cstr() function manages string assignment with proper memory handling:

static bufsize_t cmark_set_cstr(cmark_mem *mem, unsigned char **dst,
                                const char *src) {
  unsigned char *old = *dst;
  bufsize_t len;
  if (src && src[0]) {
    len = (bufsize_t)strlen(src);
    *dst = (unsigned char *)mem->realloc(NULL, len + 1);
    memcpy(*dst, src, len + 1);
  } else {
    len = 0;
    *dst = NULL;
  }
  if (old) {
    mem->free(old);
  }
  return len;
}

This function allocates a new copy of the source string, assigns it, then frees the old value — ensuring no memory leaks even when overwriting existing data.

Node Data Storage Pattern

Nodes store their text content in two ways depending on type:

  1. Direct storage (data + len): Used by CMARK_NODE_TEXT, CMARK_NODE_CODE, CMARK_NODE_CODE_BLOCK, CMARK_NODE_HTML_BLOCK, and CMARK_NODE_HTML_INLINE. The data field points to a separately allocated buffer containing the text content.

  2. Union storage (as.*): Used by lists, code blocks (for the info string), headings, links/images, and custom nodes. These store structured data rather than raw text.

  3. Hybrid: CMARK_NODE_CODE_BLOCK uses both — data for the code content and as.code.info for the info string.

The cmark_node_check() Function

For debug builds, cmark_node_check() validates the structural integrity of the tree. It checks that parent/child/sibling pointers are consistent and that the tree forms a valid structure. It returns the number of errors found and prints details to the provided FILE*.

Cross-References

Was this handbook page helpful?

This page is part of the Project Tick Handbook, which is licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license. View full license details.
Last updated: April 18, 2026