cmark — Overview

What Is cmark?

cmark is a C library and command-line tool for parsing and rendering CommonMark (standardized Markdown). Written in C99, it implements a two-phase parsing architecture — block structure recognition followed by inline content parsing — producing an Abstract Syntax Tree (AST) that can be traversed, manipulated, and rendered into multiple output formats.

Language: C (C99)
Build System: CMake (minimum version 3.14)
Project Version: 0.31.2
License: BSD-2-Clause
Authors: John MacFarlane, Vicent Marti, Kārlis Gaņģis, Nick Wellnhofer

Core Architecture Summary

cmark's processing pipeline follows this sequence:

  1. Input — UTF-8 text is fed to the parser, either all at once or incrementally via a streaming API.
  2. Block Parsing (blocks.c) — The input is scanned line-by-line to identify block-level structures (paragraphs, headings, code blocks, lists, block quotes, thematic breaks, HTML blocks).
  3. Inline Parsing (inlines.c) — Within paragraph and heading blocks, inline elements are parsed (emphasis, links, images, code spans, HTML inline, line breaks).
  4. AST Construction — A tree of cmark_node structures is built, with each node representing a document element.
  5. Rendering — The AST is traversed using an iterator and rendered to one of five output formats: HTML, XML, LaTeX, man (groff), or CommonMark.

Source File Map

The cmark/src/ directory contains the following source files, organized by responsibility:

Public API

File Purpose
cmark.h Public API header — all exported types, enums, and function declarations
cmark.c Core glue — cmark_markdown_to_html(), default memory allocator, version info
main.c CLI entry point — argument parsing, file I/O, format dispatch

AST Node System

File Purpose
node.h Internal node struct definition, type-specific unions (cmark_list, cmark_code, cmark_heading, cmark_link, cmark_custom), internal flags
node.c Node creation/destruction, accessor functions, tree manipulation (insert, append, unlink, replace)

Parsing

File Purpose
parser.h Internal cmark_parser struct definition (parser state: line number, offset, column, indent, reference map)
blocks.c Block-level parsing — line-by-line analysis, open/close block logic, list item detection, finalization
inlines.c Inline-level parsing — emphasis/strong via delimiter stack, backtick code spans, links/images via bracket stack, autolinks, HTML inline
inlines.h Internal API: cmark_parse_inlines(), cmark_parse_reference_inline(), cmark_clean_url(), cmark_clean_title()

Traversal

File Purpose
iterator.h Internal cmark_iter struct with cmark_iter_state (current + next event/node pairs)
iterator.c Iterator implementation — cmark_iter_new(), cmark_iter_next(), cmark_iter_reset(), cmark_consolidate_text_nodes()

Renderers

File Purpose
render.h cmark_renderer struct, cmark_escaping enum (LITERAL, NORMAL, TITLE, URL)
render.c Generic render framework — line wrapping, prefix management, cmark_render() dispatch loop
html.c HTML renderer — cmark_render_html(), direct strbuf-based output, no render framework
xml.c XML renderer — cmark_render_xml(), direct strbuf-based output with CommonMark DTD
latex.c LaTeX renderer — cmark_render_latex(), uses render framework
man.c groff man renderer — cmark_render_man(), uses render framework
commonmark.c CommonMark renderer — cmark_render_commonmark(), uses render framework

Text Processing and Utilities

File Purpose
buffer.h / buffer.c cmark_strbuf — growable byte buffer with amortized O(1) append
chunk.h cmark_chunk — lightweight non-owning string slice (pointer + length)
utf8.h / utf8.c UTF-8 iteration, validation, encoding, case folding, Unicode property queries
references.h / references.c Link reference definition storage and lookup (sorted array with binary search)
scanners.h / scanners.c re2c-generated scanner functions for recognizing Markdown syntax patterns
scanners.re re2c source for scanner generation
cmark_ctype.h / cmark_ctype.c Locale-independent cmark_isspace(), cmark_ispunct(), cmark_isdigit(), cmark_isalpha()
houdini.h HTML/URL escaping and unescaping API
houdini_html_e.c HTML entity escaping
houdini_html_u.c HTML entity unescaping
houdini_href_e.c URL/href percent-encoding
entities.inc HTML entity name-to-codepoint lookup table
case_fold.inc Unicode case folding table for reference normalization

The Simple Interface

The simplest way to use cmark is a single function call defined in cmark.c:

char *cmark_markdown_to_html(const char *text, size_t len, int options);

Internally, this calls cmark_parse_document() to build the AST, then cmark_render_html() to produce the output, and finally frees the document node. The caller is responsible for freeing the returned string.

The implementation in cmark.c:

char *cmark_markdown_to_html(const char *text, size_t len, int options) {
  cmark_node *doc;
  char *result;

  doc = cmark_parse_document(text, len, options);
  result = cmark_render_html(doc, options);
  cmark_node_free(doc);

  return result;
}

The Streaming Interface

For large documents or streaming input, cmark provides an incremental parsing API:

cmark_parser *parser = cmark_parser_new(CMARK_OPT_DEFAULT);

// Feed chunks of data as they arrive
while ((bytes = fread(buffer, 1, sizeof(buffer), fp)) > 0) {
    cmark_parser_feed(parser, buffer, bytes);
}

// Finalize and get the AST
cmark_node *document = cmark_parser_finish(parser);
cmark_parser_free(parser);

// Render to any format
char *html = cmark_render_html(document, CMARK_OPT_DEFAULT);
char *xml  = cmark_render_xml(document, CMARK_OPT_DEFAULT);
char *man  = cmark_render_man(document, CMARK_OPT_DEFAULT, 72);
char *tex  = cmark_render_latex(document, CMARK_OPT_DEFAULT, 80);
char *cm   = cmark_render_commonmark(document, CMARK_OPT_DEFAULT, 0);

// Cleanup
cmark_node_free(document);

The parser accumulates input in an internal line buffer (parser->linebuf) and processes complete lines as they become available. The S_parser_feed() function in blocks.c scans for line-ending characters (\n, \r) and dispatches each complete line to S_process_line().

Node Type Taxonomy

cmark defines 21 node types in the cmark_node_type enum:

Block Nodes (container and leaf)

Enum Value Type String Container? Accepts Lines? Contains Inlines?
CMARK_NODE_DOCUMENT "document" Yes No No
CMARK_NODE_BLOCK_QUOTE "block_quote" Yes No No
CMARK_NODE_LIST "list" Yes (items only) No No
CMARK_NODE_ITEM "item" Yes No No
CMARK_NODE_CODE_BLOCK "code_block" No (leaf) Yes No
CMARK_NODE_HTML_BLOCK "html_block" No (leaf) No No
CMARK_NODE_CUSTOM_BLOCK "custom_block" Yes No No
CMARK_NODE_PARAGRAPH "paragraph" No Yes Yes
CMARK_NODE_HEADING "heading" No Yes Yes
CMARK_NODE_THEMATIC_BREAK "thematic_break" No (leaf) No No

Inline Nodes

Enum Value Type String Leaf?
CMARK_NODE_TEXT "text" Yes
CMARK_NODE_SOFTBREAK "softbreak" Yes
CMARK_NODE_LINEBREAK "linebreak" Yes
CMARK_NODE_CODE "code" Yes
CMARK_NODE_HTML_INLINE "html_inline" Yes
CMARK_NODE_CUSTOM_INLINE "custom_inline" No
CMARK_NODE_EMPH "emph" No
CMARK_NODE_STRONG "strong" No
CMARK_NODE_LINK "link" No
CMARK_NODE_IMAGE "image" No

Range sentinels are also defined for classification:

  • CMARK_NODE_FIRST_BLOCK = CMARK_NODE_DOCUMENT
  • CMARK_NODE_LAST_BLOCK = CMARK_NODE_THEMATIC_BREAK
  • CMARK_NODE_FIRST_INLINE = CMARK_NODE_TEXT
  • CMARK_NODE_LAST_INLINE = CMARK_NODE_IMAGE

Option Flags

Options are passed as a bitmask integer to parsing and rendering functions:

#define CMARK_OPT_DEFAULT       0
#define CMARK_OPT_SOURCEPOS     (1 << 1)   // Add data-sourcepos attributes
#define CMARK_OPT_HARDBREAKS    (1 << 2)   // Render softbreaks as hard breaks
#define CMARK_OPT_SAFE          (1 << 3)   // Legacy (now default behavior)
#define CMARK_OPT_NOBREAKS      (1 << 4)   // Render softbreaks as spaces
#define CMARK_OPT_NORMALIZE     (1 << 8)   // Legacy (no effect)
#define CMARK_OPT_VALIDATE_UTF8 (1 << 9)   // Validate UTF-8 input
#define CMARK_OPT_SMART         (1 << 10)  // Smart quotes and dashes
#define CMARK_OPT_UNSAFE        (1 << 17)  // Allow raw HTML and dangerous URLs

Memory Management Model

cmark uses a pluggable memory allocator defined by the cmark_mem struct:

typedef struct cmark_mem {
  void *(*calloc)(size_t, size_t);
  void *(*realloc)(void *, size_t);
  void (*free)(void *);
} cmark_mem;

The default allocator in cmark.c wraps standard calloc/realloc/free with abort-on-NULL safety checks (xcalloc, xrealloc). Every node stores a pointer to the allocator it was created with (node->mem), ensuring consistent allocation/deallocation throughout the tree.

Version Information

Runtime version queries:

int cmark_version(void);               // Returns CMARK_VERSION as integer (0xMMmmpp)
const char *cmark_version_string(void); // Returns CMARK_VERSION_STRING

The version is encoded as a 24-bit integer where bits 16–23 are major, 8–15 are minor, and 0–7 are patch. For example, 0x001F02 represents version 0.31.2.

Backwards Compatibility Aliases

For code written against older cmark API versions, these aliases are provided:

#define CMARK_NODE_HEADER      CMARK_NODE_HEADING
#define CMARK_NODE_HRULE       CMARK_NODE_THEMATIC_BREAK
#define CMARK_NODE_HTML        CMARK_NODE_HTML_BLOCK
#define CMARK_NODE_INLINE_HTML CMARK_NODE_HTML_INLINE

Short-name aliases (without the CMARK_ prefix) are also available unless CMARK_NO_SHORT_NAMES is defined:

#define NODE_DOCUMENT    CMARK_NODE_DOCUMENT
#define NODE_PARAGRAPH   CMARK_NODE_PARAGRAPH
#define BULLET_LIST      CMARK_BULLET_LIST
// ... and many more

Cross-References

Was this handbook page helpful?

This page is part of the Project Tick Handbook, which is licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license. View full license details.
Last updated: April 18, 2026