cmark — Overview
What Is cmark?
cmark is a C library and command-line tool for parsing and rendering CommonMark (standardized Markdown). Written in C99, it implements a two-phase parsing architecture — block structure recognition followed by inline content parsing — producing an Abstract Syntax Tree (AST) that can be traversed, manipulated, and rendered into multiple output formats.
Language: C (C99)
Build System: CMake (minimum version 3.14)
Project Version: 0.31.2
License: BSD-2-Clause
Authors: John MacFarlane, Vicent Marti, Kārlis Gaņģis, Nick Wellnhofer
Core Architecture Summary
cmark's processing pipeline follows this sequence:
- Input — UTF-8 text is fed to the parser, either all at once or incrementally via a streaming API.
- Block Parsing (
blocks.c) — The input is scanned line-by-line to identify block-level structures (paragraphs, headings, code blocks, lists, block quotes, thematic breaks, HTML blocks). - Inline Parsing (
inlines.c) — Within paragraph and heading blocks, inline elements are parsed (emphasis, links, images, code spans, HTML inline, line breaks). - AST Construction — A tree of
cmark_nodestructures is built, with each node representing a document element. - Rendering — The AST is traversed using an iterator and rendered to one of five output formats: HTML, XML, LaTeX, man (groff), or CommonMark.
Source File Map
The cmark/src/ directory contains the following source files, organized by responsibility:
Public API
| File | Purpose |
|---|---|
cmark.h |
Public API header — all exported types, enums, and function declarations |
cmark.c |
Core glue — cmark_markdown_to_html(), default memory allocator, version info |
main.c |
CLI entry point — argument parsing, file I/O, format dispatch |
AST Node System
| File | Purpose |
|---|---|
node.h |
Internal node struct definition, type-specific unions (cmark_list, cmark_code, cmark_heading, cmark_link, cmark_custom), internal flags |
node.c |
Node creation/destruction, accessor functions, tree manipulation (insert, append, unlink, replace) |
Parsing
| File | Purpose |
|---|---|
parser.h |
Internal cmark_parser struct definition (parser state: line number, offset, column, indent, reference map) |
blocks.c |
Block-level parsing — line-by-line analysis, open/close block logic, list item detection, finalization |
inlines.c |
Inline-level parsing — emphasis/strong via delimiter stack, backtick code spans, links/images via bracket stack, autolinks, HTML inline |
inlines.h |
Internal API: cmark_parse_inlines(), cmark_parse_reference_inline(), cmark_clean_url(), cmark_clean_title() |
Traversal
| File | Purpose |
|---|---|
iterator.h |
Internal cmark_iter struct with cmark_iter_state (current + next event/node pairs) |
iterator.c |
Iterator implementation — cmark_iter_new(), cmark_iter_next(), cmark_iter_reset(), cmark_consolidate_text_nodes() |
Renderers
| File | Purpose |
|---|---|
render.h |
cmark_renderer struct, cmark_escaping enum (LITERAL, NORMAL, TITLE, URL) |
render.c |
Generic render framework — line wrapping, prefix management, cmark_render() dispatch loop |
html.c |
HTML renderer — cmark_render_html(), direct strbuf-based output, no render framework |
xml.c |
XML renderer — cmark_render_xml(), direct strbuf-based output with CommonMark DTD |
latex.c |
LaTeX renderer — cmark_render_latex(), uses render framework |
man.c |
groff man renderer — cmark_render_man(), uses render framework |
commonmark.c |
CommonMark renderer — cmark_render_commonmark(), uses render framework |
Text Processing and Utilities
| File | Purpose |
|---|---|
buffer.h / buffer.c |
cmark_strbuf — growable byte buffer with amortized O(1) append |
chunk.h |
cmark_chunk — lightweight non-owning string slice (pointer + length) |
utf8.h / utf8.c |
UTF-8 iteration, validation, encoding, case folding, Unicode property queries |
references.h / references.c |
Link reference definition storage and lookup (sorted array with binary search) |
scanners.h / scanners.c |
re2c-generated scanner functions for recognizing Markdown syntax patterns |
scanners.re |
re2c source for scanner generation |
cmark_ctype.h / cmark_ctype.c |
Locale-independent cmark_isspace(), cmark_ispunct(), cmark_isdigit(), cmark_isalpha() |
houdini.h |
HTML/URL escaping and unescaping API |
houdini_html_e.c |
HTML entity escaping |
houdini_html_u.c |
HTML entity unescaping |
houdini_href_e.c |
URL/href percent-encoding |
entities.inc |
HTML entity name-to-codepoint lookup table |
case_fold.inc |
Unicode case folding table for reference normalization |
The Simple Interface
The simplest way to use cmark is a single function call defined in cmark.c:
char *cmark_markdown_to_html(const char *text, size_t len, int options);
Internally, this calls cmark_parse_document() to build the AST, then cmark_render_html() to produce the output, and finally frees the document node. The caller is responsible for freeing the returned string.
The implementation in cmark.c:
char *cmark_markdown_to_html(const char *text, size_t len, int options) {
cmark_node *doc;
char *result;
doc = cmark_parse_document(text, len, options);
result = cmark_render_html(doc, options);
cmark_node_free(doc);
return result;
}
The Streaming Interface
For large documents or streaming input, cmark provides an incremental parsing API:
cmark_parser *parser = cmark_parser_new(CMARK_OPT_DEFAULT);
// Feed chunks of data as they arrive
while ((bytes = fread(buffer, 1, sizeof(buffer), fp)) > 0) {
cmark_parser_feed(parser, buffer, bytes);
}
// Finalize and get the AST
cmark_node *document = cmark_parser_finish(parser);
cmark_parser_free(parser);
// Render to any format
char *html = cmark_render_html(document, CMARK_OPT_DEFAULT);
char *xml = cmark_render_xml(document, CMARK_OPT_DEFAULT);
char *man = cmark_render_man(document, CMARK_OPT_DEFAULT, 72);
char *tex = cmark_render_latex(document, CMARK_OPT_DEFAULT, 80);
char *cm = cmark_render_commonmark(document, CMARK_OPT_DEFAULT, 0);
// Cleanup
cmark_node_free(document);
The parser accumulates input in an internal line buffer (parser->linebuf) and processes complete lines as they become available. The S_parser_feed() function in blocks.c scans for line-ending characters (\n, \r) and dispatches each complete line to S_process_line().
Node Type Taxonomy
cmark defines 21 node types in the cmark_node_type enum:
Block Nodes (container and leaf)
| Enum Value | Type String | Container? | Accepts Lines? | Contains Inlines? |
|---|---|---|---|---|
CMARK_NODE_DOCUMENT |
"document" | Yes | No | No |
CMARK_NODE_BLOCK_QUOTE |
"block_quote" | Yes | No | No |
CMARK_NODE_LIST |
"list" | Yes (items only) | No | No |
CMARK_NODE_ITEM |
"item" | Yes | No | No |
CMARK_NODE_CODE_BLOCK |
"code_block" | No (leaf) | Yes | No |
CMARK_NODE_HTML_BLOCK |
"html_block" | No (leaf) | No | No |
CMARK_NODE_CUSTOM_BLOCK |
"custom_block" | Yes | No | No |
CMARK_NODE_PARAGRAPH |
"paragraph" | No | Yes | Yes |
CMARK_NODE_HEADING |
"heading" | No | Yes | Yes |
CMARK_NODE_THEMATIC_BREAK |
"thematic_break" | No (leaf) | No | No |
Inline Nodes
| Enum Value | Type String | Leaf? |
|---|---|---|
CMARK_NODE_TEXT |
"text" | Yes |
CMARK_NODE_SOFTBREAK |
"softbreak" | Yes |
CMARK_NODE_LINEBREAK |
"linebreak" | Yes |
CMARK_NODE_CODE |
"code" | Yes |
CMARK_NODE_HTML_INLINE |
"html_inline" | Yes |
CMARK_NODE_CUSTOM_INLINE |
"custom_inline" | No |
CMARK_NODE_EMPH |
"emph" | No |
CMARK_NODE_STRONG |
"strong" | No |
CMARK_NODE_LINK |
"link" | No |
CMARK_NODE_IMAGE |
"image" | No |
Range sentinels are also defined for classification:
CMARK_NODE_FIRST_BLOCK = CMARK_NODE_DOCUMENTCMARK_NODE_LAST_BLOCK = CMARK_NODE_THEMATIC_BREAKCMARK_NODE_FIRST_INLINE = CMARK_NODE_TEXTCMARK_NODE_LAST_INLINE = CMARK_NODE_IMAGE
Option Flags
Options are passed as a bitmask integer to parsing and rendering functions:
#define CMARK_OPT_DEFAULT 0
#define CMARK_OPT_SOURCEPOS (1 << 1) // Add data-sourcepos attributes
#define CMARK_OPT_HARDBREAKS (1 << 2) // Render softbreaks as hard breaks
#define CMARK_OPT_SAFE (1 << 3) // Legacy (now default behavior)
#define CMARK_OPT_NOBREAKS (1 << 4) // Render softbreaks as spaces
#define CMARK_OPT_NORMALIZE (1 << 8) // Legacy (no effect)
#define CMARK_OPT_VALIDATE_UTF8 (1 << 9) // Validate UTF-8 input
#define CMARK_OPT_SMART (1 << 10) // Smart quotes and dashes
#define CMARK_OPT_UNSAFE (1 << 17) // Allow raw HTML and dangerous URLs
Memory Management Model
cmark uses a pluggable memory allocator defined by the cmark_mem struct:
typedef struct cmark_mem {
void *(*calloc)(size_t, size_t);
void *(*realloc)(void *, size_t);
void (*free)(void *);
} cmark_mem;
The default allocator in cmark.c wraps standard calloc/realloc/free with abort-on-NULL safety checks (xcalloc, xrealloc). Every node stores a pointer to the allocator it was created with (node->mem), ensuring consistent allocation/deallocation throughout the tree.
Version Information
Runtime version queries:
int cmark_version(void); // Returns CMARK_VERSION as integer (0xMMmmpp)
const char *cmark_version_string(void); // Returns CMARK_VERSION_STRING
The version is encoded as a 24-bit integer where bits 16–23 are major, 8–15 are minor, and 0–7 are patch. For example, 0x001F02 represents version 0.31.2.
Backwards Compatibility Aliases
For code written against older cmark API versions, these aliases are provided:
#define CMARK_NODE_HEADER CMARK_NODE_HEADING
#define CMARK_NODE_HRULE CMARK_NODE_THEMATIC_BREAK
#define CMARK_NODE_HTML CMARK_NODE_HTML_BLOCK
#define CMARK_NODE_INLINE_HTML CMARK_NODE_HTML_INLINE
Short-name aliases (without the CMARK_ prefix) are also available unless CMARK_NO_SHORT_NAMES is defined:
#define NODE_DOCUMENT CMARK_NODE_DOCUMENT
#define NODE_PARAGRAPH CMARK_NODE_PARAGRAPH
#define BULLET_LIST CMARK_BULLET_LIST
// ... and many more
Cross-References
- architecture.md — Detailed two-phase parsing pipeline, module dependency graph
- public-api.md — Complete public API reference with all function signatures
- ast-node-system.md — Internal
cmark_nodestruct, type-specific unions, tree operations - block-parsing.md —
blocks.cline-by-line analysis, open block tracking, finalization - inline-parsing.md —
inlines.cdelimiter algorithm, bracket stack, backtick scanning - iterator-system.md — AST traversal with enter/exit events
- html-renderer.md — HTML output with escaping and source position
- xml-renderer.md — XML output with CommonMark DTD
- latex-renderer.md — LaTeX output via render framework
- man-renderer.md — groff man page output
- commonmark-renderer.md — Round-trip CommonMark output
- render-framework.md — Shared
cmark_render()engine for text-based renderers - memory-management.md — Allocator model, buffer growth, node freeing
- utf8-handling.md — UTF-8 validation, iteration, case folding
- reference-system.md — Link reference definitions storage and resolution
- scanner-system.md — re2c-generated pattern matching
- building.md — CMake build configuration and options
- cli-usage.md — Command-line tool usage
- testing.md — Test infrastructure (spec tests, API tests, fuzzing)
- code-style.md — Coding conventions and naming patterns