Import Upstream version 0.4.3

This commit is contained in:
openKylinBot 2022-06-02 17:41:00 +08:00
commit e73391db57
47 changed files with 29103 additions and 0 deletions

34
.travis.yml Normal file
View File

@ -0,0 +1,34 @@
# YAML definition for travis-ci.com continuous integration.
# See https://docs.travis-ci.com/user/languages/c
language: c
dist: bionic
compiler:
- gcc
addons:
apt:
packages:
- python3 # for running tests
- lcov # for generating code coverage report
before_script:
- mkdir build
- cd build
# We enforce -Wdeclaration-after-statement because Qt project needs to
# build MD4C with Integrity compiler which chokes whenever a declaration
# is not at the beginning of a block.
- CFLAGS='--coverage -g -O0 -Wall -Wdeclaration-after-statement -Werror' cmake -DCMAKE_BUILD_TYPE=Debug -G 'Unix Makefiles' ..
script:
- make VERBOSE=1
after_success:
- ../scripts/run-tests.sh
# Creating report
- lcov --directory . --capture --output-file coverage.info # capture coverage info
- lcov --remove coverage.info '/usr/*' --output-file coverage.info # filter out system
- lcov --list coverage.info # debug info
# Uploading report to CodeCov
- bash <(curl -s https://codecov.io/bash) || echo "Codecov did not collect coverage reports"

268
CHANGELOG.md Normal file
View File

@ -0,0 +1,268 @@
# MD4C Change Log
## Version 0.4.3
New features:
* With `MD_FLAG_UNDERLINE`, spans enclosed in underscore (`_foo_`) are seen
as underline (`MD_SPAN_UNDERLINE`) rather then an ordinary emphasis or
strong emphasis.
Changes:
* The implementation of wiki-links extension (with `MD_FLAG_WIKILINKS`) has
been simplified.
- A noticeable increase of MD4C's memory footprint introduced by the
extension implementation in 0.4.0 has been removed.
- The priority handling towards other inline elements have been unified.
(This affects an obscure case where syntax of an image was in place of
wiki-link destination made the wiki-link invalid. Now *all* inline spans
in the wiki-link destination, including the images, is suppressed.)
- The length limitation of 100 characters now always applies to wiki-link
destination.
* Recognition of strike-through spans (with the flag `MD_FLAG_STRIKETHROUGH`)
has become much stricter and, arguably, reasonable.
- Only single tildes (`~`) and double tildes (`~~`) are recognized as
strike-through marks. Longer ones are not anymore.
- The length of the opener and closer marks have to be the same.
- The tildes cannot open a strike-through span if a whitespace follows.
- The tildes cannot close a strike-through span if a whitespace precedes.
This change follows the changes of behavior in cmark-gfm some time ago, so
it is also beneficial from compatibility point of view.
* When building MD4C by hand instead of using its CMake-based build, the UTF-8
support was by default disabled, unless explicitly asked for by defining
a preprocessor macro `MD4C_USE_UTF8`.
This has been changed and the UTF-8 mode now becomes the default, no matter
how `md4c.c` is compiled. If you need to disable it and use the ASCII-only
mode, you have explicitly define macro `MD4C_USE_ASCII` when compiling it.
(The CMake-based build as provided in our repository explicitly asked for
the UTF-8 support with `-DMD4C_USE_UTF8`. I.e. if you are using MD4C library
built with our vanilla `CMakeLists.txt` files, this change should not affect
you.)
Fixes:
* Fixed some string length handling in the special `MD4C_USE_UTF16` build.
(This does not affect you unless you are on Windows and explicitly define
the macro when building MD4C.)
* [#100](https://github.com/mity/md4c/issues/100):
Fixed an off-by-one error in the maximal length limit of some segments
of e-mail addresses used in autolinks.
* [#107](https://github.com/mity/md4c/issues/107):
Fix mis-detection of asterisk-encoded emphasis in some corner cases when
length of the opener and closer differs, as in `***foo *bar baz***`.
## Version 0.4.2
Fixes:
* [#98](https://github.com/mity/md4c/issues/98):
Fix mis-detection of asterisk-encoded emphasis in some corner cases when
length of the opener and closer differs, as in `**a *b c** d*`.
## Version 0.4.1
Unfortunately, 0.4.0 has been released with badly updated ChangeLog. Fixing
this is the only change on 0.4.1.
## Version 0.4.0
New features:
* With `MD_FLAG_LATEXMATHSPANS`, LaTeX math spans (`$...$`) and LaTeX display
math spans (`$$...$$`) are now recognized. (Note though that the HTML
renderer outputs them verbatim in a custom `<x-equation>` tag.)
Contributed by [Tilman Roeder](https://github.com/dyedgreen).
* With `MD_FLAG_WIKILINKS`, Wiki-style links (`[[...]]`) are now recognized.
(Note though that the HTML renderer renders them as a custom `<x-wikilink>`
tag.)
Contributed by [Nils Blomqvist](https://github.com/niblo).
Changes:
* Parsing of tables (with `MD_FLAG_TABLES`) is now closer to the way how
cmark-gfm parses tables as we do not require every row of the table to
contain a pipe `|` anymore.
As a consequence, paragraphs now cannot interrupt tables. A paragraph which
follows the table has to be delimited with a blank line.
Fixes:
* [#94](https://github.com/mity/md4c/issues/94):
`md_build_ref_def_hashtable()`: Do not allocate more memory then strictly
needed.
* [#95](https://github.com/mity/md4c/issues/95):
`md_is_container_mark()`: Ordered list mark requires at least one digit.
* [#96](https://github.com/mity/md4c/issues/96):
Some fixes for link label comparison.
## Version 0.3.4
Changes:
* Make Unicode-specific code compliant to Unicode 12.1.
* Structure `MD_BLOCK_CODE_DETAIL` got new member `fenced_char`. Application
can use it to detect character used to form the block fences (`` ` `` or
`~`). In the case of indented code block, it is set to zero.
Fixes:
* [#77](https://github.com/mity/md4c/issues/77):
Fix maximal count of digits for numerical character references, as requested
by CommonMark specification 0.29.
* [#78](https://github.com/mity/md4c/issues/78):
Fix link reference definition label matching for Unicode characters where
the folding mapping leads to multiple codepoints, as e.g. in `ẞ` -> `SS`.
* [#83](https://github.com/mity/md4c/issues/83):
Fix recognition of an empty blockquote which interrupts a paragraph.
## Version 0.3.3
Changes:
* Make permissive URL autolink and permissive WWW autolink extensions stricter.
This brings the behavior closer to GFM and mitigates risk of false positives.
In particular, the domain has to contain at least one dot and parenthesis
can be part of the link destination only if `(` and `)` are balanced.
Fixes:
* [#73](https://github.com/mity/md4c/issues/73):
Some raw HTML inputs could lead to quadratic parsing times.
* [#74](https://github.com/mity/md4c/issues/74):
Fix input leading to a crash. Found by fuzzing.
* [#76](https://github.com/mity/md4c/issues/76):
Fix handling of parenthesis in some corner cases of permissive URL autolink
and permissive WWW autolink extensions.
## Version 0.3.2
Changes:
* Changes mandated by CommonMark specification 0.29.
Most importantly, the white-space trimming rules for code spans have changed.
At most one space/newline is trimmed from beginning/end of the code span
(if the code span contains some non-space contents, and if it begins and
ends with space at the same time). In all other cases the spaces in the code
span are now left intact.
Other changes in behavior are in corner cases only. Refer to [CommonMark
0.29 notes](https://github.com/commonmark/commonmark-spec/releases/tag/0.29)
for more info.
Fixes:
* [#68](https://github.com/mity/md4c/issues/68):
Some specific HTML blocks were not recognized when EOF follows without any
end-of-line character.
* [#69](https://github.com/mity/md4c/issues/69):
Strike-through span not working correctly when its opener mark is directly
followed by other opener mark; or when other closer mark directly precedes
its closer mark.
## Version 0.3.1
Fixes:
* [#58](https://github.com/mity/md4c/issues/58),
[#59](https://github.com/mity/md4c/issues/59),
[#60](https://github.com/mity/md4c/issues/60),
[#63](https://github.com/mity/md4c/issues/63),
[#66](https://github.com/mity/md4c/issues/66):
Some inputs could lead to quadratic parsing times. Thanks to Anders Kaseorg
for finding all those issues.
* [#61](https://github.com/mity/md4c/issues/59):
Flag `MD_FLAG_NOHTMLSPANS` erroneously affected also recognition of
CommonMark autolinks.
## Version 0.3.0
New features:
* Add extension for GitHub-style task lists:
```
* [x] foo
* [x] bar
* [ ] baz
```
(It has to be explicitly enabled with `MD_FLAG_TASKLISTS`.)
* Added support for building as a shared library. On non-Windows platforms,
this is now default behavior; on Windows static library is still the default.
The CMake option `BUILD_SHARED_LIBS` can be used to request one or the other
explicitly.
Contributed by Lisandro Damián Nicanor Pérez Meyer.
* Renamed structure `MD_RENDERER` to `MD_PARSER` and refactorize its contents
a little bit. Note this is source-level incompatible and initialization code
in apps may need to be updated.
The aim of the change is to be more friendly for long-term ABI compatibility
we shall maintain, starting with this release.
* Added `CHANGELOG.md` (this file).
* Make sure `md_process_table_row()` reports the same count of table cells for
all table rows, no matter how broken the input is. The cell count is derived
from table underline line. Bogus cells in other rows are silently ignored.
Missing cells in other rows are reported as empty ones.
Fixes:
* CID 1475544:
Calling `md_free_attribute()` on uninitialized data.
* [#47](https://github.com/mity/md4c/issues/47):
Using bad offsets in `md_is_entity_str()`, in some cases leading to buffer
overflow.
* [#51](https://github.com/mity/md4c/issues/51):
Segfault in `md_process_table_cell()`.
* [#53](https://github.com/mity/md4c/issues/53):
With `MD_FLAG_PERMISSIVEURLAUTOLINKS` or `MD_FLAG_PERMISSIVEWWWAUTOLINKS`
we could generate bad output for ordinary Markdown links, if a non-space
character immediately follows like e.g. in `[link](http://github.com)X`.
## Version 0.2.7
This was the last version before the changelog has been added.

56
CMakeLists.txt Normal file
View File

@ -0,0 +1,56 @@
cmake_minimum_required(VERSION 3.4)
project(MD4C C)
set(MD_VERSION_MAJOR 0)
set(MD_VERSION_MINOR 4)
set(MD_VERSION_RELEASE 3)
set(MD_VERSION "${MD_VERSION_MAJOR}.${MD_VERSION_MINOR}.${MD_VERSION_RELEASE}")
if(WIN32)
# On Windows, given there is no standard lib install dir etc., we rather
# by default build static lib.
option(BUILD_SHARED_LIBS "help string describing option" OFF)
else()
# On Linux, MD4C is slowly being adding into some distros which prefer
# shared lib.
option(BUILD_SHARED_LIBS "help string describing option" ON)
endif()
add_definitions(
-DMD_VERSION_MAJOR=${MD_VERSION_MAJOR}
-DMD_VERSION_MINOR=${MD_VERSION_MINOR}
-DMD_VERSION_RELEASE=${MD_VERSION_RELEASE}
)
set(CMAKE_CONFIGURATION_TYPES Debug Release RelWithDebInfo MinSizeRel)
if("${CMAKE_BUILD_TYPE}" STREQUAL "")
set(CMAKE_BUILD_TYPE $ENV{CMAKE_BUILD_TYPE})
if("${CMAKE_BUILD_TYPE}" STREQUAL "")
set(CMAKE_BUILD_TYPE "Release")
endif()
endif()
if(${CMAKE_C_COMPILER_ID} MATCHES GNU|Clang)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -Wall")
elseif(MSVC)
# Disable warnings about the so-called unsecured functions:
add_definitions(/D_CRT_SECURE_NO_WARNINGS)
# Specify proper C runtime library:
string(REGEX REPLACE "/M[DT]d?" "" CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG}")
string(REGEX REPLACE "/M[DT]d?" "" CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE}")
string(REGEX REPLACE "/M[DT]d?" "" CMAKE_C_FLAGS_RELWITHDEBINFO "{$CMAKE_C_FLAGS_RELWITHDEBINFO}")
string(REGEX REPLACE "/M[DT]d?" "" CMAKE_C_FLAGS_MINSIZEREL "${CMAKE_C_FLAGS_MINSIZEREL}")
set(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} /MTd")
set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} /MT")
set(CMAKE_C_FLAGS_RELWITHDEBINFO "${CMAKE_C_FLAGS_RELEASE} /MT")
set(CMAKE_C_FLAGS_MINSIZEREL "${CMAKE_C_FLAGS_RELEASE} /MT")
endif()
include(GNUInstallDirs)
add_subdirectory(md4c)
add_subdirectory(md2html)

22
LICENSE.md Normal file
View File

@ -0,0 +1,22 @@
# The MIT License (MIT)
Copyright © 2016-2020 Martin Mitáš
Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the “Software”),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
IN THE SOFTWARE.

286
README.md Normal file
View File

@ -0,0 +1,286 @@
[![Linux Build Status (travis-ci.com)](https://img.shields.io/travis/mity/md4c/master.svg?logo=linux&label=linux%20build)](https://travis-ci.org/mity/md4c)
[![Windows Build Status (appveyor.com)](https://img.shields.io/appveyor/ci/mity/md4c/master.svg?logo=windows&label=windows%20build)](https://ci.appveyor.com/project/mity/md4c/branch/master)
[![Code Coverage Status (codecov.io)](https://img.shields.io/codecov/c/github/mity/md4c/master.svg?logo=codecov&label=code%20coverage)](https://codecov.io/github/mity/md4c)
[![Coverity Scan Status](https://img.shields.io/coverity/scan/mity-md4c.svg?label=coverity%20scan)](https://scan.coverity.com/projects/mity-md4c)
# MD4C Readme
* Home: http://github.com/mity/md4c
* Wiki: http://github.com/mity/md4c/wiki
* Issue tracker: http://github.com/mity/md4c/issues
MD4C stands for "Markdown for C" and that's exactly what this project is about.
## What is Markdown
In short, Markdown is the markup language this `README.md` file is written in.
The following resources can explain more if you are unfamiliar with it:
* [Wikipedia article](http://en.wikipedia.org/wiki/Markdown)
* [CommonMark site](http://commonmark.org)
## What is MD4C
MD4C is C Markdown parser with the following features:
* **Compliance:** Generally MD4C aims to be compliant to the latest version of
[CommonMark specification](http://spec.commonmark.org/). Currently, we are
fully compliant to CommonMark 0.29.
* **Extensions:** MD4C supports some commonly requested and accepted extensions.
See below.
* **Compactness:** MD4C is implemented in one source file and one header file.
There are no dependencies other then standard C library.
* **Embedding:** MD4C is easy to reuse in other projects, its API is very
straightforward: There is actually just one function, `md_parse()`.
* **Push model:** MD4C parses the complete document and calls few callback
functions provided by the application to inform it about a start/end of
every block, a start/end of every span, and with any textual contents.
* **Portability:** MD4C builds and works on Windows and POSIX-compliant OSes.
(It should be simple to make it run also on most other platforms, at least as
long as the platform provides C standard library, including a heap memory
management.)
* **Encoding:** MD4C can be compiled to recognize ASCII-only control characters,
UTF-8 and, on Windows, also UTF-16 (i.e. what is on Windows commonly called
just "Unicode"). See more details below.
* **Permissive license:** MD4C is available under the MIT license.
* **Performance:** MD4C is [very fast](https://talk.commonmark.org/t/2520).
## Using MD4C
Application has to include the header `md4c.h` and link against MD4C library;
or alternatively it may include `md4c.h` and `md4c.c` directly into its source
base as the parser is only implemented in the single C source file.
The main provided function is `md_parse()`. It takes a text in the Markdown
syntax and a pointer to a structure which provides pointers to several callback
functions.
As `md_parse()` processes the input, it calls the callbacks (when entering or
leaving any Markdown block or span; and when outputting any textual content of
the document), allowing application to convert it into another format or render
it onto the screen.
An example implementation of simple renderer is available in the `md2html`
directory which implements a conversion utility from Markdown to HTML.
## Markdown Extensions
The default behavior is to recognize only Markdown syntax defined by the
[CommonMark specification](http://spec.commonmark.org/).
However with appropriate flags, the behavior can be tuned to enable some
additional extensions:
* With the flag `MD_FLAG_COLLAPSEWHITESPACE`, a non-trivial whitespace is
collapsed into a single space.
* With the flag `MD_FLAG_TABLES`, GitHub-style tables are supported.
* With the flag `MD_FLAG_TASKLISTS`, GitHub-style task lists are supported.
* With the flag `MD_FLAG_STRIKETHROUGH`, strike-through spans are enabled
(text enclosed in tilde marks, e.g. `~foo bar~`).
* With the flag `MD_FLAG_PERMISSIVEURLAUTOLINKS` permissive URL autolinks
(not enclosed in `<` and `>`) are supported.
* With the flag `MD_FLAG_PERMISSIVEEMAILAUTOLINKS`, permissive e-mail
autolinks (not enclosed in `<` and `>`) are supported.
* With the flag `MD_FLAG_PERMISSIVEWWWAUTOLINKS` permissive WWW autolinks
without any scheme specified (e.g. `www.example.com`) are supported. MD4C
then assumes `http:` scheme.
* With the flag `MD_FLAG_LATEXMATHSPANS` LaTeX math spans (`$...$`) and
LaTeX display math spans (`$$...$$`) are supported. (Note though that the
HTML renderer outputs them verbatim in a custom tag `<x-equation>`.)
* With the flag `MD_FLAG_WIKILINKS`, wiki-style links (`[[link label]]` and
`[[target article|link label]]`) are supported. (Note that the HTML renderer
outputs them in a custom tag `<x-wikilink>`.)
* With the flag `MD_FLAG_UNDERLINE`, underscore (`_`) denotes an underline
instead of an ordinary emphasis or strong emphasis.
Few features of CommonMark (those some people see as mis-features) may be
disabled:
* With the flag `MD_FLAG_NOHTMLSPANS` or `MD_FLAG_NOHTMLBLOCKS`, raw inline
HTML or raw HTML blocks respectively are disabled.
* With the flag `MD_FLAG_NOINDENTEDCODEBLOCKS`, indented code blocks are
disabled.
## Input/Output Encoding
The CommonMark specification generally assumes UTF-8 input, but under closer
inspection, Unicode plays any role in few very specific situations when parsing
Markdown documents:
1. For detection of word boundaries when processing emphasis and strong
emphasis, some classification of Unicode characters (whether it is
a whitespace or a punctuation) is needed.
2. For (case-insensitive) matching of a link reference label with the
corresponding link reference definition, Unicode case folding is used.
3. For translating HTML entities (e.g. `&amp;`) and numeric character
references (e.g. `&#35;` or `&#xcab;`) into their Unicode equivalents.
However MD4C leaves this translation on the renderer/application; as the
renderer is supposed to really know output encoding and whether it really
needs to perform this kind of translation. (For example, when the renderer
outputs HTML, it may leave the entities untranslated and defer the work to
a web browser.)
MD4C relies on this property of the CommonMark and the implementation is, to
a large degree, encoding-agnostic. Most of MD4C code only assumes that the
encoding of your choice is compatible with ASCII, i.e. that the codepoints
below 128 have the same numeric values as ASCII.
Any input MD4C does not understand is simply seen as part of the document text
and sent to the renderer's callback functions unchanged.
The two situations (word boundary detection and link reference matching) where
MD4C has to understand Unicode are handled as specified by the following rules:
* If preprocessor macro `MD4C_USE_UTF8` is defined, MD4C assumes UTF-8 for the
word boundary detection and for the case-insensitive matching of link labels.
When none of these macros is explicitly used, this is the default behavior.
* On Windows, if preprocessor macro `MD4C_USE_UTF16` is defined, MD4C uses
`WCHAR` instead of `char` and assumes UTF-16 encoding in those situations.
(UTF-16 is what Windows developers usually call just "Unicode" and what
Win32API generally works with.)
Note that because this macro affects also the types in `md4c.h`, you have
to define the macro both when building MD4C as well as when including
`md4c.h`.
Also note this is only supported in the parser (`md4c.[hc]`). The HTML
renderer does not support this and you will have to write your own custom
renderer to use this feature.
* If preprocessor macro `MD4C_USE_ASCII` is defined, MD4C assumes nothing but
an ASCII input.
That effectively means that non-ASCII whitespace or punctuation characters
won't be recognized as such and that link reference matching will work in
a case-insensitive way only for ASCII letters (`[a-zA-Z]`).
## Documentation
The API is quite well documented in the comments in the `md4c.h` header.
There is also [project wiki](http://github.com/mity/md4c/wiki) which provides
some more comprehensive documentation. However note it is incomplete and some
details may be little-bit outdated.
## FAQ
**Q: In my code, I need to convert Markdown to HTML. How?**
**A:** Indeed the API, as provided by `md4c.h`, is just a SAX-like Markdown
parser. Nothing more and nothing less.
That said, there is a complete HTML generator built on top of the parser in the
directory `md2html` (the files `render_html.[hc]` and `entity.[hc]`). At this
time, you have to directly reuse that code in your project.
There is [some discussion](https://github.com/mity/md4c/issues/82) whether this
should be changed (and how) in the future.
**Q: How does MD4C compare to a parser XY?**
**A:** Some other implementations combine Markdown parser and HTML generator
into a single entangled code hidden behind an interface which just allows the
conversion from Markdown to HTML, and they are unusable if you want to process
the input in any other way.
Even when the parsing is available as a standalone feature, most parsers (if
not all of them; at least within the scope of C/C++ language) are full DOM-like
parsers: They construct abstract syntax tree (AST) representation of the whole
Markdown document. That takes time and it leads to bigger memory footprint.
It's completely fine as long as you really need it. If you don't need the full
AST, there is very high chance that using MD4C will be faster and much less
memory-hungry.
Last but not least, some Markdown parsers are implemented in a naive way. When
fed with a [smartly crafted input pattern](test/pathological_tests.py), they
may exhibit quadratic (or even worse) parsing times. What MD4C can still parse
in a fraction of second may turn into long minutes or possibly hours with them.
Hence, when such a naive parser is used to process an input from an untrusted
source, the possibility of denial-of-service attacks becomes a real danger.
A lot of our effort went into providing linear parsing times no matter what
kind of crazy input MD4C parser is fed with. (If you encounter an input pattern
which leads to a sub-linear parsing times, please do not hesitate and report it
as a bug.)
**Q: Does MD4C perform any input validation?**
**A:** No.
CommonMark specification declares that any sequence of (Unicode) characters is
a valid Markdown document; i.e. that it does not matter whether some Markdown
syntax is in some way broken or not. If it is broken, it will simply not be
recognized and the parser should see the broken syntax construction just as a
verbatim text.
MD4C takes this a step further. It sees any sequence of bytes as a valid input,
following completely the GIGO philosophy (garbage in, garbage out).
If you need to validate that the input is, say, a valid UTF-8 document, you
have to do it on your own. You can simply validate the whole Markdown document
before passing it to the MD4C parser.
Alternatively, you may perform the validation on the fly during the parsing,
in the `MD_PARSER::text()` callback. (Given how MD4C works internally, it will
never break a sequence of bytes into multiple calls of `MD_PARSER::text()`,
unless that sequence is already broken to multiple pieces in the input by some
whitespace, new line character(s) and/or any Markdown syntax construction.)
## License
MD4C is covered with MIT license, see the file `LICENSE.md`.
## Links to Related Projects
Ports and bindings to other languages:
* [commonmark-d](https://github.com/AuburnSounds/commonmark-d):
Port of MD4C to D language.
* [markdown-wasm](https://github.com/rsms/markdown-wasm):
Markdown parser and HTML generator for WebAssembly, based on MD4C.
Software using MD4C:
* [Qt](https://www.qt.io/):
Cross-platform C++ GUI framework.
* [Textosaurus](https://github.com/martinrotter/textosaurus):
Cross-platform text editor based on Qt and Scintilla.
* [8th](https://8th-dev.com/):
Cross-platform concatenative programming language.

29
appveyor.yml Normal file
View File

@ -0,0 +1,29 @@
# YAML definition for Appveyor.com continuous integration.
# See http://www.appveyor.com/docs/appveyor-yml
version: '{branch}-{build}'
before_build:
- 'cmake --version'
- 'if "%PLATFORM%"=="x64" cmake -G "Visual Studio 12 Win64" .'
- 'if not "%PLATFORM%"=="x64" cmake -G "Visual Studio 12" .'
build:
project: md4c.sln
verbosity: detailed
skip_tags: true
os:
- Windows Server 2012 R2
configuration:
- Debug
- Release
platform:
- x64 # 64-bit build
- win32 # 32-bit build
artifacts:
- path: $(configuration)/md2html/md2html.exe

4
codecov.yml Normal file
View File

@ -0,0 +1,4 @@
# YAML definition for codecov.io code coverage reports.
ignore:
- "md2html"

15
md2html/CMakeLists.txt Normal file
View File

@ -0,0 +1,15 @@
include_directories("${PROJECT_SOURCE_DIR}/md4c")
add_executable(md2html cmdline.c cmdline.h entity.c entity.h md2html.c render_html.c render_html.h)
target_link_libraries(md2html md4c)
install(
TARGETS md2html
ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}
LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR}
PUBLIC_HEADER DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
)
install(FILES "md2html.1" DESTINATION "${CMAKE_INSTALL_MANDIR}/man1")

296
md2html/cmdline.c Normal file
View File

@ -0,0 +1,296 @@
/* cmdline.c: a reentrant version of getopt(). Written 2006 by Brian
* Raiter. This code is in the public domain.
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include "cmdline.h"
#define docallback(opt, val) \
do { if ((r = callback(opt, val, data)) != 0) return r; } while (0)
/* Parse the given cmdline arguments.
*/
int readoptions(option const* list, int argc, char **argv,
int (*callback)(int, char const*, void*), void *data)
{
char argstring[] = "--";
option const *opt;
char const *val;
char const *p;
int stop = 0;
int argi, len, r;
if (!list || !callback)
return -1;
for (argi = 1 ; argi < argc ; ++argi)
{
/* First, check for "--", which forces all remaining arguments
* to be treated as non-options.
*/
if (!stop && argv[argi][0] == '-' && argv[argi][1] == '-'
&& argv[argi][2] == '\0') {
stop = 1;
continue;
}
/* Arguments that do not begin with '-' (or are only "-") are
* not options.
*/
if (stop || argv[argi][0] != '-' || argv[argi][1] == '\0') {
docallback(0, argv[argi]);
continue;
}
if (argv[argi][1] == '-')
{
/* Arguments that begin with a double-dash are long
* options.
*/
p = argv[argi] + 2;
val = strchr(p, '=');
if (val)
len = val++ - p;
else
len = strlen(p);
/* Is it on the list of valid options? If so, does it
* expect a parameter?
*/
for (opt = list ; opt->optval ; ++opt)
if (opt->name && !strncmp(p, opt->name, len)
&& !opt->name[len])
break;
if (!opt->optval) {
docallback('?', argv[argi]);
} else if (!val && opt->arg == 1) {
docallback(':', argv[argi]);
} else if (val && opt->arg == 0) {
docallback('=', argv[argi]);
} else {
docallback(opt->optval, val);
}
}
else
{
/* Arguments that begin with a single dash contain one or
* more short options. Each character in the argument is
* examined in turn, unless a parameter consumes the rest
* of the argument (or possibly even the following
* argument).
*/
for (p = argv[argi] + 1 ; *p ; ++p) {
for (opt = list ; opt->optval ; ++opt)
if (opt->chname == *p)
break;
if (!opt->optval) {
argstring[1] = *p;
docallback('?', argstring);
continue;
} else if (opt->arg == 0) {
docallback(opt->optval, NULL);
continue;
} else if (p[1]) {
docallback(opt->optval, p + 1);
break;
} else if (argi + 1 < argc && strcmp(argv[argi + 1], "--")) {
++argi;
docallback(opt->optval, argv[argi]);
break;
} else if (opt->arg == 2) {
docallback(opt->optval, NULL);
continue;
} else {
argstring[1] = *p;
docallback(':', argstring);
break;
}
}
}
}
return 0;
}
/* Verify that str points to an ASCII zero or one (optionally with
* whitespace) and return the value present, or -1 if str's contents
* are anything else.
*/
static int readboolvalue(char const *str)
{
char d;
while (isspace(*str))
++str;
if (!*str)
return -1;
d = *str++;
while (isspace(*str))
++str;
if (*str)
return -1;
if (d == '0')
return 0;
else if (d == '1')
return 1;
else
return -1;
}
/* Parse a configuration file.
*/
int readcfgfile(option const* list, FILE *fp,
int (*callback)(int, char const*, void*), void *data)
{
char buf[1024];
option const *opt;
char *name, *val, *p;
int len, f, r;
while (fgets(buf, sizeof buf, fp) != NULL)
{
/* Strip off the trailing newline and any leading whitespace.
* If the line begins with a hash sign, skip it entirely.
*/
len = strlen(buf);
if (len && buf[len - 1] == '\n')
buf[--len] = '\0';
for (p = buf ; isspace(*p) ; ++p) ;
if (!*p || *p == '#')
continue;
/* Find the end of the option's name and the beginning of the
* parameter, if any.
*/
for (name = p ; *p && *p != '=' && !isspace(*p) ; ++p) ;
len = p - name;
for ( ; *p == '=' || isspace(*p) ; ++p) ;
val = p;
/* Is it on the list of valid options? Does it take a
* full parameter, or just an optional boolean?
*/
for (opt = list ; opt->optval ; ++opt)
if (opt->name && !strncmp(name, opt->name, len)
&& !opt->name[len])
break;
if (!opt->optval) {
docallback('?', name);
} else if (!*val && opt->arg == 1) {
docallback(':', name);
} else if (*val && opt->arg == 0) {
f = readboolvalue(val);
if (f < 0)
docallback('=', name);
else if (f == 1)
docallback(opt->optval, NULL);
} else {
docallback(opt->optval, val);
}
}
return ferror(fp) ? -1 : 0;
}
/* Turn a string containing a cmdline into an argc-argv pair.
*/
int makecmdline(char const *cmdline, int *argcp, char ***argvp)
{
char **argv;
int argc;
char const *s;
int n, quoted;
if (!cmdline)
return 0;
/* Calcuate argc by counting the number of "clumps" of non-spaces.
*/
for (s = cmdline ; isspace(*s) ; ++s) ;
if (!*s) {
*argcp = 1;
if (argvp) {
*argvp = malloc(2 * sizeof(char*));
if (!*argvp)
return 0;
(*argvp)[0] = NULL;
(*argvp)[1] = NULL;
}
return 1;
}
for (argc = 2, quoted = 0 ; *s ; ++s) {
if (quoted == '"') {
if (*s == '"')
quoted = 0;
else if (*s == '\\' && s[1])
++s;
} else if (quoted == '\'') {
if (*s == '\'')
quoted = 0;
} else {
if (isspace(*s)) {
for ( ; isspace(s[1]) ; ++s) ;
if (!s[1])
break;
++argc;
} else if (*s == '"' || *s == '\'') {
quoted = *s;
}
}
}
*argcp = argc;
if (!argvp)
return 1;
/* Allocate space for all the arguments and their pointers.
*/
argv = malloc((argc + 1) * sizeof(char*) + strlen(cmdline) + 1);
*argvp = argv;
if (!argv)
return 0;
argv[0] = NULL;
argv[1] = (char*)(argv + argc + 1);
/* Copy the string into the allocated memory immediately after the
* argv array. Where spaces immediately follows a nonspace,
* replace it with a \0. Where a nonspace immediately follows
* spaces, store a pointer to it. (Except, of course, when the
* space-nonspace transitions occur within quotes.)
*/
for (s = cmdline ; isspace(*s) ; ++s) ;
for (argc = 1, n = 0, quoted = 0 ; *s ; ++s) {
if (quoted == '"') {
if (*s == '"') {
quoted = 0;
} else {
if (*s == '\\' && s[1])
++s;
argv[argc][n++] = *s;
}
} else if (quoted == '\'') {
if (*s == '\'')
quoted = 0;
else
argv[argc][n++] = *s;
} else {
if (isspace(*s)) {
argv[argc][n] = '\0';
for ( ; isspace(s[1]) ; ++s) ;
if (!s[1])
break;
argv[argc + 1] = argv[argc] + n + 1;
++argc;
n = 0;
} else {
if (*s == '"' || *s == '\'')
quoted = *s;
else
argv[argc][n++] = *s;
}
}
}
argv[argc + 1] = NULL;
return 1;
}

86
md2html/cmdline.h Normal file
View File

@ -0,0 +1,86 @@
/* cmdline.h: a reentrant version of getopt(). Written 2006 by Brian
* Raiter. This code is in the public domain.
*/
#ifndef _cmdline_h_
#define _cmdline_h_
/* The information specifying a single cmdline option.
*/
typedef struct option {
char const *name; /* the option's long name, or "" if none */
char chname; /* a single-char name, or zero if none */
int optval; /* a unique value representing this option */
int arg; /* 0 = no arg, 1 = arg req'd, 2 = optional */
} option;
/* Parse the given cmdline arguments. list is an array of option
* structs, each entry specifying a valid option. The last struct in
* the array must have name set to NULL. argc and argv give the
* cmdline to parse. callback is the function to call for each option
* and non-option found on the cmdline. data is a pointer that is
* passed to each invocation of callback. The return value of callback
* should be zero to continue processing the cmdline, or any other
* value to abort. The return value of readoptions() is the value
* returned from the last callback, or zero if no arguments were
* found, or -1 if an error occurred.
*
* When readoptions() encounters a regular cmdline argument (i.e. a
* non-option argument), callback() is invoked with opt equal to zero
* and val pointing to the argument. When an option is found,
* callback() is invoked with opt equal to the optval field in the
* option struct corresponding to that option, and val points to the
* option's paramter, or is NULL if the option does not take a
* parameter. If readoptions() finds an option that does not appear in
* the list of valid options, callback() is invoked with opt equal to
* '?'. If readoptions() encounters an option that is missing its
* required parameter, callback() is invoked with opt equal to ':'. If
* readoptions() finds a parameter on a long option that does not
* admit a parameter, callback() is invoked with opt equal to '='. In
* each of these cases, val will point to the erroneous option
* argument.
*/
extern int readoptions(option const* list, int argc, char **argv,
int (*callback)(int opt, char const *val, void *data),
void *data);
/* Parse the given file. list is an array of option structs, in the
* same form as taken by readoptions(). fp is a pointer to an open
* text file. callback is the function to call for each line found in
* the configuration file. data is a pointer that is passed to each
* invocation of callback. The return value of readcfgfile() is the
* value returned from the last callback, or zero if no arguments were
* found, or -1 if an error occurred while reading the file.
*
* The function will ignore lines that contain only whitespace, or
* lines that begin with a hash sign. All other lines should be of the
* form "OPTION=VALUE", where OPTION is one of the long options in
* list. Whitespace around the equal sign is permitted. An option that
* takes no arguments can either have a VALUE of 0 or 1, or omit the
* "=VALUE" entirely. (A VALUE of 0 will behave the same as if the
* line was not present.)
*/
extern int readcfgfile(option const* list, FILE *fp,
int (*callback)(int opt, char const *val, void *data),
void *data);
/* Create an argc-argv pair from a string containing a command line.
* cmdline is the string to be parsed. argcp points to the variable to
* receive the argc value, and argvp points to the variable to receive
* the argv value. argvp can be NULL if the caller just wants to get
* argc. Zero is returned on failure. This function allocates memory
* on behalf of the caller. The memory is allocated as a single block,
* so it is sufficient to simply free() the pointer returned through
* argvp. Note that argv[0] will always be initialized to NULL; the
* first argument will be stored in argv[1]. The string is parsed by
* separating arguments on whitespace boundaries. Space within
* substrings enclosed in single-quotes is ignored. A substring
* enclosed in double-quotes is treated the same, except that the
* backslash is recognized as an escape character within such a
* substring. Enclosing quotes and escaping backslashes are not copied
* into the argv values.
*/
extern int makecmdline(char const *cmdline, int *argcp, char ***argvp);
#endif

2190
md2html/entity.c Normal file

File diff suppressed because it is too large Load Diff

42
md2html/entity.h Normal file
View File

@ -0,0 +1,42 @@
/*
* MD4C: Markdown parser for C
* (http://github.com/mity/md4c)
*
* Copyright (c) 2016-2017 Martin Mitas
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
* IN THE SOFTWARE.
*/
#ifndef MD2HTML_ENTITY_H
#define MD2HTML_ENTITY_H
#include <stdlib.h>
/* Most entities are formed by single Unicode codepoint, few by two codepoints.
* Single-codepoint entities have codepoints[1] set to zero. */
struct entity {
const char* name;
unsigned codepoints[2];
};
const struct entity* entity_lookup(const char* name, size_t name_size);
#endif /* MD2HTML_ENTITY_H */

113
md2html/md2html.1 Normal file
View File

@ -0,0 +1,113 @@
.TH MD2HTML 1 "June 2019" "" "General Commands Manual"
.nh
.ad l
.
.SH NAME
.
md2html \- convert Markdown to HTML
.
.SH SYNOPSIS
.
.B md2html
.RI [ OPTION ]...\&
.RI [ FILE ]
.
.SH OPTIONS
.
.SS General options:
.
.TP
.BR -o ", " --output= \fIOUTFILE\fR
Write output to \fIOUTFILE\fR instead of \fBstdout\fR(3)
.
.TP
.BR -f ", " --full-html
Generate full HTML document, including header
.
.TP
.BR -s ", " --stat
Measure time of input parsing
.
.TP
.BR -h ", " --help
Display help and exit
.
.TP
.BR -v ", " --version
Display version and exit
.
.SS Markdown dialect options:
.
.TP
.B --commonmark
CommonMark (the default)
.
.TP
.B --github
Github Flavored Markdown
.
.PP
Note: dialect options are equivalent to some combination of flags below.
.
.SS Markdown extension options:
.
.TP
.B --fcollapse-whitespace
Collapse non-trivial whitespace
.
.TP
.B --fverbatim-entities
Do not translate entities
.
.TP
.B --fpermissive-atx-headers
Allow ATX headers without delimiting space
.
.TP
.B --fpermissive-url-autolinks
Allow URL autolinks without "<" and ">" delimiters
.
.TP
.B --fpermissive-www-autolinks
Allow WWW autolinks without any scheme (e.g. "www.example.com")
.
.TP
.B --fpermissive-email-autolinks
Allow e-mail autolinks without "<", ">" and "mailto:"
.
.TP
.B --fpermissive-autolinks
Enable all 3 of the above permissive autolinks options
.
.TP
.B --fno-indented-code
Disable indented code blocks
.
.TP
.B --fno-html-blocks
Disable raw HTML blocks
.
.TP
.B --fno-html-spans
Disable raw HTML spans
.
.TP
.B --fno-html
Same as \fB--fno-html-blocks --fno-html-spans\fR
.
.TP
.B --ftables
Enable tables
.
.TP
.B --fstrikethrough
Enable strikethrough spans
.
.TP
.B --ftasklists
Enable task lists
.
.SH SEE ALSO
.
https://github.com/mity/md4c
.

371
md2html/md2html.c Normal file
View File

@ -0,0 +1,371 @@
/*
* MD4C: Markdown parser for C
* (http://github.com/mity/md4c)
*
* Copyright (c) 2016-2017 Martin Mitas
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
* IN THE SOFTWARE.
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include "render_html.h"
#include "cmdline.h"
/* Global options. */
static unsigned parser_flags = 0;
static unsigned renderer_flags = MD_RENDER_FLAG_DEBUG;
static int want_fullhtml = 0;
static int want_stat = 0;
/*********************************
*** Simple grow-able buffer ***
*********************************/
/* We render to a memory buffer instead of directly outputting the rendered
* documents, as this allows using this utility for evaluating performance
* of MD4C (--stat option). This allows us to measure just time of the parser,
* without the I/O.
*/
struct membuffer {
char* data;
size_t asize;
size_t size;
};
static void
membuf_init(struct membuffer* buf, MD_SIZE new_asize)
{
buf->size = 0;
buf->asize = new_asize;
buf->data = malloc(buf->asize);
if(buf->data == NULL) {
fprintf(stderr, "membuf_init: malloc() failed.\n");
exit(1);
}
}
static void
membuf_fini(struct membuffer* buf)
{
if(buf->data)
free(buf->data);
}
static void
membuf_grow(struct membuffer* buf, size_t new_asize)
{
buf->data = realloc(buf->data, new_asize);
if(buf->data == NULL) {
fprintf(stderr, "membuf_grow: realloc() failed.\n");
exit(1);
}
buf->asize = new_asize;
}
static void
membuf_append(struct membuffer* buf, const char* data, MD_SIZE size)
{
if(buf->asize < buf->size + size)
membuf_grow(buf, buf->size + buf->size / 2 + size);
memcpy(buf->data + buf->size, data, size);
buf->size += size;
}
/**********************
*** Main program ***
**********************/
static void
process_output(const MD_CHAR* text, MD_SIZE size, void* userdata)
{
membuf_append((struct membuffer*) userdata, text, size);
}
static int
process_file(FILE* in, FILE* out)
{
MD_SIZE n;
struct membuffer buf_in = {0};
struct membuffer buf_out = {0};
int ret = -1;
clock_t t0, t1;
membuf_init(&buf_in, 32 * 1024);
/* Read the input file into a buffer. */
while(1) {
if(buf_in.size >= buf_in.asize)
membuf_grow(&buf_in, buf_in.asize + buf_in.asize / 2);
n = fread(buf_in.data + buf_in.size, 1, buf_in.asize - buf_in.size, in);
if(n == 0)
break;
buf_in.size += n;
}
/* Input size is good estimation of output size. Add some more reserve to
* deal with the HTML header/footer and tags. */
membuf_init(&buf_out, buf_in.size + buf_in.size/8 + 64);
/* Parse the document. This shall call our callbacks provided via the
* md_renderer_t structure. */
t0 = clock();
ret = md_render_html(buf_in.data, buf_in.size, process_output,
(void*) &buf_out, parser_flags, renderer_flags);
t1 = clock();
if(ret != 0) {
fprintf(stderr, "Parsing failed.\n");
goto out;
}
/* Write down the document in the HTML format. */
if(want_fullhtml) {
fprintf(out, "<html>\n");
fprintf(out, "<head>\n");
fprintf(out, "<title></title>\n");
fprintf(out, "<meta name=\"generator\" content=\"md2html\">\n");
fprintf(out, "</head>\n");
fprintf(out, "<body>\n");
}
fwrite(buf_out.data, 1, buf_out.size, out);
if(want_fullhtml) {
fprintf(out, "</body>\n");
fprintf(out, "</html>\n");
}
if(want_stat) {
if(t0 != (clock_t)-1 && t1 != (clock_t)-1) {
double elapsed = (double)(t1 - t0) / CLOCKS_PER_SEC;
if (elapsed < 1)
fprintf(stderr, "Time spent on parsing: %7.2f ms.\n", elapsed*1e3);
else
fprintf(stderr, "Time spent on parsing: %6.3f s.\n", elapsed);
}
}
/* Success if we have reached here. */
ret = 0;
out:
membuf_fini(&buf_in);
membuf_fini(&buf_out);
return ret;
}
#define OPTION_ARG_NONE 0
#define OPTION_ARG_REQUIRED 1
#define OPTION_ARG_OPTIONAL 2
static const option cmdline_options[] = {
{ "output", 'o', 'o', OPTION_ARG_REQUIRED },
{ "full-html", 'f', 'f', OPTION_ARG_NONE },
{ "stat", 's', 's', OPTION_ARG_NONE },
{ "help", 'h', 'h', OPTION_ARG_NONE },
{ "version", 'v', 'v', OPTION_ARG_NONE },
{ "commonmark", 0, 'c', OPTION_ARG_NONE },
{ "github", 0, 'g', OPTION_ARG_NONE },
{ "fcollapse-whitespace", 0, 'W', OPTION_ARG_NONE },
{ "flatex-math", 0, 'L', OPTION_ARG_NONE },
{ "fpermissive-atx-headers", 0, 'A', OPTION_ARG_NONE },
{ "fpermissive-autolinks", 0, 'V', OPTION_ARG_NONE },
{ "fpermissive-email-autolinks", 0, '@', OPTION_ARG_NONE },
{ "fpermissive-url-autolinks", 0, 'U', OPTION_ARG_NONE },
{ "fpermissive-www-autolinks", 0, '.', OPTION_ARG_NONE },
{ "fstrikethrough", 0, 'S', OPTION_ARG_NONE },
{ "ftables", 0, 'T', OPTION_ARG_NONE },
{ "ftasklists", 0, 'X', OPTION_ARG_NONE },
{ "funderline", 0, '_', OPTION_ARG_NONE },
{ "fverbatim-entities", 0, 'E', OPTION_ARG_NONE },
{ "fwiki-links", 0, 'K', OPTION_ARG_NONE },
{ "fno-html-blocks", 0, 'F', OPTION_ARG_NONE },
{ "fno-html-spans", 0, 'G', OPTION_ARG_NONE },
{ "fno-html", 0, 'H', OPTION_ARG_NONE },
{ "fno-indented-code", 0, 'I', OPTION_ARG_NONE },
{ 0 }
};
static void
usage(void)
{
printf(
"Usage: md2html [OPTION]... [FILE]\n"
"Convert input FILE (or standard input) in Markdown format to HTML.\n"
"\n"
"General options:\n"
" -o --output=FILE Output file (default is standard output)\n"
" -f, --full-html Generate full HTML document, including header\n"
" -s, --stat Measure time of input parsing\n"
" -h, --help Display this help and exit\n"
" -v, --version Display version and exit\n"
"\n"
"Markdown dialect options:\n"
"(note these are equivalent to some combinations of the flags below)\n"
" --commonmark CommonMark (this is default)\n"
" --github Github Flavored Markdown\n"
"\n"
"Markdown extension options:\n"
" --fcollapse-whitespace\n"
" Collapse non-trivial whitespace\n"
" --flatex-math Enable LaTeX style mathematics spans\n"
" --fpermissive-atx-headers\n"
" Allow ATX headers without delimiting space\n"
" --fpermissive-url-autolinks\n"
" Allow URL autolinks without '<', '>'\n"
" --fpermissive-www-autolinks\n"
" Allow WWW autolinks without any scheme (e.g. 'www.example.com')\n"
" --fpermissive-email-autolinks \n"
" Allow e-mail autolinks without '<', '>' and 'mailto:'\n"
" --fpermissive-autolinks\n"
" Same as --fpermissive-url-autolinks --fpermissive-www-autolinks\n"
" --fpermissive-email-autolinks\n"
" --fstrikethrough Enable strike-through spans\n"
" --ftables Enable tables\n"
" --ftasklists Enable task lists\n"
" --funderline Enable underline spans\n"
" --fwiki-links Enable wiki links\n"
"\n"
"Markdown suppression options:\n"
" --fno-html-blocks\n"
" Disable raw HTML blocks\n"
" --fno-html-spans\n"
" Disable raw HTML spans\n"
" --fno-html Same as --fno-html-blocks --fno-html-spans\n"
" --fno-indented-code\n"
" Disable indented code blocks\n"
"\n"
"HTML generator options:\n"
" --fverbatim-entities\n"
" Do not translate entities\n"
"\n"
);
}
static void
version(void)
{
printf("%d.%d.%d\n", MD_VERSION_MAJOR, MD_VERSION_MINOR, MD_VERSION_RELEASE);
}
static const char* input_path = NULL;
static const char* output_path = NULL;
static int
cmdline_callback(int opt, char const* value, void* data)
{
switch(opt) {
case 0:
if(input_path) {
fprintf(stderr, "Too many arguments. Only one input file can be specified.\n");
fprintf(stderr, "Use --help for more info.\n");
exit(1);
}
input_path = value;
break;
case 'o': output_path = value; break;
case 'f': want_fullhtml = 1; break;
case 's': want_stat = 1; break;
case 'h': usage(); exit(0); break;
case 'v': version(); exit(0); break;
case 'c': parser_flags = MD_DIALECT_COMMONMARK; break;
case 'g': parser_flags = MD_DIALECT_GITHUB; break;
case 'E': renderer_flags |= MD_RENDER_FLAG_VERBATIM_ENTITIES; break;
case 'A': parser_flags |= MD_FLAG_PERMISSIVEATXHEADERS; break;
case 'I': parser_flags |= MD_FLAG_NOINDENTEDCODEBLOCKS; break;
case 'F': parser_flags |= MD_FLAG_NOHTMLBLOCKS; break;
case 'G': parser_flags |= MD_FLAG_NOHTMLSPANS; break;
case 'H': parser_flags |= MD_FLAG_NOHTML; break;
case 'W': parser_flags |= MD_FLAG_COLLAPSEWHITESPACE; break;
case 'U': parser_flags |= MD_FLAG_PERMISSIVEURLAUTOLINKS; break;
case '.': parser_flags |= MD_FLAG_PERMISSIVEWWWAUTOLINKS; break;
case '@': parser_flags |= MD_FLAG_PERMISSIVEEMAILAUTOLINKS; break;
case 'V': parser_flags |= MD_FLAG_PERMISSIVEAUTOLINKS; break;
case 'T': parser_flags |= MD_FLAG_TABLES; break;
case 'S': parser_flags |= MD_FLAG_STRIKETHROUGH; break;
case 'L': parser_flags |= MD_FLAG_LATEXMATHSPANS; break;
case 'K': parser_flags |= MD_FLAG_WIKILINKS; break;
case 'X': parser_flags |= MD_FLAG_TASKLISTS; break;
case '_': parser_flags |= MD_FLAG_UNDERLINE; break;
default:
fprintf(stderr, "Illegal option: %s\n", value);
fprintf(stderr, "Use --help for more info.\n");
exit(1);
break;
}
return 0;
}
int
main(int argc, char** argv)
{
FILE* in = stdin;
FILE* out = stdout;
int ret = 0;
if(readoptions(cmdline_options, argc, argv, cmdline_callback, NULL) < 0) {
usage();
exit(1);
}
if(input_path != NULL && strcmp(input_path, "-") != 0) {
in = fopen(input_path, "rb");
if(in == NULL) {
fprintf(stderr, "Cannot open %s.\n", input_path);
exit(1);
}
}
if(output_path != NULL && strcmp(output_path, "-") != 0) {
out = fopen(output_path, "wt");
if(out == NULL) {
fprintf(stderr, "Cannot open %s.\n", output_path);
exit(1);
}
}
ret = process_file(in, out);
if(in != stdin)
fclose(in);
if(out != stdout)
fclose(out);
return ret;
}

561
md2html/render_html.c Normal file
View File

@ -0,0 +1,561 @@
/*
* MD4C: Markdown parser for C
* (http://github.com/mity/md4c)
*
* Copyright (c) 2016-2019 Martin Mitas
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
* IN THE SOFTWARE.
*/
#include <stdio.h>
#include <string.h>
#include "render_html.h"
#include "entity.h"
#if !defined(__STDC_VERSION__) || __STDC_VERSION__ < 199409L
/* C89/90 or old compilers in general may not understand "inline". */
#if defined __GNUC__
#define inline __inline__
#elif defined _MSC_VER
#define inline __inline
#else
#define inline
#endif
#endif
#ifdef _WIN32
#define snprintf _snprintf
#endif
typedef struct MD_RENDER_HTML_tag MD_RENDER_HTML;
struct MD_RENDER_HTML_tag {
void (*process_output)(const MD_CHAR*, MD_SIZE, void*);
void* userdata;
unsigned flags;
int image_nesting_level;
char escape_map[256];
};
#define NEED_HTML_ESC_FLAG 0x1
#define NEED_URL_ESC_FLAG 0x2
/*****************************************
*** HTML rendering helper functions ***
*****************************************/
#define ISDIGIT(ch) ('0' <= (ch) && (ch) <= '9')
#define ISLOWER(ch) ('a' <= (ch) && (ch) <= 'z')
#define ISUPPER(ch) ('A' <= (ch) && (ch) <= 'Z')
#define ISALNUM(ch) (ISLOWER(ch) || ISUPPER(ch) || ISDIGIT(ch))
static inline void
render_verbatim(MD_RENDER_HTML* r, const MD_CHAR* text, MD_SIZE size)
{
r->process_output(text, size, r->userdata);
}
/* Keep this as a macro. Most compiler should then be smart enough to replace
* the strlen() call with a compile-time constant if the string is a C literal. */
#define RENDER_VERBATIM(r, verbatim) \
render_verbatim((r), (verbatim), (MD_SIZE) (strlen(verbatim)))
static void
render_html_escaped(MD_RENDER_HTML* r, const MD_CHAR* data, MD_SIZE size)
{
MD_OFFSET beg = 0;
MD_OFFSET off = 0;
/* Some characters need to be escaped in normal HTML text. */
#define NEED_HTML_ESC(ch) (r->escape_map[(unsigned char)(ch)] & NEED_HTML_ESC_FLAG)
while(1) {
/* Optimization: Use some loop unrolling. */
while(off + 3 < size && !NEED_HTML_ESC(data[off+0]) && !NEED_HTML_ESC(data[off+1])
&& !NEED_HTML_ESC(data[off+2]) && !NEED_HTML_ESC(data[off+3]))
off += 4;
while(off < size && !NEED_HTML_ESC(data[off]))
off++;
if(off > beg)
render_verbatim(r, data + beg, off - beg);
if(off < size) {
switch(data[off]) {
case '&': RENDER_VERBATIM(r, "&amp;"); break;
case '<': RENDER_VERBATIM(r, "&lt;"); break;
case '>': RENDER_VERBATIM(r, "&gt;"); break;
case '"': RENDER_VERBATIM(r, "&quot;"); break;
}
off++;
} else {
break;
}
beg = off;
}
}
static void
render_url_escaped(MD_RENDER_HTML* r, const MD_CHAR* data, MD_SIZE size)
{
static const MD_CHAR hex_chars[] = "0123456789ABCDEF";
MD_OFFSET beg = 0;
MD_OFFSET off = 0;
/* Some characters need to be escaped in URL attributes. */
#define NEED_URL_ESC(ch) (r->escape_map[(unsigned char)(ch)] & NEED_URL_ESC_FLAG)
while(1) {
while(off < size && !NEED_URL_ESC(data[off]))
off++;
if(off > beg)
render_verbatim(r, data + beg, off - beg);
if(off < size) {
char hex[3];
switch(data[off]) {
case '&': RENDER_VERBATIM(r, "&amp;"); break;
default:
hex[0] = '%';
hex[1] = hex_chars[((unsigned)data[off] >> 4) & 0xf];
hex[2] = hex_chars[((unsigned)data[off] >> 0) & 0xf];
render_verbatim(r, hex, 3);
break;
}
off++;
} else {
break;
}
beg = off;
}
}
static unsigned
hex_val(char ch)
{
if('0' <= ch && ch <= '9')
return ch - '0';
if('A' <= ch && ch <= 'Z')
return ch - 'A' + 10;
else
return ch - 'a' + 10;
}
static void
render_utf8_codepoint(MD_RENDER_HTML* r, unsigned codepoint,
void (*fn_append)(MD_RENDER_HTML*, const MD_CHAR*, MD_SIZE))
{
static const MD_CHAR utf8_replacement_char[] = { 0xef, 0xbf, 0xbd };
unsigned char utf8[4];
size_t n;
if(codepoint <= 0x7f) {
n = 1;
utf8[0] = codepoint;
} else if(codepoint <= 0x7ff) {
n = 2;
utf8[0] = 0xc0 | ((codepoint >> 6) & 0x1f);
utf8[1] = 0x80 + ((codepoint >> 0) & 0x3f);
} else if(codepoint <= 0xffff) {
n = 3;
utf8[0] = 0xe0 | ((codepoint >> 12) & 0xf);
utf8[1] = 0x80 + ((codepoint >> 6) & 0x3f);
utf8[2] = 0x80 + ((codepoint >> 0) & 0x3f);
} else {
n = 4;
utf8[0] = 0xf0 | ((codepoint >> 18) & 0x7);
utf8[1] = 0x80 + ((codepoint >> 12) & 0x3f);
utf8[2] = 0x80 + ((codepoint >> 6) & 0x3f);
utf8[3] = 0x80 + ((codepoint >> 0) & 0x3f);
}
if(0 < codepoint && codepoint <= 0x10ffff)
fn_append(r, (char*)utf8, n);
else
fn_append(r, utf8_replacement_char, 3);
}
/* Translate entity to its UTF-8 equivalent, or output the verbatim one
* if such entity is unknown (or if the translation is disabled). */
static void
render_entity(MD_RENDER_HTML* r, const MD_CHAR* text, MD_SIZE size,
void (*fn_append)(MD_RENDER_HTML*, const MD_CHAR*, MD_SIZE))
{
if(r->flags & MD_RENDER_FLAG_VERBATIM_ENTITIES) {
fn_append(r, text, size);
return;
}
/* We assume UTF-8 output is what is desired. */
if(size > 3 && text[1] == '#') {
unsigned codepoint = 0;
if(text[2] == 'x' || text[2] == 'X') {
/* Hexadecimal entity (e.g. "&#x1234abcd;")). */
MD_SIZE i;
for(i = 3; i < size-1; i++)
codepoint = 16 * codepoint + hex_val(text[i]);
} else {
/* Decimal entity (e.g. "&1234;") */
MD_SIZE i;
for(i = 2; i < size-1; i++)
codepoint = 10 * codepoint + (text[i] - '0');
}
render_utf8_codepoint(r, codepoint, fn_append);
return;
} else {
/* Named entity (e.g. "&nbsp;"). */
const struct entity* ent;
ent = entity_lookup(text, size);
if(ent != NULL) {
render_utf8_codepoint(r, ent->codepoints[0], fn_append);
if(ent->codepoints[1])
render_utf8_codepoint(r, ent->codepoints[1], fn_append);
return;
}
}
fn_append(r, text, size);
}
static void
render_attribute(MD_RENDER_HTML* r, const MD_ATTRIBUTE* attr,
void (*fn_append)(MD_RENDER_HTML*, const MD_CHAR*, MD_SIZE))
{
int i;
for(i = 0; attr->substr_offsets[i] < attr->size; i++) {
MD_TEXTTYPE type = attr->substr_types[i];
MD_OFFSET off = attr->substr_offsets[i];
MD_SIZE size = attr->substr_offsets[i+1] - off;
const MD_CHAR* text = attr->text + off;
switch(type) {
case MD_TEXT_NULLCHAR: render_utf8_codepoint(r, 0x0000, render_verbatim); break;
case MD_TEXT_ENTITY: render_entity(r, text, size, fn_append); break;
default: fn_append(r, text, size); break;
}
}
}
static void
render_open_ol_block(MD_RENDER_HTML* r, const MD_BLOCK_OL_DETAIL* det)
{
char buf[64];
if(det->start == 1) {
RENDER_VERBATIM(r, "<ol>\n");
return;
}
snprintf(buf, sizeof(buf), "<ol start=\"%u\">\n", det->start);
RENDER_VERBATIM(r, buf);
}
static void
render_open_li_block(MD_RENDER_HTML* r, const MD_BLOCK_LI_DETAIL* det)
{
if(det->is_task) {
RENDER_VERBATIM(r, "<li class=\"task-list-item\">"
"<input type=\"checkbox\" class=\"task-list-item-checkbox\" disabled");
if(det->task_mark == 'x' || det->task_mark == 'X')
RENDER_VERBATIM(r, " checked");
RENDER_VERBATIM(r, ">");
} else {
RENDER_VERBATIM(r, "<li>");
}
}
static void
render_open_code_block(MD_RENDER_HTML* r, const MD_BLOCK_CODE_DETAIL* det)
{
RENDER_VERBATIM(r, "<pre><code");
/* If known, output the HTML 5 attribute class="language-LANGNAME". */
if(det->lang.text != NULL) {
RENDER_VERBATIM(r, " class=\"language-");
render_attribute(r, &det->lang, render_html_escaped);
RENDER_VERBATIM(r, "\"");
}
RENDER_VERBATIM(r, ">");
}
static void
render_open_td_block(MD_RENDER_HTML* r, const MD_CHAR* cell_type, const MD_BLOCK_TD_DETAIL* det)
{
RENDER_VERBATIM(r, "<");
RENDER_VERBATIM(r, cell_type);
switch(det->align) {
case MD_ALIGN_LEFT: RENDER_VERBATIM(r, " align=\"left\">"); break;
case MD_ALIGN_CENTER: RENDER_VERBATIM(r, " align=\"center\">"); break;
case MD_ALIGN_RIGHT: RENDER_VERBATIM(r, " align=\"right\">"); break;
default: RENDER_VERBATIM(r, ">"); break;
}
}
static void
render_open_a_span(MD_RENDER_HTML* r, const MD_SPAN_A_DETAIL* det)
{
RENDER_VERBATIM(r, "<a href=\"");
render_attribute(r, &det->href, render_url_escaped);
if(det->title.text != NULL) {
RENDER_VERBATIM(r, "\" title=\"");
render_attribute(r, &det->title, render_html_escaped);
}
RENDER_VERBATIM(r, "\">");
}
static void
render_open_img_span(MD_RENDER_HTML* r, const MD_SPAN_IMG_DETAIL* det)
{
RENDER_VERBATIM(r, "<img src=\"");
render_attribute(r, &det->src, render_url_escaped);
RENDER_VERBATIM(r, "\" alt=\"");
r->image_nesting_level++;
}
static void
render_close_img_span(MD_RENDER_HTML* r, const MD_SPAN_IMG_DETAIL* det)
{
if(det->title.text != NULL) {
RENDER_VERBATIM(r, "\" title=\"");
render_attribute(r, &det->title, render_html_escaped);
}
RENDER_VERBATIM(r, "\">");
r->image_nesting_level--;
}
static void
render_open_wikilink_span(MD_RENDER_HTML* r, const MD_SPAN_WIKILINK_DETAIL* det)
{
RENDER_VERBATIM(r, "<x-wikilink data-target=\"");
render_attribute(r, &det->target, render_html_escaped);
RENDER_VERBATIM(r, "\">");
}
/**************************************
*** HTML renderer implementation ***
**************************************/
static int
enter_block_callback(MD_BLOCKTYPE type, void* detail, void* userdata)
{
static const MD_CHAR* head[6] = { "<h1>", "<h2>", "<h3>", "<h4>", "<h5>", "<h6>" };
MD_RENDER_HTML* r = (MD_RENDER_HTML*) userdata;
switch(type) {
case MD_BLOCK_DOC: /* noop */ break;
case MD_BLOCK_QUOTE: RENDER_VERBATIM(r, "<blockquote>\n"); break;
case MD_BLOCK_UL: RENDER_VERBATIM(r, "<ul>\n"); break;
case MD_BLOCK_OL: render_open_ol_block(r, (const MD_BLOCK_OL_DETAIL*)detail); break;
case MD_BLOCK_LI: render_open_li_block(r, (const MD_BLOCK_LI_DETAIL*)detail); break;
case MD_BLOCK_HR: RENDER_VERBATIM(r, "<hr>\n"); break;
case MD_BLOCK_H: RENDER_VERBATIM(r, head[((MD_BLOCK_H_DETAIL*)detail)->level - 1]); break;
case MD_BLOCK_CODE: render_open_code_block(r, (const MD_BLOCK_CODE_DETAIL*) detail); break;
case MD_BLOCK_HTML: /* noop */ break;
case MD_BLOCK_P: RENDER_VERBATIM(r, "<p>"); break;
case MD_BLOCK_TABLE: RENDER_VERBATIM(r, "<table>\n"); break;
case MD_BLOCK_THEAD: RENDER_VERBATIM(r, "<thead>\n"); break;
case MD_BLOCK_TBODY: RENDER_VERBATIM(r, "<tbody>\n"); break;
case MD_BLOCK_TR: RENDER_VERBATIM(r, "<tr>\n"); break;
case MD_BLOCK_TH: render_open_td_block(r, "th", (MD_BLOCK_TD_DETAIL*)detail); break;
case MD_BLOCK_TD: render_open_td_block(r, "td", (MD_BLOCK_TD_DETAIL*)detail); break;
}
return 0;
}
static int
leave_block_callback(MD_BLOCKTYPE type, void* detail, void* userdata)
{
static const MD_CHAR* head[6] = { "</h1>\n", "</h2>\n", "</h3>\n", "</h4>\n", "</h5>\n", "</h6>\n" };
MD_RENDER_HTML* r = (MD_RENDER_HTML*) userdata;
switch(type) {
case MD_BLOCK_DOC: /*noop*/ break;
case MD_BLOCK_QUOTE: RENDER_VERBATIM(r, "</blockquote>\n"); break;
case MD_BLOCK_UL: RENDER_VERBATIM(r, "</ul>\n"); break;
case MD_BLOCK_OL: RENDER_VERBATIM(r, "</ol>\n"); break;
case MD_BLOCK_LI: RENDER_VERBATIM(r, "</li>\n"); break;
case MD_BLOCK_HR: /*noop*/ break;
case MD_BLOCK_H: RENDER_VERBATIM(r, head[((MD_BLOCK_H_DETAIL*)detail)->level - 1]); break;
case MD_BLOCK_CODE: RENDER_VERBATIM(r, "</code></pre>\n"); break;
case MD_BLOCK_HTML: /* noop */ break;
case MD_BLOCK_P: RENDER_VERBATIM(r, "</p>\n"); break;
case MD_BLOCK_TABLE: RENDER_VERBATIM(r, "</table>\n"); break;
case MD_BLOCK_THEAD: RENDER_VERBATIM(r, "</thead>\n"); break;
case MD_BLOCK_TBODY: RENDER_VERBATIM(r, "</tbody>\n"); break;
case MD_BLOCK_TR: RENDER_VERBATIM(r, "</tr>\n"); break;
case MD_BLOCK_TH: RENDER_VERBATIM(r, "</th>\n"); break;
case MD_BLOCK_TD: RENDER_VERBATIM(r, "</td>\n"); break;
}
return 0;
}
static int
enter_span_callback(MD_SPANTYPE type, void* detail, void* userdata)
{
MD_RENDER_HTML* r = (MD_RENDER_HTML*) userdata;
if(r->image_nesting_level > 0) {
/* We are inside a Markdown image label. Markdown allows to use any
* emphasis and other rich contents in that context similarly as in
* any link label.
*
* However, unlike in the case of links (where that contents becomes
* contents of the <a>...</a> tag), in the case of images the contents
* is supposed to fall into the attribute alt: <img alt="...">.
*
* In that context we naturally cannot output nested HTML tags. So lets
* suppress them and only output the plain text (i.e. what falls into
* text() callback).
*
* This make-it-a-plain-text approach is the recommended practice by
* CommonMark specification (for HTML output).
*/
return 0;
}
switch(type) {
case MD_SPAN_EM: RENDER_VERBATIM(r, "<em>"); break;
case MD_SPAN_STRONG: RENDER_VERBATIM(r, "<strong>"); break;
case MD_SPAN_U: RENDER_VERBATIM(r, "<u>"); break;
case MD_SPAN_A: render_open_a_span(r, (MD_SPAN_A_DETAIL*) detail); break;
case MD_SPAN_IMG: render_open_img_span(r, (MD_SPAN_IMG_DETAIL*) detail); break;
case MD_SPAN_CODE: RENDER_VERBATIM(r, "<code>"); break;
case MD_SPAN_DEL: RENDER_VERBATIM(r, "<del>"); break;
case MD_SPAN_LATEXMATH: RENDER_VERBATIM(r, "<x-equation>"); break;
case MD_SPAN_LATEXMATH_DISPLAY: RENDER_VERBATIM(r, "<x-equation type=\"display\">"); break;
case MD_SPAN_WIKILINK: render_open_wikilink_span(r, (MD_SPAN_WIKILINK_DETAIL*) detail); break;
}
return 0;
}
static int
leave_span_callback(MD_SPANTYPE type, void* detail, void* userdata)
{
MD_RENDER_HTML* r = (MD_RENDER_HTML*) userdata;
if(r->image_nesting_level > 0) {
/* Ditto as in enter_span_callback(), except we have to allow the
* end of the <img> tag. */
if(r->image_nesting_level == 1 && type == MD_SPAN_IMG)
render_close_img_span(r, (MD_SPAN_IMG_DETAIL*) detail);
return 0;
}
switch(type) {
case MD_SPAN_EM: RENDER_VERBATIM(r, "</em>"); break;
case MD_SPAN_STRONG: RENDER_VERBATIM(r, "</strong>"); break;
case MD_SPAN_U: RENDER_VERBATIM(r, "</u>"); break;
case MD_SPAN_A: RENDER_VERBATIM(r, "</a>"); break;
case MD_SPAN_IMG: /*noop, handled above*/ break;
case MD_SPAN_CODE: RENDER_VERBATIM(r, "</code>"); break;
case MD_SPAN_DEL: RENDER_VERBATIM(r, "</del>"); break;
case MD_SPAN_LATEXMATH: /*fall through*/
case MD_SPAN_LATEXMATH_DISPLAY: RENDER_VERBATIM(r, "</x-equation>"); break;
case MD_SPAN_WIKILINK: RENDER_VERBATIM(r, "</x-wikilink>"); break;
}
return 0;
}
static int
text_callback(MD_TEXTTYPE type, const MD_CHAR* text, MD_SIZE size, void* userdata)
{
MD_RENDER_HTML* r = (MD_RENDER_HTML*) userdata;
switch(type) {
case MD_TEXT_NULLCHAR: render_utf8_codepoint(r, 0x0000, render_verbatim); break;
case MD_TEXT_BR: RENDER_VERBATIM(r, (r->image_nesting_level == 0 ? "<br>\n" : " ")); break;
case MD_TEXT_SOFTBR: RENDER_VERBATIM(r, (r->image_nesting_level == 0 ? "\n" : " ")); break;
case MD_TEXT_HTML: render_verbatim(r, text, size); break;
case MD_TEXT_ENTITY: render_entity(r, text, size, render_html_escaped); break;
default: render_html_escaped(r, text, size); break;
}
return 0;
}
static void
debug_log_callback(const char* msg, void* userdata)
{
MD_RENDER_HTML* r = (MD_RENDER_HTML*) userdata;
if(r->flags & MD_RENDER_FLAG_DEBUG)
fprintf(stderr, "MD4C: %s\n", msg);
}
int
md_render_html(const MD_CHAR* input, MD_SIZE input_size,
void (*process_output)(const MD_CHAR*, MD_SIZE, void*),
void* userdata, unsigned parser_flags, unsigned renderer_flags)
{
MD_RENDER_HTML render = { process_output, userdata, renderer_flags, 0, { 0 } };
int i;
MD_PARSER parser = {
0,
parser_flags,
enter_block_callback,
leave_block_callback,
enter_span_callback,
leave_span_callback,
text_callback,
debug_log_callback,
NULL
};
/* Build map of characters which need escaping. */
for(i = 0; i < 256; i++) {
unsigned char ch = (unsigned char) i;
if(strchr("\"&<>", ch) != NULL)
render.escape_map[i] |= NEED_HTML_ESC_FLAG;
if(!ISALNUM(ch) && strchr("-_.+!*(),%#@?=;:/,+$", ch) == NULL)
render.escape_map[i] |= NEED_URL_ESC_FLAG;
}
return md_parse(input, input_size, &parser, (void*) &render);
}

66
md2html/render_html.h Normal file
View File

@ -0,0 +1,66 @@
/*
* MD4C: Markdown parser for C
* (http://github.com/mity/md4c)
*
* Copyright (c) 2016-2017 Martin Mitas
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
* IN THE SOFTWARE.
*/
#ifndef MD4C_RENDER_HTML_H
#define MD4C_RENDER_HTML_H
#include "md4c.h"
#ifdef __cplusplus
extern "C" {
#endif
/* If set, debug output from md_parse() is sent to stderr. */
#define MD_RENDER_FLAG_DEBUG 0x0001
#define MD_RENDER_FLAG_VERBATIM_ENTITIES 0x0002
/* Render Markdown into HTML.
*
* Note only contents of <body> tag is generated. Caller must generate
* HTML header/footer manually before/after calling md_render_html().
*
* Params input and input_size specify the Markdown input.
* Callback process_output() gets called with chunks of HTML output.
* (Typical implementation may just output the bytes to file or append to
* some buffer).
* Param userdata is just propgated back to process_output() callback.
* Param parser_flags are flags from md4c.h propagated to md_parse().
* Param render_flags is bitmask of MD_RENDER_FLAG_xxxx.
*
* Returns -1 on error (if md_parse() fails.)
* Returns 0 on success.
*/
int md_render_html(const MD_CHAR* input, MD_SIZE input_size,
void (*process_output)(const MD_CHAR*, MD_SIZE, void*),
void* userdata, unsigned parser_flags, unsigned renderer_flags);
#ifdef __cplusplus
} /* extern "C" { */
#endif
#endif /* MD4C_RENDER_HTML_H */

32
md4c/CMakeLists.txt Normal file
View File

@ -0,0 +1,32 @@
# Be sure to export all symbols in Windows.
set(CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS 1)
set(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} -DDEBUG")
set(md4c_src
md4c.c
)
add_library(md4c ${md4c_src})
set_target_properties(md4c PROPERTIES
VERSION ${MD_VERSION}
SOVERSION ${MD_VERSION_MAJOR}
PUBLIC_HEADER md4c.h
)
install(
TARGETS md4c
EXPORT md4cConfig
ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}
LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR}
PUBLIC_HEADER DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
)
# Create a pkg-config file
configure_file(md4c.pc.in md4c.pc @ONLY)
install(FILES ${CMAKE_BINARY_DIR}/md4c/md4c.pc DESTINATION ${CMAKE_INSTALL_LIBDIR}/pkgconfig)
# And a CMake file
install(EXPORT md4cConfig DESTINATION ${CMAKE_INSTALL_LIBDIR}/cmake/md4c/)

6309
md4c/md4c.c Normal file

File diff suppressed because it is too large Load Diff

388
md4c/md4c.h Normal file
View File

@ -0,0 +1,388 @@
/*
* MD4C: Markdown parser for C
* (http://github.com/mity/md4c)
*
* Copyright (c) 2016-2020 Martin Mitas
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
* IN THE SOFTWARE.
*/
#ifndef MD4C_MARKDOWN_H
#define MD4C_MARKDOWN_H
#ifdef __cplusplus
extern "C" {
#endif
#if defined MD4C_USE_UTF16
/* Magic to support UTF-16. Not that in order to use it, you have to define
* the macro MD4C_USE_UTF16 both when building MD4C as well as when
* including this header in your code. */
#ifdef _WIN32
#include <windows.h>
typedef WCHAR MD_CHAR;
#else
#error MD4C_USE_UTF16 is only supported on Windows.
#endif
#else
typedef char MD_CHAR;
#endif
typedef unsigned MD_SIZE;
typedef unsigned MD_OFFSET;
/* Block represents a part of document hierarchy structure like a paragraph
* or list item.
*/
typedef enum MD_BLOCKTYPE {
/* <body>...</body> */
MD_BLOCK_DOC = 0,
/* <blockquote>...</blockquote> */
MD_BLOCK_QUOTE,
/* <ul>...</ul>
* Detail: Structure MD_BLOCK_UL_DETAIL. */
MD_BLOCK_UL,
/* <ol>...</ol>
* Detail: Structure MD_BLOCK_OL_DETAIL. */
MD_BLOCK_OL,
/* <li>...</li>
* Detail: Structure MD_BLOCK_LI_DETAIL. */
MD_BLOCK_LI,
/* <hr> */
MD_BLOCK_HR,
/* <h1>...</h1> (for levels up to 6)
* Detail: Structure MD_BLOCK_H_DETAIL. */
MD_BLOCK_H,
/* <pre><code>...</code></pre>
* Note the text lines within code blocks are terminated with '\n'
* instead of explicit MD_TEXT_BR. */
MD_BLOCK_CODE,
/* Raw HTML block. This itself does not correspond to any particular HTML
* tag. The contents of it _is_ raw HTML source intended to be put
* in verbatim form to the HTML output. */
MD_BLOCK_HTML,
/* <p>...</p> */
MD_BLOCK_P,
/* <table>...</table> and its contents.
* Detail: Structure MD_BLOCK_TD_DETAIL (used with MD_BLOCK_TH and MD_BLOCK_TD)
* Note all of these are used only if extension MD_FLAG_TABLES is enabled. */
MD_BLOCK_TABLE,
MD_BLOCK_THEAD,
MD_BLOCK_TBODY,
MD_BLOCK_TR,
MD_BLOCK_TH,
MD_BLOCK_TD
} MD_BLOCKTYPE;
/* Span represents an in-line piece of a document which should be rendered with
* the same font, color and other attributes. A sequence of spans forms a block
* like paragraph or list item. */
typedef enum MD_SPANTYPE {
/* <em>...</em> */
MD_SPAN_EM,
/* <strong>...</strong> */
MD_SPAN_STRONG,
/* <a href="xxx">...</a>
* Detail: Structure MD_SPAN_A_DETAIL. */
MD_SPAN_A,
/* <img src="xxx">...</a>
* Detail: Structure MD_SPAN_IMG_DETAIL.
* Note: Image text can contain nested spans and even nested images.
* If rendered into ALT attribute of HTML <IMG> tag, it's responsibility
* of the renderer to deal with it.
*/
MD_SPAN_IMG,
/* <code>...</code> */
MD_SPAN_CODE,
/* <del>...</del>
* Note: Recognized only when MD_FLAG_STRIKETHROUGH is enabled.
*/
MD_SPAN_DEL,
/* For recognizing inline ($) and display ($$) equations
* Note: Recognized only when MD_FLAG_LATEXMATHSPANS is enabled.
*/
MD_SPAN_LATEXMATH,
MD_SPAN_LATEXMATH_DISPLAY,
/* Wiki links
* Note: Recognized only when MD_FLAG_WIKILINKS is enabled.
*/
MD_SPAN_WIKILINK,
/* <u>...</u>
* Note: Recognized only when MD_FLAG_UNDERLINE is enabled. */
MD_SPAN_U
} MD_SPANTYPE;
/* Text is the actual textual contents of span. */
typedef enum MD_TEXTTYPE {
/* Normal text. */
MD_TEXT_NORMAL = 0,
/* NULL character. CommonMark requires replacing NULL character with
* the replacement char U+FFFD, so this allows caller to do that easily. */
MD_TEXT_NULLCHAR,
/* Line breaks.
* Note these are not sent from blocks with verbatim output (MD_BLOCK_CODE
* or MD_BLOCK_HTML). In such cases, '\n' is part of the text itself. */
MD_TEXT_BR, /* <br> (hard break) */
MD_TEXT_SOFTBR, /* '\n' in source text where it is not semantically meaningful (soft break) */
/* Entity.
* (a) Named entity, e.g. &nbsp;
* (Note MD4C does not have a list of known entities.
* Anything matching the regexp /&[A-Za-z][A-Za-z0-9]{1,47};/ is
* treated as a named entity.)
* (b) Numerical entity, e.g. &#1234;
* (c) Hexadecimal entity, e.g. &#x12AB;
*
* As MD4C is mostly encoding agnostic, application gets the verbatim
* entity text into the MD_RENDERER::text_callback(). */
MD_TEXT_ENTITY,
/* Text in a code block (inside MD_BLOCK_CODE) or inlined code (`code`).
* If it is inside MD_BLOCK_CODE, it includes spaces for indentation and
* '\n' for new lines. MD_TEXT_BR and MD_TEXT_SOFTBR are not sent for this
* kind of text. */
MD_TEXT_CODE,
/* Text is a raw HTML. If it is contents of a raw HTML block (i.e. not
* an inline raw HTML), then MD_TEXT_BR and MD_TEXT_SOFTBR are not used.
* The text contains verbatim '\n' for the new lines. */
MD_TEXT_HTML,
/* Text is inside an equation. This is processed the same way as inlined code
* spans (`code`). */
MD_TEXT_LATEXMATH
} MD_TEXTTYPE;
/* Alignment enumeration. */
typedef enum MD_ALIGN {
MD_ALIGN_DEFAULT = 0, /* When unspecified. */
MD_ALIGN_LEFT,
MD_ALIGN_CENTER,
MD_ALIGN_RIGHT
} MD_ALIGN;
/* String attribute.
*
* This wraps strings which are outside of a normal text flow and which are
* propagated within various detailed structures, but which still may contain
* string portions of different types like e.g. entities.
*
* So, for example, lets consider an image has a title attribute string
* set to "foo &quot; bar". (Note the string size is 14.)
*
* Then the attribute MD_SPAN_IMG_DETAIL::title shall provide the following:
* -- [0]: "foo " (substr_types[0] == MD_TEXT_NORMAL; substr_offsets[0] == 0)
* -- [1]: "&quot;" (substr_types[1] == MD_TEXT_ENTITY; substr_offsets[1] == 4)
* -- [2]: " bar" (substr_types[2] == MD_TEXT_NORMAL; substr_offsets[2] == 10)
* -- [3]: (n/a) (n/a ; substr_offsets[3] == 14)
*
* Note that these conditions are guaranteed:
* -- substr_offsets[0] == 0
* -- substr_offsets[LAST+1] == size
* -- Only MD_TEXT_NORMAL, MD_TEXT_ENTITY, MD_TEXT_NULLCHAR substrings can appear.
*/
typedef struct MD_ATTRIBUTE {
const MD_CHAR* text;
MD_SIZE size;
const MD_TEXTTYPE* substr_types;
const MD_OFFSET* substr_offsets;
} MD_ATTRIBUTE;
/* Detailed info for MD_BLOCK_UL. */
typedef struct MD_BLOCK_UL_DETAIL {
int is_tight; /* Non-zero if tight list, zero if loose. */
MD_CHAR mark; /* Item bullet character in MarkDown source of the list, e.g. '-', '+', '*'. */
} MD_BLOCK_UL_DETAIL;
/* Detailed info for MD_BLOCK_OL. */
typedef struct MD_BLOCK_OL_DETAIL {
unsigned start; /* Start index of the ordered list. */
int is_tight; /* Non-zero if tight list, zero if loose. */
MD_CHAR mark_delimiter; /* Character delimiting the item marks in MarkDown source, e.g. '.' or ')' */
} MD_BLOCK_OL_DETAIL;
/* Detailed info for MD_BLOCK_LI. */
typedef struct MD_BLOCK_LI_DETAIL {
int is_task; /* Can be non-zero only with MD_FLAG_TASKLISTS */
MD_CHAR task_mark; /* If is_task, then one of 'x', 'X' or ' '. Undefined otherwise. */
MD_OFFSET task_mark_offset; /* If is_task, then offset in the input of the char between '[' and ']'. */
} MD_BLOCK_LI_DETAIL;
/* Detailed info for MD_BLOCK_H. */
typedef struct MD_BLOCK_H_DETAIL {
unsigned level; /* Header level (1 - 6) */
} MD_BLOCK_H_DETAIL;
/* Detailed info for MD_BLOCK_CODE. */
typedef struct MD_BLOCK_CODE_DETAIL {
MD_ATTRIBUTE info;
MD_ATTRIBUTE lang;
MD_CHAR fence_char; /* The character used for fenced code block; or zero for indented code block. */
} MD_BLOCK_CODE_DETAIL;
/* Detailed info for MD_BLOCK_TH and MD_BLOCK_TD. */
typedef struct MD_BLOCK_TD_DETAIL {
MD_ALIGN align;
} MD_BLOCK_TD_DETAIL;
/* Detailed info for MD_SPAN_A. */
typedef struct MD_SPAN_A_DETAIL {
MD_ATTRIBUTE href;
MD_ATTRIBUTE title;
} MD_SPAN_A_DETAIL;
/* Detailed info for MD_SPAN_IMG. */
typedef struct MD_SPAN_IMG_DETAIL {
MD_ATTRIBUTE src;
MD_ATTRIBUTE title;
} MD_SPAN_IMG_DETAIL;
/* Detailed info for MD_SPAN_WIKILINK. */
typedef struct MD_SPAN_WIKILINK {
MD_ATTRIBUTE target;
} MD_SPAN_WIKILINK_DETAIL;
/* Flags specifying extensions/deviations from CommonMark specification.
*
* By default (when MD_RENDERER::flags == 0), we follow CommonMark specification.
* The following flags may allow some extensions or deviations from it.
*/
#define MD_FLAG_COLLAPSEWHITESPACE 0x0001 /* In MD_TEXT_NORMAL, collapse non-trivial whitespace into single ' ' */
#define MD_FLAG_PERMISSIVEATXHEADERS 0x0002 /* Do not require space in ATX headers ( ###header ) */
#define MD_FLAG_PERMISSIVEURLAUTOLINKS 0x0004 /* Recognize URLs as autolinks even without '<', '>' */
#define MD_FLAG_PERMISSIVEEMAILAUTOLINKS 0x0008 /* Recognize e-mails as autolinks even without '<', '>' and 'mailto:' */
#define MD_FLAG_NOINDENTEDCODEBLOCKS 0x0010 /* Disable indented code blocks. (Only fenced code works.) */
#define MD_FLAG_NOHTMLBLOCKS 0x0020 /* Disable raw HTML blocks. */
#define MD_FLAG_NOHTMLSPANS 0x0040 /* Disable raw HTML (inline). */
#define MD_FLAG_TABLES 0x0100 /* Enable tables extension. */
#define MD_FLAG_STRIKETHROUGH 0x0200 /* Enable strikethrough extension. */
#define MD_FLAG_PERMISSIVEWWWAUTOLINKS 0x0400 /* Enable WWW autolinks (even without any scheme prefix, if they begin with 'www.') */
#define MD_FLAG_TASKLISTS 0x0800 /* Enable task list extension. */
#define MD_FLAG_LATEXMATHSPANS 0x1000 /* Enable $ and $$ containing LaTeX equations. */
#define MD_FLAG_WIKILINKS 0x2000 /* Enable wiki links extension. */
#define MD_FLAG_UNDERLINE 0x4000 /* Enable underline extension (and disables '_' for normal emphasis). */
#define MD_FLAG_PERMISSIVEAUTOLINKS (MD_FLAG_PERMISSIVEEMAILAUTOLINKS | MD_FLAG_PERMISSIVEURLAUTOLINKS | MD_FLAG_PERMISSIVEWWWAUTOLINKS)
#define MD_FLAG_NOHTML (MD_FLAG_NOHTMLBLOCKS | MD_FLAG_NOHTMLSPANS)
/* Convenient sets of flags corresponding to well-known Markdown dialects.
*
* Note we may only support subset of features of the referred dialect.
* The constant just enables those extensions which bring us as close as
* possible given what features we implement.
*
* ABI compatibility note: Meaning of these can change in time as new
* extensions, bringing the dialect closer to the original, are implemented.
*/
#define MD_DIALECT_COMMONMARK 0
#define MD_DIALECT_GITHUB (MD_FLAG_PERMISSIVEAUTOLINKS | MD_FLAG_TABLES | MD_FLAG_STRIKETHROUGH | MD_FLAG_TASKLISTS)
/* Renderer structure.
*/
typedef struct MD_PARSER {
/* Reserved. Set to zero.
*/
unsigned abi_version;
/* Dialect options. Bitmask of MD_FLAG_xxxx values.
*/
unsigned flags;
/* Caller-provided rendering callbacks.
*
* For some block/span types, more detailed information is provided in a
* type-specific structure pointed by the argument 'detail'.
*
* The last argument of all callbacks, 'userdata', is just propagated from
* md_parse() and is available for any use by the application.
*
* Note any strings provided to the callbacks as their arguments or as
* members of any detail structure are generally not zero-terminated.
* Application has take the respective size information into account.
*
* Callbacks may abort further parsing of the document by returning non-zero.
*/
int (*enter_block)(MD_BLOCKTYPE /*type*/, void* /*detail*/, void* /*userdata*/);
int (*leave_block)(MD_BLOCKTYPE /*type*/, void* /*detail*/, void* /*userdata*/);
int (*enter_span)(MD_SPANTYPE /*type*/, void* /*detail*/, void* /*userdata*/);
int (*leave_span)(MD_SPANTYPE /*type*/, void* /*detail*/, void* /*userdata*/);
int (*text)(MD_TEXTTYPE /*type*/, const MD_CHAR* /*text*/, MD_SIZE /*size*/, void* /*userdata*/);
/* Debug callback. Optional (may be NULL).
*
* If provided and something goes wrong, this function gets called.
* This is intended for debugging and problem diagnosis for developers;
* it is not intended to provide any errors suitable for displaying to an
* end user.
*/
void (*debug_log)(const char* /*msg*/, void* /*userdata*/);
/* Reserved. Set to NULL.
*/
void (*syntax)(void);
} MD_PARSER;
/* For backward compatibility. Do not use in new code. */
typedef MD_PARSER MD_RENDERER;
/* Parse the Markdown document stored in the string 'text' of size 'size'.
* The renderer provides callbacks to be called during the parsing so the
* caller can render the document on the screen or convert the Markdown
* to another format.
*
* Zero is returned on success. If a runtime error occurs (e.g. a memory
* fails), -1 is returned. If the processing is aborted due any callback
* returning non-zero, md_parse() the return value of the callback is returned.
*/
int md_parse(const MD_CHAR* text, MD_SIZE size, const MD_PARSER* parser, void* userdata);
#ifdef __cplusplus
} /* extern "C" { */
#endif
#endif /* MD4C_MARKDOWN_H */

12
md4c/md4c.pc.in Normal file
View File

@ -0,0 +1,12 @@
prefix=@CMAKE_INSTALL_PREFIX@
exec_prefix=@CMAKE_INSTALL_PREFIX@
libdir=${exec_prefix}/@CMAKE_INSTALL_LIBDIR@
includedir=${prefix}/@CMAKE_INSTALL_INCLUDEDIR@
Name: @PROJECT_NAME@
Description: @PROJECT_DESCRIPTION@
Version: @PROJECT_VERSION@
Requires:
Libs: -L${libdir} -lmd4c
Cflags: -I${includedir}

View File

@ -0,0 +1,118 @@
#!/usr/bin/env python3
import os
import sys
import textwrap
self_path = os.path.dirname(os.path.realpath(__file__));
f = open(self_path + "/unicode/CaseFolding.txt", "r")
status_list = [ "C", "F" ]
folding_list = [ dict(), dict(), dict() ]
# Filter the foldings for "full" folding.
for line in f:
comment_off = line.find("#")
if comment_off >= 0:
line = line[:comment_off]
line = line.strip()
if not line:
continue
raw_codepoint, status, raw_mapping, ignored_tail = line.split(";", 3)
if not status.strip() in status_list:
continue
codepoint = int(raw_codepoint.strip(), 16)
mapping = [int(it, 16) for it in raw_mapping.strip().split(" ")]
mapping_len = len(mapping)
if mapping_len in range(1, 4):
folding_list[mapping_len-1][codepoint] = mapping
else:
assert(False)
f.close()
# If we assume that range (index0 ... index-1) makes a range, check that index
# is compatible with it too.
#
# We are capable to handle ranges which:
#
# (1) either form consecutive sequence of codepoints and which map that range
# to other consecutive range of codepoints;
#
# (2) or consecutive range of codepoints with step 2 where each codepoint
# CP is mapped to the next codepoint CP+1
# (e.g. 0x1234 -> 0x1235; 0x1236 -> 0x1238; ...).
#
# (If the mappings have multiple codepoints, only the 1st mapped codepoint is
# considered and all the other ones have to be the same for the whole range.)
def is_range_compatible(folding, codepoint_list, index0, index):
N = index - index0
codepoint0 = codepoint_list[index0]
codepoint1 = codepoint_list[index0+1]
codepointN = codepoint_list[index]
mapping0 = folding[codepoint0]
mapping1 = folding[codepoint1]
mappingN = folding[codepointN]
# Check the range type (1):
if codepoint1 - codepoint0 == 1 and codepointN - codepoint0 == N \
and mapping1[0] - mapping0[0] == 1 and mapping1[1:] == mapping0[1:] \
and mappingN[0] - mapping0[0] == N and mappingN[1:] == mapping0[1:]:
return True
# Check the range type (2):
if codepoint1 - codepoint0 == 2 and codepointN - codepoint0 == 2 * N \
and mapping0[0] - codepoint0 == 1 \
and mapping1[0] - codepoint1 == 1 and mapping1[1:] == mapping0[1:] \
and mappingN[0] - codepointN == 1 and mappingN[1:] == mapping0[1:]:
return True
return False
def mapping_str(list, mapping):
return ",".join("0x{:04x}".format(x) for x in mapping)
for mapping_len in range(1, 4):
folding = folding_list[mapping_len-1]
codepoint_list = list(folding)
index0 = 0
count = len(folding)
records = list()
data_records = list()
while index0 < count:
index1 = index0 + 1
while index1 < count and is_range_compatible(folding, codepoint_list, index0, index1):
index1 += 1
if index1 - index0 > 2:
# Range of codepoints
records.append("R(0x{:04x},0x{:04x})".format(codepoint_list[index0], codepoint_list[index1-1]))
data_records.append(mapping_str(data_records, folding[codepoint_list[index0]]))
data_records.append(mapping_str(data_records, folding[codepoint_list[index1-1]]))
else:
# Single codepoint
records.append("S(0x{:04x})".format(codepoint_list[index0]))
data_records.append(mapping_str(data_records, folding[codepoint_list[index0]]))
index0 = index1
sys.stdout.write("static const unsigned FOLD_MAP_{}[] = {{\n".format(mapping_len))
sys.stdout.write("\n".join(textwrap.wrap(", ".join(records), 110,
initial_indent = " ", subsequent_indent=" ")))
sys.stdout.write("\n};\n")
sys.stdout.write("static const unsigned FOLD_MAP_{}_DATA[] = {{\n".format(mapping_len))
sys.stdout.write("\n".join(textwrap.wrap(", ".join(data_records), 110,
initial_indent = " ", subsequent_indent=" ")))
sys.stdout.write("\n};\n")

View File

@ -0,0 +1,66 @@
#!/usr/bin/env python3
import os
import sys
import textwrap
self_path = os.path.dirname(os.path.realpath(__file__));
f = open(self_path + "/unicode/DerivedGeneralCategory.txt", "r")
codepoint_list = []
category_list = [ "Pc", "Pd", "Pe", "Pf", "Pi", "Po", "Ps" ]
# Filter codepoints falling in the right category:
for line in f:
comment_off = line.find("#")
if comment_off >= 0:
line = line[:comment_off]
line = line.strip()
if not line:
continue
char_range, category = line.split(";")
char_range = char_range.strip()
category = category.strip()
if not category in category_list:
continue
delim_off = char_range.find("..")
if delim_off >= 0:
codepoint0 = int(char_range[:delim_off], 16)
codepoint1 = int(char_range[delim_off+2:], 16)
for codepoint in range(codepoint0, codepoint1 + 1):
codepoint_list.append(codepoint)
else:
codepoint = int(char_range, 16)
codepoint_list.append(codepoint)
f.close()
codepoint_list.sort()
index0 = 0
count = len(codepoint_list)
records = list()
while index0 < count:
index1 = index0 + 1
while index1 < count and codepoint_list[index1] == codepoint_list[index1-1] + 1:
index1 += 1
if index1 - index0 > 1:
# Range of codepoints
records.append("R(0x{:04x},0x{:04x})".format(codepoint_list[index0], codepoint_list[index1-1]))
else:
# Single codepoint
records.append("S(0x{:04x})".format(codepoint_list[index0]))
index0 = index1
sys.stdout.write("static const unsigned PUNCT_MAP[] = {\n")
sys.stdout.write("\n".join(textwrap.wrap(", ".join(records), 110,
initial_indent = " ", subsequent_indent=" ")))
sys.stdout.write("\n};\n\n")

View File

@ -0,0 +1,66 @@
#!/usr/bin/env python3
import os
import sys
import textwrap
self_path = os.path.dirname(os.path.realpath(__file__));
f = open(self_path + "/unicode/DerivedGeneralCategory.txt", "r")
codepoint_list = []
category_list = [ "Zs" ]
# Filter codepoints falling in the right category:
for line in f:
comment_off = line.find("#")
if comment_off >= 0:
line = line[:comment_off]
line = line.strip()
if not line:
continue
char_range, category = line.split(";")
char_range = char_range.strip()
category = category.strip()
if not category in category_list:
continue
delim_off = char_range.find("..")
if delim_off >= 0:
codepoint0 = int(char_range[:delim_off], 16)
codepoint1 = int(char_range[delim_off+2:], 16)
for codepoint in range(codepoint0, codepoint1 + 1):
codepoint_list.append(codepoint)
else:
codepoint = int(char_range, 16)
codepoint_list.append(codepoint)
f.close()
codepoint_list.sort()
index0 = 0
count = len(codepoint_list)
records = list()
while index0 < count:
index1 = index0 + 1
while index1 < count and codepoint_list[index1] == codepoint_list[index1-1] + 1:
index1 += 1
if index1 - index0 > 1:
# Range of codepoints
records.append("R(0x{:04x},0x{:04x})".format(codepoint_list[index0], codepoint_list[index1-1]))
else:
# Single codepoint
records.append("S(0x{:04x})".format(codepoint_list[index0]))
index0 = index1
sys.stdout.write("static const unsigned WHITESPACE_MAP[] = {\n")
sys.stdout.write("\n".join(textwrap.wrap(", ".join(records), 110,
initial_indent = " ", subsequent_indent=" ")))
sys.stdout.write("\n};\n\n")

70
scripts/coverity.sh Executable file
View File

@ -0,0 +1,70 @@
#!/bin/sh
#
# This scripts attempts to build the project via cov-build utility, and prepare
# a package for uploading to the coverity scan service.
#
# (See http://scan.coverity.com for more info.)
set -e
# Check presence of coverity static analyzer.
if ! which cov-build; then
echo "Utility cov-build not found in PATH."
exit 1
fi
# Choose a build system (ninja or GNU make).
if which ninja; then
BUILD_TOOL=ninja
GENERATOR=Ninja
elif which make; then
BUILD_TOOL=make
GENERATOR="MSYS Makefiles"
else
echo "No suitable build system found."
exit 1
fi
# Choose a zip tool.
if which 7za; then
MKZIP="7za a -r -mx9"
elif which 7z; then
MKZIP="7z a -r -mx9"
elif which zip; then
MKZIP="zip -r"
else
echo "No suitable zip utility found"
exit 1
fi
# Change dir to project root.
cd `dirname "$0"`/..
CWD=`pwd`
ROOT_DIR="$CWD"
BUILD_DIR="$CWD/coverity"
OUTPUT="$CWD/cov-int.zip"
# Sanity checks.
if [ ! -x "$ROOT_DIR/scripts/coverity.sh" ]; then
echo "There is some path mismatch."
exit 1
fi
if [ -e "$BUILD_DIR" ]; then
echo "Path $BUILD_DIR already exists. Delete it and retry."
exit 1
fi
if [ -e "$OUTPUT" ]; then
echo "Path $OUTPUT already exists. Delete it and retry."
exit 1
fi
# Build the project with the Coverity analyzes enabled.
mkdir -p "$BUILD_DIR"
cd "$BUILD_DIR"
cmake -G "$GENERATOR" "$ROOT_DIR"
cov-build --dir cov-int "$BUILD_TOOL"
$MKZIP "$OUTPUT" "cov-int"
cd "$ROOT_DIR"
rm -rf "$BUILD_DIR"

75
scripts/run-tests.sh Executable file
View File

@ -0,0 +1,75 @@
#!/bin/sh
#
# Run this script from build directory.
#set -e
SELF_DIR=`dirname $0`
PROJECT_DIR="$SELF_DIR/.."
TEST_DIR="$PROJECT_DIR/test"
PROGRAM="md2html/md2html"
if [ ! -x "$PROGRAM" ]; then
echo "Cannot find the $PROGRAM." >&2
echo "You have to run this script from the build directory." >&2
exit 1
fi
if which py >>/dev/null 2>&1; then
PYTHON=py
elif which python3 >>/dev/null 2>&1; then
PYTHON=python3
elif which python >>/dev/null 2>&1; then
if [ `python --version | awk '{print $2}' | cut -d. -f1` -ge 3 ]; then
PYTHON=python
fi
fi
echo
echo "CommonMark specification:"
$PYTHON "$TEST_DIR/spec_tests.py" -s "$TEST_DIR/spec.txt" -p "$PROGRAM"
echo
echo "Code coverage & regressions:"
$PYTHON "$TEST_DIR/spec_tests.py" -s "$TEST_DIR/coverage.txt" -p "$PROGRAM"
echo
echo "Permissive e-mail autolinks extension:"
$PYTHON "$TEST_DIR/spec_tests.py" -s "$TEST_DIR/permissive-email-autolinks.txt" -p "$PROGRAM --fpermissive-email-autolinks"
echo
echo "Permissive URL autolinks extension:"
$PYTHON "$TEST_DIR/spec_tests.py" -s "$TEST_DIR/permissive-url-autolinks.txt" -p "$PROGRAM --fpermissive-url-autolinks"
echo
echo "WWW autolinks extension:"
$PYTHON "$TEST_DIR/spec_tests.py" -s "$TEST_DIR/permissive-www-autolinks.txt" -p "$PROGRAM --fpermissive-www-autolinks"
echo
echo "Tables extension:"
$PYTHON "$TEST_DIR/spec_tests.py" -s "$TEST_DIR/tables.txt" -p "$PROGRAM --ftables"
echo
echo "Strikethrough extension:"
$PYTHON "$TEST_DIR/spec_tests.py" -s "$TEST_DIR/strikethrough.txt" -p "$PROGRAM --fstrikethrough"
echo
echo "Task lists extension:"
$PYTHON "$TEST_DIR/spec_tests.py" -s "$TEST_DIR/tasklists.txt" -p "$PROGRAM --ftasklists"
echo
echo "LaTeX extension:"
$PYTHON "$TEST_DIR/spec_tests.py" -s "$TEST_DIR/latex-math.txt" -p "$PROGRAM --flatex-math"
echo
echo "Wiki links extension:"
$PYTHON "$TEST_DIR/spec_tests.py" -s "$TEST_DIR/wiki-links.txt" -p "$PROGRAM --fwiki-links --ftables"
echo
echo "Underline extension:"
$PYTHON "$TEST_DIR/spec_tests.py" -s "$TEST_DIR/underline.txt" -p "$PROGRAM --funderline"
echo
echo "Pathological input:"
$PYTHON "$TEST_DIR/pathological_tests.py" -p "$PROGRAM"

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

64
test/LICENSE Normal file
View File

@ -0,0 +1,64 @@
The CommonMark spec (spec.txt) and DTD (CommonMark.dtd) are
Copyright (C) 2014-16 John MacFarlane
Released under the Creative Commons CC-BY-SA 4.0 license:
<http://creativecommons.org/licenses/by-sa/4.0/>.
---
The test software in test/ and the programs in tools/ are
Copyright (c) 2014, John MacFarlane
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following
disclaimer in the documentation and/or other materials provided
with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
---
The normalization code in runtests.py was derived from the
markdowntest project, Copyright 2013 Karl Dubost:
The MIT License (MIT)
Copyright (c) 2013 Karl Dubost
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

40
test/cmark.py Executable file
View File

@ -0,0 +1,40 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from ctypes import CDLL, c_char_p, c_long
from subprocess import *
import platform
import os
def pipe_through_prog(prog, text):
p1 = Popen(prog.split(), stdout=PIPE, stdin=PIPE, stderr=PIPE)
[result, err] = p1.communicate(input=text.encode('utf-8'))
return [p1.returncode, result.decode('utf-8'), err]
def use_library(lib, text):
textbytes = text.encode('utf-8')
textlen = len(textbytes)
return [0, lib(textbytes, textlen, 0).decode('utf-8'), '']
class CMark:
def __init__(self, prog=None, library_dir=None):
self.prog = prog
if prog:
self.to_html = lambda x: pipe_through_prog(prog, x)
else:
sysname = platform.system()
if sysname == 'Darwin':
libname = "libcmark.dylib"
elif sysname == 'Windows':
libname = "cmark.dll"
else:
libname = "libcmark.so"
if library_dir:
libpath = os.path.join(library_dir, libname)
else:
libpath = os.path.join("build", "src", libname)
cmark = CDLL(libpath)
markdown = cmark.cmark_markdown_to_html
markdown.restype = c_char_p
markdown.argtypes = [c_char_p, c_long]
self.to_html = lambda x: use_library(markdown, x)

464
test/coverage.txt Normal file
View File

@ -0,0 +1,464 @@
# Coverage
This file is just a collection of unit tests not covered elsewhere.
Most notably regression tests, tests improving code coverage and other useful
things may drop here.
(However any tests requiring any additional command line option, like enabling
an extension, must be included in their respective files.)
## GitHub Issues
### [Issue 2](https://github.com/mity/md4c/issues/2)
Raw HTML block:
```````````````````````````````` example
<gi att1=tok1 att2=tok2>
.
<gi att1=tok1 att2=tok2>
````````````````````````````````
Inline:
```````````````````````````````` example
foo <gi att1=tok1 att2=tok2> bar
.
<p>foo <gi att1=tok1 att2=tok2> bar</p>
````````````````````````````````
Inline with a line break:
```````````````````````````````` example
foo <gi att1=tok1
att2=tok2> bar
.
<p>foo <gi att1=tok1
att2=tok2> bar</p>
````````````````````````````````
### [Issue 4](https://github.com/mity/md4c/issues/4)
```````````````````````````````` example
![alt text with *entity* &copy;](img.png 'title')
.
<p><img src="img.png" alt="alt text with entity ©" title="title"></p>
````````````````````````````````
### [Issue 9](https://github.com/mity/md4c/issues/9)
```````````````````````````````` example
> [foo
> bar]: /url
>
> [foo bar]
.
<blockquote>
<p><a href="/url">foo
bar</a></p>
</blockquote>
````````````````````````````````
### [Issue 10](https://github.com/mity/md4c/issues/10)
```````````````````````````````` example
[x]:
x
- <?
x
.
<ul>
<li><?
x
</li>
</ul>
````````````````````````````````
### [Issue 11](https://github.com/mity/md4c/issues/11)
```````````````````````````````` example
x [link](/url "foo &ndash; bar") x
.
<p>x <a href="/url" title="foo bar">link</a> x</p>
````````````````````````````````
### [Issue 14](https://github.com/mity/md4c/issues/14)
```````````````````````````````` example
a***b* c*
.
<p>a*<em><em>b</em> c</em></p>
````````````````````````````````
### [Issue 15](https://github.com/mity/md4c/issues/15)
```````````````````````````````` example
***b* c*
.
<p>*<em><em>b</em> c</em></p>
````````````````````````````````
### [Issue 21](https://github.com/mity/md4c/issues/21)
```````````````````````````````` example
a*b**c*
.
<p>a<em>b**c</em></p>
````````````````````````````````
### [Issue 33](https://github.com/mity/md4c/issues/33)
```````````````````````````````` example
```&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;
.
<pre><code class="language-&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;"></code></pre>
````````````````````````````````
### [Issue 36](https://github.com/mity/md4c/issues/36)
```````````````````````````````` example
__x_ _x___
.
<p><em><em>x</em> <em>x</em></em>_</p>
````````````````````````````````
### [Issue 39](https://github.com/mity/md4c/issues/39)
```````````````````````````````` example
[\\]: x
.
````````````````````````````````
### [Issue 40](https://github.com/mity/md4c/issues/40)
```````````````````````````````` example
[x](url
'title'
)x
.
<p><a href="url" title="title">x</a>x</p>
````````````````````````````````
### [Issue 65](https://github.com/mity/md4c/issues/65)
```````````````````````````````` example
`
.
<p>`</p>
````````````````````````````````
### [Issue 74](https://github.com/mity/md4c/issues/74)
```````````````````````````````` example
[f]:
-
xx
-
.
<pre><code>xx
</code></pre>
<ul>
<li></li>
</ul>
````````````````````````````````
### [Issue 78](https://github.com/mity/md4c/issues/78)
```````````````````````````````` example
[SS ẞ]: /url
[ẞ SS]
.
<p><a href="/url">ẞ SS</a></p>
````````````````````````````````
### [Issue 83](https://github.com/mity/md4c/issues/83)
```````````````````````````````` example
foo
>
.
<p>foo</p>
<blockquote>
</blockquote>
````````````````````````````````
### [Issue 95](https://github.com/mity/md4c/issues/95)
```````````````````````````````` example
. foo
.
<p>. foo</p>
````````````````````````````````
### [Issue 96](https://github.com/mity/md4c/issues/96)
```````````````````````````````` example
[ab]: /foo
[a] [ab] [abc]
.
<p>[a] <a href="/foo">ab</a> [abc]</p>
````````````````````````````````
```````````````````````````````` example
[a b]: /foo
[a b]
.
<p><a href="/foo">a b</a></p>
````````````````````````````````
### [Issue 97](https://github.com/mity/md4c/issues/97)
```````````````````````````````` example
*a **b c* d**
.
<p><em>a <em><em>b c</em> d</em></em></p>
````````````````````````````````
### [Issue 100](https://github.com/mity/md4c/issues/100)
```````````````````````````````` example
<foo@123456789012345678901234567890123456789012345678901234567890123.123456789012345678901234567890123456789012345678901234567890123>
.
<p><a href="mailto:foo@123456789012345678901234567890123456789012345678901234567890123.123456789012345678901234567890123456789012345678901234567890123">foo@123456789012345678901234567890123456789012345678901234567890123.123456789012345678901234567890123456789012345678901234567890123</a></p>
````````````````````````````````
```````````````````````````````` example
<foo@123456789012345678901234567890123456789012345678901234567890123x.123456789012345678901234567890123456789012345678901234567890123>
.
<p>&lt;foo@123456789012345678901234567890123456789012345678901234567890123x.123456789012345678901234567890123456789012345678901234567890123&gt;</p>
````````````````````````````````
(Note the `x` here which turns it over the max. allowed length limit.)
### [Issue 107](https://github.com/mity/md4c/issues/107)
```````````````````````````````` example
***foo *bar baz***
.
<p>*<strong>foo <em>bar baz</em></strong></p>
````````````````````````````````
## Code coverage
### `md_is_unicode_whitespace__()`
Unicode whitespace (here U+2000) forms a word boundary so these cannot be
resolved as emphasis span because there is no closer mark.
```````````````````````````````` example
*foo *bar
.
<p>*foo *bar</p>
````````````````````````````````
### `md_is_unicode_punct__()`
Ditto for Unicode punctuation (here U+00A1).
```````````````````````````````` example
*foo¡*bar
.
<p>*foo¡*bar</p>
````````````````````````````````
### `md_get_unicode_fold_info()`
```````````````````````````````` example
[Příliš žluťoučký kůň úpěl ďábelské ódy.]
[PŘÍLIŠ ŽLUŤOUČKÝ KŮŇ ÚPĚL ĎÁBELSKÉ ÓDY.]: /url
.
<p><a href="/url">Příliš žluťoučký kůň úpěl ďábelské ódy.</a></p>
````````````````````````````````
### `md_decode_utf8__()` and `md_decode_utf8_before__()`
```````````````````````````````` example
á*Á (U+00E1, i.e. two byte UTF-8 sequence)
 *  (U+2000, i.e. three byte UTF-8 sequence)
.
<p>á*Á (U+00E1, i.e. two byte UTF-8 sequence)
* (U+2000, i.e. three byte UTF-8 sequence)</p>
````````````````````````````````
### `md_is_link_destination_A()`
```````````````````````````````` example
[link](</url\.with\.escape>)
.
<p><a href="/url.with.escape">link</a></p>
````````````````````````````````
### `md_link_label_eq()`
```````````````````````````````` example
[foo bar]
[foo bar]: /url
.
<p><a href="/url">foo bar</a></p>
````````````````````````````````
### `md_is_inline_link_spec()`
```````````````````````````````` example
> [link](/url 'foo
> bar')
.
<blockquote>
<p><a href="/url" title="foo
bar">link</a></p>
</blockquote>
````````````````````````````````
### `md_build_ref_def_hashtable()`
All link labels in the following example all have the same FNV1a hash (after
normalization of the label, which means after converting to a vector of Unicode
codepoints and lowercase folding).
So the example triggers quite complex code paths which are not otherwise easily
tested.
```````````````````````````````` example
[foo]: /foo
[qnptgbh]: /qnptgbh
[abgbrwcv]: /abgbrwcv
[abgbrwcv]: /abgbrwcv2
[abgbrwcv]: /abgbrwcv3
[abgbrwcv]: /abgbrwcv4
[alqadfgn]: /alqadfgn
[foo]
[qnptgbh]
[abgbrwcv]
[alqadfgn]
[axgydtdu]
.
<p><a href="/foo">foo</a>
<a href="/qnptgbh">qnptgbh</a>
<a href="/abgbrwcv">abgbrwcv</a>
<a href="/alqadfgn">alqadfgn</a>
[axgydtdu]</p>
````````````````````````````````
For the sake of completeness, the following C program was used to find the hash
collisions by brute force:
~~~
#include <stdio.h>
#include <string.h>
static unsigned etalon;
#define MD_FNV1A_BASE 2166136261
#define MD_FNV1A_PRIME 16777619
static inline unsigned
fnv1a(unsigned base, const void* data, size_t n)
{
const unsigned char* buf = (const unsigned char*) data;
unsigned hash = base;
size_t i;
for(i = 0; i < n; i++) {
hash ^= buf[i];
hash *= MD_FNV1A_PRIME;
}
return hash;
}
static unsigned
unicode_hash(const char* data, size_t n)
{
unsigned value;
unsigned hash = MD_FNV1A_BASE;
int i;
for(i = 0; i < n; i++) {
value = data[i];
hash = fnv1a(hash, &value, sizeof(unsigned));
}
return hash;
}
static void
recurse(char* buffer, size_t off, size_t len)
{
int ch;
if(off < len - 1) {
for(ch = 'a'; ch <= 'z'; ch++) {
buffer[off] = ch;
recurse(buffer, off+1, len);
}
} else {
for(ch = 'a'; ch <= 'z'; ch++) {
buffer[off] = ch;
if(unicode_hash(buffer, len) == etalon) {
printf("Dup: %.*s\n", (int)len, buffer);
}
}
}
}
int
main(int argc, char** argv)
{
char buffer[32];
int len;
if(argc < 2)
etalon = unicode_hash("foo", 3);
else
etalon = unicode_hash(argv[1], strlen(argv[1]));
for(len = 1; len <= sizeof(buffer); len++)
recurse(buffer, 0, len);
return 0;
}
~~~

View File

@ -0,0 +1,41 @@
# h1
## h2
### h3
#### h4
##### h5
###### h6
h1
==
h2
--
--------------------
indented code
```
fenced code
```
<tag attr='val' attr2="val2">
> quote
* list item
1. list item
[ref]: /url
paragraph
&copy; &#1234; &#xabcd;
`code`
*emph* **strong** ***strong emph***
_emph_ __strong__ ___strong emph___
[ref] [ref][] [link](/url)
![ref] ![ref][] ![img](/url)
<http://example.com> <doe@example.com>
www.example.com doe@example.com
\\ \* \. \` \

8
test/fuzz-input/gfm.md Normal file
View File

@ -0,0 +1,8 @@
* [ ] unchecked
* [x] checked
A | B | C
---|--:|:-:
aaa|bbb|ccc
~del~ ~~del~~

View File

@ -0,0 +1 @@
$a^2+b^2=c^2$ $$a^2+b^2=c^2$$

1
test/fuzz-input/wiki.md Normal file
View File

@ -0,0 +1 @@
[[wiki]] [[wiki|label]]

39
test/latex-math.txt Normal file
View File

@ -0,0 +1,39 @@
# LaTeX Math
With the flag `MD_FLAG_LATEXMATHSPANS`, MD4C enables extension for recognition
of LaTeX style math spans.
A math span is is any text wrapped in dollars or double dollars (`$...$` or
`$$...$$`).
```````````````````````````````` example
$a+b=c$ Hello, world!
.
<p><x-equation>a+b=c</x-equation> Hello, world!</p>
````````````````````````````````
If the double dollar sign is used, the math span is a display math span.
```````````````````````````````` example
This is a display equation: $$\int_a^b x dx$$.
.
<p>This is a display equation: <x-equation type="display">\int_a^b x dx</x-equation>.</p>
````````````````````````````````
Math spans may span multiple lines as they are normal spans:
```````````````````````````````` example
$$
\int_a^b
f(x) dx
$$
.
<p><x-equation type="display">\int_a^b f(x) dx </x-equation></p>
````````````````````````````````
Note though that many (simple) renderers may output the math spans just as a
verbatim text. (This includes the HTML renderer used by the `md2html` utility.)
Only advanced renderers which implement LaTeX math syntax can be expected to
provide better results.

194
test/normalize.py Executable file
View File

@ -0,0 +1,194 @@
# -*- coding: utf-8 -*-
from html.parser import HTMLParser
import urllib
try:
from html.parser import HTMLParseError
except ImportError:
# HTMLParseError was removed in Python 3.5. It could never be
# thrown, so we define a placeholder instead.
class HTMLParseError(Exception):
pass
from html.entities import name2codepoint
import sys
import re
import cgi
# Normalization code, adapted from
# https://github.com/karlcow/markdown-testsuite/
significant_attrs = ["alt", "href", "src", "title"]
whitespace_re = re.compile('\s+')
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.convert_charrefs = False
self.last = "starttag"
self.in_pre = False
self.output = ""
self.last_tag = ""
def handle_data(self, data):
after_tag = self.last == "endtag" or self.last == "starttag"
after_block_tag = after_tag and self.is_block_tag(self.last_tag)
if after_tag and self.last_tag == "br":
data = data.lstrip('\n')
if not self.in_pre:
data = whitespace_re.sub(' ', data)
if after_block_tag and not self.in_pre:
if self.last == "starttag":
data = data.lstrip()
elif self.last == "endtag":
data = data.strip()
self.output += data
self.last = "data"
def handle_endtag(self, tag):
if tag == "pre":
self.in_pre = False
elif self.is_block_tag(tag):
self.output = self.output.rstrip()
self.output += "</" + tag + ">"
self.last_tag = tag
self.last = "endtag"
def handle_starttag(self, tag, attrs):
if tag == "pre":
self.in_pre = True
if self.is_block_tag(tag):
self.output = self.output.rstrip()
self.output += "<" + tag
# For now we don't strip out 'extra' attributes, because of
# raw HTML test cases.
# attrs = filter(lambda attr: attr[0] in significant_attrs, attrs)
if attrs:
attrs.sort()
for (k,v) in attrs:
self.output += " " + k
if v in ['href','src']:
self.output += ("=" + '"' +
urllib.quote(urllib.unquote(v), safe='/') + '"')
elif v != None:
self.output += ("=" + '"' + cgi.escape(v,quote=True) + '"')
self.output += ">"
self.last_tag = tag
self.last = "starttag"
def handle_startendtag(self, tag, attrs):
"""Ignore closing tag for self-closing """
self.handle_starttag(tag, attrs)
self.last_tag = tag
self.last = "endtag"
def handle_comment(self, data):
self.output += '<!--' + data + '-->'
self.last = "comment"
def handle_decl(self, data):
self.output += '<!' + data + '>'
self.last = "decl"
def unknown_decl(self, data):
self.output += '<!' + data + '>'
self.last = "decl"
def handle_pi(self,data):
self.output += '<?' + data + '>'
self.last = "pi"
def handle_entityref(self, name):
try:
c = chr(name2codepoint[name])
except KeyError:
c = None
self.output_char(c, '&' + name + ';')
self.last = "ref"
def handle_charref(self, name):
try:
if name.startswith("x"):
c = chr(int(name[1:], 16))
else:
c = chr(int(name))
except ValueError:
c = None
self.output_char(c, '&' + name + ';')
self.last = "ref"
# Helpers.
def output_char(self, c, fallback):
if c == '<':
self.output += "&lt;"
elif c == '>':
self.output += "&gt;"
elif c == '&':
self.output += "&amp;"
elif c == '"':
self.output += "&quot;"
elif c == None:
self.output += fallback
else:
self.output += c
def is_block_tag(self,tag):
return (tag in ['article', 'header', 'aside', 'hgroup', 'blockquote',
'hr', 'iframe', 'body', 'li', 'map', 'button', 'object', 'canvas',
'ol', 'caption', 'output', 'col', 'p', 'colgroup', 'pre', 'dd',
'progress', 'div', 'section', 'dl', 'table', 'td', 'dt',
'tbody', 'embed', 'textarea', 'fieldset', 'tfoot', 'figcaption',
'th', 'figure', 'thead', 'footer', 'tr', 'form', 'ul',
'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'video', 'script', 'style'])
def normalize_html(html):
r"""
Return normalized form of HTML which ignores insignificant output
differences:
Multiple inner whitespaces are collapsed to a single space (except
in pre tags):
>>> normalize_html("<p>a \t b</p>")
'<p>a b</p>'
>>> normalize_html("<p>a \t\nb</p>")
'<p>a b</p>'
* Whitespace surrounding block-level tags is removed.
>>> normalize_html("<p>a b</p>")
'<p>a b</p>'
>>> normalize_html(" <p>a b</p>")
'<p>a b</p>'
>>> normalize_html("<p>a b</p> ")
'<p>a b</p>'
>>> normalize_html("\n\t<p>\n\t\ta b\t\t</p>\n\t")
'<p>a b</p>'
>>> normalize_html("<i>a b</i> ")
'<i>a b</i> '
* Self-closing tags are converted to open tags.
>>> normalize_html("<br />")
'<br>'
* Attributes are sorted and lowercased.
>>> normalize_html('<a title="bar" HREF="foo">x</a>')
'<a href="foo" title="bar">x</a>'
* References are converted to unicode, except that '<', '>', '&', and
'"' are rendered using entities.
>>> normalize_html("&forall;&amp;&gt;&lt;&quot;")
'\u2200&amp;&gt;&lt;&quot;'
"""
html_chunk_re = re.compile("(\<!\[CDATA\[.*?\]\]\>|\<[^>]*\>|[^<]+)")
try:
parser = MyHTMLParser()
# We work around HTMLParser's limitations parsing CDATA
# by breaking the input into chunks and passing CDATA chunks
# through verbatim.
for chunk in re.finditer(html_chunk_re, html):
if chunk.group(0)[:8] == "<![CDATA":
parser.output += chunk.group(0)
else:
parser.feed(chunk.group(0))
parser.close()
return parser.output
except HTMLParseError as e:
sys.stderr.write("Normalization error: " + e.msg + "\n")
return html # on error, return unnormalized HTML

122
test/pathological_tests.py Executable file
View File

@ -0,0 +1,122 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import re
import argparse
import sys
import platform
from cmark import CMark
from timeit import default_timer as timer
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Run cmark tests.')
parser.add_argument('-p', '--program', dest='program', nargs='?', default=None,
help='program to test')
parser.add_argument('--library-dir', dest='library_dir', nargs='?',
default=None, help='directory containing dynamic library')
args = parser.parse_args(sys.argv[1:])
cmark = CMark(prog=args.program, library_dir=args.library_dir)
# list of pairs consisting of input and a regex that must match the output.
pathological = {
# note - some pythons have limit of 65535 for {num-matches} in re.
"nested strong emph":
(("*a **a " * 65000) + "b" + (" a** a*" * 65000),
re.compile("(<em>a <strong>a ){65000}b( a</strong> a</em>){65000}")),
"many emph closers with no openers":
(("a_ " * 65000),
re.compile("(a[_] ){64999}a_")),
"many emph openers with no closers":
(("_a " * 65000),
re.compile("(_a ){64999}_a")),
"many 3-emph openers with no closers":
(("a***" * 65000),
re.compile("(a<em><strong>a</strong></em>){32500}")),
"many link closers with no openers":
(("a]" * 65000),
re.compile("(a\]){65000}")),
"many link openers with no closers":
(("[a" * 65000),
re.compile("(\[a){65000}")),
"mismatched openers and closers":
(("*a_ " * 50000),
re.compile("([*]a[_] ){49999}[*]a_")),
"openers and closers multiple of 3":
(("a**b" + ("c* " * 50000)),
re.compile("a[*][*]b(c[*] ){49999}c[*]")),
"link openers and emph closers":
(("[ a_" * 50000),
re.compile("(\[ a_){50000}")),
"hard link/emph case":
("**x [a*b**c*](d)",
re.compile("\\*\\*x <a href=\"d\">a<em>b\\*\\*c</em></a>")),
"nested brackets":
(("[" * 50000) + "a" + ("]" * 50000),
re.compile("\[{50000}a\]{50000}")),
"nested block quotes":
((("> " * 50000) + "a"),
re.compile("(<blockquote>\r?\n){50000}")),
"U+0000 in input":
("abc\u0000de\u0000",
re.compile("abc\ufffd?de\ufffd?")),
"backticks":
("".join(map(lambda x: ("e" + "`" * x), range(1,1000))),
re.compile("^<p>[e`]*</p>\r?\n$")),
"many links":
("[t](/u) " * 50000,
re.compile("(<a href=\"/u\">t</a> ?){50000}")),
"many references":
("".join(map(lambda x: ("[" + str(x) + "]: u\n"), range(1,20000 * 16))) + "[0] " * 20000,
re.compile("(\[0\] ){19999}")),
"deeply nested lists":
("".join(map(lambda x: (" " * x + "* a\n"), range(0,1000))),
re.compile("<ul>\r?\n(<li>a<ul>\r?\n){999}<li>a</li>\r?\n</ul>\r?\n(</li>\r?\n</ul>\r?\n){999}")),
"many html openers and closers":
(("<>" * 50000),
re.compile("(&lt;&gt;){50000}")),
"many html proc. inst. openers":
(("x" + "<?" * 50000),
re.compile("x(&lt;\\?){50000}")),
"many html CDATA openers":
(("x" + "<![CDATA[" * 50000),
re.compile("x(&lt;!\\[CDATA\\[){50000}")),
"many backticks and escapes":
(("\\``" * 50000),
re.compile("(``){50000}")),
"many broken link titles":
(("[ (](" * 50000),
re.compile("(\[ \(\]\(){50000}")),
"broken thematic break":
(("* " * 50000 + "a"),
re.compile("<ul>\r?\n(<li><ul>\r?\n){49999}<li>a</li>\r?\n</ul>\r?\n(</li>\r?\n</ul>\r?\n){49999}"))
}
whitespace_re = re.compile('/s+/')
passed = 0
errored = 0
failed = 0
#print("Testing pathological cases:")
for description in pathological:
(inp, regex) = pathological[description]
start = timer()
[rc, actual, err] = cmark.to_html(inp)
end = timer()
if rc != 0:
errored += 1
print('{:35} [ERRORED (return code %d)]'.format(description, rc))
print(err)
elif regex.search(actual):
print('{:35} [PASSED] {:.3f} secs'.format(description, end-start))
passed += 1
else:
print('{:35} [FAILED]'.format(description))
print(repr(actual))
failed += 1
print("%d passed, %d failed, %d errored" % (passed, failed, errored))
if (failed == 0 and errored == 0):
exit(0)
else:
exit(1)

View File

@ -0,0 +1,50 @@
# Permissive E-mail Autolinks
With the flag `MD_FLAG_PERMISSIVEEMAILAUTOLINKS`, MD4C enables more permissive
recognition of e-mail addresses and transforms them to autolinks, even if they
do not exactly follow the syntax of autolink as specified in CommonMark
specification.
This is standard CommonMark e-mail autolink:
```````````````````````````````` example
E-mail: <mailto:john.doe@gmail.com>
.
<p>E-mail: <a href="mailto:john.doe@gmail.com">mailto:john.doe@gmail.com</a></p>
````````````````````````````````
With the permissive autolinks enabled, this is sufficient:
```````````````````````````````` example
E-mail: john.doe@gmail.com
.
<p>E-mail: <a href="mailto:john.doe@gmail.com">john.doe@gmail.com</a></p>
````````````````````````````````
`+` can occur before the `@`, but not after.
```````````````````````````````` example
hello@mail+xyz.example isn't valid, but hello+xyz@mail.example is.
.
<p>hello@mail+xyz.example isn't valid, but <a href="mailto:hello+xyz@mail.example">hello+xyz@mail.example</a> is.</p>
````````````````````````````````
`.`, `-`, and `_` can occur on both sides of the `@`, but only `.` may occur at
the end of the email address, in which case it will not be considered part of
the address:
```````````````````````````````` example
a.b-c_d@a.b
a.b-c_d@a.b.
a.b-c_d@a.b-
a.b-c_d@a.b_
.
<p><a href="mailto:a.b-c_d@a.b">a.b-c_d@a.b</a></p>
<p><a href="mailto:a.b-c_d@a.b">a.b-c_d@a.b</a>.</p>
<p>a.b-c_d@a.b-</p>
<p>a.b-c_d@a.b_</p>
````````````````````````````````

View File

@ -0,0 +1,92 @@
# Permissive URL Autolinks
With the flag `MD_FLAG_PERMISSIVEURLAUTOLINKS`, MD4C enables more permissive recognition
of URLs and transform them to autolinks, even if they do not exactly follow the syntax
of autolink as specified in CommonMark specification.
This is standard CommonMark autolink:
```````````````````````````````` example
Homepage: <https://github.com/mity/md4c>
.
<p>Homepage: <a href="https://github.com/mity/md4c">https://github.com/mity/md4c</a></p>
````````````````````````````````
With the permissive autolinks enabled, this is sufficient:
```````````````````````````````` example
Homepage: https://github.com/mity/md4c
.
<p>Homepage: <a href="https://github.com/mity/md4c">https://github.com/mity/md4c</a></p>
````````````````````````````````
But this permissive autolink feature can work only for very widely used URL
schemes, in alphabetical order `ftp:`, `http:`, `https:`.
That's why this is not a permissive autolink:
```````````````````````````````` example
ssh://root@example.com
.
<p>ssh://root@example.com</p>
````````````````````````````````
The same rules for path validation as for permissivve WWW autolinks apply.
Therefore the final question mark here is not part of the autolink:
```````````````````````````````` example
Have you ever visited http://www.zombo.com?
.
<p>Have you ever visited <a href="http://www.zombo.com">http://www.zombo.com</a>?</p>
````````````````````````````````
But in contrast, in this example it is:
```````````````````````````````` example
http://www.bing.com/search?q=md4c
.
<p><a href="http://www.bing.com/search?q=md4c">http://www.bing.com/search?q=md4c</a></p>
````````````````````````````````
And finally one complex example:
```````````````````````````````` example
http://commonmark.org
(Visit https://encrypted.google.com/search?q=Markup+(business))
Anonymous FTP is available at ftp://foo.bar.baz.
.
<p><a href="http://commonmark.org">http://commonmark.org</a></p>
<p>(Visit <a href="https://encrypted.google.com/search?q=Markup+(business)">https://encrypted.google.com/search?q=Markup+(business)</a>)</p>
<p>Anonymous FTP is available at <a href="ftp://foo.bar.baz">ftp://foo.bar.baz</a>.</p>
````````````````````````````````
## GitHub Issues
### [Issue 53](https://github.com/mity/md4c/issues/53)
```````````````````````````````` example
This is [link](http://github.com/).
.
<p>This is <a href="http://github.com/">link</a>.</p>
````````````````````````````````
```````````````````````````````` example
This is [link](http://github.com/)X
.
<p>This is <a href="http://github.com/">link</a>X</p>
````````````````````````````````
## [Issue 76](https://github.com/mity/md4c/issues/76)
```````````````````````````````` example
*(http://example.com)*
.
<p><em>(<a href="http://example.com">http://example.com</a>)</em></p>
````````````````````````````````

View File

@ -0,0 +1,107 @@
# Permissive WWW Autolinks
With the flag `MD_FLAG_PERMISSIVEWWWAUTOLINKS`, MD4C enables recognition of
autolinks starting with `www.`, even if they do not exactly follow the syntax
of autolink as specified in CommonMark specification.
These do not have to be enclosed in `<` and `>`, and they even do not need
any preceding scheme specification.
The WWW autolink will be recognized when a valid domain is found.
A valid domain consists of the text `www.`, followed by alphanumeric characters,
nderscores (`_`), hyphens (`-`) and periods (`.`). There must be at least one
period, and no underscores may be present in the last two segments of the domain.
The scheme `http` will be inserted automatically:
```````````````````````````````` example
www.commonmark.org
.
<p><a href="http://www.commonmark.org">www.commonmark.org</a></p>
````````````````````````````````
After a valid domain, zero or more non-space non-`<` characters may follow:
```````````````````````````````` example
Visit www.commonmark.org/help for more information.
.
<p>Visit <a href="http://www.commonmark.org/help">www.commonmark.org/help</a> for more information.</p>
````````````````````````````````
We then apply extended autolink path validation as follows:
Trailing punctuation (specifically, `?`, `!`, `.`, `,`, `:`, `*`, `_`, and `~`)
will not be considered part of the autolink, though they may be included in the
interior of the link:
```````````````````````````````` example
Visit www.commonmark.org.
Visit www.commonmark.org/a.b.
.
<p>Visit <a href="http://www.commonmark.org">www.commonmark.org</a>.</p>
<p>Visit <a href="http://www.commonmark.org/a.b">www.commonmark.org/a.b</a>.</p>
````````````````````````````````
When an autolink ends in `)`, we scan the entire autolink for the total number
of parentheses. If there is a greater number of closing parentheses than
opening ones, we don't consider the last character part of the autolink, in
order to facilitate including an autolink inside a parenthesis:
```````````````````````````````` example
www.google.com/search?q=Markup+(business)
(www.google.com/search?q=Markup+(business))
.
<p><a href="http://www.google.com/search?q=Markup+(business)">www.google.com/search?q=Markup+(business)</a></p>
<p>(<a href="http://www.google.com/search?q=Markup+(business)">www.google.com/search?q=Markup+(business)</a>)</p>
````````````````````````````````
This check is only done when the link ends in a closing parentheses `)`, so if
the only parentheses are in the interior of the autolink, no special rules are
applied:
```````````````````````````````` example
www.google.com/search?q=(business)+ok
.
<p><a href="http://www.google.com/search?q=(business)+ok">www.google.com/search?q=(business)+ok</a></p>
````````````````````````````````
If an autolink ends in a semicolon (`;`), we check to see if it appears to
resemble an [entity reference][entity references]; if the preceding text is `&`
followed by one or more alphanumeric characters. If so, it is excluded from
the autolink:
```````````````````````````````` example
www.google.com/search?q=commonmark&hl=en
www.google.com/search?q=commonmark&hl;
.
<p><a href="http://www.google.com/search?q=commonmark&amp;hl=en">www.google.com/search?q=commonmark&amp;hl=en</a></p>
<p><a href="http://www.google.com/search?q=commonmark">www.google.com/search?q=commonmark</a>&amp;hl;</p>
````````````````````````````````
`<` immediately ends an autolink.
```````````````````````````````` example
www.commonmark.org/he<lp
.
<p><a href="http://www.commonmark.org/he">www.commonmark.org/he</a>&lt;lp</p>
````````````````````````````````
## GitHub Issues
### [Issue 53](https://github.com/mity/md4c/issues/53)
```````````````````````````````` example
This is [link](www.github.com/).
.
<p>This is <a href="www.github.com/">link</a>.</p>
````````````````````````````````
```````````````````````````````` example
This is [link](www.github.com/)X
.
<p>This is <a href="www.github.com/">link</a>X</p>
````````````````````````````````

9709
test/spec.txt Normal file

File diff suppressed because it is too large Load Diff

144
test/spec_tests.py Executable file
View File

@ -0,0 +1,144 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import sys
from difflib import unified_diff
import argparse
import re
import json
from cmark import CMark
from normalize import normalize_html
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Run cmark tests.')
parser.add_argument('-p', '--program', dest='program', nargs='?', default=None,
help='program to test')
parser.add_argument('-s', '--spec', dest='spec', nargs='?', default='spec.txt',
help='path to spec')
parser.add_argument('-P', '--pattern', dest='pattern', nargs='?',
default=None, help='limit to sections matching regex pattern')
parser.add_argument('--library-dir', dest='library_dir', nargs='?',
default=None, help='directory containing dynamic library')
parser.add_argument('--no-normalize', dest='normalize',
action='store_const', const=False, default=True,
help='do not normalize HTML')
parser.add_argument('-d', '--dump-tests', dest='dump_tests',
action='store_const', const=True, default=False,
help='dump tests in JSON format')
parser.add_argument('--debug-normalization', dest='debug_normalization',
action='store_const', const=True,
default=False, help='filter stdin through normalizer for testing')
parser.add_argument('-n', '--number', type=int, default=None,
help='only consider the test with the given number')
args = parser.parse_args(sys.argv[1:])
def out(str):
sys.stdout.buffer.write(str.encode('utf-8'))
def print_test_header(headertext, example_number, start_line, end_line):
out("Example %d (lines %d-%d) %s\n" % (example_number,start_line,end_line,headertext))
def do_test(test, normalize, result_counts):
[retcode, actual_html, err] = cmark.to_html(test['markdown'])
if retcode == 0:
expected_html = test['html']
unicode_error = None
if normalize:
try:
passed = normalize_html(actual_html) == normalize_html(expected_html)
except UnicodeDecodeError as e:
unicode_error = e
passed = False
else:
passed = actual_html == expected_html
if passed:
result_counts['pass'] += 1
else:
print_test_header(test['section'], test['example'], test['start_line'], test['end_line'])
out(test['markdown'] + '\n')
if unicode_error:
out("Unicode error: " + str(unicode_error) + '\n')
out("Expected: " + repr(expected_html) + '\n')
out("Got: " + repr(actual_html) + '\n')
else:
expected_html_lines = expected_html.splitlines(True)
actual_html_lines = actual_html.splitlines(True)
for diffline in unified_diff(expected_html_lines, actual_html_lines,
"expected HTML", "actual HTML"):
out(diffline)
out('\n')
result_counts['fail'] += 1
else:
print_test_header(test['section'], test['example'], test['start_line'], test['end_line'])
out("program returned error code %d\n" % retcode)
sys.stdout.buffer.write(err)
result_counts['error'] += 1
def get_tests(specfile):
line_number = 0
start_line = 0
end_line = 0
example_number = 0
markdown_lines = []
html_lines = []
state = 0 # 0 regular text, 1 markdown example, 2 html output
headertext = ''
tests = []
header_re = re.compile('#+ ')
with open(specfile, 'r', encoding='utf-8', newline='\n') as specf:
for line in specf:
line_number = line_number + 1
l = line.strip()
#if l == "`" * 32 + " example":
if re.match("`{32} example( [a-z]{1,})?", l):
state = 1
elif state == 2 and l == "`" * 32:
state = 0
example_number = example_number + 1
end_line = line_number
tests.append({
"markdown":''.join(markdown_lines).replace('',"\t"),
"html":''.join(html_lines).replace('',"\t"),
"example": example_number,
"start_line": start_line,
"end_line": end_line,
"section": headertext})
start_line = 0
markdown_lines = []
html_lines = []
elif l == ".":
state = 2
elif state == 1:
if start_line == 0:
start_line = line_number - 1
markdown_lines.append(line)
elif state == 2:
html_lines.append(line)
elif state == 0 and re.match(header_re, line):
headertext = header_re.sub('', line).strip()
return tests
if __name__ == "__main__":
if args.debug_normalization:
out(normalize_html(sys.stdin.read()))
exit(0)
all_tests = get_tests(args.spec)
if args.pattern:
pattern_re = re.compile(args.pattern, re.IGNORECASE)
else:
pattern_re = re.compile('.')
tests = [ test for test in all_tests if re.search(pattern_re, test['section']) and (not args.number or test['example'] == args.number) ]
if args.dump_tests:
out(json.dumps(tests, ensure_ascii=False, indent=2))
exit(0)
else:
skipped = len(all_tests) - len(tests)
cmark = CMark(prog=args.program, library_dir=args.library_dir)
result_counts = {'pass': 0, 'fail': 0, 'error': 0, 'skip': skipped}
for test in tests:
do_test(test, args.normalize, result_counts)
out("{pass} passed, {fail} failed, {error} errored, {skip} skipped\n".format(**result_counts))
exit(result_counts['fail'] + result_counts['error'])

75
test/strikethrough.txt Normal file
View File

@ -0,0 +1,75 @@
# Strike-Through
With the flag `MD_FLAG_STRIKETHROUGH`, MD4C enables extension for recognition
of strike-through spans.
Strike-through text is any text wrapped in one or two tildes (`~`).
```````````````````````````````` example
~Hi~ Hello, world!
.
<p><del>Hi</del> Hello, world!</p>
````````````````````````````````
If the length of the opener and closer doesn't match, the strike-through is
not recognized.
```````````````````````````````` example
This ~text~~ is curious.
.
<p>This ~text~~ is curious.</p>
````````````````````````````````
Too long tilde sequence won't be recognized:
```````````````````````````````` example
foo ~~~bar~~~
.
<p>foo ~~~bar~~~</p>
````````````````````````````````
Also note the markers cannot open a strike-through span if they are followed
with a whitespace; and similarly, then cannot close the span if they are
preceded with a whitespace:
```````````````````````````````` example
~foo ~bar
.
<p>~foo ~bar</p>
````````````````````````````````
As with regular emphasis delimiters, a new paragraph will cause the cessation
of parsing a strike-through:
```````````````````````````````` example
This ~~has a
new paragraph~~.
.
<p>This ~~has a</p>
<p>new paragraph~~.</p>
````````````````````````````````
## GitHub Issues
### [Issue 69](https://github.com/mity/md4c/issues/69)
```````````````````````````````` example
~`foo`~
.
<p><del><code>foo</code></del></p>
````````````````````````````````
```````````````````````````````` example
~*foo*~
.
<p><del><em>foo</em></del></p>
````````````````````````````````
```````````````````````````````` example
*~foo~*
.
<p><em><del>foo</del></em></p>
````````````````````````````````

363
test/tables.txt Normal file
View File

@ -0,0 +1,363 @@
# Tables
With the flag `MD_FLAG_TABLES`, MD4C enables extension for recognition of
tables.
Basic table example of a table with two columns and three lines (when not
counting the header) is as follows:
```````````````````````````````` example
| Column 1 | Column 2 |
|----------|----------|
| foo | bar |
| baz | qux |
| quux | quuz |
.
<table>
<thead>
<tr><th>Column 1</th><th>Column 2</th></tr>
</thead>
<tbody>
<tr><td>foo</td><td>bar</td></tr>
<tr><td>baz</td><td>qux</td></tr>
<tr><td>quux</td><td>quuz</td></tr>
</tbody>
</table>
````````````````````````````````
The leading and succeeding pipe characters (`|`) on each line are optional:
```````````````````````````````` example
Column 1 | Column 2 |
---------|--------- |
foo | bar |
baz | qux |
quux | quuz |
.
<table>
<thead>
<tr><th>Column 1</th><th>Column 2</th></tr>
</thead>
<tbody>
<tr><td>foo</td><td>bar</td></tr>
<tr><td>baz</td><td>qux</td></tr>
<tr><td>quux</td><td>quuz</td></tr>
</tbody>
</table>
````````````````````````````````
```````````````````````````````` example
| Column 1 | Column 2
|----------|---------
| foo | bar
| baz | qux
| quux | quuz
.
<table>
<thead>
<tr><th>Column 1</th><th>Column 2</th></tr>
</thead>
<tbody>
<tr><td>foo</td><td>bar</td></tr>
<tr><td>baz</td><td>qux</td></tr>
<tr><td>quux</td><td>quuz</td></tr>
</tbody>
</table>
````````````````````````````````
```````````````````````````````` example
Column 1 | Column 2
---------|---------
foo | bar
baz | qux
quux | quuz
.
<table>
<thead>
<tr><th>Column 1</th><th>Column 2</th></tr>
</thead>
<tbody>
<tr><td>foo</td><td>bar</td></tr>
<tr><td>baz</td><td>qux</td></tr>
<tr><td>quux</td><td>quuz</td></tr>
</tbody>
</table>
````````````````````````````````
However for one-column table, at least one pipe has to be used in the table
header underline, otherwise it would be parsed as a Setext title followed by
a paragraph.
```````````````````````````````` example
Column 1
--------
foo
baz
quux
.
<h2>Column 1</h2>
<p>foo
baz
quux</p>
````````````````````````````````
Leading and trailing whitespace in a table cell is ignored and the columns do
not need to be aligned.
```````````````````````````````` example
Column 1 |Column 2
---|---
foo | bar
baz| qux
quux|quuz
.
<table>
<thead>
<tr><th>Column 1</th><th>Column 2</th></tr>
</thead>
<tbody>
<tr><td>foo</td><td>bar</td></tr>
<tr><td>baz</td><td>qux</td></tr>
<tr><td>quux</td><td>quuz</td></tr>
</tbody>
</table>
````````````````````````````````
The table cannot interrupt a paragraph.
```````````````````````````````` example
Lorem ipsum dolor sit amet.
| Column 1 | Column 2
| ---------|---------
| foo | bar
| baz | qux
| quux | quuz
.
<p>Lorem ipsum dolor sit amet.
| Column 1 | Column 2
| ---------|---------
| foo | bar
| baz | qux
| quux | quuz</p>
````````````````````````````````
Similarly, paragraph cannot interrupt a table:
```````````````````````````````` example
Column 1 | Column 2
---------|---------
foo | bar
baz | qux
quux | quuz
Lorem ipsum dolor sit amet.
.
<table>
<thead>
<tr><th>Column 1</th><th>Column 2</th></tr>
</thead>
<tbody>
<tr><td>foo</td><td>bar</td></tr>
<tr><td>baz</td><td>qux</td></tr>
<tr><td>quux</td><td>quuz</td></tr>
<tr><td>Lorem ipsum dolor sit amet.</td><td></td></tr>
</tbody>
</table>
````````````````````````````````
The underline of the table is crucial for recognition of the table, count of
its columns and their alignment: The line has to contain at least one pipe,
and it has provide at least three dash (`-`) characters for every column in
the table.
Thus this is not a table because there are too few dashes for Column 2.
```````````````````````````````` example
| Column 1 | Column 2
| ---------|--
| foo | bar
| baz | qux
| quux | quuz
.
<p>| Column 1 | Column 2
| ---------|--
| foo | bar
| baz | qux
| quux | quuz</p>
````````````````````````````````
The first, the last or both the first and the last dash in each column
underline can be replaced with a colon (`:`) to request left, right or middle
alignment of the respective column:
```````````````````````````````` example
| Column 1 | Column 2 | Column 3 | Column 4 |
|----------|:---------|:--------:|---------:|
| default | left | center | right |
.
<table>
<thead>
<tr><th>Column 1</th><th align="left">Column 2</th><th align="center">Column 3</th><th align="right">Column 4</th></tr>
</thead>
<tbody>
<tr><td>default</td><td align="left">left</td><td align="center">center</td><td align="right">right</td></tr>
</tbody>
</table>
````````````````````````````````
To include a literal pipe character in any cell, it has to be escaped.
```````````````````````````````` example
Column 1 | Column 2
---------|---------
foo | bar
baz | qux \| xyzzy
quux | quuz
.
<table>
<thead>
<tr><th>Column 1</th><th>Column 2</th></tr>
</thead>
<tbody>
<tr><td>foo</td><td>bar</td></tr>
<tr><td>baz</td><td>qux | xyzzy</td></tr>
<tr><td>quux</td><td>quuz</td></tr>
</tbody>
</table>
````````````````````````````````
Contents of each cell is parsed as an inline text which may contents any
inline Markdown spans like emphasis, strong emphasis, links etc.
```````````````````````````````` example
Column 1 | Column 2
---------|---------
*foo* | bar
**baz** | [qux]
quux | [quuz](/url2)
[qux]: /url
.
<table>
<thead>
<tr><th>Column 1</th><th>Column 2</th></tr>
</thead>
<tbody>
<tr><td><em>foo</em></td><td>bar</td></tr>
<tr><td><strong>baz</strong></td><td><a href="/url">qux</a></td></tr>
<tr><td>quux</td><td><a href="/url2">quuz</a></td></tr>
</tbody>
</table>
````````````````````````````````
However pipes which are inside a code span are not recognized as cell
boundaries.
```````````````````````````````` example
Column 1 | Column 2
---------|---------
`foo | bar`
baz | qux
quux | quuz
.
<table>
<thead>
<tr><th>Column 1</th><th>Column 2</th></tr>
</thead>
<tbody>
<tr><td><code>foo | bar</code></td><td></td></tr>
<tr><td>baz</td><td>qux</td></tr>
<tr><td>quux</td><td>quuz</td></tr>
</tbody>
</table>
````````````````````````````````
## GitHub Issues
### [Issue 41](https://github.com/mity/md4c/issues/41)
```````````````````````````````` example
* x|x
---|---
.
<ul>
<li>x|x
---|---</li>
</ul>
````````````````````````````````
(Not a table, because the underline has wrong indentation and is not part of the
list item.)
```````````````````````````````` example
* x|x
---|---
x|x
.
<ul>
<li><table>
<thead>
<tr>
<th>x</th>
<th>x</th>
</tr>
</thead>
<tbody>
</tbody>
</table>
</li>
</ul>
<p>x|x</p>
````````````````````````````````
(Here the underline has the right indentation so the table is detected.
But the last line is not part of it due its indentation.)
### [Issue 42](https://github.com/mity/md4c/issues/42)
```````````````````````````````` example
] http://x.x *x*
|x|x|
|---|---|
|x|
.
<p>] http://x.x <em>x</em></p>
<table>
<thead>
<tr>
<th>x</th>
<th>x</th>
</tr>
</thead>
<tbody>
<tr>
<td>x</td>
<td></td>
</tr>
</tbody>
</table>
````````````````````````````````
### [Issue 104](https://github.com/mity/md4c/issues/104)
```````````````````````````````` example
A | B
--- | ---
[x](url)
.
<table>
<thead>
<tr>
<th>A</th>
<th>B</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="url">x</a></td>
<td></td>
</tr>
</tbody>
</table>
````````````````````````````````

117
test/tasklists.txt Normal file
View File

@ -0,0 +1,117 @@
# Tasklists
With the flag `MD_FLAG_TASKLISTS`, MD4C enables extension for recognition of
task lists.
Basic task list may look as follows:
```````````````````````````````` example
* [x] foo
* [X] bar
* [ ] baz
.
<ul>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>foo</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>bar</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled>baz</li>
</ul>
````````````````````````````````
Task lists can also be in ordered lists:
```````````````````````````````` example
1. [x] foo
2. [X] bar
3. [ ] baz
.
<ol>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>foo</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>bar</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled>baz</li>
</ol>
````````````````````````````````
Task lists can also be nested in ordinary lists:
```````````````````````````````` example
* xxx:
* [x] foo
* [x] bar
* [ ] baz
* yyy:
* [ ] qux
* [x] quux
* [ ] quuz
.
<ul>
<li>xxx:
<ul>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>foo</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>bar</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled>baz</li>
</ul></li>
<li>yyy:
<ul>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled>qux</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>quux</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled>quuz</li>
</ul></li>
</ul>
````````````````````````````````
Or in a parent task list:
```````````````````````````````` example
1. [x] xxx:
* [x] foo
* [x] bar
* [ ] baz
2. [ ] yyy:
* [ ] qux
* [x] quux
* [ ] quuz
.
<ol>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>xxx:
<ul>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>foo</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>bar</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled>baz</li>
</ul></li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled>yyy:
<ul>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled>qux</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>quux</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled>quuz</li>
</ul></li>
</ol>
````````````````````````````````
Also, ordinary lists can be nested in the task lists.
```````````````````````````````` example
* [x] xxx:
* foo
* bar
* baz
* [ ] yyy:
* qux
* quux
* quuz
.
<ul>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>xxx:
<ul>
<li>foo</li>
<li>bar</li>
<li>baz</li>
</ul></li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled>yyy:
<ul>
<li>qux</li>
<li>quux</li>
<li>quuz</li>
</ul></li>
</ul>
````````````````````````````````

39
test/underline.txt Normal file
View File

@ -0,0 +1,39 @@
# Underline
With the flag `MD_FLAG_UNDERLINE`, MD4C sees underscore `_` rather as a mark
denoting an underlined span rather then an ordinary emphasis (or a strong
emphasis).
```````````````````````````````` example
_foo_
.
<p><u>foo</u></p>
````````````````````````````````
In sequences of multiple underscores, each single one translates into an
underline span mark.
```````````````````````````````` example
___foo___
.
<p><u><u><u>foo</u></u></u></p>
````````````````````````````````
Intra-word underscores are not recognized as underline marks:
```````````````````````````````` example
foo_bar_baz
.
<p>foo_bar_baz</p>
````````````````````````````````
Also the parser follows the standard understanding when the underscore can
or cannot open or close a span. Therefore there is no underline in the following
example because no underline can be seen as a closing mark.
```````````````````````````````` example
_foo _bar
.
<p>_foo _bar</p>
````````````````````````````````

232
test/wiki-links.txt Normal file
View File

@ -0,0 +1,232 @@
# Wiki Links
With the flag `MD_FLAG_WIKILINKS`, MD4C recognizes wiki links.
The simple wiki-link is a wiki-link destination enclosed in `[[` followed with
`]]`.
```````````````````````````````` example
[[foo]]
.
<p><x-wikilink data-target="foo">foo</x-wikilink></p>
````````````````````````````````
However wiki-link may contain an explicit label, delimited from the destination
with `|`.
```````````````````````````````` example
[[foo|bar]]
.
<p><x-wikilink data-target="foo">bar</x-wikilink></p>
````````````````````````````````
A wiki-link destination cannot be empty.
```````````````````````````````` example
[[]]
.
<p>[[]]</p>
````````````````````````````````
```````````````````````````````` example
[[|foo]]
.
<p>[[|foo]]</p>
````````````````````````````````
The wiki-link destination cannot contain a new line.
```````````````````````````````` example
[[foo
bar]]
.
<p>[[foo
bar]]</p>
````````````````````````````````
```````````````````````````````` example
[[foo
bar|baz]]
.
<p>[[foo
bar|baz]]</p>
````````````````````````````````
The wiki-link destination is rendered verbatim; inline markup in it is not
recognized.
```````````````````````````````` example
[[*foo*]]
.
<p><x-wikilink data-target="*foo*">*foo*</x-wikilink></p>
````````````````````````````````
```````````````````````````````` example
[[foo|![bar](bar.jpg)]]
.
<p><x-wikilink data-target="foo"><img src="bar.jpg" alt="bar"></x-wikilink></p>
````````````````````````````````
With multiple `|` delimiters, only the first one is recognized and the other
ones are part of the label.
```````````````````````````````` example
[[foo|bar|baz]]
.
<p><x-wikilink data-target="foo">bar|baz</x-wikilink></p>
````````````````````````````````
However the delimiter `|` can be escaped with `/`.
```````````````````````````````` example
[[foo\|bar|baz]]
.
<p><x-wikilink data-target="foo|bar">baz</x-wikilink></p>
````````````````````````````````
The label can contain inline elements.
```````````````````````````````` example
[[foo|*bar*]]
.
<p><x-wikilink data-target="foo"><em>bar</em></x-wikilink></p>
````````````````````````````````
Empty explicit label is the same as using the implicit label; i.e. the verbatim
destination string is used as the label.
```````````````````````````````` example
[[foo|]]
.
<p><x-wikilink data-target="foo">foo</x-wikilink></p>
````````````````````````````````
The label can span multiple lines.
```````````````````````````````` example
[[foo|foo
bar
baz]]
.
<p><x-wikilink data-target="foo">foo
bar
baz</x-wikilink></p>
````````````````````````````````
Wiki-links have higher priority then links.
```````````````````````````````` example
[[foo]](foo.jpg)
.
<p><x-wikilink data-target="foo">foo</x-wikilink>(foo.jpg)</p>
````````````````````````````````
```````````````````````````````` example
[foo]: /url
[[foo]]
.
<p><x-wikilink data-target="foo">foo</x-wikilink></p>
````````````````````````````````
Wiki links can be inlined in tables.
```````````````````````````````` example
| A | B |
|------------------|-----|
| [[foo|*bar*]] | baz |
.
<table>
<thead>
<tr>
<th>A</th>
<th>B</th>
</tr>
</thead>
<tbody>
<tr>
<td><x-wikilink data-target="foo"><em>bar</em></x-wikilink></td>
<td>baz</td>
</tr>
</tbody>
</table>
````````````````````````````````
Wiki-links are not prioritized over images.
```````````````````````````````` example
![[foo]](foo.jpg)
.
<p><img src="foo.jpg" alt="[foo]"></p>
````````````````````````````````
Something that may look like a wiki-link at first, but turns out not to be,
is recognized as a normal link.
```````````````````````````````` example
[[foo]
[foo]: /url
.
<p>[<a href="/url">foo</a></p>
````````````````````````````````
Escaping the opening `[` escapes only that one character, not the whole `[[`
opener:
```````````````````````````````` example
\[[foo]]
[foo]: /url
.
<p>[<a href="/url">foo</a>]</p>
````````````````````````````````
Like with other inline links, the innermost wiki-link is preferred.
```````````````````````````````` example
[[foo[[bar]]]]
.
<p>[[foo<x-wikilink data-target="bar">bar</x-wikilink>]]</p>
````````````````````````````````
There is limit of 100 characters for the wiki-link destination.
```````````````````````````````` example
[[12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901]]
[[12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901|foo]]
.
<p>[[12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901]]
[[12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901|foo]]</p>
````````````````````````````````
100 characters inside a wiki link target works.
```````````````````````````````` example
[[1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890]]
[[1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890|foo]]
.
<p><x-wikilink data-target="1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890">1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890</x-wikilink>
<x-wikilink data-target="1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890">foo</x-wikilink></p>
````````````````````````````````
The limit on link content does not include any characters belonging to a block
quote, if the label spans multiple lines contained in a block quote.
```````````````````````````````` example
> [[12345678901234567890123456789012345678901234567890|1234567890
> 1234567890
> 1234567890
> 1234567890
> 123456789]]
.
<blockquote>
<p><x-wikilink data-target="12345678901234567890123456789012345678901234567890">1234567890
1234567890
1234567890
1234567890
123456789</x-wikilink></p>
</blockquote>
````````````````````````````````