I wanted to start this blog for some time now, but never got to it. Now that i finally have, i want to talk about the software i use.
Naturally i wrote my own blog-parser in C. I do want to acknowledge that there are a lot of superb blog parsers out there already. For really minimal setups there is stuff like Saait and for elaborate pages Hugo and of course WordPress. My motivation for this is really just the coding itself.
blog-parser Saait Hugo WordPress
One thing i do miss with most tools however is the ability to create static math content that works with simpler browsers (bonus points for gemini). My motivation for this is sustainable and accessible digitalization. Hardware like the Pinebook is used to help economically disadvantaged people take part in the digital world. In my hometown an organization refurbishes ~15-20 year old laptops for students. Those devices really struggle with the modern web unfortunately. I’d like to write software/content that can be used by everyone. If you want to know more about this topic, the german Bits & Bäume community does a lot of great work.
My way of ensuring this accessibility is targeting browsers like NetSurf and dillo. This means no javascript and limited modern html and css features. For sciency stuff it also means no MathML. The solution i found for this is utf8 math. The library libtexprintf reads LaTeX style math expressions and outputs them as utf8 symbols.
T
⎛x x ⎞ ⎛x x ⎞
⌠ ⎜ 11 12⎟ ⌠ ⎜ 11 21⎟
⎮ dr ⎜x x ⎟ = ⎮ dr ⎜x x ⎟
⌡ ⎝ 21 22⎠ ⌡ ⎝ 12 22⎠
GGA ⌠
E = ⎮ dr f [ n(r), ∇n(r) ]
xc ⌡
Since it’s only utf8 symbols, it is trivially compatible with all browsers and gemini. I need to test its viability using screenreaders however…
I wanted to parse a folder structure like this:
My_Blog/
header.html
footer.html
stylesheet.css
index.md
Entries/
2024-04-21.md
2024-04-28.md
The Output should look like this:
Output/
stylesheet.css
index.html
Entries/
2024-04-21.html
2024-04-28.html
The final html pages should have the header.html and footer.html attached before and after the content. This means i want to copy header.html to Output/index.html, append the rendered output of index.md and then append footer.html. For entries it should work the same way. Additionally the index.html file should include automatically generated links to all entries.
I used the Discount library implementation of John Gruber’s original Markdown specification. For my distribution (Debian) it is packaged as libmarkdown2.
Discount Markdown specification
The library provides the header <mkdio.h>
and the functions mkd_in(FILE*, options)
and markdown(MMIOT*, FILE*, options)
.
In order to parse a input file and write the resulting document to an output file, your programm would look something like this.
#include <stdio.h>
#include <mkdio.h>
int main() {
FILE* input = fopen("...", "r");
... error handling ...
MMIOT* text_obj = mkd_in(input, 0);
FILE* output = fopen("...", "w");
... error handling ...
markdown(text_obj, output, 0);
}
A object with type MMIOT*
stores an intermediary representation of the rendered markdown text.
The function markdown
then does the final rendering.
Somehow i never opened folders in a C program before this project.
It’s quite similar to how C handles files, but a bit more convoluted.
On Unix-like systems we need the headers <sys/types.h>
, <sys/stat.h>
and <dirent.h>
.
We get a type DIR*
and the function opendir(char*)
.
DIR* entries = opendir("Entries/");
if ( !entries ) {
fprintf(stderr, "Error while opening directory\n");
exit(1);
}
The error handling is also analogous to files. If there was an error while opening the directory, we get a NULL pointer.
In order to read individual entries in this directory, we can use the function readdir(DIR*)
.
while( (struct dirent* a = readdir(entries)) != NULL ) {
if ( ismd(a) ) {
AppendVec(files, a->d_name);
}
}
The header dirent.h
gives us access to the type struct dirent*
.
For our purposes here, we need to know if the current entry is a file and what name it has.
We can write a function int ismd(struct dirent*)
that checks this.
int ismd(struct dirent* file) {
// d_type of text file is 8, of dir is 4
if ( file->d_type != 8 ) {
return 0;
}
size_t len = strlen(file->d_name);
if ( strncmp(file->d_name+len-3, ".md", 3) == 0 ) {
return 1;
}
return 0;
}
Finally we want to close the directory using the closedir(DIR*)
function.
Handling header and footer is easy.
Just save the input to a file and write to the same file as markdown()
.
FILE* header_file = fopen("header.html, "r");
// Get length
fseek(header_file, 0, SEEK_END);
size_t len_header = ftell(header_file);
fseek(header_file, 0, SEEK_SET);
char* header = malloc(len_header);
fread(header, 1, len_header, header_file);
... same for footer ...
FILE* output_file = ...
fwrite(header, 1, len_header, output_file);
markdown(text_pbj, output_file, 0);
fwrite(footer, 1, len_footer, output_file);
And we’re done!
Reading the headlines is also quite simple. I chose to use the first line of the entry files as headlines. So we just need to read those and trim any syntax elements.
char BUFF[256];
fgets(BUFF, 256, entry_file);
size_t start = 0;
while ( BUFF[start] == '#' || BUFF[start] == ' ' ) {
start += 1;
}
// Handle the line break
if ( BUFF[strlen(BUFF)-1] == '\n' ) {
BUFF[strlen(BUFF)-1] = 0;
}
... append headline ...
// Reset file pointer to start
rewind(entry_file);
Now we can write those links to the index.html file.
for ( int i = 0; i < headlines->len; i += 1 ) {
// Link destination
char fstring[100];
strcpy(fstring, "Entries/");
// I saved the filenames in a vector-like datastructure
strcat(fstring, files->data[i]);
// Replace .md with .html
fstring[strlen(fstring)-3] = 0;
strcat(fstring, ".html");
fprintf(main_file, "<p><a href=\"%s\">%s</a></p>\n", fstring, headlines->data[i]);
}
Now we have all of the ingredients for a basic blog parser. It’s quite fast, too. On my i5-7200u laptop it takes 0.007s to parse my blog.
My goal here was to use common markdown syntax and preprocess the document. This block
$$ x^2 $$
should be processed to this
x²
a single indent and empty lines before and after a block denote a preformatted block in markdown.
This means, the text will wrapped in a <pre><code>
environment in html.
Also, since larger operators, like an integral sign take multiple lines, the line height should match the font size.
⌠
⎮
⌡
My approach was to wrap math in a <pre class=math><code>
environment to enable special CSS settings for them.
Okay, so what do we need to do here?
$$
The easiest approach here is to build a second string for the processed input.
for ( size_t idx = 0; idx < strlen(text); idx += 1 ) {
if ( strncmp(text+idx, "\n$$", 3) == 0 ) {
size_t len_math = 0;
// Move by strlen("\n$$")
size_t i = 3;
while ( strncmp(text+idx+i, "$$", 2) != 0 ) {
len_math += 1;
i += 1;
}
char* buff = malloc(len_math+1);
strncpy(buff, text+idx+3, len_math);
// Math render function from library
char* render = texstring(buff);
free(buff);
// The new output string
out_len += strlen(render) + strlen("\n\n\n\n");
out = realloc(out, out_len);
strcat(out, "\n\n");
strcat(out, render);
strcat(out, "\n\n");
free(render);
// Move loop to position after math expression
idx += len_math + 3 + 2
// Move loop to position after math expression
idx += len_math + 3 + 2
}
out_len += 1;
out = realloc(out, out_len);
out[out_len-2] = text[idx];
out[out_len-1] = 0;
}
This was a lot…
We start by using strncmp
to search for “\n$$”, so $$ at the start of line in the text.
The function strncmp
compares n characters at the supplied position in memory, so strncmp(text+idx, "\n$$", 3)
starts reading at position idx in the text and compares 3=strlen("\n$$")
chars.
If this is found, we want to get the length of the math block.
To do this, we start 3 = strlen("\n$$")
chars after the current idx and compare the text with $$, which denotes the end of the block.
After that we can render the text, realloc() our string to the right size and copy the math expression.
If we don’t find a math block at the current position, we only copy the character at position idx instead.
One thing we are missing here is the indent to indicate a preformatted block. I did this by iterating over the string render and additionally writing a tab every time a newline is encountered. Reallocating the output string for every character is quite bad for performance, so this is something i want to revisit in the future. Rendering my blog with math parsing takes 0.012 seconds.
The parser is still quite barebones, especially since the original markdown specification didn’t include many features. Something i definitely want to add is support for tables. If it is not too much work, i’d also like to include syntax highlighting for my favorite languages. Additionally Gemini output would be cool.
Partial support is already implemented - meaning that .gmi files are written with parsed math - but the markdown source is not parsed. If you just want to use the parser to generate a automatic link list and some math, it already works. I’d like to convert the markdown syntax to Gemtext as far as possible, though. Let’s see how far i can make it.
If you want to try this parser, a static build for amd64 linux is hosted on my server.
I implemented link and preformatted block parsing for gemtext. Additionally i added a custom syntax for tables both for html and gemtext output.
table
Entry1 Entry2 Entry3
abc abc abc
table
The entries are separated by tab characters. For tables and math preprocessing and html + gemini output, the execution time for this blog is 0.034s.
This site uses the wonderful Catppuccin color scheme.