My boyfriend and i decided to start last years advent of code today. The first exercise can be found on their webpage.
In short: Parse a text file and extract the first and last digit in a line to form a two digit number. For example:
1abc2
pqr3stu8vwx
a1b2c3d4e5f
treb7uchet
should give 12
, 38
, 15
and 77
.
A short and fun exercise.
I wanted to use this opportunity and compare the standard file input functions in Hare and C. First i’ll look at a simple unbuffered read of the whole file, second i’ll compare buffered line reads.
In C i used the fread()
function to perform the read operation.
Since i needed the amount of bytes to read for the call, i used ftell
for that first.
// Get length of file in bytes
fseek(fhandle, 0, SEEK_END);
u32 len = ftell(fhandle);
// Set file pointer to beginning
rewind(fhandle);
// Read whole file
char *textfile = malloc(len+1);
fread(textfile, 1, len, fhandle);
fclose(fhandle);
Next i iterated over the textfile and extracted the first and last digits.
int first = -1;
int last = 0;
for ( u32 i = 0; i < len; i += 1 ) {
// Search individual lines
if ( textfile[i] == '\n' ) {
u32 num = (textfile[first]-48)*10 + textfile[last]-48;
AppendVec(digits, num);
first = -1;
last = 0;
}
if ( textfile[i] > 47 && textfile[i] < 58 ) {
last = i;
first = first == -1 ? i : first;
}
}
The variables first
and last
save the first and last appearance of ASCII-digits in each line.
The first appearance is distinguished by setting first to -1
- something the iterator i
should never be - and only updating the value if first == -1
.
If a linebreak is encountered, the two digits are evaluated - digit 0 is value 48 in ASCII - and appending the value to an array.
I wrote a custom vector struct for this purpose.
Then first
and last
are reset to their default values.
For Hare i used io::drain
for the reading and the same logic for the loop.
let textfile = io::drain(fhandle) as []u8;
let digits: []u32 = [];
let first = -1;
let last = 0;
for ( let i = 0u32; i < len(textfile); i += 1 ) {
if ( textfile[i] == '\n' ) {
// No number in line
if ( first == -1 ) {
continue;
};
let num = (textfile[first]-48)*10 + textfile[last]-48;
append(digits, num);
first = -1;
last = 0;
};
if ( textfile[i] > 47 && textfile[i] < 58 ) {
if ( first == -1 ) { first = i: i32; };
last = i: i32;
};
};
In C i used the fgets()
function with a heap-buffer and the same vector type as above.
u32 buff_size = 5000000;
char *BUFF = malloc( buff_size );
u32 first = 0;
u32 last = 0;
Vec digits = NewVec();
while ( 1 ) {
// Check for EOF
if ( !fgets(BUFF, buff_size, fhandle) ) {
break;
}
// Check for empty lines
if ( BUFF[0] == '\n' ) {
continue;
}
// Search first digit
for ( u32 i = 0; i < strlen(BUFF); i += 1 ) {
if ( BUFF[i] > 47 && BUFF[i] < 58 ) {
first = i;
break;
}
}
// Search last digit
for ( u32 i = first; i < strlen(BUFF); i += 1 ) {
if ( BUFF[i] > 47 && BUFF[i] < 58 ) {
last = i;
}
}
AppendVec(digits, ((BUFF[first]-48)*10 + BUFF[last]-48) );
}
free(BUFF);
The buffer size is quite large, because i tested with long automatically generated files.
For Hare i used the bufio::read_line()
function as described in the documentation.
let digits: []u32 = [];
const scan = bufio::newscanner(fhandle, types::SIZE_MAX);
for (true) {
const line = match (bufio::read_line(&scan)!) {
case io::EOF =>
break;
case let line: []u8 =>
yield line;
};
let first = 0u32;
let last = 0u32;
// Find first
for ( let i=0u32; i < len(line); i += 1 ) {
// Search for ascii digits
if ( line[i] < 58 && line[i] > 47 ) {
first = i;
break;
};
};
// Find last
for ( let i=first; i < len(line); i += 1 ) {
// Search for ascii digits;
// This works for utf8 as well, since no 7-bit
// codepoints appear in multibyte characters
if ( line[i] < 58 && line[i] > 47 ) {
// Convert ascii digits to digit values
last = i;
};
};
// Use first found digit as
let num = (line[first]-48)*10 + (line[last]-48);
append(digits, num);
};
To benchmark the different solution, i generated a text file with 10000 lines, random line length between 20 and 20020 chars and random ASCII characters.
for ( size_t i = 0; i < 10000; i += 1 ) {
size_t len = 20 + rand()%20000;
char *buff = malloc(len);
for ( size_t k = 0; k < len-1; k += 1 ) {
buff[k] = rand()%92 + 33;
}
buff[len-1] = 52;
printf("%s\n", buff);
free(buff);
}
The resulting file is 96 MiB large and should provide a reasonable testing ground for the different solutions. The last entry is always set to 52 to ensure that at least one digit is always present.
I tested gcc
with and without optimization using -O3
, tcc
and hare build -R
.
To gain better insights, i measured the execution time of the I/O operations, the extraction loop and the linux execution time.
All values are given in seconds.
gcc | -O3 | tcc | hare | |
BUFF | ||||
time | 7.273 | 0.103 | 7.242 | 168.658 |
IO | 0.024 | 0.021 | 0.021 | 168.168 |
Loop | 7.243 | 0.077 | 7.212 | - |
UNBUFF | ||||
time | 0.443 | 0.191 | 0.440 | 0.590 |
IO | 0.044 | 0.035 | 0.035 | 0.093 |
Loop | 0.392 | 0.149 | 0.399 | 0.492 |
The significantly slower operations in the buffered I/O package in Hare is quite a surprise.
But the jump from gcc/tcc to gcc -O3
is also quite impressive.
The unbuffered IO solutions perform better due to the better loop optimization.
For me the somewhat slower IO in C and slower time all around using -O3
is also quite surprising.
I’m honestly not too sure what to make of all of this, but it was quite fun!
This site uses the wonderful Catppuccin color scheme.
My RSS Feed.
Mail: mia.schambeck@mailbox.org
XMPP: mia.schambeck@mailbox.org
Bluesky: @miajara.bsky.social
Mastodon: @miasb.sueden.social