My boyfriend and i decided to start last years advent of code today. The first exercise can be found on their webpage.
In short: Parse a text file and extract the first and last digit in a line to form a two digit number. For example:
1abc2
pqr3stu8vwx
a1b2c3d4e5f
treb7uchet
should give 12
, 38
, 15
and 77
.
A short and fun exercise.
I wanted to use this opportunity and compare the standard file input functions in Hare and C. First i’ll look at a simple unbuffered read of the whole file, second i’ll compare buffered line reads.
In C i used the fread()
function to perform the read operation.
Since i needed the amount of bytes to read for the call, i used ftell
for that first.
// Get length of file in bytes
fseek(fhandle, 0, SEEK_END);
u32 len = ftell(fhandle);
// Set file pointer to beginning
rewind(fhandle);
// Read whole file
char *textfile = malloc(len+1);
fread(textfile, 1, len, fhandle);
fclose(fhandle);
Next i iterated over the textfile and extracted the first and last digits.
int first = -1;
int last = 0;
for ( u32 i = 0; i < len; i += 1 ) {
// Search individual lines
if ( textfile[i] == '\n' ) {
u32 num = (textfile[first]-48)*10 + textfile[last]-48;
AppendVec(digits, num);
first = -1;
last = 0;
}
if ( textfile[i] > 47 && textfile[i] < 58 ) {
last = i;
first = first == -1 ? i : first;
}
}
The variables first
and last
save the first and last appearance of ASCII-digits in each line.
The first appearance is distinguished by setting first to -1
- something the iterator i
should never be - and only updating the value if first == -1
.
If a linebreak is encountered, the two digits are evaluated - digit 0 is value 48 in ASCII - and appending the value to an array.
I wrote a custom vector struct for this purpose.
Then first
and last
are reset to their default values.
For Hare i used io::drain
for the reading and the same logic for the loop.
let textfile = io::drain(fhandle) as []u8;
let digits: []u32 = [];
let first = -1;
let last = 0;
for ( let i = 0u32; i < len(textfile); i += 1 ) {
if ( textfile[i] == '\n' ) {
// No number in line
if ( first == -1 ) {
continue;
};
let num = (textfile[first]-48)*10 + textfile[last]-48;
append(digits, num);
first = -1;
last = 0;
};
if ( textfile[i] > 47 && textfile[i] < 58 ) {
if ( first == -1 ) { first = i: i32; };
last = i: i32;
};
};
In C i used the fgets()
function with a heap-buffer and the same vector type as above.
u32 buff_size = 5000000;
char *BUFF = malloc( buff_size );
u32 first = 0;
u32 last = 0;
Vec digits = NewVec();
while ( 1 ) {
// Check for EOF
if ( !fgets(BUFF, buff_size, fhandle) ) {
break;
}
// Check for empty lines
if ( BUFF[0] == '\n' ) {
continue;
}
// Search first digit
for ( u32 i = 0; i < strlen(BUFF); i += 1 ) {
if ( BUFF[i] > 47 && BUFF[i] < 58 ) {
first = i;
break;
}
}
// Search last digit
for ( u32 i = first; i < strlen(BUFF); i += 1 ) {
if ( BUFF[i] > 47 && BUFF[i] < 58 ) {
last = i;
}
}
AppendVec(digits, ((BUFF[first]-48)*10 + BUFF[last]-48) );
}
free(BUFF);
The buffer size is quite large, because i tested with long automatically generated files.
For Hare i used the bufio::read_line()
function as described in the documentation.
let digits: []u32 = [];
const scan = bufio::newscanner(fhandle, types::SIZE_MAX);
for (true) {
const line = match (bufio::read_line(&scan)!) {
case io::EOF =>
break;
case let line: []u8 =>
yield line;
};
let first = 0u32;
let last = 0u32;
// Find first
for ( let i=0u32; i < len(line); i += 1 ) {
// Search for ascii digits
if ( line[i] < 58 && line[i] > 47 ) {
first = i;
break;
};
};
// Find last
for ( let i=first; i < len(line); i += 1 ) {
// Search for ascii digits;
// This works for utf8 as well, since no 7-bit
// codepoints appear in multibyte characters
if ( line[i] < 58 && line[i] > 47 ) {
// Convert ascii digits to digit values
last = i;
};
};
// Use first found digit as
let num = (line[first]-48)*10 + (line[last]-48);
append(digits, num);
};
To benchmark the different solution, i generated a text file with 10000 lines, random line length between 20 and 20020 chars and random ASCII characters.
for ( size_t i = 0; i < 10000; i += 1 ) {
size_t len = 20 + rand()%20000;
char *buff = malloc(len);
for ( size_t k = 0; k < len-1; k += 1 ) {
buff[k] = rand()%92 + 33;
}
buff[len-1] = 52;
printf("%s\n", buff);
free(buff);
}
The resulting file is 96 MiB large and should provide a reasonable testing ground for the different solutions. The last entry is always set to 52 to ensure that at least one digit is always present.
I tested gcc
with and without optimization using -O3
, tcc
and hare build -R
.
To gain better insights, i measured the execution time of the I/O operations, the extraction loop and the linux execution time.
All values are given in seconds.
gcc | -O3 | tcc | hare | |
BUFF | ||||
time | 7.273 | 0.103 | 7.242 | 168.658 |
IO | 0.024 | 0.021 | 0.021 | 168.168 |
Loop | 7.243 | 0.077 | 7.212 | - |
UNBUFF | ||||
time | 0.443 | 0.191 | 0.440 | 0.590 |
IO | 0.044 | 0.035 | 0.035 | 0.093 |
Loop | 0.392 | 0.149 | 0.399 | 0.492 |
The significantly slower operations in the buffered I/O package in Hare is quite a surprise.
But the jump from gcc/tcc to gcc -O3
is also quite impressive.
The unbuffered IO solutions perform better due to the better loop optimization.
For me the somewhat slower IO in C and slower time all around using -O3
is also quite surprising.
I’m honestly not too sure what to make of all of this, but it was quite fun!
This site uses the wonderful Catppuccin color scheme.