Hello and welcome to my little nock of the internet. Here I have my blog which mainly contain posts about tech, the outdoors, my journey as a PostDoc, cooking, and some times mead brewing.

A bit of background information before you step into my world of crazy. I am Lars, a.k.a. looopTools, a Software Engineer and PostDoc living in East Jutland, Denmark. My research focus on the application of generalised deduplication in real world storage systems. This includes the design, implementation, and evaluation of prototypes and more complex systems. I have 10+ years of experience from industry mainly through student positions, but also as self-employed consultant, or full-time employee. I mainly work in low-level user space, high-level kernel space, and storage systems in general. Besides research and software development, I also love the outdoors and try to go out as often as possible (not enough at the moment) and I am an aspiring author currently working on a few different novels. I also dabble in being a more advance home cook and baker, which you may see some posts about. Finally I like the ancient art of brewing mead, a.k.a. honey wine, and experiment with different flavour combinations and ageing times.

Introducing QueryC++

14 April 2021

I love C++, and I love PostgreSQL. What I do not like is writing raw SQL queries using string in C++. First of all, I do not like to see SQL query strings in C++, and two developers (my self included) sometimes make syntax errors that we cannot catch on compile-time. Therefore, to reduce the potential for errors, I have been looking for libraries to solve this. Sadly the libraries I found were either out-of-date or tricky to set up correctly and was too complex for simply generating queries. For this reason, I decided to create a new library called QueryC++ to provide a bare-bone simple library for generating SQL queries in C++. Additionally, the purpose is also to provide a modern library that takes advantage of the latest features in C++17 and 20.

The library is open source under the BSD license and is available on GitHub: looopTools/querycpp. The project is still in its infant stage and not ready for production usage, but I wanted to share information about it such that people can follow along in the development process and give me input.

Additionally, the project serves as a learning experience for me as I plan to include build-chains for CMake and a Conan package, where I have limited experience. Besides CMake, there will also be provided build scripts for waf.io, and standard Make to accommodate most macOS, Linux, and BSD users (sorry, Windows, you will have to wait a bit).

ROAD MAP: The current road map for the project is to provide full support for PostgreSQL SQL syntax supporting all our favourite commands. The first step will be to get basic SELECT statements of the ground and build up from there. Hopefully, we will see an alpha release of the project in late April start May.

Plan for other databases: The benefit of most relational databases is that their languages are very similar, and queries (most of the time) works across SQL dialects. But some (quite a few) special cases exist, and we will have to address them to make QueryC++ work with other databases, and this is the plan. I just decided to start with PostgreSQL because it is the database I favour working with. The plan for adopting other databases is that when I am adequate satisfied with PostgreSQL compatibility, I will start implementing support for MariaDB, which also should include a large amount of MySQL.

What is this project not! This project is not meant as a replacement for jtv/libpqxx but rather to work with PQXX where QueryC++ will make it easy to build your queries and PQXX will make it easy to connect to and execute your queries in a PostgreSQL database.

./Lars

[C++] std::string contains a character or a substring

14 April 2021

I am a little late to this game (and will explain why later), but I often find that students need to identify if a string contains either a single character or a sub-string. Which, in most languages, is pretty easy to identify as they provide a contains() function on strings which returns a boolean to determine if the string contains the character or sub-string. However, C++ does not provide such a function (yet), and to use Boost to get this functionality is a bit excessive. But is it easy to do in C++? Well, based on the solutions I have seen from students, the answer is no! and the main reason for this is that they do not know modern C++ or, more specifically, std::string.

Before I show how I would address this issue, I will show some solutions to this that students have shown me. One solution is to find a single character, and one is to find a sub-string, with both being based on for-loops. For finding a single character, the students would iterate the string to identify for each index if the element at that index match the character and return true` if it does.

bool contains(const std::string& str, const char sub)
{
    for (size_t i = 0; i < str.size(); ++i)
    {
        if (str.at(i) == sub)
        {
            return true;
        }
    }
    return false; 
}

This solution is not bad or incorrect. But a question I often ask the students is; “If you do not need the index, why use”. Next, I will suggest that if they insist on using loops, then they should use a for-each-loop as it reduces the risk of index errors. This morphs the code above to the one below. This solution is essentially no different than the students’ solution, and it is just “safer” to use. But we will make it much better.

bool contains(const std::string& str, const char sub)
{
    for (const auto& elm : str)
    {
        if (elm == sub)
        {
            return true; 
        }
    }
    return false; 
}

Now for identifying if a string contains a sub-string, it often looks similar to this code below. This code has the same risk of indexing errors if we are not careful, but it is also much more challenging to change this from indexed based loops to for-each. Additionally, we have to remember to break the loop and return if we find the information we are looking for, all in all. Not nice compared to, for instance, Python, where we simply can call contains.

bool contains (const std::string& str, const std::string& sub)
{
    for (size_t i = 0; i < str.size(); ++i)
    {
        if (str.at(i) == sub.at(0))
        {
            bool found = true;
            for (size_t j = 0; j < sub.size(); ++j)
            {
                if (str.at(i + j) != sub.at(j))
                {
                    found = false;
                    break;
                }
            }
            if (found)
            {
                return found; 
            }
        }
    }
    return false; 
}

But how do we make this code simpler and safer to use? Well let us take a look on basic_string what we will see is that string has a function called find which returns a size_type which can be compared to std::string:npos. A cool thing here is that find works with both a char and string input, so instead of having different functions for contains, we can easily define a single function.

#include <string>

template<typename T>
bool cotains(const std::string& str, const T sub)
{
    return str.find(sub) != std::string::npos;
}

The benefit of this solution is that it completely removes the need for loops (loops exposed to us) and indexing. Additionally, it is super easy to read. If you want to test the function compile this code:

#include <string>

#include <iostream>


template<typename T>
bool contains(const std::string& str, const T elm)
{
    return str.find(sub) != std::string::npos;
}


int main(void)
{

    std::string base = "abba"; 
    std::string sub = "ba";
    bool _contains = contains(base, sub);
    std::cout << _contains << "\n";

    sub = "da"; 
    _contains = contains(base, sub);
    std::cout << _contains << "\n";

    char csub = 'a'; 
    _contains = contains(base, csub);
    std::cout << _contains << "\n";

    csub = 'f'; 
    _contains = contains(base, csub);
    std::cout << _contains << "\n";
}

Now, this is the best solution I know of in C++11/14/17/20. But remember that I wrote C++ does not have a contains function yet? Well, with C++23, this will change as the contains function will be added with the new standard, and if you follow the link, you will see that what we have implemented above is the same as what is intended to be used in C++23.

./Lars

Developing a Cake recipe trial by trial - Part 1

11 January 2021

Backing cakes is one of the easiest things you can do in a kitchen and often it is just as fast as shake and back but tastier. However, for quite a while I have struggled with backing chocolate cakes, the flavour have been great by the texture have not been on par. I have for a long while sworn to this (danish) Alletiders mumse CHOKOLADEKAGE recipe when cooking as it got closest to what I wanted, but it was still lacking something.

In this post and following post, I will tack you through my development of my own recipe which is based on the recipe above.

Step 1: The base recipe

Let us start on common ground with the base ingredient list and recipe in English

Ingredient Amount
Cacao 36g
All purpose flour 250g
Sugar 300g
Baking soda 5g
Baking powder 5g
salt a nib
Egg 2
Water 2.5 dl.

A few comments on this the Cacao, baking soda, and baking powder is given in table spoon and tea spoons in the original recipe. I have calculated it here in grams based on standard transformation tables, I have from a danish cookbook.

Sift the dry ingredients together and melt the butter. Mix in butter, eggs, and water. Put the mixture in a 23 by 23 cm pan and bake in a preheated oven at 170 on the lowest rack for 35 min by conventional heat. Now the recipe does not call for buttering off the pan so I am assuming the original author wants us to baking paper. Also the author states that the cake is freezer, this is totally true and in my opinion this cake is best served “luke” or cold and not piping hot.

What is wrong with this Recipe?

Nothing! I really love it. Then why the hell do I want to change it? There are a few reasons, I like an airy cake, which I do not think the original recipe provides, it has a very monotone flavour, and I like to mix it up.

So let us get nutty!

Step 2: First prototype

So my thoughts behind the first changes was, what do I want to change. I want a more airy cake and I want to modify the flavour a bit. Also I hate melting butter for cakes, for some random reason, so why not creme it?

Making the cake more airy can be fixed by whipping the whites of the eggs and gently fold them in. This also gives us a bit more control when adding the yolks as we can “creme” them in with the butter.

For the fun of it I add 125g of dark chocolate chips as well.

Changing the flavour, chocolate and cocoa goes well with dark berries and I positively love boysenberries (a Frankenstein berry cultivated by mixing european raspberries, blackberries and American dewberries and logan berries) and jam made from them is pretty readily available in most supermarkets in Denmark. I decided to go with a 100g, without changing other ingredients. Now let us move on to the modified recipe!

In a bowl creme together the sugar and butter until fully combined and white. Separate the egg yolks and whites. Put the whites in a cool bowl (if you have a metal one use that, copper even better) and the yolks in a separate dish. In a separate bowl sift the flour, cocoa, salt, baking soda and baking powder together. Now put one egg yolk into the butter and sugar and one table spoon of the flour mix. Mix until fully combined and then repeat with the next egg yolk. Now mix in the water and the rest of the flour mixture until fully combined. Mix in the chocolate chips. Set a side for a bit while you whip the egg whites to stiff peaks. If you do not have dexterity problems, then there is no reason to use an electrical whisk for this, just learn a proper whisking technique. First fold in half of the egg whites until fully combined. Then gently fold in the other half of the whites. Rub a backing tin or springform pan (I use the latter) with butter and sprinkle with sugar, and gently pour in the mixture. Increase the backing time to 60 minutes.

Comments and things to change

The cake did become more airy due to the whipping of the whites and gently folding them in. The chocolate chips made the cake to rich (which is a thing I never thought I should say). Finally, the test people and I still thought some texture was missing and the boysenberry flavour was not powerful enough.

  1. The reason the cooking time increases, is due to the add amount of liquid from the jam.
  2. The amount of jam was to little, the flavour of the boysenberries was not as powerfull as I wanted
  3. The chocolate chips made the cake to rich
  4. The reason I sprinkled the form with sugar as well, is because it creates a caramelised outside on the cake. Which gives an extra layer of texture.
  5. What if we add a marzipan wrapping? This came about due to again thinking the flavour was to monotone.

Step 3: Second prototype and currently newest

So for this bake, I removed the chocolate chip cookies and increased the jam to 300g (an entire glass) and after the bake I rolled out marzipan and wrapped the cake. The rest of the mixing process was the same. We kept the same baking temperature, but increased the baking time to 90 minutes.

Comments and things to change

First of the cake was nice, lovely airy (these cakes never get light), tasted of both chocolate and boysenberry. But a few things, 1) Due to the heavily increased amount of sugar added by both the jam and marzipan, the cake got way to sweet. This needs to be rectified and I will comment on how I wanna try to do this. 2) The marzipan did add a lowly almond flavour (for the love of the gods buy good quality), but the almond flavour was not pronounced enough. 3) Marzipan clearly works best if the cake is served “luke” / cold. I still have a few pieces in my freezer and the once I have tried from there had a much strong marzipan flavour.

Now let us talk about 1) in more detail. I am not sure I want to keep the marzipan, but I do want to keep the jam level. The solution therefore to reduce the sweetness is to reduce the sugar level. So I am thinking about cutting as much as 125 grams of sugar. But this will have to be trial by error as it is very hard to judge.

Next 2) Adding more almond flavour can also help us address 1). So a friend said that some almond cream in between two layers of this cake would be nice. This got me thinking, lots of chocolate cakes use coffee to cut through the riches and sweets of a cake using the bitterness of the coffee. Therefore the next version will feature an espresso cream with Amaretto (Italian Almond liquor).

Finally 3) I am a bit torn here, either I keep the marzipan or I find another way to put in more almond flavour (besides Amaretto). But I do not like chopped almonds inside of cake. I could use essence of almond, but I kind of see that as cheating. So I have to come up with ideas for how to fix this.

This is were this first post drops off, but I will keep you posted as I change the recipe. In the end I will add the final recipe to my recipe collection which you can find on my GitHub page!

What keyboard and mouse do I use?

29 December 2020

As a developer and a PhD Fellow I spend a lot of time in front of a monitor, screen, mouse, and keyboard and have done so for quite a while. To put it into perspective when I was fifteen I earned money for the first website I made for some else, by the time I was 18 I coded Java for companies in my spare time. During my summer holiday at 18 I coded 8-16 hours a day for a start-up company and basically in 2 months I shattered my wrists. It got really bad, but I thought I was young a powered through, resulting in RSI pains in my wrists. I had to dramatically decrease my working hours and focus on studying (though I did work anyway) and through exercises I got to a point where I could type for 5-6 hours, which is decent even for an everyday working developer. But prolonged use of standard keyboards still gives me pain, so I had to come up with a solution to this problem. In this post I will cover the hardware I use to achieve a ergonomic desk setup and share links to some of the stuff, bare in mind none is sponsored it is purely my recommendation.

For years I used a standard Apple full-size DK layout keybard, the one in aluminium with white keys, and it actually served me great. In my opinion it is one of the best membrane keyboards out there, as it causes me minimal pain to use. I have had mine since 2008 and used everyday until late 2014, so a six year run. On occasion I will still whip it out and I will explain later why. But what happened in 2014? I got hired by Steinwurf ApS and saw the work setup the CTO at his desk sat one of the craziest keyboards I had ever seen. It was a Kinesis Advantage 2 LF which is a keyboard build for ergonomics instead of looking pretty and to be honest it is amazing. In the time I have used it (so roughly 5 and half years years now) I haven’t really experienced any wrist pains unless I work for more than 8 hours straight (which does happen on occasion) and last year I went abbroad for 6 months without my beloved keyboard and the pains returned and when I returned to the Kinesis keyboard the pain disappeared quickly. Now the keyboard does have a few drawbacks, it takes a while to get use to type on and will reduce your productivity for a while. Additionally colleagues who have to help you at your desk, they will have a bit of a hard time using your keyboards. I actually keep an auxiliary keyboard at hand for that. But that is about all the disadvantages I whole heartily recommend this keyboard, I recommend it so much that I have to myself. If you want to see a more in-depth review of this Linus Tech Tips have a good one: The Most BIZARRE Keyboard…

Now back to why I some times whip out my Apple keyboard and others. Some times I get a weird pain in my shoulder (due to a bike crash a few years back) and the only way I have found to lessen the pain while I work is to switch between different keyboards. At home this it is the Apple keyboard and at work I have a Das Keyboard 4 Professional and a Vortex Pok3r. I actually got the Pok3r first on the recommendation of a friend, who loves it, but the only positive thing I have to say about that keyboard is that it can be used as a blunt weapon in an emergency situation. I can, best case, use it for 45 minutes before I have so much pain in my wrists that I need to take a 30 minute break and my shoulders will start acing pretty quickly after. This is of cause without having pain in the should before. Multiple people have claimed that this is just because I am not use to using it and that my fingers are not used to Cherry MX Blue switches. Well the should pain maybe, but I love MX Blue switches and all though my Kinesis is not with MX Blues I would love it to be and I actually have had a Das Keyboard 4 Professional with MX Blues at a former employer which was amazing. The Das Keyboard I have at work now is MX Browns, just to not have my colleagues kill me when I type on it and it helps a lot on the should pains I fell from time to time. I think it is simply down to the switch of position and it is also the keyboard I use as auxiliary keyboard.

Next, mouse! What type of mouse do I use? I am ergonomic lover and therefore I love trackball mice. At work I use a Logitech M570 which Logitech no longer makes and it has been replaced by the mouse I have at home, which is the Ergo M575. Both mice are amazing and I have no complaints. Well I have one, but I am not sure if that is the M570 or my Mac being annoying. At work I mainly use a Dell XPS 15” (model 9560 I think) running (surprise) Fedora which the M570 just works with no problem what so ever. But if I connect the M570 to my MacBook Pro the connection is often dropped and I have been unable to figure out why. Finally at home I also use an Apple Magic trackpad (gen 1) and I have done so for years, since it was available in Denmark I have had one. The reason being that moving windows between spaces on macOS with the keyboard is just a pain in the arse and none of the tools I found did a good job. So for that I use the trackpad and nothing else really. I have had a Magic Mouse (gen 1) as well and that is mouse send from hell, it is even worse that the Pok3r at providing wrist pain. I much prefer the old Apple Mighty Mouse, which I still have a few off and use to play Warcraft.

So this is the hardware I use and what I have tried.

TL;DR

I use a Kinesis Advantage 2 LF keyboard and Logitech M570 and M575 trackball mice.

Using multiple binarya in Postgres queries with libpqxx

18 November 2020

In deduplication one of the things we use a lot is data fingerprints, where we use some partial function to generate a unique fingerprint for a data chunk. We use this fingerprint for a variety of things, comparing chunks, referencing chunks, and sometimes as file names for the chunks. For one of the systems I work on, I need to store these fingerprints in a database and to be honest this have caused me a bit of concern, which I will go into in a bit. But first let me describe how we generate the fingerprints and why.

In the system we have our chunks represented as vector of bytes in the form of std::vector<uint8_t>. These varies in size depending on the chunk size we are using for the experiments, but the most common I use is 4kB. Comparing chunks of 4kB may not seem expensive, but you are still comparing 32768 bits for each chunk you compare against. To reduce the processing time we generate fingerprints and in our system we use SHA-1 to do this we use Harpocrates [1] which is a wrapper around openSSL’s implementation which makes it super easy to use. Using SHA-1 gives us a 20 byte fingerprint so only 160 bites are now compared, which is a reduction of roughly 99.51% which is pretty cool.

Now that is all good and cool but how do we store this in a database? This has caused me some concern. First I was thinking (and have done this [2]) that we could store the fingerprints in the database as “safe strings” by converting them to hexadecimal representation, using the following conversion method:

#include <cstdint>
#include <string>

#include <vector>

#include <sstream>

std::string fingerprint_to_string(const std::vector<uint8_t>& fingerprint)
{
    std::stringstream ss; 
    for (const auto& elm : fingerprint)
    {
        ss << std::hex << elm; 
    }
    return ss.str();
}

My main concern with this approach is that it increases the length of the fingerprint from 20 to 40 bytes, admittedly not a lot but enough that I was annoyed. Also when it is stored in the database a couple of bytes might be added. Some would say, why not just use blobs then? Well I would like to be able to compare the fingerprints, so using a blob is not always an option, and blobs are more for large amounts of data. Luckily we are using PostgreSQL which has a data type bytea [3] (short for bytearray or binary string), which costs the length of the binary string plus 1 to 4 bytes. Not quite sure why it is 1 to 4 and not one or the other. This was intriguing to me as libpqxx (pqxx) [4] actually does support bytea with the data type pqxx::binarystring. But I ran into a problem, I like using prepared statements with pqxx, so I trieded the following

#include <pqxx/pqxx>

... 
// Prepare statment on connection 
m_conn.prepare("seen_fingerprints", "SELECT fingerprint FROM basis WHERE fingerprint IN $1");
...

std::vector<std::vector<uint8_t>> seen_fingerprints(const std::vector<std::vector<uint8_t>>& fingerprints)
{
    std::vector<pqxx::binarystring> b_fingerprints(fingerprints.size()); 
    
    for (size_t i = 0; i < fingerprints.size() ++i)
    {
        auto fingerprint = pqxx::binarystring(fingerprints.at(i).data(), fingerprints.at(i).size()); 
        b_fingerprints.at(i) = fingerprint; 
    }
    
    // start transaction
    pqxx::work worker{m_conn}; 
    pqxx::results res {worker.exec_prepared("seen_fingerprints", b_fingerprints)};
    // handled result
    ...
    // Commit transaction 
    worker.commit(); 
    
    return result; 
}

But I ran into this error: 'pqxx::string_traits<pqxx::binarystring>::size_buffer' is not defined and the only solution I could find was to do it for each fingerprint individually. Those who program knows that sounds like a dumb idea and I was aware of this. So I ended up making a issue on pqxx’s GitHub page [5] and with the help of some of the people of there actually found a solution. Below is a simplified insert function using this solution.

#include <fmt/core.h>

void register_fingerprints(const std::vector<std::Vector<uint8_t>>& fingerprints)
{
    std::vector<pqxx::binarystring> params(fingerprints.size());
    
    std::string params_list = ""; 
    
    // PQXX 1 index parameter list
    size_t param_index = 1;
    for (const auto& fingerprint : fingerprints)
    {
        auto b_fingerprint = pqxx::binarystring(fingerprint.dat(), fingerprint.size());
        params.at(param_index - 1) = b_fingerprints; 
        params_list = params_list + fmt::format("(${}, 1),", param_index); 
        param_index = param_index + 1; 
    }
    
    // substring - 1 because we reomve th elast ,
    std::string query = fmt::format("INSERT INTO basis (fingerprint, reference) VALUES {}",
                                                        param_list.substr(0, param_list.size() - 1)); 
                                                        
    pqxx::work worker(m_conn); 
    worker.exec_params(query, pqxx::prepare::make_dynamic_params(params)); 
    worker.commit(); 
}

Basically we create a query with a an amount of parameters equal to the number of fingerprints and then we construct and insert statement from this, and parse the params vector as a dynamic parameter, which basically solved my problems.

Now one thing I have yet to test is how the underlying libpq [6] actually handles the pqxx::binarystring and if it converts it to a hexadecimal string, based on the answers I got in [5] I think it does in which case the storage savings are none. But I still believe I am using a more appropriate data type so that is a win.

Final note, some may ask “what is fmt and where does it come from?” fmt [7] is a string formatting library for C++ by Victor Zverovich [8], which brings easy string formatting to C++ and it feels and looks a lot like what is available by default in Python. I was turned on to this library by Jason Turner (lefticus) [9] the host of C++ weekly [10] (which I recommend that you follow). fmt will be integrated in to C++20 as std::format from the header <format> [11]. It is a nifty “little” library and I recommend that you give it a spin, it has made my life so much easier.