In deduplication one of the things we use a lot is data fingerprints, where we use some partial function to generate a unique fingerprint for a data chunk. We use this fingerprint for a variety of things, comparing chunks, referencing chunks, and sometimes as file names for the chunks. For one of the systems I work on, I need to store these fingerprints in a database and to be honest this have caused me a bit of concern, which I will go into in a bit. But first let me describe how we generate the fingerprints and why.

In the system we have our chunks represented as vector of bytes in the form of std::vector<uint8_t>. These varies in size depending on the chunk size we are using for the experiments, but the most common I use is 4kB. Comparing chunks of 4kB may not seem expensive, but you are still comparing 32768 bits for each chunk you compare against. To reduce the processing time we generate fingerprints and in our system we use SHA-1 to do this we use Harpocrates [1] which is a wrapper around openSSL’s implementation which makes it super easy to use. Using SHA-1 gives us a 20 byte fingerprint so only 160 bites are now compared, which is a reduction of roughly 99.51% which is pretty cool.

Now that is all good and cool but how do we store this in a database? This has caused me some concern. First I was thinking (and have done this [2]) that we could store the fingerprints in the database as “safe strings” by converting them to hexadecimal representation, using the following conversion method:

#include <cstdint>
#include <string>

#include <vector>

#include <sstream>

std::string fingerprint_to_string(const std::vector<uint8_t>& fingerprint)
{
    std::stringstream ss; 
    for (const auto& elm : fingerprint)
    {
        ss << std::hex << elm; 
    }
    return ss.str();
}

My main concern with this approach is that it increases the length of the fingerprint from 20 to 40 bytes, admittedly not a lot but enough that I was annoyed. Also when it is stored in the database a couple of bytes might be added. Some would say, why not just use blobs then? Well I would like to be able to compare the fingerprints, so using a blob is not always an option, and blobs are more for large amounts of data. Luckily we are using PostgreSQL which has a data type bytea [3] (short for bytearray or binary string), which costs the length of the binary string plus 1 to 4 bytes. Not quite sure why it is 1 to 4 and not one or the other. This was intriguing to me as libpqxx (pqxx) [4] actually does support bytea with the data type pqxx::binarystring. But I ran into a problem, I like using prepared statements with pqxx, so I trieded the following

#include <pqxx/pqxx>

... 
// Prepare statment on connection 
m_conn.prepare("seen_fingerprints", "SELECT fingerprint FROM basis WHERE fingerprint IN $1");
...

std::vector<std::vector<uint8_t>> seen_fingerprints(const std::vector<std::vector<uint8_t>>& fingerprints)
{
    std::vector<pqxx::binarystring> b_fingerprints(fingerprints.size()); 
    
    for (size_t i = 0; i < fingerprints.size() ++i)
    {
        auto fingerprint = pqxx::binarystring(fingerprints.at(i).data(), fingerprints.at(i).size()); 
        b_fingerprints.at(i) = fingerprint; 
    }
    
    // start transaction
    pqxx::work worker{m_conn}; 
    pqxx::results res {worker.exec_prepared("seen_fingerprints", b_fingerprints)};
    // handled result
    ...
    // Commit transaction 
    worker.commit(); 
    
    return result; 
}

But I ran into this error: 'pqxx::string_traits<pqxx::binarystring>::size_buffer' is not defined and the only solution I could find was to do it for each fingerprint individually. Those who program knows that sounds like a dumb idea and I was aware of this. So I ended up making a issue on pqxx’s GitHub page [5] and with the help of some of the people of there actually found a solution. Below is a simplified insert function using this solution.

#include <fmt/core.h>

void register_fingerprints(const std::vector<std::Vector<uint8_t>>& fingerprints)
{
    std::vector<pqxx::binarystring> params(fingerprints.size());
    
    std::string params_list = ""; 
    
    // PQXX 1 index parameter list
    size_t param_index = 1;
    for (const auto& fingerprint : fingerprints)
    {
        auto b_fingerprint = pqxx::binarystring(fingerprint.dat(), fingerprint.size());
        params.at(param_index - 1) = b_fingerprints; 
        params_list = params_list + fmt::format("(${}, 1),", param_index); 
        param_index = param_index + 1; 
    }
    
    // substring - 1 because we reomve th elast ,
    std::string query = fmt::format("INSERT INTO basis (fingerprint, reference) VALUES {}",
                                                        param_list.substr(0, param_list.size() - 1)); 
                                                        
    pqxx::work worker(m_conn); 
    worker.exec_params(query, pqxx::prepare::make_dynamic_params(params)); 
    worker.commit(); 
}

Basically we create a query with a an amount of parameters equal to the number of fingerprints and then we construct and insert statement from this, and parse the params vector as a dynamic parameter, which basically solved my problems.

Now one thing I have yet to test is how the underlying libpq [6] actually handles the pqxx::binarystring and if it converts it to a hexadecimal string, based on the answers I got in [5] I think it does in which case the storage savings are none. But I still believe I am using a more appropriate data type so that is a win.

Final note, some may ask “what is fmt and where does it come from?” fmt [7] is a string formatting library for C++ by Victor Zverovich [8], which brings easy string formatting to C++ and it feels and looks a lot like what is available by default in Python. I was turned on to this library by Jason Turner (lefticus) [9] the host of C++ weekly [10] (which I recommend that you follow). fmt will be integrated in to C++20 as std::format from the header <format> [11]. It is a nifty “little” library and I recommend that you give it a spin, it has made my life so much easier.