I have found myself crossing-paths with threadpools quite a lot over the last few weeks. For my first encounter, I found myself pouring over the fast-api docs in an attempt to better understand how it dispatches functions to it's external threadpool (outside the main event-loop). Not long afterwards (during my orderbook rewrite[1]), I began to think about matching-engines again, specifically, high-throughput message handling once an order has been executed (@ 100k's messages p/s). As a matching-engine operates on orders sequentially—where resource allocation is focused on processing incoming orders as fast as possible—post-trade execution requires a pragmatic approach to handling those trade messages reliably outside of the core matching-engine logic (at least that's how I've always interpreted it). This is where threadpools began to show their relevance once more (frequency illusion at it's finest). What this post entails is a brief overview of threadpools, how I went about implementing one and some of the interesting things I learned along the way 🕳️🐇.
Using multiple threads (of execution) allows for multiple tasks to be scheduled within a single process in such a way that they can be handled concurrently. What this means is that rather than waiting for a single function to execute sequentially before moving to the next, a process can try to optimize it's pipeline by scheduling tasks in such a way as to minimize idle time between instructions (e.g. executing instructions in another function while waiting for an asynchronous call to complete). An important cavaet here is that instantiating a thread is an expensive endeavour. A thread runtime requires it's own stack along with copying over any relevant state from the parent thread before beginning execution, all of which requires kernel involvement for scheduler and memory allocation.
Threadpool overview: Each task (T) is enqueued before being executed by a thread (note the fixed number of threads in the pool)
This is where threadpools come into play. Threadpools can offer an efficient implementation of the traditional multi-threaded model by providing a (fixed) series of long-running threads of execution (pool), rather than going through unnecessary and repeated creation and teardown of threads as work arises. Limiting the number of threads being created can also help mitigate potential resource contention, consumption and excessive context-switching. The important thing to note when using threadpools is that they are rather inelastic, by that I mean, we want to have a relatively good idea of our average workload before deciding to use a threadpool. Afterall, this inelasticity is what can make a threadpool more efficient than spawning threads on demand.
I'll try not to get into the details of threadpool implementations too much as I believe the design considerations offer a better insight into how they operate. The program was written in C and turned out to be much more compact and simple than I anticipated (which I'm super happy about). For those curious about the technical details, I'll provide a link to the source code of thool below[2]. For now, here are some of the considerations and common patterns that I found interesting while writing thool.
task object comprises of a function and an argument
    property.
    A worker will pull a task from the queue and execute the function using
    the given arguments (if any were given). There appeared to be two common
    approaches to implementing a taskqueue, by using a linked-list or a
    circular buffer. Using a linked-list offers more flexibility through an
    unbounded-queue, i.e., it can grow indefinitely, opening up the
    possibility to adopting a truly lock-free, unbounded queueing
    mechanism. However, with these benefits comes the cost of frequent
    malloc and free usage, an expensive set of
    calls that I would like to avoid. The circular-buffer on the other-hand,
    is the antithesis of the linked-list. While it is bounded to how many
    tasks it can accomadate, we can drastically minimize the number of
    allocations required. Along with the circular buffer, mutexes become a
    good default option, as I find them easier to reason about and there is
    potential to be blocked by the queue size anyways.
    Mutex
Mutex
wait when a given scenario is met (e.g. no
    tasks to process) and be notified or signalled once that
    state of the system changes. This allows us to avoid the cost of
    constantly spinning threads like you would see in a spin-lock for
    example and helps to save CPU cycles. In thool, you will find a
    single cond-var used to signal a thread to wait while the
    taskqueue is empty and to awaken when there are tasks
    pending. I believe a binary-sempahore could also be used similarily
    here but it's not something I am familiar with and it didn't feel
    entirely appropriate compared to using conditional variables.
    thread_count,
    was responsible for managing the number of alive and active threads in
    the threadpool and taskqueue_lock, which governed which
    thread had control over the taskqueues read-write privilege.
    Fortunately, synchronisation is mutually exclusive which means a small
    number of mutexes can be used to govern threadpool state effectively.
    
    While I'm happy with the finished product, I think there are some
    interesting areas that I would like to explore with thool. I would
    like to consider adding in a force_add option when
    queueing tasks onto the threadpool. While this would require a
    malloc it would hopefully be logarithmic. Finally,
    I think it would be nice to expand the thool library such that it
    could be used on Windows machines. This would require writing  some kind
    of wrapper around the pthread module as that is not
    available on Windows runtimes. Anyways, I hope someone found this
    useful in some way, I really enjoyed the process 🤙.