🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:
Response:
Calculated Shard: 180 (from laksa185)

2. Crawled Status Check

Query:
Response:

3. Robots.txt Check

Query:
Response:

4. Spam/Ban Check

Query:
Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

📄
INDEXABLE
✅
CRAWLED
5 days ago
🤖
ROBOTS ALLOWED

Page Info Filters

FilterStatusConditionDetails
HTTP statusPASSdownload_http_code = 200HTTP 200
Age cutoffPASSdownload_stamp > now() - 6 MONTH0.2 months ago
History dropPASSisNull(history_drop_reason)No drop reason
Spam/banPASSfh_dont_index != 1 AND ml_spam_score = 0ml_spam_score=0
CanonicalPASSmeta_canonical IS NULL OR = '' OR = src_unparsedNot set

Page Details

PropertyValue
URLhttps://morestina.net/1400/minimalistic-blocking-bounded-queue-for-c
Last Crawled2026-04-04 18:24:54 (5 days ago)
First Indexed2025-08-06 14:58:16 (8 months ago)
HTTP Status Code200
Meta TitleMinimalistic blocking bounded queue in C++ | More Stina Blog!
Meta Descriptionnull
Meta Canonicalnull
Boilerpipe Text
While working on speeding up a Python extension written in C++, I needed a queue to distribute work among threads, and possibly to gather their results. What I had in mind was something akin to Python’s queue module – more specifically: a blocking MPMC queue with fixed capacity ; that supports push , try_push , pop , and try_pop ; that depends only on the C++11 standard library ; and has a simple and robust header-only implementation. Since I could choose the size of the unit of work arbitrarily, the performance of the queue as measured by microbenchmarks wasn’t a priority, as long as it wasn’t abysmal. Likewise, the queue capacity would be fairly small, on the order of the number of cores on the system, so the queue wouldn’t need to be particularly efficient at allocation – while a ring buffer would be optimal for storage, a linked list or deque would work as well. Other non-requirements included: strong exception safety – while certainly an indicator of quality, the work queue did not require it because its payload would consist of smart pointers, either std::unique_ptr<T> or std::shared_ptr<T> , neither of which throws on move/copy. lock-free/wait-free mutation – again, most definitely a plus (so long as the queue can fall back to blocking when necessary), but not a requirement because the workers would spend the majority of time doing actual work. timed versions of blocking operations – useful in general, but I didn’t need them. Googling C++ bounded blocking MPMC queue gives quite a few results, but surprisingly, most of them don’t really meet the requirements set out at the beginning. For example, this queue is written by Dmitry Vyukov and endorsed by Martin Thompson , so it’s definitely of the highest quality, and way beyond something I (or most people) would be able to come up with on my own. However, it’s a lock-free queue that doesn’t expose any kinds of blocking operations by design. According to the author, waiting for events such as the queue no longer being empty or full is a concern that should be handled outside the queue. This position has a lot of merit, especially in more complex scenarios where you might need to wait on multiple queues at once, but it complicates the use in many simple cases, such as the work distribution I was implementing. In these use cases the a thread is communicating with only queue at one time. When the queue is unavailable due to being empty or full, that is a signal of backpressure and there is nothing else the thread can do but sleep until the situation changes. If done right, this sleep should neither hog the CPU nor introduce unnecessary latency, so it is at least convenient that it be implemented as part of the queue. The next implementation that looks really promising is Erik Rigtorp’s MPMCQueue . It has a small implementation, supports both blocking and non-blocking variants of enqueue and dequeue operations, and covers inserting elements by move, copy, and emplace. The author claims that it’s battle-tested in several high-profile EA games, as well as in a low-latency trading platform. However, a closer look at push() and pop() betrays a problem with the blocking operations. For example, pop() contains the following loop: while (turn(tail) * 2 + 1 != slot.turn.load(std::memory_order_acquire)) ; Waiting is implemented with a busy loop, so that the waiting thread is not put to sleep when unable to pop, but it instead loops until this is possible. In usage that means that if the producers stall for any reason, such as reading new data from a slow network disk, consumers will angrily pounce on the CPU waiting for new items to appear in the queue. And if the producers are faster than the consumers, then it’s them who will spend CPU to wait for some free space to appear in the queue. In the latter case, the busy-looping producers will actually take the CPU cycles away from the workers doing useful work, prolonging the wait. This is not to say that Erik’s queue is not of high quality, just that it doesn’t fit this use case. I suspect that applications the queue was designed for very rarely invoke the blocking versions of these operations, and when they do, they do so in scenarios where they are confident that the blockage is about to clear up. Then there are a lot of queue implementations on stackoverflow and various personal blogs that seem like they could be used, but don’t stand up to scrutiny. For example, this queue comes pretty high in search results, definitely supports blocking, and appears to achieve the stated requirements. Compiling it does produce some nasty warnings, though: warning: operation on 'm_popIndex' may be undefined [-Wsequence-point] Looking at the source, the assignment m_popIndex = ++m_popIndex % m_size is an infamous case of undefined behavior in C and C++. The author probably thought he found a way to avoid parentheses in m_popIndex = (m_popIndex + 1) % m_size , but C++ doesn’t work like that. The problem is easily fixed, but it leads a bad overall impression. Another issue is that it requires a semaphore implementation, using boost’s inter-process semaphore by default. The page does provide a replacement semaphore, but the two in combination become unreasonably heavy-weight: the queue contains no less than three mutexes, two condition variables, and two counters. The simple queues implemented with mutexes and condition variables tend to have a single mutex and one or two condition variables, depending on whether the queue is bounded. It was at this point that it occurred to me that it would actually be more productive to just take a simple and well-tested classic queue and port it to C++ than to review and correct examples found by googling. I started from Python’s queue implementation and ended up with the following: template<typename T> class queue { std::deque<T> content; size_t capacity; std::mutex mutex; std::condition_variable not_empty; std::condition_variable not_full; queue(const queue &) = delete; queue(queue &&) = delete; queue &operator = (const queue &) = delete; queue &operator = (queue &&) = delete; public: queue(size_t capacity): capacity(capacity) {} void push(T &&item) { { std::unique_lock<std::mutex> lk(mutex); not_full.wait(lk, [this]() { return content.size() < capacity; }); content.push_back(std::move(item)); } not_empty.notify_one(); } bool try_push(T &&item) { { std::unique_lock<std::mutex> lk(mutex); if (content.size() == capacity) return false; content.push_back(std::move(item)); } not_empty.notify_one(); return true; } void pop(T &item) { { std::unique_lock<std::mutex> lk(mutex); not_empty.wait(lk, [this]() { return !content.empty(); }); item = std::move(content.front()); content.pop_front(); } not_full.notify_one(); } bool try_pop(T &item) { { std::unique_lock<std::mutex> lk(mutex); if (content.empty()) return false; item = std::move(content.front()); content.pop_front(); } not_full.notify_one(); return true; } }; Notes on the implementation: push() and try_push() only accept rvalue references, so elements get moved into the queue, and if you have them in a variable, you might need to use queue.push(std::move(el)) . This property allows the use of std::unique_ptr<T> as the queue element. If you have copyable types that you want to copy into the queue, you can always use queue.push(T(el)) . pop() and try_pop() accept a reference to the item rather than returning the item. This provides the strong exception guarantee – if moving the element from the front of the queue to item throws, the queue remains unchanged. Yes, I know I said I didn’t care for exception guarantees in my use case, but this design is shared by most C++ queues, and the exception guarantee follows from it quite naturally, so it’s a good idea to embrace it. I would have preferred pop() to just return an item, but that would require the C++17 std::optional (or equivalent) for the return type of try_pop() , and I couldn’t use C++17 in the project. The downside of accepting a reference is that to pop an item you must first be able to construct an empty one. Of course, that’s not a problem if you use the smart pointers which default-construct to hold nullptr , but it’s worth mentioning. Again, the same design is used by other C++ queues, so it’s apparently an acceptable choice. The use of mutex and condition variables make this queue utterly boring to people excited about parallel and lock-free programming. Still, those primitives are implemented very efficiently in modern C++, without the need to use system calls in the non-contended case. In benchmarks against fancier queues I was never able to measure the difference on the workload of the application I was using. Is this queue as battle-tested as the ones by Mr Vyukov and Mr Rigtorp? Certainly not! But it does work fine in production, it has passed code review by in-house experts, and most importantly, the code is easy to follow and understand. The remaining question is – is this a classic example of NIH ? Is there really no other way than to implement your own queue? How do others do it? To shed some light on this, I invite the reader to read Dmitry Vyukov’s classification of queues . In short, he categorizes queues across several dimensions: by whether they produce multiple producers, consumers, or both, by the underlying data structure, by maximum size, overflow behavior, support for priorities, ordering guarantees, and many many others! Differences in these choices vastly affect the implementation, and there is no one class that fits into all use cases. If you have needs for extremely low latency, definitely look into the lock-free queues like the ones from Vyukov or Rigtorp. If your needs are matched by the “classic” blocking queues, then a simple implementation like the one showed in this article might be the right choice. I would still prefer to have one good implementation written by experts that tries to cover the middle ground, for example a C++ equivalent of Rust’s crossbeam channels . But that kind of thing doesn’t appear to exist for C++ yet, at least not as a free-standing class. EDIT As pointed out on reddit , the last paragraph is not completely correct. There is a queue implementation that satisfies the requirements and has most of the crossbeam channels functionality: the moodycamel queue . I did stumble on that queue when originally researching the queues, but missed it when writing the article later, since it didn’t come up at the first page of results when googling for C++ bounded blocking MPMC queue. This queue was too complex to drop into the project I was working on, since its size would have dwarfed the rest of the code. But if size doesn’t scare you and you’re looking for a full-featured high quality queue implementation, do consider that one.
Markdown
[![More Stina Blog\!](https://morestina.net/blog/wp-content/uploads/2014/11/morestina.png)](https://morestina.net/) # [More Stina Blog\!](https://morestina.net/) [Search](https://morestina.net/1400/minimalistic-blocking-bounded-queue-for-c#search-container) Primary Menu [Skip to content](https://morestina.net/1400/minimalistic-blocking-bounded-queue-for-c#content) - [About](https://morestina.net/about) - [Programs](https://morestina.net/programs) - [Reviews](https://morestina.net/reviews) [Tech](https://morestina.net/category/tech) # Minimalistic blocking bounded queue in C++ [January 19, 2020](https://morestina.net/1400/minimalistic-blocking-bounded-queue-for-c) [Hrvoje](https://morestina.net/author/hrvoje) [4 Comments](https://morestina.net/1400/minimalistic-blocking-bounded-queue-for-c#comments) While working on speeding up a Python extension written in C++, I needed a queue to distribute work among threads, and possibly to gather their results. What I had in mind was something akin to Python’s [`queue`](https://docs.python.org/3/library/queue.html) module – more specifically: - a **blocking** [MPMC](http://www.1024cores.net/home/lock-free-algorithms/queues) queue with **fixed capacity**; - that supports `push`, `try_push`, `pop`, and `try_pop`; - that depends only on the **C++11 standard library**; - and has a simple and robust **header-only** implementation. Since I could choose the size of the unit of work arbitrarily, the performance of the queue as measured by microbenchmarks wasn’t a priority, as long as it wasn’t abysmal. Likewise, the queue capacity would be fairly small, on the order of the number of cores on the system, so the queue wouldn’t need to be particularly efficient at allocation – while a ring buffer would be optimal for storage, a linked list or deque would work as well. Other non-requirements included: - strong exception safety – while certainly an indicator of quality, the work queue did not require it because its payload would consist of smart pointers, either `std::unique_ptr<T>` or `std::shared_ptr<T>`, neither of which throws on move/copy. - lock-free/wait-free mutation – again, most definitely a plus (so long as the queue can fall back to blocking when necessary), but not a requirement because the workers would spend the majority of time doing actual work. - timed versions of blocking operations – useful in general, but I didn’t need them. Googling C++ bounded blocking MPMC queue gives quite a few results, but surprisingly, most of them don’t really meet the requirements set out at the beginning. For example, [this queue](http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue) is written by [Dmitry Vyukov](http://www.1024cores.net/home/about-me) and [endorsed by Martin Thompson](https://groups.google.com/d/msg/mechanical-sympathy/yR2imPNvlr0/0QsDoF-3BQAJ), so it’s definitely of the highest quality, and way beyond something I (or most people) would be able to come up with on my own. However, it’s a lock-free queue that doesn’t expose any kinds of blocking operations by design. According to the author, waiting for events such as the queue no longer being empty or full is a concern that should be handled outside the queue. This position has a lot of merit, especially in more complex scenarios where you might need to wait on multiple queues at once, but it complicates the use in many simple cases, such as the work distribution I was implementing. In these use cases the a thread is communicating with only queue at one time. When the queue is unavailable due to being empty or full, that is a signal of backpressure and there is nothing else the thread can do but sleep until the situation changes. If done right, this sleep should neither hog the CPU nor introduce unnecessary latency, so it is at least convenient that it be implemented as part of the queue. The next implementation that looks really promising is [Erik Rigtorp’s MPMCQueue](https://github.com/rigtorp/MPMCQueue). It has a small implementation, supports both blocking and non-blocking variants of enqueue and dequeue operations, and covers inserting elements by move, copy, and emplace. The author claims that it’s battle-tested in several high-profile EA games, as well as in a low-latency trading platform. However, a closer look at `push()` and `pop()` betrays a problem with the blocking operations. For example, `pop()` contains the following loop: ``` while (turn(tail) * 2 + 1 != slot.turn.load(std::memory_order_acquire)) ; ``` Waiting is implemented with a busy loop, so that the waiting thread is not put to sleep when unable to pop, but it instead loops until this is possible. In usage that means that if the producers stall for any reason, such as reading new data from a slow network disk, consumers will angrily pounce on the CPU waiting for new items to appear in the queue. And if the producers are faster than the consumers, then it’s them who will spend CPU to wait for some free space to appear in the queue. In the latter case, the busy-looping producers will actually take the CPU cycles away from the workers doing useful work, prolonging the wait. This is not to say that Erik’s queue is not of high quality, just that it doesn’t fit this use case. I suspect that applications the queue was designed for very rarely invoke the blocking versions of these operations, and when they do, they do so in scenarios where they are confident that the blockage is about to clear up. Then there are a lot of queue implementations on stackoverflow and various personal blogs that seem like they could be used, but don’t stand up to scrutiny. For example, [this queue](https://vorbrodt.blog/2019/02/03/blocking-queue/) comes pretty high in search results, definitely supports blocking, and appears to achieve the stated requirements. Compiling it does produce some nasty warnings, though: ``` warning: operation on 'm_popIndex' may be undefined [-Wsequence-point] ``` Looking at the source, the assignment `m_popIndex = ++m_popIndex % m_size` is an infamous case of [undefined behavior](https://stackoverflow.com/questions/4176328/undefined-behavior-and-sequence-points) in C and C++. The author probably thought he found a way to avoid parentheses in `m_popIndex = (m_popIndex + 1) % m_size`, but C++ doesn’t work like that. The problem is easily fixed, but it leads a bad overall impression. Another issue is that it requires a semaphore implementation, using boost’s inter-process semaphore by default. The page does provide a replacement semaphore, but the two in combination become unreasonably heavy-weight: the queue contains no less than three mutexes, two condition variables, and two counters. The simple queues implemented with mutexes and condition variables tend to have a single mutex and one or two condition variables, depending on whether the queue is bounded. It was at this point that it occurred to me that it would actually be *more* productive to just take a simple and well-tested classic queue and port it to C++ than to review and correct examples found by googling. I started from Python’s queue [implementation](https://github.com/python/cpython/blob/cd7db76a636c218b2d81d3526eb435cfae61f212/Lib/queue.py#L27) and ended up with the following: ``` template<typename T> class queue { std::deque<T> content; size_t capacity; std::mutex mutex; std::condition_variable not_empty; std::condition_variable not_full; queue(const queue &) = delete; queue(queue &&) = delete; queue &operator = (const queue &) = delete; queue &operator = (queue &&) = delete; public: queue(size_t capacity): capacity(capacity) {} void push(T &&item) { { std::unique_lock<std::mutex> lk(mutex); not_full.wait(lk, [this]() { return content.size() < capacity; }); content.push_back(std::move(item)); } not_empty.notify_one(); } bool try_push(T &&item) { { std::unique_lock<std::mutex> lk(mutex); if (content.size() == capacity) return false; content.push_back(std::move(item)); } not_empty.notify_one(); return true; } void pop(T &item) { { std::unique_lock<std::mutex> lk(mutex); not_empty.wait(lk, [this]() { return !content.empty(); }); item = std::move(content.front()); content.pop_front(); } not_full.notify_one(); } bool try_pop(T &item) { { std::unique_lock<std::mutex> lk(mutex); if (content.empty()) return false; item = std::move(content.front()); content.pop_front(); } not_full.notify_one(); return true; } }; ``` Notes on the implementation: - `push()` and `try_push()` only accept rvalue references, so elements get *moved* into the queue, and if you have them in a variable, you might need to use `queue.push(std::move(el))`. This property allows the use of `std::unique_ptr<T>` as the queue element. If you have copyable types that you want to copy into the queue, you can always use `queue.push(T(el))`. - `pop()` and `try_pop()` accept a reference to the item rather than returning the item. This provides the strong exception guarantee – if moving the element from the front of the queue to `item` throws, the queue remains unchanged. Yes, I know I said I didn’t care for exception guarantees in my use case, but this design is shared by most C++ queues, and the exception guarantee follows from it quite naturally, so it’s a good idea to embrace it. I would have preferred `pop()` to just *return* an item, but that would require the C++17 `std::optional` (or equivalent) for the return type of `try_pop()`, and I couldn’t use C++17 in the project. The downside of accepting a reference is that to pop an item you must first be able to construct an empty one. Of course, that’s not a problem if you use the smart pointers which default-construct to hold `nullptr`, but it’s worth mentioning. Again, the same design is used by other C++ queues, so it’s apparently an acceptable choice. - The use of mutex and condition variables make this queue utterly boring to people excited about parallel and lock-free programming. Still, those primitives are implemented very efficiently in modern C++, without the need to use system calls in the non-contended case. In benchmarks against fancier queues I was never able to measure the difference on the workload of the application I was using. Is this queue as battle-tested as the ones by Mr Vyukov and Mr Rigtorp? Certainly not! But it does work fine in production, it has passed code review by in-house experts, and most importantly, the code is easy to follow and understand. The remaining question is – is this a classic example of [NIH](https://en.wikipedia.org/wiki/Not_invented_here)? Is there really no other way than to implement your own queue? How do others do it? To shed some light on this, I invite the reader to read Dmitry Vyukov’s [classification of queues](http://www.1024cores.net/home/lock-free-algorithms/queues). In short, he categorizes queues across several dimensions: by whether they produce multiple producers, consumers, or both, by the underlying data structure, by maximum size, overflow behavior, support for priorities, ordering guarantees, and many many others! Differences in these choices vastly affect the implementation, and there is **no** one class that fits into all use cases. If you have needs for extremely low latency, definitely look into the lock-free queues like the ones from Vyukov or Rigtorp. If your needs are matched by the “classic” blocking queues, then a simple implementation like the one showed in this article might be the right choice. I would still prefer to have one good implementation written by experts that tries to cover the middle ground, for example a C++ equivalent of Rust’s [crossbeam channels](https://docs.rs/crossbeam/latest/crossbeam/). ~~But that kind of thing doesn’t appear to exist for C++ yet, at least not as a free-standing class.~~ **EDIT** As pointed out on [reddit](https://www.reddit.com/r/cpp/comments/equle2/account_of_search_for_a_minimalistic_bounded/), the last paragraph is not completely correct. There is a queue implementation that satisfies the requirements and has most of the crossbeam channels functionality: the [moodycamel queue](https://github.com/cameron314/concurrentqueue). I did stumble on that queue when originally researching the queues, but missed it when writing the article later, since it didn’t come up at the first page of results when googling for C++ bounded blocking MPMC queue. This queue was too complex to drop into the project I was working on, since its size would have dwarfed the rest of the code. But if size doesn’t scare you and you’re looking for a full-featured high quality queue implementation, do consider that one. # Post navigation [Previous PostControl your magnetic tapes](https://morestina.net/1388/1388)[Next PostParallel stream processing with Rayon](https://morestina.net/1432/parallel-stream-processing-with-rayon) ## 4 thoughts on “Minimalistic blocking bounded queue in C++” 1. Pingback: [Ring buffer usage and cache locality: boost lockfree spsc\_queue cache memory access – HPC](https://hpc170063702.wordpress.com/2018/07/04/boost-lockfree-spsc_queue-cache-memory-access/) 2. ![](https://secure.gravatar.com/avatar/599034d9f5ea2bb08e05c9d809fc987bd8e679b7f71605b496cf93479b57fb44?s=34&d=identicon&r=g) **preparer** says: [January 26, 2021 at 22:01](https://morestina.net/1400/minimalistic-blocking-bounded-queue-for-c#comment-211) I think you should use notify\_all. For push scenario, notify\_one can notify a thread that is waiting on push while we want to notify a thread on pop. Likewise think about the pop scenario where we want to notify a thread that waiting on push. To achieve this you need to use notify\_all. [Reply](https://morestina.net/1400/minimalistic-blocking-bounded-queue-for-c?replytocom=211#respond) 1. ![](https://secure.gravatar.com/avatar/599034d9f5ea2bb08e05c9d809fc987bd8e679b7f71605b496cf93479b57fb44?s=34&d=identicon&r=g) **preparer** says: [January 26, 2021 at 22:04](https://morestina.net/1400/minimalistic-blocking-bounded-queue-for-c#comment-212) ignore my comment .. I realized you are using one condition\_variable for not\_empty and one for not\_full. [Reply](https://morestina.net/1400/minimalistic-blocking-bounded-queue-for-c?replytocom=212#respond) 3. ![](https://secure.gravatar.com/avatar/c5437dfa2011209f638a1ca50cef63146a7057c1dec5141a7f27a0ceda8651a4?s=34&d=identicon&r=g) **Hao** says: [March 12, 2023 at 22:26](https://morestina.net/1400/minimalistic-blocking-bounded-queue-for-c#comment-246) Thanks for the post! Just posting another version similar to yours using ring buffer (use int as an example but could be easily changed to T) ``` void enqueue(int element) { { std::unique_lock<std::mutex> lock(guard_); not_full_.wait(lock, [this]{ return !this->isFull(); }); data_[back_] = element; back_ = next(back_); } not_empty_.notify_one(); } int dequeue() { int payload = 0; { std::unique_lock<std::mutex> lock(guard_); not_empty_.wait(lock, [this]{ return !this->isEmpty(); }); payload = data_[front_]; front_ = next(front_); } not_full_.notify_one(); return payload; } int size() const noexcept { if (back_ >= front_) { return back_ - front_; } else { return static_cast<int>(data_.size()) - (front_ - back_); } } private: bool isFull() const noexcept { return next(back_) == front_; } bool isEmpty() const noexcept { return back_ == front_; } int next(int index) const noexcept { int next = index + 1; if (next == static_cast<int>(data_.size())) { next = 0; } return next; } ``` [Reply](https://morestina.net/1400/minimalistic-blocking-bounded-queue-for-c?replytocom=246#respond) ### Leave a Reply[Cancel reply](https://morestina.net/1400/minimalistic-blocking-bounded-queue-for-c#respond) ## Ramblings about whatever # Recent Posts - [Rust global variables, two years on](https://morestina.net/2055/rust-global-variables-two-years-on) - [A close encounter with false sharing](https://morestina.net/1976/a-close-encounter-with-false-sharing) - [Compile-time checks in generic functions work, and you can use them in your code](https://morestina.net/1940/compile-time-checks-in-generic-functions-work-and-you-can-use-them-in-your-code) - [Faster sorting with decorate-sort-undecorate](https://morestina.net/1916/faster-sorting-with-decorate-sort-undecorate) - [Self-referential types for fun and profit](https://morestina.net/1868/self-referential-types-for-fun-and-profit) # Recent Comments - James E. on [Parallel iteration in Python](https://morestina.net/1378/parallel-iteration-in-python#comment-259) - [Grownup Webcam Chat – offmarket business for sale](https://offmarketbusinessforsale.com/grownup-webcam-chat/) on [Access your photo collection from a smartphone](https://morestina.net/311/access-your-photo-collection-from-a-smartphone#comment-258) - [在Rust中多线程正则表达式匹配的争论 - 偏执的码农](https://geek.ds3783.com/2024/07/%E5%9C%A8rust%E4%B8%AD%E5%A4%9A%E7%BA%BF%E7%A8%8B%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F%E5%8C%B9%E9%85%8D%E7%9A%84%E4%BA%89%E8%AE%BA/) on [Contention on multi-threaded regex matching](https://morestina.net/1827/multi-threaded-regex#comment-256) - [在Rust中与错误共享的密切接触 - 偏执的码农](https://geek.ds3783.com/2024/05/%E5%9C%A8rust%E4%B8%AD%E4%B8%8E%E9%94%99%E8%AF%AF%E5%85%B1%E4%BA%AB%E7%9A%84%E5%AF%86%E5%88%87%E6%8E%A5%E8%A7%A6/) on [A close encounter with false sharing](https://morestina.net/1976/a-close-encounter-with-false-sharing#comment-255) - [Dominic Portain](http://goocy.tumblr.com/) on [Control your magnetic tapes](https://morestina.net/1388/1388#comment-254) # Archives - [November 2023](https://morestina.net/date/2023/11) - [October 2023](https://morestina.net/date/2023/10) - [January 2023](https://morestina.net/date/2023/01) - [December 2022](https://morestina.net/date/2022/12) - [October 2022](https://morestina.net/date/2022/10) - [November 2021](https://morestina.net/date/2021/11) - [May 2021](https://morestina.net/date/2021/05) - [March 2021](https://morestina.net/date/2021/03) - [October 2020](https://morestina.net/date/2020/10) - [July 2020](https://morestina.net/date/2020/07) - [January 2020](https://morestina.net/date/2020/01) - [February 2019](https://morestina.net/date/2019/02) - [August 2018](https://morestina.net/date/2018/08) - [March 2018](https://morestina.net/date/2018/03) - [February 2018](https://morestina.net/date/2018/02) - [October 2017](https://morestina.net/date/2017/10) - [September 2017](https://morestina.net/date/2017/09) - [August 2017](https://morestina.net/date/2017/08) - [March 2017](https://morestina.net/date/2017/03) - [September 2016](https://morestina.net/date/2016/09) - [June 2016](https://morestina.net/date/2016/06) - [February 2016](https://morestina.net/date/2016/02) - [January 2016](https://morestina.net/date/2016/01) - [September 2015](https://morestina.net/date/2015/09) - [December 2014](https://morestina.net/date/2014/12) - [November 2014](https://morestina.net/date/2014/11) - [September 2014](https://morestina.net/date/2014/09) # Categories - [archeology](https://morestina.net/category/archeology) - [biking](https://morestina.net/category/biking) - [Cooking](https://morestina.net/category/cooking) - [health](https://morestina.net/category/health) - [Linux](https://morestina.net/category/linux-2) - [Movie](https://morestina.net/category/movie) - [Photography](https://morestina.net/category/photo) - [Politics](https://morestina.net/category/politics) - [Programs](https://morestina.net/category/programs) - [Reviews](https://morestina.net/category/reviews) - [smoking](https://morestina.net/category/smoking) - [Space](https://morestina.net/category/space) - [Tech](https://morestina.net/category/tech) - [trip-taking](https://morestina.net/category/trip-taking) - [Wine](https://morestina.net/category/wine) - [Writing](https://morestina.net/category/writing) - [zagreb](https://morestina.net/category/zagreb) [Proudly powered by WordPress](http://wordpress.org/)
Readable Markdown
While working on speeding up a Python extension written in C++, I needed a queue to distribute work among threads, and possibly to gather their results. What I had in mind was something akin to Python’s [`queue`](https://docs.python.org/3/library/queue.html) module – more specifically: - a **blocking** [MPMC](http://www.1024cores.net/home/lock-free-algorithms/queues) queue with **fixed capacity**; - that supports `push`, `try_push`, `pop`, and `try_pop`; - that depends only on the **C++11 standard library**; - and has a simple and robust **header-only** implementation. Since I could choose the size of the unit of work arbitrarily, the performance of the queue as measured by microbenchmarks wasn’t a priority, as long as it wasn’t abysmal. Likewise, the queue capacity would be fairly small, on the order of the number of cores on the system, so the queue wouldn’t need to be particularly efficient at allocation – while a ring buffer would be optimal for storage, a linked list or deque would work as well. Other non-requirements included: - strong exception safety – while certainly an indicator of quality, the work queue did not require it because its payload would consist of smart pointers, either `std::unique_ptr<T>` or `std::shared_ptr<T>`, neither of which throws on move/copy. - lock-free/wait-free mutation – again, most definitely a plus (so long as the queue can fall back to blocking when necessary), but not a requirement because the workers would spend the majority of time doing actual work. - timed versions of blocking operations – useful in general, but I didn’t need them. Googling C++ bounded blocking MPMC queue gives quite a few results, but surprisingly, most of them don’t really meet the requirements set out at the beginning. For example, [this queue](http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue) is written by [Dmitry Vyukov](http://www.1024cores.net/home/about-me) and [endorsed by Martin Thompson](https://groups.google.com/d/msg/mechanical-sympathy/yR2imPNvlr0/0QsDoF-3BQAJ), so it’s definitely of the highest quality, and way beyond something I (or most people) would be able to come up with on my own. However, it’s a lock-free queue that doesn’t expose any kinds of blocking operations by design. According to the author, waiting for events such as the queue no longer being empty or full is a concern that should be handled outside the queue. This position has a lot of merit, especially in more complex scenarios where you might need to wait on multiple queues at once, but it complicates the use in many simple cases, such as the work distribution I was implementing. In these use cases the a thread is communicating with only queue at one time. When the queue is unavailable due to being empty or full, that is a signal of backpressure and there is nothing else the thread can do but sleep until the situation changes. If done right, this sleep should neither hog the CPU nor introduce unnecessary latency, so it is at least convenient that it be implemented as part of the queue. The next implementation that looks really promising is [Erik Rigtorp’s MPMCQueue](https://github.com/rigtorp/MPMCQueue). It has a small implementation, supports both blocking and non-blocking variants of enqueue and dequeue operations, and covers inserting elements by move, copy, and emplace. The author claims that it’s battle-tested in several high-profile EA games, as well as in a low-latency trading platform. However, a closer look at `push()` and `pop()` betrays a problem with the blocking operations. For example, `pop()` contains the following loop: ``` while (turn(tail) * 2 + 1 != slot.turn.load(std::memory_order_acquire)) ; ``` Waiting is implemented with a busy loop, so that the waiting thread is not put to sleep when unable to pop, but it instead loops until this is possible. In usage that means that if the producers stall for any reason, such as reading new data from a slow network disk, consumers will angrily pounce on the CPU waiting for new items to appear in the queue. And if the producers are faster than the consumers, then it’s them who will spend CPU to wait for some free space to appear in the queue. In the latter case, the busy-looping producers will actually take the CPU cycles away from the workers doing useful work, prolonging the wait. This is not to say that Erik’s queue is not of high quality, just that it doesn’t fit this use case. I suspect that applications the queue was designed for very rarely invoke the blocking versions of these operations, and when they do, they do so in scenarios where they are confident that the blockage is about to clear up. Then there are a lot of queue implementations on stackoverflow and various personal blogs that seem like they could be used, but don’t stand up to scrutiny. For example, [this queue](https://vorbrodt.blog/2019/02/03/blocking-queue/) comes pretty high in search results, definitely supports blocking, and appears to achieve the stated requirements. Compiling it does produce some nasty warnings, though: ``` warning: operation on 'm_popIndex' may be undefined [-Wsequence-point] ``` Looking at the source, the assignment `m_popIndex = ++m_popIndex % m_size` is an infamous case of [undefined behavior](https://stackoverflow.com/questions/4176328/undefined-behavior-and-sequence-points) in C and C++. The author probably thought he found a way to avoid parentheses in `m_popIndex = (m_popIndex + 1) % m_size`, but C++ doesn’t work like that. The problem is easily fixed, but it leads a bad overall impression. Another issue is that it requires a semaphore implementation, using boost’s inter-process semaphore by default. The page does provide a replacement semaphore, but the two in combination become unreasonably heavy-weight: the queue contains no less than three mutexes, two condition variables, and two counters. The simple queues implemented with mutexes and condition variables tend to have a single mutex and one or two condition variables, depending on whether the queue is bounded. It was at this point that it occurred to me that it would actually be *more* productive to just take a simple and well-tested classic queue and port it to C++ than to review and correct examples found by googling. I started from Python’s queue [implementation](https://github.com/python/cpython/blob/cd7db76a636c218b2d81d3526eb435cfae61f212/Lib/queue.py#L27) and ended up with the following: ``` template<typename T> class queue { std::deque<T> content; size_t capacity; std::mutex mutex; std::condition_variable not_empty; std::condition_variable not_full; queue(const queue &) = delete; queue(queue &&) = delete; queue &operator = (const queue &) = delete; queue &operator = (queue &&) = delete; public: queue(size_t capacity): capacity(capacity) {} void push(T &&item) { { std::unique_lock<std::mutex> lk(mutex); not_full.wait(lk, [this]() { return content.size() < capacity; }); content.push_back(std::move(item)); } not_empty.notify_one(); } bool try_push(T &&item) { { std::unique_lock<std::mutex> lk(mutex); if (content.size() == capacity) return false; content.push_back(std::move(item)); } not_empty.notify_one(); return true; } void pop(T &item) { { std::unique_lock<std::mutex> lk(mutex); not_empty.wait(lk, [this]() { return !content.empty(); }); item = std::move(content.front()); content.pop_front(); } not_full.notify_one(); } bool try_pop(T &item) { { std::unique_lock<std::mutex> lk(mutex); if (content.empty()) return false; item = std::move(content.front()); content.pop_front(); } not_full.notify_one(); return true; } }; ``` Notes on the implementation: - `push()` and `try_push()` only accept rvalue references, so elements get *moved* into the queue, and if you have them in a variable, you might need to use `queue.push(std::move(el))`. This property allows the use of `std::unique_ptr<T>` as the queue element. If you have copyable types that you want to copy into the queue, you can always use `queue.push(T(el))`. - `pop()` and `try_pop()` accept a reference to the item rather than returning the item. This provides the strong exception guarantee – if moving the element from the front of the queue to `item` throws, the queue remains unchanged. Yes, I know I said I didn’t care for exception guarantees in my use case, but this design is shared by most C++ queues, and the exception guarantee follows from it quite naturally, so it’s a good idea to embrace it. I would have preferred `pop()` to just *return* an item, but that would require the C++17 `std::optional` (or equivalent) for the return type of `try_pop()`, and I couldn’t use C++17 in the project. The downside of accepting a reference is that to pop an item you must first be able to construct an empty one. Of course, that’s not a problem if you use the smart pointers which default-construct to hold `nullptr`, but it’s worth mentioning. Again, the same design is used by other C++ queues, so it’s apparently an acceptable choice. - The use of mutex and condition variables make this queue utterly boring to people excited about parallel and lock-free programming. Still, those primitives are implemented very efficiently in modern C++, without the need to use system calls in the non-contended case. In benchmarks against fancier queues I was never able to measure the difference on the workload of the application I was using. Is this queue as battle-tested as the ones by Mr Vyukov and Mr Rigtorp? Certainly not! But it does work fine in production, it has passed code review by in-house experts, and most importantly, the code is easy to follow and understand. The remaining question is – is this a classic example of [NIH](https://en.wikipedia.org/wiki/Not_invented_here)? Is there really no other way than to implement your own queue? How do others do it? To shed some light on this, I invite the reader to read Dmitry Vyukov’s [classification of queues](http://www.1024cores.net/home/lock-free-algorithms/queues). In short, he categorizes queues across several dimensions: by whether they produce multiple producers, consumers, or both, by the underlying data structure, by maximum size, overflow behavior, support for priorities, ordering guarantees, and many many others! Differences in these choices vastly affect the implementation, and there is **no** one class that fits into all use cases. If you have needs for extremely low latency, definitely look into the lock-free queues like the ones from Vyukov or Rigtorp. If your needs are matched by the “classic” blocking queues, then a simple implementation like the one showed in this article might be the right choice. I would still prefer to have one good implementation written by experts that tries to cover the middle ground, for example a C++ equivalent of Rust’s [crossbeam channels](https://docs.rs/crossbeam/latest/crossbeam/). ~~But that kind of thing doesn’t appear to exist for C++ yet, at least not as a free-standing class.~~ **EDIT** As pointed out on [reddit](https://www.reddit.com/r/cpp/comments/equle2/account_of_search_for_a_minimalistic_bounded/), the last paragraph is not completely correct. There is a queue implementation that satisfies the requirements and has most of the crossbeam channels functionality: the [moodycamel queue](https://github.com/cameron314/concurrentqueue). I did stumble on that queue when originally researching the queues, but missed it when writing the article later, since it didn’t come up at the first page of results when googling for C++ bounded blocking MPMC queue. This queue was too complex to drop into the project I was working on, since its size would have dwarfed the rest of the code. But if size doesn’t scare you and you’re looking for a full-featured high quality queue implementation, do consider that one.
Shard180 (laksa)
Root Hash2014184751704518180
Unparsed URLnet,morestina!/1400/minimalistic-blocking-bounded-queue-for-c s443