From Celery to ARQ to 'Screw It, I'll Build My Own': A Task Queue Journey
There's a specific kind of optimism that hits you when you first set up Celery. You read the docs, you see the little ASCII celery stalk in the terminal, you fire off your first task, and you think: this is it, this is the abstraction that will carry me to production and beyond.
What I'm Building
StatusDude is a website monitoring service. Users set up monitors, I ping their websites from multiple regions (EU, US, Asia), and tell them when things go down. Simple concept, interesting engineering problem.
At scale, "ping websites" means scheduling tens of thousands of HTTP and TCP checks per minute, routing them to the right regional worker, executing them with proper timeouts and SSL validation, collecting the results, detecting status changes, and firing notifications. All of this needs to happen reliably, quickly, and without falling over when someone adds monitor number 50,001.
The beating heart of this system is the task queue and a scheduler. It's what turns "check this website" into an actual HTTP request on an actual server in an actual region. Get the task queue wrong and everything downstream falls apart: checks pile up, notifications arrive late, and your monitoring service becomes the thing that needs monitoring.
Act I: Celery's Greatest Hits (Of Pain)
Celery with Redis as the broker, gevent as the execution pool. Battle-tested, widely deployed, the default answer to "how do I run background tasks in Python." I set --pool=gevent with --concurrency=500 per worker and figured I was good for a while.
When things were running locally, everything on one host? All good! Speeeeeeeeeeeeeeeeeeeeeeeeeeeed!
And then you need to get tasks on remote workers, which means you either set up a Redis cluster (eww!) or use that fantastic global network to reach out and grab some work from a shared Redis instance across the ocean.
The Per-Task Overhead Problem
Here's what nobody tells you about Celery until you're already in production with thousands of tasks per second: every single task is its own little transaction with the broker.
Task arrives in Redis. Worker picks it up — that's a BRPOP. Worker starts executing. Worker finishes. Now it needs to acknowledge the task — that's another round-trip to Redis. With task_acks_late=True (the "safe" mode that acknowledges after execution, so crashed tasks get retried), each ACK is a full network round-trip to the broker.
On localhost? Negligible. Sub-millisecond. You'd never notice.
With a remote Redis? Let's say 50ms latency to your cross-region broker. Each task acknowledgement takes 50ms. That's a hard ceiling of ~20 tasks/second per worker just for the ACK overhead, before you've done any actual work. I had concurrency set to 2000 and Flower was showing 1-3 active tasks. One to three. Out of two thousand available slots.
I tried task_acks_late=False (acknowledge immediately, accept the risk of losing tasks on crash). Better throughput, but now you've traded reliability for speed, and that's not a trade you want to make in a monitoring system where a lost task means a missed outage.
The Prefetch Multiplier Trap
While debugging the throughput issues, I stumbled into Celery's prefetch multiplier. This setting controls how many tasks each worker prefetches from the queue into local memory. The actual prefetch count is multiplier × concurrency.
The tuning journey:
- multiplier=128 (initial): 128 × 500 greenlets = 64,000 tasks cached locally. Redis queue looks empty, workers look busy, but most tasks are sitting in a Python list inside the worker process, invisible to monitoring. Urgent tasks (rechecks, priority checks) get queued behind tens of thousands of already-prefetched items.
- multiplier=4: Better visibility, queue behaves like a queue again. But the per-task ACK overhead is still there.
- multiplier=100: Tried to find a sweet spot. There isn't one. The per-task ACK overhead is still there.
- multiplier=1: Minimal prefetch. Maximum Redis round-trips. Pick your poison. Per-task ACK overhead is still there....
Maybe I just can't do it.
No value of the prefetch multiplier solves the fundamental problem: each task is still one Redis round-trip for pickup and one for acknowledgement.
One Task, One Result, No Batching
Beyond the broker overhead, there was a more fundamental issue with how Celery handles work. The pattern is: task arrives -> worker picks it up -> worker executes it -> worker processes the result. One task, one execution, one result.
For website monitoring at scale, this means every single check is its own isolated unit. 50,000 monitors at 1-minute intervals = ~800 task completions per second, each independently processing its own result. No batching, no aggregation, no "hey, I just did 200 checks, let me write all the results at once." Every check is a beautiful, independent snowflake that has no idea its 832 siblings exist.
"But What About celery-batches?"
Yes, I know about celery-batches. It's a Celery extension that buffers tasks and processes them as a list. Sounds like exactly what I needed.
Except it requires setting worker_prefetch_multiplier to 0 (unlimited) or to a value higher than the batch size. Remember the prefetch multiplier? Setting it to 0 means the worker will try to pull the entire queue into memory. The celery-batches docs themselves warn that this "can cause excessive resource consumption on both Celery workers and the broker when used with a deep queue." Their words, not mine.
Result handling requires manually calling mark_as_done() on each request. Retries need special treatment. And underneath all of it, you still have the same per-task broker overhead for getting tasks into the worker in the first place.
I looked at it, sighed, and kept scrolling. I usually miss something and surely - I've missed something here as well.
Or simply did it wrong. Either way, moving on...
Act II: ARQ — The Honeymoon
ARQ is a small async task queue built on Redis, written by Samuel Colvin (yes, the Pydantic guy). It uses native Python asyncio instead of gevent greenlets. No monkey-patching. No import-order landmines. Just async def and await.
What immediately worked:
- Real async concurrency. 500 concurrent checks actually meant 500 concurrent checks. With
aiohttp as the HTTP client, every I/O operation is a real await, not a greenlet pretending to be async. No monkey-patching, no import order games, no guessing which library is secretly blocking.
- No prefetch multiplier tuning. ARQ doesn't prefetch. It pops jobs from a Redis sorted set as they're due. No local buffering, no invisible queues, no ghost tasks, but...
But ARQ Had Its Own Overhead
ARQ solved the gevent mess, but it still had a per-task pickup problem — just a different one.
ARQ's default _poll_iteration uses ZRANGEBYSCORE to find due jobs, then for each job, does a WATCH/MULTI/EXEC transaction to claim it. That's an optimistic locking pattern — 5 Redis round-trips per job. It's designed for correctness when multiple workers compete for the same queue, but with 50ms latency to remote Redis, that's 250ms just to claim a single task. Need to pick up 100 jobs? That's 500 Redis calls and 25 seconds of network time. For a system that needs to process 800 tasks per second, that's not going to work.
I monkey-patched ARQ's _poll_iteration() to use ZPOPMIN with a count — one atomic Redis command that pops up to 1,000 jobs from the sorted set in a single call. Then one MGET to fetch all their data. Two Redis round-trips total for the entire batch, not per job. With 50ms latency, that's 100ms for 1,000 jobs instead of 250,000ms. The math starts mathing again.
The trick is running one worker per queue (per region), so there's no contention — no need for the locking dance. ZPOPMIN is atomic, so even if two workers did race, they'd just split the work. But with dedicated queues per region, it's a non-issue.
Yes, I monkey-patched the thing I adopted because it didn't require monkey-patching. The irony is not lost on me.
Even with ARQ and the ZPOPMIN fix, I was still stuck in the one-task-one-result mindset. Each check_monitor task would execute an HTTP check, get the result, and... then what? Write it somewhere? Send a notification? Check if this is a multi-region recheck? If every worker handles its own results, every worker needs to understand the full business logic.
At 800 results per second, having each task individually process its own result means 800 individual writes, 800 status change checks, 800 potential notification evaluations — all happening in parallel across multiple workers with no coordination.
The Batch Processor
Here's where I diverged from the standard task queue pattern entirely.
Instead of workers processing their own results, they just push a JSON blob to a Redis list (LPUSH). That's it. The check is done, the worker moves on to the next task. Zero business logic in the worker.
A separate processor worker runs a continuous while True loop. Each iteration, it atomically pops up to 1,000 results from the Redis list (using LRANGE + LTRIM in a pipeline — atomic, no race conditions), groups them by monitor, and batch-inserts them in a single database transaction.
When the buffer is empty, the processor backs off: 100ms, 200ms, 500ms, up to 1 second. When results start flowing again, it snaps back to full speed. No wasted CPU cycles, no missed results.
One smart processor replacing 800 individual per-task writes.
The Elephant in the README
ARQ is officially in maintenance-only mode. The author opened issue #510 saying as much. Last release was v0.27.0 in February 2025. There are 86 open issues, 16 open PRs, and 2.8k stars from people who presumably also noticed the maintenance status and decided they're fine with it.
For my use case, this is actually okay. I need a small, stable async task runner and scheduler — not a framework that ships new features every quarter. ARQ's codebase is compact enough that I can (and do) fork and patch the parts I need. Already monkey-patched the hot path. Practically a co-maintainer at this point, just without the commit access.
So a, we're not using the queue system but, the scheduler only? Kinda yes and also no ;-)
There's one caveat tho, worker crashing will essentially loose all tasks for given batch and that is.... OK.
I mean, it's not OK for HFT systems and critical billing processing or even health checks and alerting for those but,
you would want to be using 1 second checks.
BTW, if that's your thing - Contact Us - and we'll figure something out!