Security Insights delivers practical security guidance tailored to each Cloudflare account. To uncover these insights, we conduct routine scans across all accounts, zones, and DNS records, identifying potential vulnerabilities and configuration errors.
However, two major challenges surfaced. First, our scanning cadence was too slow—occurring only once every one to two weeks—leaving newly introduced security flaws undetected for as long as fourteen days. Second, automatic scanning was optional for many users on free plans, meaning a significant number of accounts were never scanned at all.
The dangers of delayed or absent scans are growing: as automated cyberattacks become faster and more frequent, the timeframe for catching misconfigurations is narrowing. Ensuring we detect these issues for every customer is essential to our mission of creating a safer, more secure Internet for all.
We determined that to boost scanning frequency and extend automatic scanning to all accounts, we’d need to ramp up our scanning capacity by roughly tenfold—jumping from 10 scans per second to 100 per second. Yet our existing system was already buckling under pressure: millions of events piled up in our processing queue, our API regularly timed out, and our services frequently crashed. We had to stabilize our infrastructure—and make it scalable.
This is the story of how we boosted Security Insights scanning throughput by over 10x, extended security coverage to millions of additional customers, and doubled the scan frequency for everyone. Keep reading to learn how we made it happen.
How we scan for security insights
In broad terms, our automated security scans are initiated by a scheduler. When an account or zone is scheduled for a scan, the scheduler sends one or more messages to Apache Kafka, an open-source platform for distributed event streaming. These messages are then distributed to various checkers—specialized Go microservices designed to inspect specific assets or configurations.
For each message, every checker forwards its findings (i.e., the security insights it detects) to our internal API, which stores them in a Postgres database.
Apache Kafka isn’t technically a queue—it’s a partitioned event stream (though it has recently added queue-like capabilities). Within each partition, messages must be consumed and processed sequentially. This contrasts with traditional queues, where messages may be dequeued in order but processed concurrently. Because of this, only one consumer per partition can be active within a given consumer group.
This design imposes two key limitations on our system:
Messages that take a long time to process prevent the consumer from moving on to the next message
Each checker can have no more consumers than there are partitions (since each checker operates in its own consumer group)
We considered scaling by adding more partitions, but that would have placed additional strain on the Kafka broker—a shared resource used by many other services. We decided to treat that option as a last resort and instead focused first on refining our code and system architecture.
Introducing parallel processing
Even though messages must be consumed in sequence, there’s no restriction preventing us from handling multiple messages simultaneously.
We updated our checkers to process messages in batches, with each message handled in its own goroutine. The trade-off? If a crash occurs mid-batch, we’d need to reprocess more work, and memory usage would rise slightly. For our use case, both were acceptable compromises.
Avoiding head-of-line blocking
Certain messages handled by some checkers require significantly more time than others. For instance, an account or zone with a large number of assets will naturally take longer to scan. In extreme cases, these messages can take minutes or even hours—compared to the typical processing time of seconds or milliseconds.
We adopted a straightforward solution: we split our consumer groups and checkers into two lanes—the ‘slow lane’ and the ‘fast lane’. We could quickly assess whether a message would be slow or fast to process. If a ‘fast lane’ checker detects a slow message, it simply skips it.
This approach worked perfectly: slow messages received dedicated resources and ample time to complete, while fast messages continued to be processed at their usual rapid pace.
Optimizing our database queries
Every insight we discover is saved to our Postgres database. A single API endpoint handles this process, which our checkers call with a list of insights. The implementation looked like this:
for _, issue := range issues {
_, err = tx.Exec(ctx, `INSERT INTO table ... VALUES ($1, $2, ...) ON CONFLICT DO UPDATE ...`, ...)
if err != nil {
return err
}
}Sharp-eyed readers will notice that for large sets of insights, this code sends a separate request to the database for each insight. With a maximum observed size of 500,000, this meant half a million individual requests, queries, and transactions within a single API call.
We initially attempted the recommended method for bulk inserts in Postgres: COPY into a temporary table. However, we discovered that this approach caused bloat in the Postgres system tables.
We ultimately adopted a hybrid strategy:
This gave us the best of both worlds: reasonably fast inserts for massive sets of insights (seconds), and even faster inserts (milliseconds) for small sets of insights.
Investigating our API timeouts
As we attempted to scale, we observed several unusual behaviors in our internal API:
A significant number of requests were causing client-side timeouts
Many checkers were spending 20-90% of their processing time on a single API call
When running a large number of scans, our throughput would start strong and then steadily decline
All of these issues shared a single root cause: latency.
Our primary database is located in Portland, Oregon. Our API, however, was operating in an active-active setup in both Portland and Amsterdam. Even traveling at the speed of light, the round-trip time between Portland and Amsterdam is 50 milliseconds.
Because of this latency, database queries from the Amsterdam API instance took significantly longer, keeping connections from our client-side connection pool open. With the high volume of requests we were sending to the API, the connection pool was quickly depleted, causing timeouts while waiting for an available connection. Our average API call finished in 10 ms from Portland, but took nearly 3 seconds from Amsterdam!
But what caused the drop in message throughput? Each checker process is assigned a specific set of partitions from the Kafka stream to process. Our API is load-balanced. Since we maintain the connection open for the entire duration of the process, some processes were connected to the Amsterdam API, while others were connected to the Portland API. The partitions handled by Portland were processed quickly, but those handled by the Amsterdam-connected processes fell behind:
Kafka lag (the number of messages waiting to be processed within a single consumer group) by partition for one of our checkers. Note that we have 30 partitions in this case. Exactly 15 partitions can be seen falling behind (the lines that reach or approach zero later than around 03/10 03:00). This occurs because the load balancer distributes traffic evenly between our API endpoints.
The fix was straightforward: we switched our API to active-passive, making sure the active API was in the same location as our primary database. Our latency issues vanished overnight.
We had scaled Kafka. We had optimized our database queries. We had resolved our API issues. However, we still faced a challenge: we needed to ensure our scans were spread out relatively evenly over time. Queuing all scans simultaneously wasn’t practical, because our Kafka topic uses a time-based retention policy: scans would accumulate in Kafka and eventually be deleted before they could be processed.
Our scheduler wasn’t effective at distributing scans evenly. The number of scans triggered at any given time was erratic and unpredictable. At certain points during the week, hundreds of thousands of scans would be triggered within minutes of each other. What was causing this?
The scheduler triggers scans at fixed, recurring intervals. In pseudocode, the logic looked like this:
Loop continuously:
Find accounts where last_scheduled_at + scanning frequency <= now
For each account:
Trigger scan for account
Trigger scan for all zones in the account
Update last_scheduled_at = nowWe quickly noticed that last_scheduled_at was identical for a large number of accounts in our database, which contributed to this unevenness.
However, even with a perfectly uniform distribution, increasing our scanning frequency would have made the problem worse. For instance, changing the scanning interval from every 15 days to every seven days would mean 53% of accounts would suddenly be due for a scan.
There was another issue with this approach. Some accounts contain a very large number of zones. When these accounts were scheduled, a cascade of scans was triggered for all their zones. This overwhelmed our Kafka partitions and caused delays for scans of much smaller accounts.
To resolve these issues, we implemented three key changes:
Schedule zones independently from accounts: each zone receives its own last_scheduled_at field.
Randomize the last_scheduled_at time for existing accounts and zones.
Introduce adaptive rate limiting for scan scheduling.
Scheduling zones independently was an obvious solution for handling large accounts. Randomizing the last_scheduled_at time (while ensuring no scans were delayed during this transition) allowed us to address the existing unevenness in our database.
Adaptive rate limiting is more nuanced. Rate limiting helps prevent a sudden spike in scans when we change scanning frequencies. For example, if we wanted to increase scanning frequency to every 7 days, with 50 million accounts, a rate limit of approximately 83 scans/second would ensure they were distributed evenly across 7 days.
But what if we added 10 million more accounts? That rate limit would then require 8 days to scan all accounts. This is where the adaptive aspect becomes important: the rate limit is recalculated asynchronously every half-hour based on the total number of accounts and zones we have, along with our scanning frequencies. This ensures scans continue on schedule even as we onboard thousands or millions of new accounts.
Here is the rewritten HTML with paraphrased text:
func computeRate(free, pro, biz, ent int64) rate.Limit {
r := float64(free)/freeScanInterval.Seconds() +
float64(pro)/proScanInterval.Seconds() +
float64(biz)/bizScanInterval.Seconds() +
float64(ent)/entScanInterval.Seconds()
// Protect against zero counts. We always want to plan at least one scan per second.
if r < 1 {
r = 1
}
// Boost rate limit above the 'ideal' value, to provide a buffer in case of any downtime
// or load surges.
r *= rateLimitBufferFactor
return rate.Limit(r)
}After applying these fixes, our 7-day moving average throughput per checker increased by over 10x.
Prior to these enhancements, we were carrying out around 10 scans per second. The difference between this and our target throughput of 100 scans per second seemed enormous. We talked about dedicating more resources to the issue, adding more partitions to our Kafka topic – even scrapping our entire architecture.
However, our fixes proved to be the key. Currently, Security Insights maintains over 120 scans per second during peak scheduling, surpassing our 10x improvement target. Our internal API no longer experiences timeouts, and our Kafka lag indicators are much more stable. These scalability upgrades have enabled us to activate automatic scanning for all free accounts and zones and raise the scanning frequency for all customers:
The enhanced system reliability has given us the confidence to develop new features that we were previously unable to create. We’ve introduced the capability to perform detailed on-demand scans. You can now manually re-scan a Cloudflare account, zone, insight, or insight type.
Initiating a detailed on-demand scan from the Security Overview page in the Cloudflare dashboard
The takeaway from our experience is that it’s essential to thoroughly understand the current system before discarding anything. By carefully examining our code, SQL queries, logs, and metrics (especially metrics!), we managed to expand our capacity without simply adding more pods or partitions. By challenging our assumptions, investigating unusual-looking metrics, and avoiding easy shortcuts (like raising API client-side timeouts), we built a more dependable and resilient system.
Allocating more resources to the problem might occasionally be the solution, but at Cloudflare, we prefer to engineer our way through challenges.
Security Insights scans are turned on by default across all Cloudflare plans. Sign in to the Cloudflare dashboard today to review and manage your security insights.



