I not too long ago had the chance to evaluate 5 in style SIEM options as a part of a judging panel for a Safety award. Whereas every platform had its personal distinctive aptitude, their core guarantees have been remarkably constant:
- 24/7/365 SOC monitoring: Round the clock protection backed by international consultants to validate and prioritize alerts.
- Proactive risk searching: Lively searches for hidden threats relatively than simply ready for automated triggers.
- AI and machine studying integration: Leveraging every little thing from primary anomaly detection to “Agentic AI” to cut back noise and speed up investigations.
- Lively incident response and containment: Capabilities to isolate endpoints or disable compromised customers to cease lateral motion.
- Third-party instrument integrations: Ingesting telemetry from the “native stack” and third-party instruments like CrowdStrike or Microsoft Defender.
- Steady intelligence updates: Fixed streams of latest detection guidelines and playbooks primarily based on international analysis.
- Service stage ensures: Monetary credit or pricing changes for damaged SLOs.
These choices are spectacular, but a evident omission stood out: none of them mentioned how they deal with multi-tenancy. In a cloud-native world, it is extremely seemingly that almost all if not all of those suppliers function on shared infrastructure. This implies they aren’t resistant to the “noisy neighbor” impact, a phenomenon the place a single misbehaving tenant can degrade the safety posture of everybody else on the platform.
The noisy neighbor impact
As safety operations transfer towards cloud-native frameworks to deal with the exponential progress of telemetry information (usually reaching petabytes of logs), they depend on the elasticity of software-as-a-service (SaaS). Nonetheless, the sharing of bodily assets (together with CPU, reminiscence and I/O) amongst impartial clients introduces a big engineering danger.
When one tenant’s workload consumes a disproportionate share of those assets, it creates a bottleneck. For different tenants, this interprets to elevated ingestion latency, delayed risk detection and violated SLAs. In safety, a “delayed” alert is commonly as ineffective as no alert in any respect.
The multi-tenant paradox
The core enchantment of multi-tenant SIEM options is effectivity: shared infrastructure results in decrease prices and unified administration. But, with out deliberate engineering, this turns into a zero-sum sport. In a naive system, a high-volume tenant can saturate the ingestion pipeline, inflicting “starvation” for smaller tenants. This breaks the real-time detection and response (RTDR) promise that these firms market so closely.
The important thing distinction is that multi-tenancy doesn’t must be zero-sum. The equity methods explored on this article exist exactly to stop that consequence, however provided that distributors have invested in them. The silence in advertising supplies suggests many haven’t.
Why equity is an engineering drawback
Engineering “fairness” shouldn’t be merely about setting onerous limits; it’s about refined useful resource orchestration. I extremely advocate studying AWS’s paper on equity in multitenant techniques. A inflexible cap may shield the system, however punish a consumer throughout a real safety emergency once they want ingestion capability most. Conversely, a totally open system is susceptible to cascading failures.
To resolve this, engineers should transfer past easy rate-limiting and embrace “fair share” scheduling, clever queuing and dynamic useful resource allocation. This text explores the architectural methods required to make sure that each tenant receives the efficiency they have been promised, even when their neighbor’s home is on hearth.
The anatomy of a contemporary SIEM
To know the place equity fails in a multi-tenant surroundings, we should first dissect the anatomy of a contemporary SIEM. It’s not a monolithic database, however a distributed information pipeline designed to ingest, rework and analyze petabytes of telemetry. This pipeline depends on decoupling producers from customers utilizing message queues, guaranteeing {that a} spike in a single layer doesn’t essentially result in a complete system failure.
The ingestion layer
The Ingestion Layer is the system’s entrance door. It’s chargeable for accumulating uncooked telemetry from numerous sources similar to EDR brokers, cloud APIs and firewalls. To deal with the “firehose” of incoming information, which might spike unpredictably throughout a safety incident, this layer doesn’t course of information instantly. As a substitute, it acts as a high-throughput buffer, writing uncooked occasions immediately right into a uncooked occasion queue (sometimes Apache Kafka). This decoupling is crucial as a result of it ensures that even when downstream processing layers are gradual, the system can nonetheless settle for incoming logs with out information loss.
The normalization layer
The normalization layer consumes uncooked occasions from the preliminary queue. Its main function is to deliver order to chaos by parsing heterogeneous log codecs (JSON, XML or Syslog) right into a structured schema just like the frequent info mannequin (CIM). This includes CPU-intensive duties similar to regex matching, area extraction and enrichment. As soon as processed, these structured occasions are revealed to a second normalized occasion queue. This central bus turns into the only supply of reality for all downstream customers.
The rule-based detection layer (real-time)
The primary shopper of the normalized queue is the rule-based detection layer, usually powered by engines like Apache Flink within the final 2-3 years. This layer is optimized for pace, executing low-latency, rule-based logic on occasions as they circulation by way of the pipe. It handles high-volume, easy detections, similar to “five failed logins in one minute,” in milliseconds. By alerting on these patterns instantly, it reduces the time-to-detect for crucial threats with out ready for information to be listed.
The ad-hoc search layer
Parallel to the streaming engine, the ad-hoc search layer additionally consumes from the normalized queue. This technique (usually using Elasticsearch or Splunk indexers) is optimized for human interplay. It indexes the information to assist sub-second search and retrieval, enabling safety analysts to carry out investigations and risk searching. Whereas the streaming layer finds recognized threats, this layer helps analysts discover the unknown ones by way of interactive querying.
The storage layer (long-term retention)
Concurrently, a 3rd shopper reads from the normalized queue to persist information into the storage layer. This layer is architected for sturdiness and cost-efficiency, sometimes writing information to object storage (like Amazon S3) in a columnar format (similar to Parquet). This “cold storage” ensures compliance with information retention insurance policies at a fraction of the price of the high-performance search tier, successfully decoupling retention from compute.
The analytics and correlation layer (batch)
Lastly, the analytics and correlation layer operates by consuming information from the storage layer. In contrast to the streaming engine, which seems to be at particular person occasions in movement, this layer executes complicated queries over huge historic datasets. It runs scheduled jobs to detect refined patterns, similar to “beaconing to a rare domain over thirty days,” that require analyzing very long time home windows. By studying from storage relatively than the real-time stream, it isolates these resource-intensive jobs from the ingestion and search pipelines.
Abstract of SIEM layers
| Layer | Main Operate | Key Problem |
| Ingestion | Collects uncooked logs and buffers them right into a Uncooked Queue. | Dealing with large throughput spikes with out information loss. |
| Normalization | Parses uncooked logs into a typical schema and publishes to a Normalized Queue. | Excessive CPU overhead from regex parsing and enrichment. |
| Rule-based detection | Consumes normalized stream for quick, rule-based alerting. | Managing state and windowing for thousands and thousands of concurrent entities. |
| Advert-hoc search | Indexes normalized information for quick, interactive investigation. | Unpredictable useful resource consumption from complicated analyst queries. |
| Storage | Persists normalized information for long-term retention. | Optimizing file codecs (Parquet or Avro) for environment friendly learn and write. |
| Analytics | Executes complicated batch queries towards storage. | Scheduling long-running jobs with out impacting different workloads. |
Methods to encode equity
With out deliberate intervention, shared infrastructure will at all times favor the loudest voice. To construct a resilient SIEM, engineers should implement methods that implement isolation and guarantee equitable useful resource distribution. These methods typically fall into three classes: admission management, tenant-aware scheduling and useful resource partitioning.
Admission management and price limiting
The primary line of protection is on the very entrance of the ingestion pipeline. Admission management ensures {that a} single tenant can’t flood the uncooked occasion queue past a sure threshold. Nonetheless, fashionable SIEMs transfer past “hard” price limits (the place information is solely dropped) and as an alternative use “soft” limits or shaping.
A typical strategy is the token bucket algorithm. Every tenant is allotted a sure variety of tokens per second, representing their licensed ingestion price. Throughout a spike, they’ll devour accrued tokens to “burst” above their restrict for a brief period. As soon as the bucket is empty, the system may start “shaping” the visitors, introducing slight delays to the ingestion of that particular tenant’s logs to guard the system’s international stability with out instantly discarding crucial safety information.
In observe: A tenant contracted at 10,000 occasions per second is perhaps permitted to burst to fifteen,000 EPS for as much as 60 seconds by drawing on their accrued token reserve. An actual incident producing 20,000 EPS would exhaust the bucket and set off shaping: their logs decelerate, however nothing is dropped. In the meantime, each different tenant on the platform continues processing at full pace.
Tenant-aware fair proportion scheduling
Contained in the processing layers (similar to normalization or analytics), the system should determine which tenant’s duties to execute subsequent. In a naive “first-in, first-out” (FIFO) mannequin, an enormous batch of logs from one tenant will block everybody else.
Engineers clear up this by implementing weighted honest queuing (WFQ). As a substitute of 1 big queue for all occasions, the system maintains digital queues for every tenant. The scheduler cycles by way of these queues, choosing a small batch of occasions from every. This ensures {that a} small tenant with solely ten occasions per second by no means has to attend behind a big tenant processing ten million. This “interleaving” of processing duties ensures that each buyer makes progress, no matter their neighbor’s exercise.
In observe: In a Kafka-backed SIEM, that is carried out by assigning every tenant their very own partition (or partition group) inside a subject. Normalization customers are then configured to course of a bounded variety of information per tenant per ballot cycle, biking by way of partitions in round-robin order. A tenant producing a 50x spike in log quantity will get their very own partition filling up, however the shopper by no means spends greater than its fair proportion of processing time on that partition earlier than transferring to the subsequent tenant.
Digital useful resource isolation (quotas and reservations)
For elements just like the ad-hoc search layer, the place useful resource utilization is extremely unpredictable, engineers use useful resource partitioning. This includes organising logical boundaries inside the shared compute pool.
Via useful resource quotas, the SIEM supplier can cap the utmost CPU and reminiscence a single tenant’s queries can devour at any given time. Some superior architectures take this a step additional with assured reservations. A high-tier buyer is perhaps assured a particular share of the cluster’s assets, guaranteeing that even throughout a world system spike, their SOC analysts can nonetheless run search queries with the identical sub-second latency they anticipate.
In observe: In Elasticsearch, that is carried out by way of a mixture of search thread pool sizing per node and query-level circuit breakers. A tenant’s queries might be routed to a devoted set of nodes (utilizing shard allocation filtering), and the circuit breaker limits might be configured per tenant on the coordinating node layer. The result’s {that a} runaway analyst question producing an costly aggregation throughout 90 days of information will hit its reminiscence ceiling and fail gracefully, relatively than cascading throughout all the cluster.
Per-tenant buffering and decoupled processing
In a extremely resilient SIEM, I favor that backpressure (the place a downstream failure forces the front-end to cease accepting information) ought to be prevented. As a substitute of pressuring the ingestion layer to cease, the system makes use of the queues positioned between every layer as shock absorbers.
By implementing per-tenant digital partitions inside these queues, the system can be sure that a bottleneck within the storage or search layers solely impacts the processing pace of the accountable tenant. If one tenant’s information is being written too slowly, their particular digital queue grows, whereas others proceed to course of at full pace. This leads to delayed detection for the “noisy” tenant, nevertheless it ensures information completeness. The system ultimately catches up with out ever dropping a log or impacting the real-time efficiency of the remainder of the platform.
The last word isolation: Bodily vs. logical
The methods above deal with equity inside shared infrastructure. However for sure organizations, the precise reply isn’t any sharing in any respect.
In a contemporary cloud surroundings, it’s totally possible to provision and allocate a complete, impartial SIEM stack per tenant. This “cluster-per-tenant” mannequin eliminates the noisy neighbor drawback totally as a result of there aren’t any neighbors. Every buyer’s ingestion pipeline, normalization staff, search nodes and storage buckets are totally devoted to their very own workload.
The compliance implications alone make this value severe consideration. Frameworks like FedRAMP, ITAR and CJIS usually have specific or implicit necessities round compute and information isolation {that a} shared multi-tenant cluster can’t fulfill with out vital architectural gymnastics. A devoted cluster satisfies these necessities cleanly, reduces audit floor space and simplifies the proof chain throughout compliance critiques.
The trade-off is value. Devoted clusters carry considerably larger per-tenant overhead: idle compute should be provisioned to deal with peak hundreds, administration complexity scales with cluster depend and the economies of scale that make shared SaaS engaging are partially surrendered. In observe, suppliers who supply this mannequin sometimes cost a significant premium (usually 2-3x the multi-tenant equal) and reserve it for enterprise or public sector clients with particular regulatory necessities.
The sensible framework for safety leaders evaluating this resolution is simple. In case your group operates underneath a compliance framework that names compute or information isolation as a requirement, begin with the devoted cluster dialog. In case your main concern is detection efficiency and value, make investments time as an alternative in understanding how deeply a vendor has engineered equity into their shared surroundings, as a result of that engineering is what determines whether or not the multi-tenant promise holds when it issues most.
Conclusion
The silence relating to multi-tenancy in main SIEM advertising is a danger that safety leaders shouldn’t ignore. As telemetry volumes proceed to blow up, the engineering behind “fairness” turns into simply as essential because the AI detecting the threats.
A great SIEM resolution ought to supply one of the best of each worlds: the pliability of a multi-tenant cluster the place equity is deeply engineered into each layer, mixed with the choice to deploy devoted, bodily remoted clusters for organizations with excessive efficiency or compliance wants. Till SIEM suppliers are clear about how they handle the noisy tenants subsequent door, the promise of 24/7/365 safety stays susceptible to the exercise of a neighbor you didn’t even know you had.
This text is revealed as a part of the Foundry Skilled Contributor Community.
Wish to be part of?



