[P] I Educated A Mamba-3 Log Anomaly Detector That Hit 0.9975 F1 On HDFS — And I’m Curious How Far This May Go

Experiment #324 ended nicely. 😉

This time I constructed a small mission round log anomaly detection. In about two days, I went from roughly 60% effectiveness within the first runs to a remaining F1 rating of 0.9975 on the HDFS benchmark.

Below my present preprocessing and analysis setup, LogAI reaches F1=0.9975, which is barely above the 0.996 HDFS consequence reported for LogRobust in a latest comparative examine.

What meaning in follow:

on 3,368 anomalous classes within the take a look at set, it missed about 9 (recall = 0.9973)
on roughly 112k regular classes, it raised solely about 3 false alarms (precision = 0.9976)

What I discover particularly attention-grabbing is that that is in all probability the primary log anomaly detection mannequin constructed on high of Mamba-3 / SSM, which was solely revealed a number of weeks in the past.

The mannequin is small:

4.9M parameters
trains in about 36 minutes on an RTX 4090
wants about 1 GB of GPU reminiscence
inference is under 2 ms on a single client GPU, so over 500 log occasions/sec

For comparability, my earlier method took round 20 hours to coach.

The dataset right here is the basic HDFS benchmark from LogHub / Zenodo, based mostly on Amazon EC2 logs:

11M+ uncooked log strains
575,061 classes
16,838 anomalous classes (2.9%)

This benchmark has been utilized in loads of papers since 2017, so it’s a helpful place to check concepts.

The half that stunned me most was not simply the rating, however what truly made the distinction.

I began with a reasonably commonplace NLP-style method:

BPE tokenizer
comparatively giant mannequin, round 40M parameters

That bought me one thing like 0.61–0.74 F1, relying on the run. It seemed cheap at first, however I stored hitting a wall. Hyperparameter tuning helped a bit, however not sufficient.

The breakthrough got here after I stopped treating logs like pure language.

As an alternative of splitting strains into subword tokens, I switched to template-based tokenization: one log template = one token representing an occasion sort.

So as an alternative of feeding the mannequin one thing like textual content, I feed it sequences like this:

[5, 3, 7, 5, 5, 3, 12, 12, 5, …]

The place for instance:

"Receiving block blk_123 from 10.0.0.1" – Template #5
"PacketResponder 1 terminating" – Template #3
"Unexpected error deleting block blk_456" – Template #12

That one change did rather a lot directly:

vocabulary dropped from about 8000 to round 50
mannequin measurement shrank by roughly 10x
coaching went from hours to minutes
and, most significantly, the overfitting drawback largely disappeared

The second necessary change was matching the classifier head to the structure. Mamba is causal, so the final token carries a compressed abstract of the sequence context. As soon as I revered that within the pooling/classification setup, the mannequin began behaving the way in which I had hoped.

The coaching pipeline was easy:

Pretrain (next-token prediction): the mannequin solely sees regular logs and learns what “normal” appears like
Finetune (classification): the mannequin sees labeled regular/anomalous classes
Take a look at: the mannequin will get unseen classes and predicts regular vs anomaly

Information break up was 70% prepare / 10% val / 20% take a look at, so the reported F1 is on classes the mannequin didn’t see throughout coaching.

One other helpful factor is that the output isn’t just binary. The mannequin provides a steady anomaly rating from 0 to 1.

So in manufacturing this might be used with a number of thresholds, for instance:

> 0.7 = warning
> 0.95 = important

Or with an adaptive threshold that tracks the baseline noise stage of a selected system.

A broader lesson for me: expertise and workflows I developed whereas taking part in with AI fashions for chess switch surprisingly nicely to different domains. That’s not precisely new – loads of AI labs began with video games, and plenty of nonetheless do – however it’s satisfying to see it work in follow.

Additionally, I undoubtedly didn’t get right here alone. This can be a mixture of:

studying loads of papers
working automated experiment loops
difficult AI assistants as an alternative of trusting them blindly
after which doing my very own interpretation and tuning

Very tough break up:

50% studying papers and extracting concepts
30% automated hyperparameter / experiment loops
20% guide tuning and adjustments based mostly on what I discovered

Now I’ll in all probability construct a dashboard and do this by myself Astrography / Astropolis manufacturing logs. Or I could push it additional first on BGL, Thunderbird, or Spirit.

Truthfully, I nonetheless discover it fairly wild how a lot can now be accomplished on a gaming PC when you mix respectable {hardware}, public analysis, and newer architectures rapidly sufficient.

Curious what individuals right here assume:

does this route look genuinely promising to you?
has anybody else tried SSMs / Mamba for log modeling?
and which benchmark would you hit subsequent: BGL, Thunderbird, or Spirit?

If there’s curiosity, I may share extra in regards to the preprocessing, coaching loop, and the errors that bought me caught at 60-70% earlier than it lastly clicked.

P.S. I additionally examined its effectiveness and reproducibility throughout completely different seeds. On most of them, it truly carried out barely higher than earlier than.

submitted by /u/Adam_Jesion
[comments]

Top Posts

The Micro-Loop That Turbocharges RAG: Parsing Questions Before Retrieval

Beyond the SaaS Storm: How Workday and Tech Titans Plan to Outsmart AI Apocalypse

Ignite Your Neural Network: Demystifying Backpropagation for Curious Minds

[P] I educated a Mamba-3 log anomaly detector that hit 0.9975 F1 on HDFS — and I’m curious how far this may go

The Micro-Loop That Turbocharges RAG: Parsing Questions Before Retrieval

WANDR: The Open Benchmark Stress-Testing Research Agents That Wander Wide and Deep

Unlock Loyalty: Revolutionizing FinTech Retention Secrets

Kimi K3 vs DeepSeek V4 Pro vs GLM-5.2: Open Trillion-Scale MoE Models Compared on Benchmarks, License, and Serving Cost

Beyond the Hype: Architecting Your AI-Native Data Fortress

The Hidden Alignment Chasm: Why Enterprise AI’s Unexamined Reality Gap Threatens Deployment

The Micro-Loop That Turbocharges RAG: Parsing Questions Before Retrieval

Beyond the SaaS Storm: How Workday and Tech Titans Plan to Outsmart AI Apocalypse

Ignite Your Neural Network: Demystifying Backpropagation for Curious Minds

SonicWall’s Hidden Zero-Days: How Hackers Stole Root Access Before the Patch

5 Laptop Upgrades Worth the Splurge (and 3 to Skip)

10 No-Code Open-Source Powerhouses to Forge LLM Apps, RAG, and AI Agents

WANDR: The Open Benchmark Stress-Testing Research Agents That Wander Wide and Deep

Escape the Teleoperation Trap: Revolutionizing Robotics Development

Trending

The Micro-Loop That Turbocharges RAG: Parsing Questions Before Retrieval

Beyond the SaaS Storm: How Workday and Tech Titans Plan to Outsmart AI Apocalypse

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

[P] I educated a Mamba-3 log anomaly detector that hit 0.9975 F1 on HDFS — and I’m curious how far this may go

Related Posts