Decoding Agent Minds: Parsing, Analyzing, And Fine-Tuning Reasoning Traces From The Lambda/hermes-agent-reasoning-traces Dataset

In this tutorial, we delve into the lambda/hermes-agent-reasoning-traces dataset to uncover how agent-based models reason, interact with tools, and craft responses throughout multi-turn dialogues. We begin by loading and exploring the dataset, reviewing its layout, categories, and conversation structure to gain a solid understanding of the data at hand. Next, we develop straightforward parsers to isolate essential elements—such as reasoning traces, tool invocations, and tool outputs—enabling us to distinguish between internal deliberation and external tool interactions. We then examine patterns including how often tools are used, how long conversations tend to be, and how frequently errors occur, all to build a clearer picture of agent behavior. To make these insights more accessible, we generate visualizations that illustrate key trends. Lastly, we reformat the dataset into a structure ready for training, making it well-suited for applications like supervised fine-tuning.

!pip -q install -U datasets pandas matplotlib seaborn transformers accelerate trl


import json, re, random, textwrap
from collections import Counter, defaultdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datasets import load_dataset, concatenate_datasets


random.seed(0)


CONFIG = "kimi"
ds = load_dataset("lambda/hermes-agent-reasoning-traces", CONFIG, split="train")
print(ds)
print("Config:", CONFIG, "| Fields:", ds.column_names)
print("Categories:", sorted(set(ds["category"])))


COMPARE_BOTH = False
if COMPARE_BOTH:
   ds_kimi = load_dataset("lambda/hermes-agent-reasoning-traces", "kimi", split="train")
   ds_glm  = load_dataset("lambda/hermes-agent-reasoning-traces", "glm-5.1", split="train")
   ds_kimi = ds_kimi.add_column("source", ["kimi"] * len(ds_kimi))
   ds_glm  = ds_glm.add_column("source", ["glm-5.1"] * len(ds_glm))
   ds = concatenate_datasets([ds_kimi, ds_glm]).shuffle(seed=0)
   print("Combined:", ds, "→ counts:", Counter(ds["source"]))


sample = ds[0]
print("n=== Sample 0 ===")
print("id        :", sample["id"])
print("category  :", sample["category"], "/", sample["subcategory"])
print("task      :", sample["task"])
print("turns     :", len(sample["conversations"]))
print("system[0] :", sample["conversations"][0]["value"][:220], "...n")

We install all required libraries and import the necessary modules to prepare our environment. After that, we load the lambda/hermes-agent-reasoning-traces dataset and examine its structure, fields, and categories. We also have the option to merge multiple dataset configurations and inspect a sample entry to get familiar with the conversation format.

THINK_RE     = re.compile(r"(.*?)", re.DOTALL)
TOOL_CALL_RE = re.compile(r"s*({.*?})s*", re.DOTALL)
TOOL_RESP_RE = re.compile(r"s*(.*?)s*", re.DOTALL)


def parse_assistant(value: str) -> dict:
   thoughts = [t.strip() for t in THINK_RE.findall(value)]
   calls = []
   for raw in TOOL_CALL_RE.findall(value):
       try:
           calls.append(json.loads(raw))
       except json.JSONDecodeError:
           calls.append({"name": "", "arguments": {}})
   final = TOOL_CALL_RE.sub("", THINK_RE.sub("", value)).strip()
   return {"thoughts": thoughts, "tool_calls": calls, "final": final}


def parse_tool(value: str):
   raw = TOOL_RESP_RE.search(value)
   if not raw: return {"raw": value}
   body = raw.group(1)
   try:    return json.loads(body)
   except: return {"raw": body}


first_gpt = next(t for t in sample["conversations"] if t["from"] == "gpt")
p = parse_assistant(first_gpt["value"])
print("Thought preview :", (p["thoughts"][0][:160] + "...") if p["thoughts"] else "(none)")
print("Tool calls       :", [(c.get("name"), list(c.get("arguments", {}).keys())) for c in p["tool_calls"]])

We create regex-based parsers to pull out reasoning traces, tool calls, and tool responses from the dataset. Assistant messages are processed to neatly separate internal thoughts, tool actions, and final replies into a structured format. We then run the parser on a sample conversation to confirm that the extraction works as expected.

N = 3000
sub = ds.select(range(min(N, len(ds))))


tool_calls         = Counter()
parallel_widths    = Counter()
thoughts_per_turn  = []
calls_per_traj     = []
errors_per_traj    = []
turns_per_traj     = []
cat_counts         = Counter()


for ex in sub:
   cat_counts[ex["category"]] += 1
   n_calls = n_err = 0
   turns_per_traj.append(len(ex["conversations"]))
   for t in ex["conversations"]:
       if t["from"] == "gpt":
           p = parse_assistant(t["value"])
           thoughts_per_turn.append(len(p["thoughts"]))
           if p["tool_calls"]:
               parallel_widths[len(p["tool_calls"])] += 1
               for c in p["tool_calls"]:
                   tool_calls[c.get("name", "")] += 1
               n_calls += len(p["tool_calls"])
       elif t["from"] == "tool":
           r = parse_tool(t["value"])
           blob = json.dumps(r).lower()
           if "error" in blob or '"exit_code": 1' in blob or "traceback" in blob:
               n_err += 1
   calls_per_traj.append(n_calls)
   errors_per_traj.append(n_err)


print(f"nScanned {len(sub)} trajectories")
print(f"Avg turns/traj      : {np.mean(turns_per_traj):.1f}")
print(f"Avg tool calls/traj : {np.mean(calls_per_traj):.1f}")
print(f"% with >=1 error    : {100*np.mean([e>0 for e in errors_per_traj]):.1f}%")
print(f"% parallel turns    : {100*sum(v for k,v in parallel_widths.items() if k>1)/max(1,sum(parallel_widths.values())):.1f}%")
print("Top 10 tools        :", tool_calls.most_common(10))


fig, axes = plt.subplots(2, 2, figsize=(13, 9))


top = tool_calls.most_common(15)
axes[0,0].barh([t for t,_ in top][::-1], [c for _,c in top][::-1], color="teal")
axes[0,0].set_title("Top 15 tools by call volume")
axes[0,0].set_xlabel("calls")


ks = sorted(parallel_widths)
axes[0,1].bar([str(k) for k in ks], [parallel_widths[k] for k in ks], color="coral")
axes[0,1].set_title("Tool-calls per assistant turn (parallel width)")
axes[0,1].set_xlabel("# tool calls in one turn"); axes[0,1].set_ylabel("count")
axes[0,1].set_yscale("log")


axes[1,0].hist(turns_per_traj, bins=40, color="steelblue")
axes[1,0].set_title("Conversation length"); axes[1,0].set_xlabel("turns")


cats, vals = zip(*cat_counts.most_common())
axes[1,1].pie(vals, labels=cats, autopct="%1.0f%%", startangle=90)
axes[1,1].set_title("Category distribution")


plt.tight_layout(); plt.show()

We carry out dataset-wide analysis to measure tool usage, conversation lengths, and error patterns. By aggregating statistics across thousands of samples, we build a comprehensive understanding of how agents behave. We also produce visualizations that bring these trends to life—showing tool frequency, parallel call patterns, and how categories are distributed across the dataset.

def render_trace(ex, max_chars=350):
   print(f"n{'='*72}nTASK [{ex['category']} / {ex['subcategory']}]: {ex['task']}n{'='*72}")
   for t in ex["conversations"]:
       role = t["from"]
       if role == "system":
           continue
       if role == "human":
           print(f"n[USER]n{ww.shorten(t['value'], 600)}")
       elif role == "gpt":
           p = parse_assistant(t["value"])
           for th in p["thoughts"]:
               print(f"n[THINK]n{ww.shorten(th, max_chars)}")
           for c in p["tool_calls"]:
               args = j.dumps(c.get("arguments", {}))[:200]
               print(f"[CALL] {c.get('name')}({args})")
           if p["final"]:
               print(f"n[ANSWER]n{ww.shorten(p['final'], max_chars)}")
       elif role == "tool":
           print(f"[TOOL_RESPONSE] {ww.shorten(t['value'], 220)}")
   print("="*72)


idx = int(np.argmin(np.abs(np.array(turns_per_traj) - 10)))
render_trace(sub[idx])


def get_tool_schemas(ex):
   try:    return j.loads(ex["tools"])
   except: return []


schemas = get_tool_schemas(sample)
print(f"nSample 0 has {len(schemas)} tools available")
for s in schemas[:3]:
   fn = s.get("function", {})
   print(" -", fn.get("name"), "—", (fn.get("description") or "")[:80])


ROLE_MAP = {"system": "system", "human": "user", "gpt": "assistant", "tool": "tool"}


def to_openai_messages(conv):
   return [{"role": ROLE_MAP[t["from"]], "content": t["value"]} for t in conv]


example_msgs = to_openai_messages(sample["conversations"])
print("nFirst 2 OpenAI messages:")
for m in example_msgs[:2]:
   print(" ", m["role"], "→", m["content"][:120].replace("n", " "), "...")

To gain a clearer picture, we create tools that display entire conversation logs in an easy-to-read layout. We also pull out tool definitions and reshape the dataset into the OpenAI message structure, making it ready for use in training workflows. This makes it much simpler to grasp both how tools are set up and how dialogues can be standardized.

from transformers import AutoTokenizer
TOK_ID = "Qwen/Qwen2.5-0.5B-Instruct"
tok = AutoTokenizer.from_pretrained(TOK_ID)


def build_masked(conv, tokenizer, max_len=2048):
   msgs = to_openai_messages(conv)
   for m in msgs:
       if m["role"] == "tool":
           m["role"] = "user"
           m["content"] = "[TOOL OUTPUT]n" + m["content"]
   input_ids, labels = [], []
   for m in msgs:
       text = tokenizer.apply_chat_template([m], tokenize=False, add_generation_prompt=False)
       ids = tokenizer.encode(text, add_special_tokens=False)
       input_ids.extend(ids)
       labels.extend(ids if m["role"] == "assistant" else [-100] * len(ids))
   return input_ids[:max_len], labels[:max_len]


ids, lbls = build_masked(sample["conversations"], tok)
trainable = sum(1 for x in lbls if x != -100)
print(f"nTokenized example: {len(ids)} tokens, {trainable} trainable ({100*trainable/len(ids):.1f}%)")


think_lens, call_lens, ans_lens = [], [], []
for ex in sub.select(range(min(500, len(sub)))):
   for t in ex["conversations"]:
       if t["from"] != "gpt": continue
       p = parse_assistant(t["value"])
       for th in p["thoughts"]: think_lens.append(len(th))
       for c in p["tool_calls"]: call_lens.append(len(j.dumps(c)))
       if p["final"]: ans_lens.append(len(p["final"]))


plt.figure(figsize=(10,4))
plt.hist([think_lens, call_lens, ans_lens], bins=40, log=True,
        label=["", "", "final answer"], stacked=False)
plt.legend(); plt.xlabel("characters"); plt.title("Length distributions (log y)")
plt.tight_layout(); plt.show()


class TraceReplayer:
   def __init__(self, ex):
       self.ex = ex
       self.steps = []
       pending = None
       for t in ex["conversations"]:
           if t["from"] == "gpt":
               if pending: self.steps.append(pending)
               pending = {"think": parse_assistant(t["value"]), "responses": []}
           elif t["from"] == "tool" and pending:
               pending["responses"].append(parse_tool(t["value"]))
       if pending: self.steps.append(pending)
   def __len__(self): return len(self.steps)
   def play(self, i):
       s = self.steps[i]
       print(f"n── Step {i+1}/{len(self)} ──")
       for th in s["think"]["thoughts"]:
           print(f"💭 {ww.shorten(th, 280)}")
       for c in s["think"]["tool_calls"]:
           print(f"⚙️  {c.get('name')}({j.dumps(c.get('arguments', {}))[:140]})")
       for r in s["responses"]:
           print(f"📥 {ww.shorten(j.dumps(r), 200)}")
       if s["think"]["final"]:
           print(f"💬 {ww.shorten(s['think']['final'], 200)}")


rp = TraceReplayer(sample)
for i in range(min(3, len(rp))):
   rp.play(i)


TRAIN = False
if TRAIN:
   import torch
   from transformers import AutoModelForCausalLM
   from trl import SFTTrainer, SFTConfig


   train_subset = ds.select(range(200))


   def to_text(batch):
       msgs = to_openai_messages(batch["conversations"])
       for m in msgs:
           if m["role"] == "tool":
               m["role"] = "user"; m["content"] = "[TOOL]n" + m["content"]
       batch["text"] = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)
       return batch


   train_subset = train_subset.map(to_text)


   model = AutoModelForCausalLM.from_pretrained(
       TOK_ID,
       torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
       device_map="auto" if torch.cuda.is_available() else None,
   )


   cfg = SFTConfig(
       output_dir="hermes-sft-demo",
       per_device_train_batch_size=1,
       gradient_accumulation_steps=4,
       max_steps=20,
       learning_rate=2e-5,
       logging_steps=2,
       max_seq_length=1024,
       dataset_text_field="text",
       report_to="none",
       fp16=torch.cuda.is_available(),
   )
   SFTTrainer(model=model, args=cfg, train_dataset=train_subset, processing_class=tok).train()
   print("Fine-tune demo finished.")


print("n✅ Tutorial complete. You now have parsers, analytics, plots, a replayer, "
     "tokenized + label-masked SFT examples, and an optional training hook.")

We convert the dialogues into tokens and use label masking so that only the assistant’s replies are factored into the learning process. We examine how long the reasoning steps, tool calls, and answers tend to be, giving us a deeper look into their characteristics. We also build a trace player that lets you walk through an agent’s actions step by step, and optionally run a small-scale fine-tuning session.

All in all, we built a well-organized process to parse, study, and work with agent reasoning logs. We broke conversations into clear, understandable parts, explored how agents think through problems, and tracked how they use tools along the way. Through charts and analysis, we uncovered recurring patterns and behaviors within the dataset. We also transformed the data into a format ready for training language models, taking care of both the tokenization and the label masking for assistant outputs. This whole approach lays a solid groundwork for studying, evaluating, and enhancing tool-driven AI systems in a hands-on, scalable manner.

Be sure to grab the Full Codes with Notebook. And don’t hesitate to follow us on Twitter, and make sure to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Still on the fence? You can now join us on Telegram too.

Looking to collaborate with us on promoting your GitHub Repo, Hugging Face Page, Product Release, Webinar, or more? Get in touch with us

Top Posts

**Rising Tide: OT Cybersecurity Threats Breaching Critical Infrastructure

30 Laptop-Free Days: The Unfiltered Truth About Real Productivity

Dual-Parameter Phase Stability Control for Autonomous Mobile Robots

Decoding Agent Minds: Parsing, Analyzing, and Fine-Tuning Reasoning Traces from the lambda/hermes-agent-reasoning-traces Dataset

How to Get Hired in the AI Era

Open Weight Text-to-Speach with Voxtral TTS

Bridging the Gap: A Generalist-Specialist Framework for Generalizable Medical AI

Qwen AI Unveils Qwen-Scope: Open-Source Sparse AutoEncoders Suite Revolutionizing LLM Internal Features into Developer Tools

Unlocking Smarter Decisions with Stochastic Programming: A Beginner’s Guide

5 Surprise Python Decorators That Spark Clean AI Code

**Rising Tide: OT Cybersecurity Threats Breaching Critical Infrastructure

30 Laptop-Free Days: The Unfiltered Truth About Real Productivity

Dual-Parameter Phase Stability Control for Autonomous Mobile Robots

Choosing the Right Regularizer: Insights from 134,400 Simulations

Clear Crypto Act Text: Firms Can Offer Stablecoin Rewards & Shield Bank Yield

Decoding Agent Minds: Parsing, Analyzing, and Fine-Tuning Reasoning Traces from the lambda/hermes-agent-reasoning-traces Dataset

“Shadow Network: China-Backed Cyber Espionage Campaign Strikes Asian Governments, NATO Allies, and Global Press”

Fixed-Price Contracts Gain Momentum as Stakeholder Accountability Takes Center Stage

Trending

**Rising Tide: OT Cybersecurity Threats Breaching Critical Infrastructure

30 Laptop-Free Days: The Unfiltered Truth About Real Productivity

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Decoding Agent Minds: Parsing, Analyzing, and Fine-Tuning Reasoning Traces from the lambda/hermes-agent-reasoning-traces Dataset

Related Posts