Tips On How To Construct And Evolve A Customized OpenAI Agent With A-Evolve Utilizing Benchmarks, Expertise, Reminiscence, And Workspace Mutations

On this tutorial, we work straight with the A-Evolve framework in Colab and construct a whole evolutionary agent pipeline from the bottom up. We arrange the repository, configure an OpenAI-powered agent, outline a customized benchmark, and construct our personal evolution engine to see how A-Evolve truly improves an agent via iterative workspace mutations. By way of the code, we use the framework’s core abstractions for prompts, abilities, reminiscence, benchmarking, and evolution, which assist us perceive not simply the best way to run A-Evolve, but additionally the best way to prolong it in a sensible, Colab-friendly method.

import os
import sys
import json
import textwrap
import subprocess
import shutil
from pathlib import Path
from getpass import getpass
from collections import Counter, defaultdict


subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "openai>=1.30.0", "pyyaml>=6.0", "matplotlib>=3.8"])
REPO_DIR = Path("/content/a-evolve")
if REPO_DIR.exists():
   shutil.rmtree(REPO_DIR)
subprocess.check_call(["git", "clone", "--depth", "1", " str(REPO_DIR)])
sys.path.insert(0, str(REPO_DIR))


if not os.environ.get("OPENAI_API_KEY"):
   os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ").strip()


OPENAI_MODEL = "gpt-4o-mini"


import yaml
import matplotlib.pyplot as plt


import agent_evolve as ae
from agent_evolve.protocol.base_agent import BaseAgent
from agent_evolve.benchmarks.base import BenchmarkAdapter
from agent_evolve.engine.base import EvolutionEngine
from agent_evolve.varieties import Activity, Trajectory, Suggestions, StepResult
from agent_evolve.contract.workspace import AgentWorkspace
from openai import OpenAI


consumer = OpenAI(api_key=os.environ["OPENAI_API_KEY"])


WORKSPACE_ROOT = Path("/content/a_evolve_demo_workspace")
if WORKSPACE_ROOT.exists():
   shutil.rmtree(WORKSPACE_ROOT)


(WORKSPACE_ROOT / "prompts").mkdir(dad and mom=True, exist_ok=True)
(WORKSPACE_ROOT / "skills").mkdir(dad and mom=True, exist_ok=True)
(WORKSPACE_ROOT / "memory").mkdir(dad and mom=True, exist_ok=True)
(WORKSPACE_ROOT / "tools").mkdir(dad and mom=True, exist_ok=True)


manifest = 
           "id": "train-05",
           "rule": "pipe_unique_sorted_lower",
           "input": "Tokens: Banana, apple, banana, Cherry, apple",
           "answer": "apple
with open(WORKSPACE_ROOT / "manifest.yaml", "w") as f:
   yaml.dump(manifest, f, sort_keys=False)


initial_system_prompt = textwrap.dedent("""
You're a exact text-transformation agent.


Resolve the duty precisely.
Be concise.
Return solely the ultimate reply with no clarification except the duty explicitly asks for JSON.
""").strip()


(WORKSPACE_ROOT / "prompts" / "system.md").write_text(initial_system_prompt)

We put together the complete Colab surroundings wanted to run the tutorial from begin to end. We set up the required packages, clone the A-Evolve repository, load the framework imports, and securely gather the OpenAI API key for mannequin entry. We additionally outline the workspace construction and initialize the manifest and system immediate, offering our evolving agent with a sound place to begin inside the A-Evolve framework.

def build_dataset():
   practice = [
       zebra"
       ,
       
           "id": "holdout-03",
           "rule": "pipe_unique_sorted_lower",
           "input": "Tokens: Mango, apple, mango, Berry, berry",
           "answer": "apple,
       banana,
       zebra"
       ,
       mango"
       ,
       lion,
       
           "id": "holdout-03",
           "rule": "pipe_unique_sorted_lower",
           "input": "Tokens: Mango, apple, mango, Berry, berry",
           "answer": "apple,
       {
           "id": "train-08",
           "rule": "vowel_parity",
           "input": "Word: education",
           "answer": "ODD"
       },
   ]


   holdout = [
       {
           "id": "holdout-01",
           "rule": "json_sum",
           "input": "Numbers: 100, 1, 9",
           "answer": '{"sum":110}'
       },
       {
           "id": "holdout-02",
           "rule": "acronym_upper",
           "input": "Create the acronym from: artificial general intelligence",
           "answer": "AGI"
       },
       mango"
       ,
       {
           "id": "holdout-04",
           "rule": "vowel_parity",
           "input": "Word: aeroplane",
           "answer": "ODD"
       },
   ]
   return practice, holdout


TRAIN_DATA, HOLDOUT_DATA = build_dataset()


def normalize_text(x: str) -> str:
   return x.strip().substitute(" ", "")


class MiniTextBenchmark(BenchmarkAdapter):
   def __init__(self):
       self.practice = TRAIN_DATA
       self.holdout = HOLDOUT_DATA


   def get_tasks(self, break up: str = "train", restrict: int = 10):
       information = self.practice if break up == "train" else self.holdout
       duties = []
       for row in information[:limit]:
           duties.append(
               Activity(
                   id=row["id"],
                   enter=row["input"],
                   metadata={
                       "rule": row["rule"],
                       "answer": row["answer"]
                   }
               )
           )
       return duties


   def consider(self, process: Activity, trajectory: Trajectory):
       pred = trajectory.output.strip()
       gold = process.metadata["answer"].strip()
       success = normalize_text(pred) == normalize_text(gold)
       element = {
           "rule": process.metadata["rule"],
           "gold": gold,
           "pred": pred,
           "input": process.enter,
           "success": success
       }
       rating = 1.0 if success else 0.0
       return Suggestions(
           success=success,
           rating=rating,
           element=json.dumps(element, ensure_ascii=False),
           uncooked=element
       )


SKILL_ROUTING = {
   "json_sum": ["json", "sum"],
   "acronym_upper": ["acronym", "uppercase"],
   "pipe_unique_sorted_lower": ["unique", "sorted", "lowercase", "pipe"],
   "vowel_parity": ["vowel", "odd", "even", "parity"]
}

We outline the coaching and holdout datasets used to measure the agent earlier than and after evolution. We construct a customized benchmark class that packages every instance into A-Evolve duties and evaluates predictions towards precise anticipated outputs. We additionally arrange the routing hints for abilities, which prepares the system to attach totally different process varieties with the correct behavioral patterns later within the workflow.

class ColabAEResolverAgent(BaseAgent):
   def __init__(self, workspace_dir: str | Path, mannequin: str = OPENAI_MODEL):
       self.mannequin = mannequin
       tremendous().__init__(workspace_dir)


   def _pick_relevant_skills(self, process: Activity):
       rule = process.metadata.get("rule", "")
       chosen = []
       for ability in self.abilities:
           hay = f"{skill.name} {skill.description}".decrease()
           if rule == "json_sum" and ("json" in hay or "sum" in hay):
               chosen.append(ability)
           elif rule == "acronym_upper" and ("acronym" in hay or "uppercase" in hay):
               chosen.append(ability)
           elif rule == "pipe_unique_sorted_lower" and any(okay in hay for okay in ["unique", "sorted", "lowercase", "pipe"]):
               chosen.append(ability)
           elif rule == "vowel_parity" and any(okay in hay for okay in ["vowel", "odd", "even", "parity"]):
               chosen.append(ability)
       return chosen[:3]


   def clear up(self, process: Activity) -> Trajectory:
       relevant_skills = self._pick_relevant_skills(process)
       relevant_skill_texts = []
       for s in relevant_skills:
           relevant_skill_texts.append(self.get_skill_content(s.title))


       memory_text = "n".be a part of(
           [f"- {m.get('content', '')}" for m in self.memories[-8:]]
       ).strip()


       skill_block = "nn".be a part of(relevant_skill_texts).strip()
       if not skill_block:
           skill_block = "(no skills loaded yet)"


       if not memory_text:
           memory_text = "(no memory yet)"


       user_prompt = textwrap.dedent(f"""
       TASK RULE: {process.metadata.get("rule")}
       TASK INPUT:
       {process.enter}


       ACTIVE SYSTEM PROMPT:
       {self.system_prompt}


       RELEVANT SKILLS:
       {skill_block}


       RECENT MEMORIES:
       {memory_text}


       Resolve the duty precisely.
       Return solely the ultimate reply.
       """).strip()


       response = consumer.chat.completions.create(
           mannequin=self.mannequin,
           temperature=0,
           messages=[
               {"role": "system", "content": "You are an exact text-transformation agent."},
               {"role": "user", "content": user_prompt}
           ]
       )


       output = (response.selections[0].message.content material or "").strip()


       self.bear in mind(
           content material=f"Task {task.id} under rule {task.metadata.get('rule')} produced output: {output}",
           class="episodic"
       )


       return Trajectory(
           task_id=process.id,
           output=output,
           steps=[
               {
                   "rule": task.metadata.get("rule"),
                   "used_skills": [s.name for s in relevant_skills],
                   "system_prompt_chars": len(self.system_prompt),
                   "memory_items_seen": len(self.recollections)
               }
           ]
       )


SKILL_TEMPLATES = {
   "json_sum": textwrap.dedent("""
       ---
       title: json-sum-exact
       description: Add all integers and output strict compact JSON with the only key sum.
       ---
       # JSON Sum Precise


       Process:
       1. Extract all integers from the duty enter.
       2. Add them.
       3. Return precisely one compact JSON object on this format:
          {"sum":NUMBER}
       4. Don't add areas, explanations, markdown, or further keys.
   """).strip(),


   "acronym_upper": textwrap.dedent("""
       ---
       title: acronym-upper-exact
       description: Construct an uppercase acronym by taking the primary letter of every phrase.
       ---
       # Acronym Higher Precise


       Process:
       1. Establish the phrase after the colon.
       2. Take the primary letter of every phrase.
       3. Convert each letter to uppercase.
       4. Return solely the ultimate acronym, with no punctuation or clarification.
   """).strip(),


   "pipe_unique_sorted_lower": textwrap.dedent("""
       ---
       title: pipe-unique-sorted-lower
       description: Normalize tokens to lowercase, deduplicate them, kind ascending, and be a part of them with pipes.
       ---
       # Pipe Distinctive Sorted Decrease


       Process:
       1. Learn the token record after the colon.
       2. Cut up by commas.
       3. Trim areas and lowercase each token.
       4. Take away duplicates.
       5. Type alphabetically ascending.
       6. Be part of with "|" and return solely the ultimate string.
   """).strip(),


   "vowel_parity": textwrap.dedent("""
       ---
       title: vowel-parity-exact
       description: Rely vowels within the phrase and output ODD or EVEN solely.
       ---
       # Vowel Parity Precise


       Process:
       1. Learn the goal phrase after the colon.
       2. Rely vowels utilizing a, e, i, o, u.
       3. If the rely is odd, output ODD.
       4. If the rely is even, output EVEN.
       5. Return solely ODD or EVEN with no further textual content.
   """).strip(),
}


PROMPT_APPENDIX = textwrap.dedent("""
## STRICT OUTPUT CONTRACT
- Output solely the ultimate reply.
- By no means clarify your reasoning.
- If a process expects JSON, return compact JSON with precise keys solely.
- When a related ability exists, observe it actually.
- Precise format is extra essential than being conversational.
""").strip()

We implement the customized A-Evolve agent that reads the lively immediate, abilities, and reminiscence from the workspace and makes use of OpenAI to unravel every process. We design the agent so it selects related abilities, injects latest reminiscence, and returns trajectories within the construction anticipated by the framework. We additionally outline the ability templates and the strict output contract, which function the principle elements that the evolution engine can add to enhance efficiency over time.

class ColabMutationEngine(EvolutionEngine):
   def __init__(self):
       self.cycle_count = 0


   def step(self, workspace: AgentWorkspace, observations, historical past, trial):
       self.cycle_count += 1


       failed_by_rule = defaultdict(record)
       for obs in observations:
           if not obs.suggestions.success:
               failed_by_rule[obs.task.metadata["rule"]].append({
                   "task_id": obs.process.id,
                   "input": obs.process.enter,
                   "gold": obs.process.metadata["answer"],
                   "pred": obs.trajectory.output
               })


       mutated = False
       summaries = []


       current_prompt = workspace.read_prompt()
       if "STRICT OUTPUT CONTRACT" not in current_prompt:
           workspace.write_prompt(current_prompt.rstrip() + "nn" + PROMPT_APPENDIX + "n")
           mutated = True
           summaries.append("prompt hardened")


       existing_skill_names = {s.title for s in workspace.list_skills()}


       needed_rule_to_skill_name = {
           "json_sum": "json-sum-exact",
           "acronym_upper": "acronym-upper-exact",
           "pipe_unique_sorted_lower": "pipe-unique-sorted-lower",
           "vowel_parity": "vowel-parity-exact",
       }


       for rule, fails in failed_by_rule.objects():
           skill_name = needed_rule_to_skill_name[rule]
           if skill_name not in existing_skill_names:
               workspace.write_skill(skill_name, SKILL_TEMPLATES[rule])
               mutated = True
               summaries.append(f"added skill {skill_name}")


           workspace.add_memory({
               "content": f"Cycle {self.cycle_count}: rule={rule} failed {len(fails)} time(s). Common failure pattern: output formatting or procedure mismatch. Gold examples must be followed exactly.",
               "rule": rule,
               "examples": fails[:2]
           }, class="episodic")


       if not failed_by_rule:
           workspace.add_memory({
               "content": f"Cycle {self.cycle_count}: all current training tasks succeeded. Preserve exact formatting behavior."
           }, class="episodic")


       abstract = " | ".be a part of(summaries) if summaries else "no mutation needed"
       return StepResult(
           mutated=mutated,
           abstract=abstract,
           metadata={
               "failed_rules": record(failed_by_rule.keys()),
               "num_failed_rules": len(failed_by_rule),
               "cycle": self.cycle_count
           }
       )


def evaluate_split(agent, benchmark, break up="train"):
   duties = benchmark.get_tasks(break up=break up, restrict=100)
   rows = []
   complete = 0
   appropriate = 0
   for process in duties:
       traj = agent.clear up(process)
       fb = benchmark.consider(process, traj)
       rows.append({
           "task_id": process.id,
           "rule": process.metadata["rule"],
           "input": process.enter,
           "gold": process.metadata["answer"],
           "pred": traj.output,
           "score": fb.rating,
           "success": fb.success
       })
       complete += 1
       appropriate += int(fb.success)
   rating = appropriate / max(complete, 1)
   return rating, rows


def print_table(rows, title, max_rows=20):
   print("n" + "=" * 110)
   print(title)
   print("=" * 110)
   proven = rows[:max_rows]
   for r in proven:
       print(f"[{r['task_id']}] rule={r['rule']}")
       print(f"  input : {r['input']}")
       print(f"  gold  : {r['gold']}")
       print(f"  pred  : {r['pred']}")
       print(f"  score : {r['score']}  success={r['success']}")
       print("-" * 110)


def show_workspace(root: Path):
   print("n" + "=" * 110)
   print("EVOLVED WORKSPACE SNAPSHOT")
   print("=" * 110)
   for path in sorted(root.rglob("*")):
       rel = path.relative_to(root)
       if path.is_dir():
           print(f"[DIR ] {rel}/")
       else:
           print(f"[FILE] {rel}")


def show_skill_contents(root: Path):
   skill_files = sorted((root / "skills").glob("*/SKILL.md"))
   print("n" + "=" * 110)
   print("SKILL FILES")
   print("=" * 110)
   if not skill_files:
       print("No skill files yet.")
   for sf in skill_files:
       print(f"n--- {sf.parent.name}/SKILL.md ---")
       print(sf.read_text())

We construct a customized evolution engine that inspects failures and decides the best way to mutate the workspace. We use it to harden the immediate, add lacking abilities, and retailer episodic reminiscence in order that the agent steadily learns higher formatting and task-specific habits throughout cycles. We additionally outline analysis and reporting utilities that assist us rating the agent, examine predictions, and consider the advanced workspace clearly.

benchmark = MiniTextBenchmark()
agent = ColabAEResolverAgent(WORKSPACE_ROOT, mannequin=OPENAI_MODEL)
engine = ColabMutationEngine()


baseline_train_score, baseline_train_rows = evaluate_split(agent, benchmark, break up="train")
baseline_holdout_score, baseline_holdout_rows = evaluate_split(agent, benchmark, break up="holdout")


print(f"Baseline train score   : {baseline_train_score:.3f}")
print(f"Baseline holdout score : {baseline_holdout_score:.3f}")


print_table(baseline_train_rows, "BASELINE TRAIN RESULTS")
print_table(baseline_holdout_rows, "BASELINE HOLDOUT RESULTS")


config = ae.EvolveConfig(
   batch_size=8,
   max_cycles=4,
   egl_window=2
)


evolver = ae.Evolver(
   agent=agent,
   benchmark=benchmark,
   config=config,
   engine=engine
)


consequence = evolver.run(cycles=4)


print("n" + "=" * 110)
print("A-EVOLVE RUN SUMMARY")
print("=" * 110)
print(f"Cycles completed : {result.cycles_completed}")
print(f"Final train score: {result.final_score:.3f}")
print(f"Score history    : {result.score_history}")
print(f"Converged        : {result.converged}")


agent.reload_from_fs()
final_train_score, final_train_rows = evaluate_split(agent, benchmark, break up="train")
final_holdout_score, final_holdout_rows = evaluate_split(agent, benchmark, break up="holdout")


print(f"nFinal train score   : {final_train_score:.3f}")
print(f"Final holdout score : {final_holdout_score:.3f}")


print_table(final_train_rows, "FINAL TRAIN RESULTS")
print_table(final_holdout_rows, "FINAL HOLDOUT RESULTS")


show_workspace(WORKSPACE_ROOT)
show_skill_contents(WORKSPACE_ROOT)


print("n" + "=" * 110)
print("FINAL SYSTEM PROMPT")
print("=" * 110)
print((WORKSPACE_ROOT / "prompts" / "system.md").read_text())


episodic_path = WORKSPACE_ROOT / "memory" / "episodic.jsonl"
if episodic_path.exists():
   print("n" + "=" * 110)
   print("RECENT EPISODIC MEMORY")
   print("=" * 110)
   strains = episodic_path.read_text().strip().splitlines()
   for line in strains[-10:]:
       print(line)


plt.determine(figsize=(8, 4))
plt.plot(vary(1, len(consequence.score_history) + 1), consequence.score_history, marker="o")
plt.xlabel("Evolution cycle")
plt.ylabel("Train score")
plt.title("A-Evolve score history")
plt.grid(True)
plt.present()


print("n" + "=" * 110)
print("COMPARISON")
print("=" * 110)
print(f"Train   : {baseline_train_score:.3f} -> {final_train_score:.3f}")
print(f"Holdout : {baseline_holdout_score:.3f} -> {final_holdout_score:.3f}")


improved_rules = []
for earlier than, after in zip(sorted(baseline_train_rows, key=lambda x: x["task_id"]), sorted(final_train_rows, key=lambda x: x["task_id"])):
   if (not earlier than["success"]) and after["success"]:
       improved_rules.append(after["rule"])


print(f"Improved train cases by rule: {dict(Counter(improved_rules))}")


print("nDone. This notebook used the real A-Evolve framework and demonstrated:")
print("1) a valid agent workspace")
print("2) a BaseAgent subclass")
print("3) a BenchmarkAdapter subclass")
print("4) an EvolutionEngine subclass")
print("5) prompt / skill / memory mutations across A-Evolve cycles")

We put every part collectively and run the complete A-Evolve loop from baseline analysis to post-evolution evaluation. We measure the agent earlier than coaching, execute a number of evolution cycles, reload the workspace, after which evaluate the ultimate practice and holdout efficiency to see what improves. We additionally examine the advanced immediate, abilities, reminiscence, and rating historical past, which lets us clearly observe how the framework transforms the agent step-by-step.

In conclusion, we efficiently constructed and ran a full A-Evolve workflow relatively than simply inspecting the repository at a floor stage. We created a sound workspace, plugged in a customized agent, benchmarked it on structured duties, after which advanced its habits by modifying prompts, including abilities, and storing reminiscence throughout cycles. Additionally, we noticed how A-Evolve’s design permits us to deal with agent enchancment as a repeatable engineering course of, wherein we are able to measure baseline efficiency, apply managed mutations, and observe how the system turns into extra correct over time.

Take a look at the Full Coding Pocket book right here. Additionally, be happy to observe us on Twitter and don’t overlook to affix our 120k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as effectively.

Top Posts

CMMC Listening Sessions: DoD Hears Questions as Plans Take Shape

Sensing the Skies: IoT’s Silent Revolution in Aerospace Safety Checks

5 Agentic AI Power-Ups: Unlock Free Intelligence Now

Tips on how to Construct and Evolve a Customized OpenAI Agent with A-Evolve Utilizing Benchmarks, Expertise, Reminiscence, and Workspace Mutations

5 Agentic AI Power-Ups: Unlock Free Intelligence Now

The Blackout Test: Crucial Mistakes I Made With Backup Power (And How You Can Avoid Them)

Unlock Peak Performance: Your Command Protocol for GPT-5.6 Synergy

The Trust Chasm: Why Enterprise AI’s Real Crisis Isn’t Retrieval, It’s Context Collapse

Bunkerhill’s $55M Mission: Unleashing Agentic AI to Revolutionize Healthcare

Beyond Context Engineering: The Loop Experiment Running Blind Without an LLM

CMMC Listening Sessions: DoD Hears Questions as Plans Take Shape

Sensing the Skies: IoT’s Silent Revolution in Aerospace Safety Checks

5 Agentic AI Power-Ups: Unlock Free Intelligence Now

Dale-Proof AI Learns Perfect MNIST, Near-CIFAR-10 Vision—No Backpropagation Needed

Critical WordPress Zero-Day: Unauthenticated Code Execution Exposed in WP2Shell Flaw

Bolivia’s Bold Crypto Play: USDT Adoption Sparks AI Mining Debate

General Dynamics Fires Back: DISA’s Enclave Cloud Expansion Sparks Contract Clash

Wireless Logic Bolsters US IoT Reach with Strategic SIMETRY Acquisition

Trending

CMMC Listening Sessions: DoD Hears Questions as Plans Take Shape

Sensing the Skies: IoT’s Silent Revolution in Aerospace Safety Checks

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Tips on how to Construct and Evolve a Customized OpenAI Agent with A-Evolve Utilizing Benchmarks, Expertise, Reminiscence, and Workspace Mutations

Related Posts