In the realm of Operations Research, many have encountered a familiar challenge while trying to leverage AI to construct mathematical optimization models tailored to real business problems. While these tools excel when applied to textbook examples, they falter significantly when confronted with actual company datasets and pragmatic scenarios.
This disconnect is not accidental; it is a byproduct of design choices, ultimately leading to the creation of ORPilot.
The Potential of AI-Driven Optimization
For decades, Operations Research (OR) has played a crucial role in facilitating pivotal business decisions—from routing fleets and scheduling manufacturing protocols to designing intricate supply chains and managing cargo allocation. The underlying mathematics is highly sophisticated, and advanced solvers are readily available. The primary constraint has traditionally been the specialized knowledge necessary to translate complex business issues into solvable mathematical frameworks.
The emergence of Large Language Models (LLMs) appeared to offer the ideal solution. Various initiatives, such as OptiMUS and OR-LLM, have demonstrated that advanced LLMs can generate precise solver code for defined linear programming (LP) and mixed integer programming (MIP) tasks. Initial experiments appeared highly promising with compelling performance metrics.
However, once these tools are tested against genuine business complexities, their limitations become immediately apparent.
Limitations of Current Technologies
Nearly all current LLM-based OR solutions rely on a hidden premise: that the problem description is fully defined and presented in a structured format within a single prompt alongside all relevant datasets.
This rarely mirrors actual OR workflows. In practice, developing a supply chain optimization model often involves the following realities:
- Problem descriptions are often vague and incomplete. A logistics team might aim to “minimize freight expenses” without including distribution center capacity limits, restricted transport routes, or one-time setup costs for new facilities. These gaps aren’t necessarily errors; they are often details analysts take for granted. If an AI begins modeling before these specifics are finalized, the model may be mathematically valid but practically useless.
- Datasets are often too extensive for prompts. Real-world logistics scenarios may include hundreds of locations and thousands of products across numerous periods. For example, demand tables might encompass millions of entries, making it impossible to embed them within a text block. Even if successful, overloading the system with raw data substantially increases computational strain.
- Raw datasets frequently require preprocessing. Systems may need geographic distance matrices derived from GPS coordinates or gross demand summaries extracted from detailed sales ledgers. Converting raw figures into actionable parameters is a vital engineering hurdle that most existing tools ignore.
- Consistency and portability are essential post-development. Retesting on updated datasets, switching solvers like moving from Gurobi to an open-source alternative, or duplicating results on a colleague’s system require reusable outputs. Most current tools generate specific code without considering long-term utility.
These are not uncommon anomalies; they are fundamental hurdles in production. Existing systems were designed for theoretical situations and fail noticeably upon entering the field.
Presenting ORPilot
ORPilot is an open-source AI agent specifically engineered for industrial-stakes environments. It is among the first OR platforms designed specifically for the complexities of large-scale, data-intensive optimization work.
While most systems rush to generate code, ORPilot takes a different approach: it prioritizes understanding your specific needs.
That core philosophy—clarity before action—reflects a simple truth: the system should mimic the methodology of a seasoned human consultant.
An experienced consultant does not immediately draft equations during an initial meeting. They gather requirements, listen to constraints, and clarify ambiguities. They ensure datasets are properly prepared. Only then do they begin the technical modeling.
ORPilot is organized across five distinct stages mirroring this professional workflow.
Stage 1: Strategic Interviewing
The system initiates with an intake phase. Upon receiving an initial problem outline—potentially complex or contradictory—the platform engages in a guided dialogue to resolve uncertainties. Modeling begins only after all information has been verified.
The platform is configured to:
- Identify gaps in the current description.
- Ask targeted clarifying questions progressively (one at a time) to avoid confusion.
- Confirm that the objective, variables, constraints, and data requirements are fully specified before proceeding.
Common clarifications include:
- ORPilot: “Are facilities permanently open once established, or can they be deactivated later?”
- ORPilot: “Are you modeling a single product category or multiple variations?”
- ORPilot: “Is the transportation cost per unit, per shipment, or calculated differently?”
Before moving forward, the platform provides a complete summary of objectives, variables, and parameters, allowing for final adjustments. This prevents the most common failure: modeling an incorrect representation of the problem.
Stage 2: Efficient Data Handling
This stage is a key structural innovation missing in most similar tools.
While previous systems assumed data was readily available within the prompt, ORPilot recognizes that industrial datasets are massive and complex. For example, a supply chain involving 500 sites and 12 months generates millions of data points that cannot fit in a text prompt. Additionally, embedding such volumes of data increases error rates significantly.
ORPilot isolates data management from queries. All information is maintained within CSV files.
The AI can only access it by writing and executing code. The data collection agent’s role is to determine the exact structure required for those CSV files.
Based on the problem description from the interview agent, the data collection agent identifies:
- Which entities (sets) are present in the model
- What attributes (parameters) each entity requires
- The exact schema for each necessary table: column names, data types, and meaning
It presents this specification to you and waits until you provide all the files in the correct format. It checks for completeness before moving forward.
Importantly, the agent is adaptable: if you lack a specific piece of model-ready data (for example, the model requires a distance matrix but you only have GPS coordinates), you inform the agent about what you actually possess, and it adjusts the schema accordingly — passing the gap to the next stage to resolve.
Stage 3: Parameter Computation Agent
Nearly every existing LLM-for-OR tool assumes the numerical values required by the model are directly present in the user-provided data. In reality, this is almost never the case. Two examples that frequently arise in actual OR problems:
- A vehicle routing model requires a pairwise distance matrix. The user has GPS coordinates. Calculating Euclidean or geographic distances is a transformation entirely outside the scope of LP/MIP formulation.
- A multi-period production model requires aggregate demand per period. The user has a transaction ledger with one row per order. The model parameter is a sum-aggregation that must be computed from the raw data.
The parameter computation agent automatically bridges this gap. It receives the problem specification and the raw CSV files, then:
- Identifies which model parameters cannot be directly read from the raw tables
- Generates a Python script to compute those derived parameters
- Runs the script in a sandboxed environment
- Writes the results as additional CSV files, passed to the modeling step
This ensures that by the time the modeling agent sees the data, it is clean, correctly typed, correctly indexed, and model-ready. In our experiments, this step significantly reduced code generation failures and retry counts.
Another common scenario where the parameter computation agent proves useful is computing BigM values. In some experiments I conducted with ORPilot, the parameter computation agent calculated a BigM value needed for constraints linking continuous shipment variables to binary facility-opening decisions. This is a derived parameter that would be impractical to ask the user to provide directly.
Stage 4: Code Generation Agent
With a complete problem specification, raw data, and derived parameters all prepared, the code generation agent produces a complete Python solver script for your chosen backend. ORPilot currently supports five backends: Gurobi, CPLEX, PuLP, Pyomo, and OR-Tools.
The generated code is immediately executed in a sandbox. If anything goes wrong: syntax error, runtime exception, or an infeasible/unbounded solver result, the full error message and traceback are fed back to the LLM along with the previously generated code. The agent retries, up to a user-configurable maximum number of attempts.
In practice, most failures are resolved within one or two retries. The key reason ORPilot’s retry loop is effective is that the upstream stages have already done the hard work: the problem is correctly specified, the data is model-ready, and the agent only needs to fix a code-level mistake rather than rethink the entire model structure.
Stage 5: Reporter Agent
After a successful solve, a reporter agent translates the numerical results into plain English, explaining which facilities to open, what routes to use, what quantities to produce, in the domain language of the original business problem, for consumption by a business user rather than an OR expert.
Why This Order Matters
The pipeline is intentionally sequential. Each stage is gated on the previous one completing successfully. The interview must finish before data collection begins. Data must be validated before parameter computation runs. Parameters must be ready before code is generated.
This sequencing prevents the most common failure mode in LLM-based OR tools: cascading errors where an ambiguous problem description propagates through the pipeline and produces code that is syntactically valid but models the wrong objective.
What This Looks Like at Scale
I tested ORPilot on several OR problems, one of which is a supply chain network design problem with 50 production sites, 50 distribution centers, 500 customers, 500 products, 12 periods. The resulting model had more than 9.7 million decision variables and 963,000 constraints. ORPilot successfully handled the full pipeline end to end, from the initial conversation through data collection, parameter computation, code generation, and solution reporting, producing an optimal solution with Gurobi. Check out my paper here to see the results of more test problems.
Getting Started
ORPilot is open source and available now:
GitHub: https://github.com/GuangruiXieVT/ORPilot
Paper:
Installation takes a few minutes. ORPilot supports OpenAI, Anthropic, Google, and DeepSeek as LLM providers, and Gurobi, CPLEX, PuLP, Pyomo, and OR-Tools as solver backends.
In the next post in this series, we’ll take a deep dive on the Intermediate Representation (IR) — the solver-agnostic JSON artifact that makes ORPilot’s results reproducible and portable across backends without ever calling the LLM again. Stay tuned!



