Introduction
that always operates with stunning inefficiency: handbook processes, piles of paperwork, authorized complexities. Many firms nonetheless run on paper or Excel and don’t even gather knowledge on their shipments.
However what if an organization is giant sufficient to save lots of tens of millions — and even a whole bunch of tens of millions — of {dollars} by optimization (to say nothing of the environmental impression)? Or what if an organization is small, however poised for fast development?
Optimization is commonly non-existent or rudimentary — designed for operational comfort relatively than maximizing financial savings. The business is clearly lagging behind, but there’s a TON of cash on the desk. Cargo networks span the globe, from Alaska to Sydney. I received’t bore you with market dimension statistics right here. Insiders already know the dimensions, and outsiders could make an informed (or not so educated) guess.
And that’s the place I got here in. As a Information Science and Machine Studying specialist, I discovered myself in a big, fast-growing logistics firm. Crucially, the workforce there wasn’t simply going by the motions; they genuinely wished to optimize. This led to the creation of a line-haul optimization challenge that I led for 2 years — and that’s the story I’m right here to inform.
This challenge will all the time maintain a heat spot in my coronary heart, regardless that it by no means absolutely made it to manufacturing. I imagine it holds large potential — particularly within the mixture of logistics and RL’s distinctive capacity to generalize decision-making.
Whereas conventional optimization initiatives often concentrate on maximizing the target perform or execution velocity, probably the most fascinating metric right here is what number of unseen circumstances we will remedy with the identical mannequin (zero-shot or few-shot).
In different phrases, we’re aiming for a generalizable zero-shot coverage.
Ideally, we practice an agent, drop it into new circumstances (ones it has by no means seen), and it simply works— with none retraining or with solely minimal fine-tuning. We don’t want perfection; we simply want it to carry out ‘good enough’ to not breach the SLA.
Then we will say: ‘Cool, the agent generalized this case, too.’
I’m assured that this method can yield fashions able to ever-increasing generalization over time. I imagine that is the way forward for the business.
And as one in all my favourite stand-up comedians as soon as mentioned:
Finally, someone will do it anyway. Let it’s us.
Enterprise Context
The corporate had scaled quickly, rising right into a community of over 100 line-haul terminals. At this magnitude, handbook scheduling reached its operational restrict. As soon as established, a schedule — together with its underlying enterprise contracts and preparations — would usually stay static for months and not using a single change.
We noticed a constant inefficiency: vehicles had been regularly dispatched with suboptimal hundreds — both underutilized (driving up unit prices) or bottlenecked by last-minute overflows.
The monetary impression of this inefficiency was important. In a community of this dimension, even a 1% improve in car utilization interprets to tens of millions of {dollars} in annual financial savings. Subsequently, maximizing car utilization turned the first lever for value discount.
Huge Image Drawback
We had entry to historic cargo knowledge. Whereas the storage format was removed from handy, the amount was ample for modeling. Because of the efforts of my knowledge engineering and knowledge science colleagues, this uncooked knowledge was remodeled right into a clear, usable state (I’ll cowl the precise knowledge engineering challenges in a separate article).
My preliminary purpose was to generate a ‘good’ schedule. A Schedule is outlined right here as a tabular dataset the place each row represents a bodily motion (cargo):
- Timestamp: Hourly precision.
- Origin & Vacation spot: The precise edge within the graph.
- Car Sort: The discrete asset class (e.g., 20-ton semi, 5-ton van, and so on.).
- Load Manifest: The actual set of aggregated ‘pallets’ packed inside.
Subsequently, constructing a schedule requires 4 distinct choices:
- Select what packages to ship. What can go flawed: if low-priority packages are despatched first, invaluable or pressing cargo would possibly get stranded on the warehouse. We don’t need that, as a result of the penalty is greater for the extra invaluable packages.
- Select the subsequent warehouse (the place to ship). Primarily, it is a routing drawback: deciding on the optimum ‘next edge’ on the graph for each single bundle.
- Select car varieties and their amount. It is a balancing act. What can go flawed: sending a number of small autos as a substitute of 1 giant one creates fleet inefficiency, whereas dispatching giant vehicles that drive principally empty means paying for air. Conversely, under-provisioning the fleet results in delays, costing us in each SLA penalties and fame.
- Lastly, inaction can be an motion. For any given time step, the optimum transfer is perhaps to ship no vehicles in any respect. To create an optimized schedule, the system should completely steadiness energetic shipments with ‘doing nothing’.
Nonetheless, actuality introduces further complexities and constraints into the issue area:
- Tempo of Change: Enterprise guidelines are quite a few, complicated, and evolve quickly. The true world will be much more complicated and messier than a primary simulation. And modifications in the actual world result in costly and time-consuming code updates.
- Stochastic Demand: Demand is non-deterministic, unknown upfront, and dynamic (e.g., a number of visits to a buyer inside a window).
- Multi-Goal Optimization: We aren’t simply minimizing value; we’re balancing value towards SLA penalties (lateness) and fleet bills.
So now, we perceive that we not solely must create a great schedule, but additionally create a system that respects dynamic demand, truck capability, and quite a few customized enterprise guidelines, which may additionally usually change. This crystallized into the next.
Want-Checklist
- Low-Value Reusability. We want the power to reuse the mechanism for brand spanking new duties and contexts cheaply. Since real-world issues shift shortly, the answer have to be versatile — adaptable to new settings with out requiring us to retrain the mannequin from scratch each time.
- Quick Inference. Whereas gradual coaching is appropriate if it yields stronger generalization, the inference (decision-making) have to be quick.
- ‘Good Enough’ Effectiveness. The system doesn’t should be good, nevertheless it should strictly adhere to the baseline SLA ranges.
- International Optimization. We have to optimize the system as an entire, relatively than optimizing its particular person elements in isolation.
System Specs
- Topology: Customized graph containing 2 to 100 nodes
- Resolution frequency: 1-hour intervals, 480 steps/episode (representing 20 days)
- Brokers: Decentralized hubs appearing as impartial decision-makers
- Constraints: Laborious bodily limits on car quantity (m³) and weight (kg). Laborious restrict on the variety of autos dispatched from a terminal per hour.
- Goal: Decrease world value whereas adhering to dynamic SLA home windows.
- Main metrics: Shipments value, share of late packages (SLA violations), rely of dispatched autos by kind
- Secondary “Long-term” Metrics: Common transit time and car capability utilization.
Why Not Normal Solvers?
Spoiler: They’ll’t reduce it, and they aren’t adequate.
Naturally, we began by exploring commonplace solvers and off-the-shelf instruments like Google OR-Instruments. Nonetheless, the consensus was discouraging: these instruments would both remedy our precise drawback poorly, or they might completely remedy a unique, imaginary model of the issue. Finally, I concluded that this method was a lifeless finish.
Linear Optimization
That is the best and most cost-effective method, nevertheless it has a deadly flaw: a linear formulation fails to account for temporal dynamics (each different step is determined by the earlier one).
Primarily, LP assumes your complete optimization drawback matches right into a single, static snapshot. It ignores the truth that each step is determined by the earlier one. That is basically incorrect and divorced from actuality, the place each motion within the community creates ripple results elsewhere.
Moreover, the sheer quantity of enterprise guidelines makes it virtually inconceivable to cram all of them right into a “flat” solver. In brief, whereas Linear Programming is a good software, it is just too inflexible for an issue of this magnitude.
Genetic Algorithms
Genetic Algorithms (GA) had been nearer in philosophy to what we would have liked. Whereas they do work, they arrive with important drawbacks of their very own.
First, gradual Inference. To get a consequence, you primarily must run the optimization from scratch each time (evolving the inhabitants). You can not merely “train” a mannequin and freeze the weights, as a result of there are not any weights to freeze. Consequently, the system’s response time is measured in seconds and even minutes — not milliseconds — typical of a neural community or a heuristic. In a manufacturing setting coping with a whole bunch of hubs in real-time, this turns into a significant bottleneck.
Second, lack of determinism. For those who run the scheduler twice on the identical dataset, a GA can yield two fully totally different schedules. Enterprise clients often don’t like that very a lot, which may result in belief points.
Why not Pure RL?
Theoretically, one might attempt to remedy your complete drawback end-to-end utilizing pure Reinforcement Studying. However that’s undoubtedly the onerous approach.
A possible pure RL resolution would take one in all two varieties: both a single “God Mode” Agent that sees all the things and allocates each bundle to each truck on each route at each step. Or a workforce of Sequential Brokers appearing one after one other.
God-Mode Agent
Within the first case, the motion area turns into unmanageable. You aren’t simply deciding on a route — it’s a must to select each truck (from N varieties) Ok instances for each course. With packages, it will get even worse: you don’t simply want to pick a subset of cargo — it’s a must to assign particular packages to particular vehicles. Plus, you keep the choice to go away a bundle on the warehouse.
Even with a small fleet, the variety of methods to assign particular packages to particular vehicles is astronomical. Asking a neural community to discover this whole area from scratch is inefficient. It will spend eons simply making an attempt to determine which bundle matches into which bin.
Sequential Brokers
A series of brokers passing packages down the road would create a non-stationarity nightmare.
Whereas Agent 1 is studying, its habits is basically random. Agent 2 tries to adapt to Agent 1, however since Agent 1 retains altering its technique, Agent 2 can by no means stabilize. As a substitute of fixing logistics, every agent is pressured to infinitely adapt to its neighbor’s instability. It turns into a case of the blind main the blind, unlikely to converge in any affordable time.
Moreover, pure RL struggles to be taught onerous constraints (like most weight limits) with out incurring large penalties. It tends to “hallucinate” options — outputs that look environment friendly however are bodily inconceivable.
However, we’ve got Linear Programming (LP): a quick, easy solver that handles onerous constraints natively. The temptation to carve out a sub-problem and offload it to LP was too nice to withstand.
And that’s the reason I selected a hybrid method.
Carried out Answer
MARL + LP Hybrid Structure
Let’s construct an RL agent that observes the state of the logistics community and orchestrates the move of packages — deciding precisely what quantity of cargo strikes between warehouses at any given second. Ideally, this agent makes choices strategically, factoring within the world state of the system relatively than simply optimizing particular person warehouses in isolation.
Then, an Agent represents a selected warehouse liable for delivery packages to its neighbors. We then join these brokers right into a multi-agent community. Since each motion taken by an agent corresponds to a cargo to a number of locations, the mixture sequence of those actions constitutes the ultimate schedule.
Technically, we applied a Multi-Agent Reinforcement Studying (MARL) framework. The RL setting trains the algorithms to generate viable transportation schedules for real-world shipments. Crucially, this challenge contains each the setting creation and the agent coaching pipelines, guaranteeing that the answer can adapt (by way of continuous studying) to more and more complicated eventualities with minimal human intervention.
What brokers see
Under are the important thing observations (mannequin inputs) fed into the agent (I’ll cowl extra of the implementation particulars in Half 2).
- Native Stock: The amount of packages at every warehouse.
- In-Transit Quantity: The amount of packages at the moment touring on the perimeters between warehouses.
- Cargo Worth: The whole monetary worth of the stock (essential for danger administration) at every warehouse.
- SLA Heatmap: The closest deadlines for the present inventory (figuring out pressing cargo).
- Inbound Forecast: The quantity of packages anticipated to reach inside the subsequent 24 hours.
- Heuristic Hints: Used solely throughout the imitation studying stage to bootstrap coaching.
Model 1. Brokers Slicing a PriorityQueue
On this model, packages are lined up in a precedence queue, sorted in descending order based mostly on a easy method: Precedence = Worth x Urgency (proximity to deadline). The RL agent “slices” a portion of this queue by deciding on a fraction of the highest packages and deciding which warehouse to ship them to.
We use heuristics to pre-filter the choices — discarding packages we undoubtedly don’t need to ship but, or ruling out nonsensical locations (e.g., delivery a bundle in the other way of its vacation spot).
As soon as the RL selects the what and the place, the Linear Programming solver steps in to select the amount and sort of autos. The LP enforces onerous constraints on weight, quantity, and fleet availability to make sure the simulation doesn’t violate the legal guidelines of physics.
In Model 1, a single motion consists of sending packages to 1 neighbor solely. The quantity is set by the “fraction” (0.0 to 1.0) chosen by the agent. “Doing nothing” is just selecting a fraction of 0.

However then, it hit me!
Model 2. Brokers Sending Vehicles
TL;DR: As a substitute of choosing packages, we constructed an agent that selects what number of vehicles to dispatch to every vacation spot. The Linear Programming (LP) solver then decides precisely which packages to pack into these vehicles.
What if the agent managed the fleet capability straight? This enables the LP solver to deal with the low-level “bin packing” work, whereas the RL agent focuses purely on high-level move administration. That is precisely what we would have liked!
Right here is the brand new division of labor:
RL Agent — Fleet Supervisor. Decides the amount of autos and their locations.
- Instinct: It seems to be on the map, checks the calendar, and shouts: “Send 5 trucks to the North Hub!” It handles the move administration.
- Talent: Technique, foresight, and balancing.
LP Solver — Dock Employee. Selects the precise car varieties (optimizing the fleet combine) and picks the precise packages to pack.
- Instinct: It takes the “5 trucks” order and the pile of packing containers, then packs them completely to maximise worth density.
- Talent: Tetris, algebra, and bodily validity.
Beforehand, the agent managed a “fraction of the queue,” which decided the bundle rely, which decided the truck rely, which lastly decided the reward. Now, the agent controls the truck rely straight. The hyperlink between Motion and Reward turned a lot shorter and extra predictable, making coaching sooner and extra secure. In technical phrases, we considerably lowered the stochastic noise within the reward sign. The LP now optimizes solely the packaging and fleet combine after the strategic capability determination has already been made.
However the engineering advantages didn’t cease there. For the reason that LP now selects the packages, we not want to keep up a sorted Precedence Queue. This simplified the structure in three important methods. First is concurrency: We eradicated the technical multiprocessing complications related to sharing complicated PriorityQueue objects between processes. Second is vectorization: We not must iterate by a queue item-by-item (a gradual Python loop). We will now rewrite all the things utilizing matrix operations. This unlocked an enormous potential for velocity optimization. Plus, the code turned considerably shorter and cleaner. And at last, multi-destination actions: The agent can now dispatch X vehicles to N totally different warehouses in a single step (not like V1, which was restricted to 1 vacation spot per step). It turned instantly clear that this was the profitable structure.

Scale-Invariant Remark House and Generalization
TL;DR: I exploit histogram state representations normalized to 0–1 as a substitute of absolute values to make the brokers transferable to new circumstances.
A core pillar of this challenge’s philosophy is universality — the power to reuse the answer throughout totally different duties and new circumstances with out retraining. Nonetheless, commonplace RL requires a rigidly fastened motion and statement area.
To reconcile this, we normalized the statement area to make it scale-invariant. As a substitute of monitoring uncooked counts (e.g., “how many packages were sent”), we monitor ratios (e.g., “what percentage of the total backlog was sent”). This enables the agent to function on a better degree of abstraction the place absolute numbers are irrelevant.
The result’s a mannequin able to generalizing throughout totally different eventualities, enabling zero-shot switch throughout nodes with vastly totally different capacities.
A Glimpse of the Efficiency
Brokers Realized “LTL Consolidation” Habits
TL;DR: Elevated cargo value led to extra idle actions and fewer autos.
One of the crucial spectacular emergent behaviors was the brokers’ capacity to carry out LTL (Much less-Than-Truckload) Consolidation. Initially of coaching, the brokers had been trigger-happy, dispatching many partially stuffed vehicles at each step. Over time, their habits shifted.
The cargo value is calculated as a product of the car value and the cargo value multiplier. When the cargo value multiplier will increase, a cargo prices extra in relation to the worth of the packages. That provides us a easy strategy to regulate the cargo value a part of the reward manually.

As we elevated the cargo value multiplier (making logistics dearer relative to the bundle worth), the brokers realized to be affected person. They started selecting extra “idle” actions, successfully accumulating stock to ship fewer, fuller vehicles.

As a result of it’s pricey to ship a truck half-empty (or half-full, relying in your worldview), brokers began ready to fill the vehicles nearer to 100% capability. In different phrases, the brokers realized to optimize car utilization not directly, purely as a byproduct of the fee/reward perform.
However, sending fewer automobiles led to a better variety of overdue packages. I imagine this type of trade-off — value vs. velocity — ought to be determined by every enterprise independently, based mostly on their particular technique and SLAs. In our particular case, we had a tough cap on the proportion of allowed delays, therefore we might optimize by staying beneath that cap.
Extra outcomes and experiments might be proven within the coming Half 3
Constraints and Advantages
As I discussed earlier, high-quality knowledge is essential for this engine. For those who don’t have knowledge, you haven’t any simulation, no schedules, and no bundle move forecasts — the very basis of your complete system.
You additionally want the willingness to adapt what you are promoting processes. In observe, that is usually met with resistance. And, in fact, you want the uncooked compute energy (substantial RAM + CPU) to run the simulations.
However when you can overcome these hurdles, you would possibly discover that your logistics community has remodeled into one thing rather more highly effective — a community that:
- Can face up to overloads, peak seasons, and sudden occasions. It’s because you could have a quick, dependable strategy to generate a brand new schedule immediately by merely making use of your pre-trained brokers to the brand new knowledge.
- Is extra environment friendly than the competitors. MARL has the potential to realize not simply native optimization, however world optimization of your complete community over a steady time horizon.
- Can quickly broaden or contract as wanted. This flexibility is achieved exactly by the mannequin’s generalization capabilities.
All the perfect to everybody, and should your shipments all the time be quick and dependable!
See the upcoming Half 2 for the implementation specifics and methods I used to make this work!
LinkedIn | E-mail



