, I’d wish to share a sensible variation of Uber’s Two-Tower Embedding (TTE) strategy for instances the place each user-related information and computing sources are restricted. The issue got here from a heavy-traffic discovery widget on the house display screen of a meals supply app. This widget reveals curated alternatives corresponding to Italian, Burgers, Sushi, or Wholesome. The alternatives are created from tags: every restaurant can have a number of tags, and every tile is basically a tag-defined slice of the catalog (with the addition of some handbook selecting). In different phrases, the candidate set is already identified, so the true drawback will not be retrieval however rating.
At the moment this widget was considerably underperforming compared to different widgets on a discovery (foremost) display screen. The ultimate choice was ranked on basic recognition with out making an allowance for any personalised indicators. What we found is that customers are reluctant to scroll and in the event that they don’t discover one thing fascinating throughout the first 10 to 12 positions then they normally don’t convert. However the alternatives may be huge generally, in some instances as much as 1500 eating places. On prime of {that a} single restaurant could possibly be chosen for various alternatives, which implies that for instance McDonald’s may be chosen for each Burgers and Ice Cream, however it’s clear that its recognition is barely legitimate for the primary choice, however the basic recognition sorting would put it on prime in each alternatives.
The product setup makes the issue even much less pleasant to static options corresponding to basic recognition sorting. These collections are dynamic and alter steadily resulting from seasonal campaigns, operational wants, or new enterprise initiatives. Due to that, coaching a devoted mannequin for every particular person choice will not be lifelike. A helpful recommender has to generalize to new tag-based collections from day one.
Earlier than shifting to a two-tower-style answer, we tried easier approaches corresponding to localized recognition rating on the city-district stage and multi-armed bandits. In our case, neither delivered a measurable uplift over a basic recognition kind. As part of our analysis initiative we tried to regulate Uber’s TTE for our case.
Two-Tower Embeddings Recap
A two-tower mannequin learns two encoders in parallel: one for the consumer facet and one for the restaurant facet. Every tower produces a vector in a shared latent house, and relevance is estimated from a similarity rating, normally a dot product. The operational benefit is decoupling: restaurant embeddings may be precomputed offline, whereas the consumer embedding is generated on-line at request time. This makes the strategy engaging for techniques that want quick scoring and reusable representations.
Uber’s write-up targeted primarily on retrieval, however it additionally famous that the identical structure can function a closing rating layer when candidate technology is already dealt with elsewhere and latency should stay low. That second formulation was a lot nearer to our use case.
Our Method
We saved the two-tower construction however simplified essentially the most resource-heavy components. On the restaurant facet, we didn’t fine-tune a language mannequin contained in the recommender. As a substitute, we reused a TinyBERT mannequin that had already been fine-tuned for search within the app and handled it as a frozen semantic encoder. Its textual content embedding was mixed with specific restaurant options corresponding to worth, scores, and up to date efficiency indicators, plus a small trainable restaurant ID embedding, after which projected into the ultimate restaurant vector. This gave us semantic protection with out paying the total price of end-to-end language-model coaching. For a POC or MVP, a small frozen sentence-transformer could be an affordable start line as effectively.
We prevented studying a devoted user-ID embedding and as a substitute represented every consumer on the fly by means of their earlier interactions. The consumer vector was constructed from averaged embeddings of eating places the shopper had ordered from (Uber’s submit talked about this supply as effectively, however the authors don’t specify the way it was used), along with consumer and session options. We additionally used views with out orders as a weak detrimental sign. That mattered when order historical past was sparse or irrelevant to the present choice. If the mannequin couldn’t clearly infer what the consumer preferred, it nonetheless helped to know which eating places had already been explored and rejected.
An important modeling alternative was filtering that historical past by the tag of the present choice. Averaging the entire order historical past created an excessive amount of noise. If a buyer largely ordered burgers after which opened an Ice Cream choice, a world common may pull the mannequin towards burger locations that occurred to promote desserts somewhat than towards the strongest ice cream candidates. By filtering previous interactions to matching tags earlier than averaging, we made the consumer illustration contextual as a substitute of world. In apply, this was the distinction between modeling long-term style and modeling present intent.
Lastly, we skilled the mannequin on the session stage and used multi-task studying. The identical restaurant could possibly be optimistic in a single session and detrimental in one other, relying on the consumer’s present intent. The rating head predicted click on, add-to-basket, and order collectively, with a easy funnel constraint: P(order) ≤ P(add-to-basket) ≤ P(click on). This made the mannequin much less static and improved rating high quality in contrast with optimizing a single goal in isolation.
Offline validation was additionally stricter than a random cut up: analysis used out-of-time information and customers unseen throughout coaching, which made the setup nearer to manufacturing conduct.
Outcomes
In accordance with A/B assessments the ultimate system confirmed a statistically important uplift in conversion charge. Simply as importantly, it was not tied to 1 widget. As a result of the mannequin scores a consumer–restaurant pair somewhat than a set checklist, it generalized naturally to new alternatives with out architectural adjustments since tags are a part of restaurant’s metadata and may be retrieved with out alternatives in thoughts.
That transferability made the mannequin helpful past the unique rating floor. We later reused it in Advertisements, the place its CTR-oriented output was utilized to particular person promoted eating places with optimistic outcomes. The identical illustration studying setup subsequently labored each for choice rating and for different recommendation-like placement issues contained in the app.
Additional Analysis
The obvious subsequent step is multimodality. Restaurant pictures, icons, and doubtlessly menu visuals may be added as additional branches to the restaurant tower. That issues as a result of click on conduct is strongly influenced by presentation. A pizza place inside a pizza choice could underperform if its foremost picture doesn’t present pizza, whereas a price range restaurant can look premium purely due to its hero picture. Textual content and tabular options don’t seize that hole effectively.
Key Takeaways:
- Two-Tower fashions can work even with restricted information. You don’t want Uber-scale infrastructure if candidate retrieval is already solved and the mannequin focuses solely on the rating stage.
- Reuse pretrained embeddings as a substitute of coaching from scratch. A frozen light-weight language mannequin (e.g., TinyBERT or a small sentence-transformer) can present robust semantic indicators with out costly fine-tuning.
- Averaging embeddings of beforehand ordered eating places works surprisingly effectively when consumer historical past is sparse.
- Contextual filtering reduces noise and helps the mannequin seize the consumer’s present intent, not simply long-term style.
- Detrimental indicators assist in sparse environments. Eating places that customers considered however didn’t order from present helpful info when optimistic indicators are restricted.
- Multi-task studying stabilizes rating. Predicting click on, add-to-basket, and order collectively with funnel constraints produces extra constant scores.
- Design for reuse. A mannequin that scores consumer–restaurant pairs somewhat than particular lists may be reused throughout product surfaces corresponding to alternatives, search rating, or adverts.



