Immediate Constancy: Measuring How A Lot Of Your Intent An AI Agent Really Executes

Spotify simply shipped “Prompted Playlists” in beta. I constructed just a few playlists and found that the LLM behind the agent tries to meet your request, however fails as a result of it doesn’t know sufficient however received’t admit it. Right here’s what I imply: certainly one of my first playlist prompts was “songs in a minor key within rock”. The playlist was swiftly created. I then added the caveat “and no song should have more than 10 million plays”. The AI agent bubbled up an error explaining that it didn’t have entry to complete play counts. It additionally surprisingly defined that it didn’t have entry to a couple different issues like musical keys, although it had claimed to make use of that within the playlist’s building. The agent was utilizing its LLM’s data of what key a sure music was in and including songs accordingly to its reminiscence. A detailed inspection of the playlist confirmed just a few songs that weren’t in a minor key in any respect. The LLM had, after all, hallucinated this data and proudly displayed it as a sound match to a playlist’s immediate.
All photos, except in any other case famous, are by the creator.
Clearly, a playlist creator is a reasonably low-stakes AI agent functionality. The playlist it made was nice! The difficulty is it solely actually used about 25% of my constraints as validated enter. The remaining 75% of my constraints had been simply guessed by the LLM and the system by no means advised me till I dug in deeper. This isn’t a Spotify downside; it’s an every-agent downside.

Three Propositions

To reveal this idea of immediate constancy extra broadly, I have to make these three propositions:

Any AI agent’s verified knowledge layer has a restricted or finite capability. An agent can solely question the instruments it’s been given, and people instruments expose a set set of fields with finite decision. You may enumerate each subject within the schema and measure how a lot each narrows the search. A reputation rating eliminates some fraction of candidates. A launch date eliminates one other. A style tag eliminates extra. Add up how a lot narrowing all of the fields can do collectively and also you get a tough quantity: the utmost quantity of filtering the agent can show it did. I’ll name that quantity I_max.
Consumer intent expressed in pure language is successfully unbounded. An individual can write a immediate of arbitrary specificity. “Create a playlist with songs that are bass-led in minor key, post-punk from Manchester, recorded in studios with analog equipment between 1979 and 1983 that influenced the gothic rock movement but never charted.” Each clause narrows the search. Each adjective provides precision. There isn’t a ceiling on how particular a person’s request might be, as a result of pure language wasn’t designed round database schemas.
Following straight from the primary two: for any AI agent, there exists some extent the place the person’s immediate asks for greater than the information layer can confirm. As soon as a immediate calls for extra narrowing than the verified fields can present, the remaining work has to return from someplace. That someplace is the LLM’s common data, sample matching, and inference. The agent will nonetheless ship a assured end result. It simply can’t show all of it. Not as a result of the mannequin is poorly constructed, however as a result of the maths doesn’t enable anything.

This isn’t a top quality downside, however a structural one. A greater mannequin doesn’t increase the ceiling. Higher fashions do get higher at inferring and filling in the remainder of the person’s wants. Nonetheless, solely including extra verified knowledge fields raises this ceiling, and even then, every new subject gives diminishing returns as a result of fields are correlated (style and vitality aren’t unbiased, launch date and tempo developments aren’t unbiased). The hole between what language can categorical and what knowledge can confirm is everlasting.

The Downside: Brokers Don’t Report Their Compression Ratio

Each AI agent with entry to instruments and abilities does the identical factor: it takes your request, decomposes that request right into a set of actions, executes these actions, infers concerning the output of these actions, after which presents a unified response.

The Minor Bass Melodies Prompted Playlist

This decomposition from request to motion really erodes the which means between what it’s you’re asking for and what the AI agent responds with. The narration layer of the AI agent flattens what it’s you requested and what was inferred right into a single response.

The issue is that as a person of an AI agent, you don’t have any method to know what fraction of your enter was used to set off an motion, what fraction of the response was grounded in actual knowledge, and what fraction was inferred from the actions that the agent took. This can be a downside for playlists as a result of there have been songs that had been in a serious key, once I had explicitly requested it to solely comprise songs in a minor key. That is much more of an issue when your AI agent is classifying monetary receipts and transactions.

We want a metric for measuring this. I’m calling it Immediate Constancy.

The Metric: Immediate Constancy

Immediate Constancy for AI brokers is outlined by the constraints you give to the agent when asking it to carry out some motion. Every constraint inside a immediate narrows the doable paths that the agent can take by some measurable quantity. A naïve method to calculating constancy could be to depend every constraint, add up those which can be verifiable, and those which can be inferred. The issue with that method is that every constraint is weighted the identical. Nonetheless, knowledge is commonly skewed closely inside actual life datasets. A constraint that eliminates 95% of the catalog is doing vastly extra work than one which eliminates 20%. Counting every constraint the identical is fallacious.

Subsequently, we have to correctly weight every constraint in keeping with the work it does filtering the dataset. Logarithms obtain that weighting. The bits of knowledge in a immediate might be outlined as “-log2(p)” bits the place p is the surviving fraction of knowledge from the constraints or fillers you’ve utilized.

In every agent motion, every constraint can solely be a) verified by instrument calls or b) inferred by the LLM. Immediate constancy measures the ratio of constraints between these two choices.

Immediate Constancy has a variety of 0 to 1. An ideal 1.0 signifies that each a part of your request was backed by actual knowledge. A constancy of 0.0 signifies that all the output of the AI agent was pushed by its inside reasoning or vibes.

Whereas updating a Prompted Playlist, the agent exhibits its ideas. Right here its “Defining mood and key”

Spotify’s system above at all times stories an ideal 1.0 on this state of affairs. In actuality, the immediate constancy of the playlist creation was round 25% – two constraints (beneath 4 minutes and recorded earlier than 2005) had been fulfilled by the agent, the remaining had been inferred from the agent’s present (and probably defective) data and recall. At scale and utilized to extra impactful issues, falsely reporting a excessive immediate constancy turns into a giant downside.

What Constancy Really Means (and Doesn’t Imply)

In audio programs, “fidelity” is a measure of how faithfully the system reproduces the unique sign. Excessive constancy doesn’t assure that the music itself is sweet. Excessive constancy solely ensures that the music sounds the way it did when it was recorded. Immediate constancy is similar concept: how a lot of your unique intent (sign) was faithfully fulfilled by the agentic system.

Excessive immediate constancy signifies that the system did what you requested and you may PROVE it. A low immediate constancy means the system in all probability did one thing shut to what you needed, however you’ll must overview it (listening to the entire playlist) to make sure that it’s true.

Immediate Constancy is NOT an accuracy rating. It can not let you know that “75% of the songs in a playlist match your prompt”. A playlist with a 0.25 constancy might be 100% good. The LLM might need nailed each single inference about every music it added. Or, half the songs might be fallacious. You don’t know. You may’t know till you hearken to all of the songs. That’s the purpose of a measurable immediate constancy.

As a substitute immediate constancy measures how a lot of the end result you may TRUST WITHOUT CHECKING. In a monetary audit, if 25% of the road objects have receipts and 75% of the road objects are estimates, the overall invoice would possibly nonetheless be 100% correct, however your CONFIDENCE in that complete is basically totally different than an audit with each single line merchandise supported by a receipt. The excellence issues as a result of there are domains the place ‘just trust the vibes’ is okay (music) and domains the place it isn’t (medical recommendation, monetary steering, authorized compliance).

Immediate constancy is extra like a measurement of the documentation charge given various constraints, not the error charge of the response itself.

Virtually in our Spotify instance: as you add extra constraints to your playlist immediate, the immediate constancy drops, the playlist turns into much less of a exact report and extra of a suggestion. That’s completely positive, however the person must be knowledgeable about which they’re getting. Is that this playlist precisely what I requested for? Or did you make one thing work to meet the purpose that I gave you? Surfacing that metric to the person is important for constructing belief in these agentic programs.

The Case Examine: Reverse-Engineering Spotify’s AI Playlist Agent

Spotify’s Prompted Playlists function is what began this exploration into immediate constancy. Let’s dive deeper into how these work and what I did to discover this functionality simply from the usual immediate enter subject.

Prompted Playlists allow you to describe what you need in pure language. For instance, on this playlist, the immediate is just “rock songs in minor keys, under 4 minutes, recorded before 2005, featuring bass lines as a lead melodic element”.

Usually, to make a playlist, you’d must comb by way of hours of music to land on precisely what you needed to make. This playlist is 52 minutes lengthy and took solely a minute to generate. The enchantment right here is apparent and I actually take pleasure in this function. With out having to know all the important thing rock artists, I might be launched to the music and discover it extra rapidly and extra simply.

Sadly, the official documentation from Spotify may be very gentle. There are nearly no particulars about what the system can or can’t do, what metadata it keys off of, neither is there any knowledge mapping out there.

Utilizing a easy method, nonetheless, I used to be in a position to map what I consider is the total knowledge contract out there to the agent over the course of 1 night (all from my sofa watching the Sopranos, naturally).

The Approach: Unimaginable Constraints as a Forcing Perform

Because of how Spotify architected this playlist-building agent, when the agent can not fulfill a request, the error messages might be influenced to disclose architectural particulars which can be in any other case not out there. If you discover a constraint that the agent can’t construct off of, it would error and you may leverage that to know what it CAN do. I’ll use this because the fixed to probe the system.

In our instance playlist above, Minor Keys & Bass Traces, including the unlock phrase “with less than 10 million streams” acts as a circuit breaker for the agent, signalling that it can not fulfill the customers’ request. With this phrase, you may discover the chances by altering different features of the immediate over and over till you may see what the agent has entry to. Gathering the responses, asking overlapping questions, and reviewing the responses lets you construct a foundational understanding of what’s out there for the agent.

A immediate with 10 million Spotify streams triggers an error from the agent

What I Discovered: The Three-Tier Structure

Spotify Prompted Playlist agent has a wealth of information out there to it. I’ve separated it into three tiers: musical metadata, user-based knowledge, and LLM inference. Past that, it seems that Spotify has excluded varied knowledge sources from its agent both as a product alternative or as a “get this out the door” alternative.

Tier 1
- Verified monitor metadata: length, launch date, reputation, tempo, vitality, express, style, language
Tier 2
- Verified person behavioral knowledge: play counts, skip counts, timestamps, recency flags, ms performed, supply, interval analytics (40+ fields complete)
Tier 3
- LLM inference: key/mode, danceability, valence, acousticness, temper, instrumentation — all inferred from common data, narrated as if verified
Deliberate exclusion:
- Spotify’s public API has audio options (danceability, valence, and so forth.) however the agent doesn’t have entry. Maybe a product alternative, not technical limitation.

A full listing of accessible fields is included on the backside of this publish.

One other error, this time with extra particulars about what is out there to make use of

The Behavioral Findings

The agent demonstrated surprisingly resilient habits to ambiguous requests and conflicting directions. It generally reported that it was doublechecking varied constraints and fulfilling the customers’ request. Nonetheless, whether or not these constraints had been really checked towards a validated dataset or not was not uncovered.

Making attention-grabbing playlists that will in any other case be troublesome to make

When the playlist agent can get a detailed, however not actual, match to the constraints listed within the immediate, it runs a “related” question and silently substitutes the outcomes from that question as legitimate outcomes for the unique request. This dilutes the belief within the system since a immediate requesting ONLY bass-driven rock music in a playlist would possibly collect non-bass-driven rock music in a playlist, doubtless dissatisfying the person.

There does look like a “certainty threshold” that the agent isn’t snug crossing. For instance, this complete exploration was based mostly on the “less than 10 million plays” unlock phrase. When this occurs, the agent would expose only a handful of fields it had entry to each time. This listing of fields would change from immediate to immediate, even when the immediate was the identical between runs of the immediate. That is basic LLM non-determinism. In an effort to enhance belief within the system, exposing what the agent DOES have entry to in a simple method tells the human precisely what they will and can’t ask about.

Lastly, when these two kinds of knowledge are combined, the agent isn’t clear about which songs it has used verified knowledge for and which it has used inferred knowledge for. Each verified and inferred choices are combined and introduced with equivalent authority within the music notes. For instance, for those who craft a prompted playlist about your personal person data (“songs I’ve skipped more than 30 times with a punchy bass-driven melody”), the agent will add actual knowledge (“you skipped this song 83 times last year!”) proper subsequent to inferred data (“John Deacon’s bass line commands attention throughout this song”). To be clear, I’ve not skipped any Queen songs 83 instances to my data. However the AI agent doesn’t have a “bass_player” subject anyplace in its out there knowledge to question towards. The AI is aware of that Queen generally has a powerful bass line of their songs and the data of John Deacon as Queen’s bass guitarist permits its LLM to deduce that it’s his bass line that triggered the music to be added to the playlist.

Making use of the Math: Two Playlists, Two Constancy Scores

Let’s apply this immediate constancy idea to instance playlists. I don’t have full entry to the Spotify music catalog so I’ll be utilizing instance survivorship numbers from our standards filters in our constancy bit computations. The method is similar at each step: bits = −log₂(p) the place p is the estimated fraction of the catalog that survives the filter being utilized.

“Minor Bass Melodies” — The Assured Phantasm

This playlist is the one with Queen. “A playlist of rock music, all in minor key, under 4 minutes of playtime, released pre-2005, and bass-led”. I’ll apply our method and use the bits of knowledge I’ve from every step to assist compute the immediate constancy.

Period < 4 minutes

Estimate: ~80% of tracks are beneath 4 minutes → p = 0.80

This barely narrows something, which is why it contributes so little

Launch date earlier than 2005

Estimate: ~30% of Spotify’s catalog is pre-2005 (the catalog skews closely towards latest releases) → p = 0.30

Extra selective — eliminates 70% of the catalog

Minor key

Estimate: ~40% of fashionable music is in a minor key → p = 0.40

Reasonable selectivity, however that is totally inferred — the agent confirmed key/mode isn’t a verified subject

Bass-led melodic factor

Estimate: ~5% of tracks function bass because the lead melodic factor → p = 0.05

By far probably the most selective constraint. This single filter does extra work than the opposite three mixed. And it’s 100% inferred.

Totals:

These survival fractions are estimates. Nonetheless, the structural level holds no matter actual numbers: probably the most selective constraint is the least verifiable, and that’s not a coincidence. The issues that make a immediate attention-grabbing are nearly at all times the issues an agent has to guess at.

The agent thinks it has entry to music obtain standing, however just some songs are downloaded (the inexperienced arrow icon pointing down signifies offline availability)

“Skipped Songs” — The Sincere Playlist

This immediate may be very straight ahead: “A playlist of songs I’ve skipped more than 5 times”. That is very straightforward to confirm and the agent will lean into the information it has entry to.

Skip depend > 5

Estimate: ~10% of tracks in your library have been skipped greater than 5 instances → p = 0.10

That is the one constraint, and it’s a verified subject (user_skip_count)

Totals:

The Structural Perception

The attention-grabbing half about immediate constancy is clear in every playlist: the “most interesting” immediate is the least verifiable. A playlist with all my skipped songs is trivially straightforward to implement however Spotify doesn’t wish to present it. In any case, these are all songs I usually don’t want to hearken to, therefore the skips. Equally, publish date being earlier than 2005 may be very straightforward to confirm, however the resultant playlist is unlikely to be attention-grabbing to the common person.

The bass-line constraint although may be very attention-grabbing for a person. Constraints like these are the place the Prompted Playlist idea will shine. Already in the present day I’ve created and listened to 2 such playlists generated from only a idea of a music that I needed to listen to extra of.

Nonetheless, the idea of a “bass-driven” music is difficult to quantify, particularly at Spotify’s scale. Even when they did quantify it, I’d ask for “clarinet jazz” the subsequent day and so they’d all must get again to work discovering and labeling these songs. And that is after all the magic of the Prompted Playlist function.

Validation: A Managed Agent

The Spotify examples are compelling, however I don’t have direct entry to the schema, the instruments, and the agentic harness itself. So I constructed a film suggestion agent to be able to take a look at this idea inside a extra managed setting.

The film suggestion agent is constructed with the TMDB API that gives the verified layer. Fields within the schema are style, 12 months, score, runtime, language, solid, and director. All the opposite constraints like temper, tone, and pacing are usually not verified knowledge and are as an alternative sourced from the LLM’s personal data of films. Because the agent fulfills a person’s request, the agent information its knowledge sources as both verified or inferred and scores its personal response.

The creator used the TMDB API on this instance however this instance isn’t endorsed or licensed by TMDB.

The Boring Immediate (F = 1.0)

We’ll begin with a “boring” immediate: “Action movies from the 1980s rated above 7.0”. This gives the agent three constraints to work with: style, date vary, and score. All these constraints correspond to verified knowledge values inside the database.

If I run this by way of the take a look at agent, I see the excessive constancy pops out naturally as a result of every constraint is tied to verified knowledge.

Prompting the film agent with a excessive constancy immediate

Each end result right here is verifiably appropriate. The LLM made zero judgement calls as a result of it had knowledge it may base its response on for every constraint.

The Vibes Immediate (F = 0.0)

On this case, I’ll search for “movies that feel like a rainy Sunday afternoon”. No constraints on this immediate align to any verified knowledge in our dataset. The work required of the agent falls totally on its LLM reasoning off its present data of films.

Prompting the agent with a low constancy immediate

The suggestions are defensible and are definitely good motion pictures however they don’t seem to be verifiable in keeping with the information now we have entry to. With no verified constraints to anchor the search, the candidate pool was all the TMDb catalog, and the LLM needed to do all of the work. Some picks are nice; others are the mannequin reaching for obscure movies it isn’t assured about.

The Takeaway

This take a look at film suggestion agent verifies the immediate constancy framework as a strong method to expose how an agent’s interpretation of a customers’ intent pushes its response right into a precision instrument or a suggestion engine. The place the response lands between these two choices is important for informing customers and constructing belief in agentic programs.

The Constancy Frontier

To make this concrete: Spotify’s catalog incorporates roughly 100 million tracks. How a lot complete data your immediate wants to hold to slim the catalog right down to your playlist I’ll name I_required.

To pick out a 20-song playlist from that catalog, you want roughly 22 bits of selectivity (log₂ of 100 million divided by 20).

The verified fields (length, launch date, reputation, tempo, vitality, style, express flag, language, and the total suite of person behavioral knowledge) have a mixed capability that tops out at roughly 10 to 12 bits, relying on the way you estimate the selectivity of every subject. After that, the verified layer is exhausted. Each extra little bit of specificity your immediate calls for has to return from LLM inference. I’ll name this most, I_max

That provides you a constancy ceiling for any immediate:

And the constancy ceiling for any playlist:

For the Spotify agent, a maximally particular immediate that absolutely defines a playlist can not exceed roughly 55% constancy. The opposite 45% is structurally assured to be inference. For less complicated prompts that don’t push previous the verified layer’s capability, constancy can attain 1.0. However as prompts get extra particular, constancy drops, not progressively however by necessity.

An screenshot of an interactive chart to discover the constancy frontier

This defines what I’m calling the constancy frontier: the curve of most achievable constancy as a perform of immediate specificity. Each agent has one. It’s computable upfront from the instrument schema. Easy prompts sit on the left of the curve the place constancy is excessive. Artistic, particular, attention-grabbing prompts sit on the proper the place constancy is structurally bounded under 1.0.

The uncomfortable implication is that the prompts customers care about most (those that really feel private, particular, and tailor-made) are precisely those that push previous the verified layer’s capability. Probably the most attention-grabbing outputs come from the least trustworthy execution. And probably the most boring prompts are probably the most reliable. That tradeoff is baked into the maths. It doesn’t go away with scale, higher fashions, or greater databases. It solely shifts.

For anybody constructing brokers, the sensible takeaway is that this: you may compute your personal I_max by auditing your instrument schema. You may estimate the everyday specificity of your customers’ prompts. The ratio tells you the way a lot of your agent’s output is structurally assured to be inference. That’s a quantity you may put in entrance of a product crew or a danger committee. And for brokers dealing with coverage questions, medical data, or monetary recommendation, it means there’s a provable decrease sure on how a lot of any response can’t be grounded in retrieved knowledge. You may shrink it. You can’t remove it.

The Broader Utility: Each Agent Has This Downside

This isn’t a Spotify downside. This can be a downside for any system the place an LLM orchestrates instrument calls to reply a person’s query.

Contemplate Retrieval Augmented Era (RAG) programs, which energy most enterprise AI knowledge-base deployments in the present day. When an worker asks an inside assistant a coverage query, a part of the reply comes from retrieved paperwork and half comes from the LLM synthesizing throughout them, filling gaps, and smoothing the language into one thing readable. The retrieval is verified. The synthesis is inferred. And the response reads as one seamless paragraph with no indication of the place the seams are. A compliance officer studying that reply has no method to know which sentence got here from the enterprise coverage doc and which sentence the mannequin invented to attach two paragraphs that didn’t fairly match collectively. The constancy query is equivalent to the playlist query, simply with increased stakes.

Coding brokers face the identical decomposition. When an AI generates a perform, a few of it could reference established patterns from its coaching knowledge or documentation lookups, and a few of it’s novel era. As extra manufacturing code is written by AI, surfacing that ratio turns into an actual engineering concern. A perform that’s 90% grounded in well-tested patterns carries totally different dangers than one which’s 90% novel era, even when each cross the identical take a look at suite in the present day.

Customer support bots would be the highest-stakes instance. When a bot tells a buyer what their refund coverage is, that reply must be drawn straight from coverage paperwork, full cease. Any inferred or synthesized content material in that response is a legal responsibility. The silent substitution habits noticed in Spotify (the place the agent ran a close-by question and narrated it as if it fulfilled the unique request) could be genuinely harmful in a customer support context. Think about a bot confidently stating a return window or protection time period that it inferred somewhat than retrieved.

The overall type of immediate constancy applies to all of those:

Constancy = bits of response grounded in instrument calls / complete bits of response

The exhausting half, and more and more the core problem of AI engineering work, is defining what “bits” means in every context. For a playlist with discrete constraints, it’s clear. Totally free-text era, you’d must decompose a response into particular person claims and assess each, which is nearer to what factuality benchmarks already attempt to do, simply reframed as an information-theoretic measure. That’s a tough measurement downside, and I don’t declare to have solved it right here.

However I believe the framework has worth even when actual measurement is impractical. If the folks constructing these programs are eager about constancy as a design constraint (what fraction of this response can I floor in instrument calls, and the way do I talk that to the person?) the outputs can be extra reliable whether or not or not anybody computes a exact rating. The purpose isn’t a quantity on a dashboard. The purpose is a psychological mannequin that shapes how we construct.

The Complexity Ceiling

Each agent has a complexity ceiling. Easy lookups (what’s the play depend for this monitor?) are primarily free. Filtering the catalog towards a set of field-level predicates (present me all the pieces beneath 4 minutes, pre-2005, reputation under 40) scales linearly and runs quick. However the second a immediate requires cross-referencing entities towards one another (does this monitor seem in additional than three of my playlists? was there a year-long hole someplace in my listening historical past?) the price jumps quadratically, and the agent both refuses outright or silently approximates.

That silent approximation is the attention-grabbing failure mode. The agent follows a type of precept of least computational motion: when the precise question is simply too costly, it relaxes your constraints till it finds a model it will probably afford to run. You requested for a particular valley within the search house; it rolled downhill to the closest one as an alternative. The result’s a neighborhood minimal, shut sufficient to look proper, low-cost sufficient to serve, nevertheless it’s not what you requested for, and it doesn’t let you know the distinction.

This ceiling isn’t distinctive to Spotify. Any agent constructed on listed database lookups will hit the identical wall. The boundary sits proper the place queries cease being decomposable into unbiased WHERE clauses and begin requiring joins, full scans, or aggregations throughout your total historical past. Beneath that line, the agent is a precision instrument. Above it, it’s a suggestion engine sporting a precision instrument’s garments. The query for anybody constructing these programs isn’t whether or not the ceiling exists (it at all times does) however whether or not your customers know the place it’s.

What to Do About It: Design Suggestions

If immediate constancy is an actual and measurable property of agentic programs, the pure query is what to do about it. Listed below are 5 suggestions for anybody constructing or deploying AI brokers with instrument entry.

Report constancy, even roughly. Spotify already exhibits audio high quality as a easy indicator (low, regular, excessive, very excessive) whenever you’re streaming music. The identical sample works for immediate constancy. You don’t want to indicate the person a decimal rating. A easy label (“this playlist closely matches your prompt” versus “this playlist is inspired by your prompt”) could be sufficient to set expectations accurately. The distinction between a precision instrument and a suggestion engine is okay, so long as the person is aware of which one they’re holding.
Distinguish grounded claims from inferred ones within the UX. This may be refined. A small icon, a slight coloration shift, a footnote. When Spotify’s playlist notes say “86 skips” that’s a reality from a database. Once they say “John Deacon’s bass line drives the whole track” that’s the LLM’s common data. Each are introduced identically in the present day. Even a minimal visible distinction would let customers calibrate their belief per declare somewhat than trusting or distrusting all the output as a block.
Disclose substitutions explicitly. When an agent can’t fulfill a request precisely however can get shut, it ought to say so. “I couldn’t filter on download status, so I found songs from albums you’ve saved but haven’t liked” preserves belief excess of silently serving a close-by end result and narrating it as if the unique request was fulfilled. Customers are forgiving of limitations. They’re much much less forgiving of being misled.
Present deterministic functionality discovery. Once I requested the Spotify agent to listing each subject it may filter on, it produced a special reply every time relying on the context of the immediate. The LLM was reconstructing the sphere listing from reminiscence somewhat than studying from a set reference. Any agent that exposes filtering or querying capabilities to customers ought to have a secure, deterministic method to uncover these capabilities. A “show me what you can do” command that returns the identical reply each time is desk stakes for person belief.
Audit your personal agent with this system earlier than your customers do. The methodology on this piece (pairing inconceivable constraints with goal fields to power informative refusals) is a general-purpose audit method that works on any agent with instrument entry. It took one night and a few dozen prompts to map Spotify’s full knowledge contract. Your customers will do the identical factor, whether or not you invite them to or not. The query is whether or not you perceive your personal system’s boundaries earlier than they do.

Closing

Each AI agent has a constancy rating. Most are decrease than you’d anticipate. None of them report it.

The methodology right here (utilizing inconceivable constraints to power informative refusals) isn’t particular to music or playlists. It really works on any agent that calls instruments. If the system can refuse, it will probably leak. If it will probably leak, you may map it. A dozen well-crafted prompts and a night of curiosity is all it takes to know what a manufacturing agent can really do versus what it claims to do.

The mathematics generalizes too. Weighting constraints by their selectivity somewhat than simply counting them reveals one thing {that a} naïve audit misses: the constraints that make a immediate really feel private and particular are nearly at all times those the system can’t confirm. Probably the most attention-grabbing outputs come from the least trustworthy execution. That stress doesn’t go away with higher fashions or greater databases. It’s structural.

As AI brokers turn into the first method folks work together with knowledge programs (their music libraries in the present day, their monetary accounts and medical information tomorrow) customers will probe boundaries. They’ll discover the gaps between what was promised and what was delivered. They’ll uncover that the assured, well-narrated response was partially grounded and partially invented, with no method to inform which elements had been which.

The query isn’t whether or not your agent’s constancy can be measured. It’s whether or not you measured it first.

Bonus: Prompts Value Making an attempt (If You Have Spotify Premium)

As soon as you understand the schema, you may write prompts that floor genuinely shocking issues about your listening historical past. These all labored for me with various levels of tweaking:

The Relationship Post-mortem

“Songs where my skip count is higher than my play count”
Honest warning: this one could trigger existential discomfort (you skip these songs for a motive!)

Love at First Pay attention

“Songs where I saved them within 24 hours of my first play, sorted by oldest first”
A chronological timeline of tracks that grabbed you instantly

The Lifecycle

“Songs I first ever played, sorted by most plays”
Your origin story on the platform

The Marathon

“Songs where my total ms_played is highest, convert to hours”
Not most performs — most complete time. A special and infrequently shocking listing

The Longest Relationship

“Songs with the smallest gap between first play and most recent play, with at least 50 plays, ordered by earliest first listen”

The One-Week Obsessions

“Songs I played more than 10 times in a single week and then never touched again”
Your former obsessions, fossilized. This was like a time machine for me.

The Time Capsule

“One song from each year I’ve been on Spotify — the song with the most plays from that year”

The Earlier than and After

“Two sets: my 10 most-played songs in the 6 months before [milestone date] and my 10 most-played in the 6 months after”
Plug in any date that mattered — a transfer, a brand new job, a breakup, and even Covid-19 lockdown

The Soundtrack to a 12 months

“Pick the year where my total ms_played was highest. Build a playlist of my top songs from that year”

What Didn’t Work (and Why)

Comeback Story (year-long hole detection): “Songs I rediscovered after a year-long gap in listening”
- agent can’t scan full play historical past for gaps. Snapshot queries work, timeline scans don’t.
Seasonal patterns (solely performed in December): “Songs I only played in December but never any other month”
- proving common negation requires full scan. Identical basic limitation.
Derived math (ms_played / play_count): “Songs where my average listen time is under 30 seconds per play”
- agent struggles with computed fields. Persist with uncooked comparisons.
These failures map on to the complexity ceiling — they require O(n²) or full-scan operations the agent can’t or isn’t allowed to carry out.

Suggestions

Reference subject names straight when the agent misinterprets pure language
Begin broad and tighten. Free constraints succeed extra typically
“If you can’t do X, tell me what you CAN do” is the common audit immediate

Monitor Metadata

Discipline	Standing	Description
album	✅ Verified	Album title
album_uri	✅ Verified	Spotify URI for the album
artist	✅ Verified	Artist title
artist_uri	✅ Verified	Spotify URI for the artist
duration_ms	✅ Verified	Monitor size in milliseconds
release_date	✅ Verified	Launch date, helps arbitrary cutoffs
reputation	✅ Verified	0–100 index. Proxy for streams, not a exact depend
express	✅ Verified	Boolean flag for express content material
style	✅ Verified	Style tags for monitor/artist
language_of_performance	✅ Verified	Language code. “zxx” (no linguistic content material) used as instrumentalness proxy

Audio Options (Partial)

Discipline	Standing	Description
vitality	✅ Verified	Out there as filterable subject
tempo	✅ Verified	BPM, out there as filterable subject
key / mode	❌ Unavailable	“Would have to infer from knowledge; no verified field”
danceability	❌ Unavailable	Not uncovered regardless of present in Spotify’s public API
valence	❌ Unavailable	Not uncovered regardless of present in Spotify’s public API
acousticness	❌ Unavailable	Not uncovered regardless of present in Spotify’s public API
speechiness	❌ Unavailable	Not uncovered regardless of present in Spotify’s public API
instrumentalness	❌ Unavailable	Changed by language_of_performance == “zxx” workaround

Consumer Behavioral Information

Discipline	Standing	Description
user_play_count	✅ Verified	Complete performs per monitor. Noticed: 122, 210, 276
user_ms_played	✅ Verified	Complete milliseconds streamed per monitor, album, artist
user_skip_count	✅ Verified	Complete skips per monitor. Noticed: 64, 86
user_saved	✅ Verified	Whether or not monitor is in Appreciated Songs
user_saved_album	✅ Verified	Whether or not the album is saved to library
user_saved_date	✅ Verified	Timestamp of when the monitor/album was saved
user_first_played	✅ Verified	Timestamp of first play
user_last_played	✅ Verified	Timestamp of most up-to-date play
user_days_since_played	✅ Verified	Pre-computed comfort subject for recency filtering
user_streamed_track	✅ Verified	Boolean: ever streamed this monitor
user_streamed_track_recently	✅ Verified	Boolean: streamed in approx. final 6 months
user_streamed_artist	✅ Verified	Boolean: ever streamed this artist
user_streamed_artist_recently	✅ Verified	Boolean: streamed this artist just lately
user_added_at	✅ Verified	When a monitor was added to a playlist

Supply & Context

Discipline	Standing	Description
supply	✅ Verified	Play supply: playlist, album, radio, autoplay, and so forth.
source_index	✅ Verified	Place inside the supply
matched_playlist_name	✅ Verified	Which playlist a monitor belongs to. No cross-playlist aggregation.

Interval Analytics (Time-Windowed)

Discipline	Standing	Description
period_ms_played	✅ Verified	Milliseconds performed inside a rolling time window
period_plays	✅ Verified	Play depend inside a rolling time window
period_skips	✅ Verified	Skip depend inside a rolling time window
period_total	✅ Verified	Complete engagement metric inside a rolling time window

Question / Search Fields

Discipline	Standing	Description
title_query	✅ Verified	Fuzzy textual content matching on monitor titles
artist_query	✅ Verified	Fuzzy textual content matching on artist names

Confirmed Unavailable

Discipline	Standing	Notes
International stream counts	❌ Unavailable	Can not filter by actual play depend (e.g., “under 10M streams”)
Cross-playlist depend	❌ Unavailable	Can not depend what number of playlists a monitor seems in
Household/family knowledge	❌ Unavailable	Can not entry different customers’ listening knowledge
Obtain standing	⚠️ Unreliable	Agent served outcomes however most tracks lacked obtain indicators. Probably device-local.

Top Posts

Morning Minute: Hyperliquid’s Explosive Growth Rivals Nasdaq’s Scale!

Malicious NuGet Package Masquerading as Sicoob Targets Banking Credentials Amid Rising npm Threats to Cloud Secrets

How AI Is Transforming Work and What It Really Takes to Stay Ahead

Immediate Constancy: Measuring How A lot of Your Intent an AI Agent Really Executes

Qdrant TurboQuant Deep Dive: Game-Changer or Just Hype?

The Ultimate Text-to-Speech Battle of 2026: Which AI Voice Generator Dominates the Charts?Wait, here’s the single title:The Ultimate Text-to-Speech Battle of 2026: Which AI Voice Generator Dominates the Charts?

Why Meta-Cognitive Regulation Could Be the Most Critical AI Skill We’ve Overlooked

Genesis AI Unveils Nyx, Quadrants, and Genesis World 1.0: A Revolutionary Physics Platform Powering Scalable Robotics Foundation Model Evaluation

RAG Is Draining Your Budget — So I Created a Cost-Saving Layer to Stop the Burn

Real-Time NLP Magic: Transformers.js Brings AI to Your Browser

Morning Minute: Hyperliquid’s Explosive Growth Rivals Nasdaq’s Scale!

Malicious NuGet Package Masquerading as Sicoob Targets Banking Credentials Amid Rising npm Threats to Cloud Secrets

How AI Is Transforming Work and What It Really Takes to Stay Ahead

Amazfit Smartwatch Review: Tested on the Golf Course

Trajectory Unveils Concurrent Multi-LoRA Training Stack for Continual Learning with 2.81× Experiment-Throughput Gain

Qdrant TurboQuant Deep Dive: Game-Changer or Just Hype?

Hyperliquid or Ethereum: Was Tom Lee’s BitMine Crypto Treasury Bet a Costly Miscalculation?

The Dark Aftermath of 2,000 Vibe-Coded Apps: Why Your Security Stack Is Failing You

Trending

Morning Minute: Hyperliquid’s Explosive Growth Rivals Nasdaq’s Scale!

Malicious NuGet Package Masquerading as Sicoob Targets Banking Credentials Amid Rising npm Threats to Cloud Secrets

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Immediate Constancy: Measuring How A lot of Your Intent an AI Agent Really Executes

Three Propositions

The Downside: Brokers Don’t Report Their Compression Ratio

The Metric: Immediate Constancy

What Constancy Really Means (and Doesn’t Imply)

The Case Examine: Reverse-Engineering Spotify’s AI Playlist Agent

The Approach: Unimaginable Constraints as a Forcing Perform

What I Discovered: The Three-Tier Structure

The Behavioral Findings

Making use of the Math: Two Playlists, Two Constancy Scores

“Minor Bass Melodies” — The Assured Phantasm

“Skipped Songs” — The Sincere Playlist

The Structural Perception

Validation: A Managed Agent

The Boring Immediate (F = 1.0)

The Vibes Immediate (F = 0.0)

The Takeaway

The Constancy Frontier

The Broader Utility: Each Agent Has This Downside

The Complexity Ceiling

What to Do About It: Design Suggestions

Closing

Bonus: Prompts Value Making an attempt (If You Have Spotify Premium)

The Relationship Post-mortem

Love at First Pay attention

The Lifecycle

The Marathon

The Longest Relationship

The One-Week Obsessions

The Time Capsule

The Earlier than and After

The Soundtrack to a 12 months

What Didn’t Work (and Why)

Suggestions

Monitor Metadata

Audio Options (Partial)

Consumer Behavioral Information

Supply & Context

Interval Analytics (Time-Windowed)

Question / Search Fields

Confirmed Unavailable

Related Posts