Here’s the paraphrased version of the article in HTML format:
I was recently assigned a new task at work: using raw, unstructured text data, produce a thorough PDF report summarizing what customers are saying about our products this quarter.
So I composed a straightforward prompt. I provided Claude with a comprehensive set of directions, supplied the dataset, and received the output. I submitted it.
However, when the stakeholder and I examined the deliverable closely, we started noticing some increasingly troubling issues.
Claude was wrong with total confidence.
Not completely wrong, in the sense of fabricating nonexistent facts. It was more like…excessively certain in its errors. Claude would produce a quarterly insight report and claim something along the lines of:
“Negative sentiment in the Dresses department climbed 23% this quarter, pointing to a considerable shift in customer satisfaction that demands urgent attention from the product team.”
It sounds impressive. However, that spike was largely driven by a single well-known item that launched mid-quarter with a recognized sizing flaw. One item — not the entire department.
Claude had no way of knowing that. And my prompt never instructed it to consider that angle.
A Creating a Quarterly Customer Review Report Skill
In this article, I’ll walk you through a Claude skill I developed that produces a quarterly customer sentiment report from disorganized product review data, delivered as a PDF for stakeholders.
Naturally, I won’t be revealing the actual dataset I worked with at the office. Instead, I’m using the Women’s E-Commerce Clothing Reviews dataset from Kaggle (CC0 license). It consists of 23,000 genuine, anonymized customer reviews spanning clothing departments (Tops, Dresses, Bottoms, Jackets, and others), including review text, star ratings, and product details. Any references to the company within the reviews have been substituted with “retailer.”
This skill should:
- Extract and read a filtered subset of reviews from the current quarter
- Organize them by department
- Detect emerging trends and pain points
- Produce a polished summary PDF for the product leadership team
Here’s the initial prompt I used:
You are a data analyst preparing a quarterly customer sentiment report for a women’s clothing e-commerce retailer. Based on this quarter’s customer reviews (including review text, star ratings, and department), compose a polished stakeholder report covering:
– A broad sentiment overview for the quarter
– Major themes organized by department (Tops, Dresses, Bottoms, Jackets)
– 2–3 notable insights drawn from the review text
– A concise recommendation directed at the product team
Write in a professional, clear tone.
Once you finish, please create a skill called reviews-analysis and store your instructions there.
What “Confident Errors” Really Look Like
Below is a sample of what Claude generated with the basic skill described above, during a quarter when the Dresses department received a surge of negative reviews:
“Negative sentiment in the Dresses department rose sharply this quarter, with customers repeatedly mentioning fit and sizing problems. This implies that the retailer’s sizing standards may be drifting away from what customers expect — a pattern that, if left unchecked, could weaken brand loyalty in this important category.”
The actual reason? A single dress (a specific SKU) launched in Week 7 with a manufacturing defect. The reviews focused almost entirely on that one product. The remainder of the Dresses department was performing well.
Claude didn’t fabricate anything outright. It simply lacked the background to understand why the trend appeared. And missing that understanding, it followed the standard LLM approach: it bridged the gap with a narrative that sounded convincing.

The Solution: Four Critical Lines You Must Add
Line 1: Clarify What Context Claude Is Missing
You do NOT have access to product launch calendars, inventory records, promotional campaign timelines, or individual SKU-level history. Do NOT assign department-level trends to company-wide causes. Report only the patterns visible within the text; avoid explaining their origin unless the reviews themselves make the reason unmistakably clear.
This single addition prevents a major class of confidently inaccurate outputs. Without it, Claude will default to crafting a strategic explanation because that’s what a skilled analyst would do — and Claude is trying to act like a skilled analyst.
The issue is that a skilled analyst also recognizes the boundaries of their knowledge. A real analyst would say: “We’re noticing a rise in sizing complaints in Dresses this quarter. It may be tied to a recent product launch, but we’d need SKU-level data to verify that.” Claude won’t make that distinction unless explicitly told to.
Line 2: Specify What “Significant” Actually Entails
Claude is fond of using the word significant. It deploys it constantly — and it hardly ever defines what that means.
Only label a sentiment shift as “significant” if it reflects a change of more than 15 percentage points in the positive-to-negative ratio compared to the previous quarter, OR if a theme shows up in over 20% of reviews within a specific department. For subtler signals, use phrasing like “slight uptick” or “minor increase.” Never apply the words “notable” or “significant” to anything falling beneath these thresholds. Always include the actual percentage alongside your claim.
Feel free to adjust those 15% and 20% thresholds to fit your own dataset. The key principle is grounding Claude’s word choices in measurable criteria.
Without this guardrail, Claude will describe both a 3-review uptick in complaints and a legitimate 30-point sentiment plunge as equally “significant.” Your stakeholders will eventually start ignoring the reports. And when a genuinely major shift
If they don’t know it happened, they won’t know about it.
Line 3: Add a Confidence Rating to Every Finding
Before each finding, add a confidence level in brackets: [Data-Backed], [Likely], or [Assumed].
Only use [Data-Backed] when the finding is directly supported by the review content given. Use [Likely] when it’s a logical conclusion drawn from the text. Use [Assumed] when you’re guessing about reasons or background not mentioned in the reviews.
When I initially included this rule, I thought most tags would be [Data-Backed]. Instead, I got a blend of all three, which revealed just how much Claude had been making assumptions in earlier reports without my awareness.
Here’s an example of the result after applying this rule:

Now your team can clearly distinguish what’s confirmed and what’s an estimate. That’s a far more transparent report.
Line 4: Make Claude Outline What the Analysis Can’t Do
At the conclusion of the report, add a section titled “What This Report Can’t Answer.” List 2-3 pieces of information that would be necessary for stronger insights, such as item-level review details, return rate statistics, or repeat buyer behavior.
This directive pushes Claude to recognize the boundaries of its own evaluation. It also gives your stakeholders a clear guide for what areas to explore next, which is truly one of the most valuable contributions an analyst can offer.
Here’s the generated output:

How to Use Claude to Improve the Skill
Creating a skill once isn’t sufficient. You have to test and refine it just as you’d refine a data model.
Step 1: Test the skill against familiar cases.
Narrow your data to a period where you already understand what occurred. (A quarter with a product recall, a seasonal sale, an unusually high return period, and so on.) Check what Claude reports. Does it apply the term “significant” appropriately? Does it reference actual facts and numbers where it should?
Step 2: Give Claude its own output and request a self-review.
Claude can effectively identify its own inflated confidence when prompted to check for it.
Below is a quarterly customer sentiment report produced by an AI analyst. Examine each finding in this report and flag any that:
– Draw cause-and-effect conclusions without clear support in the review text
– Apply terms like “significant” or “notable” without supporting evidence
– Connect isolated product problems to company-wide patterns
– Presume context missing from the dataset (launch dates, stock levels, purchase records)
For each flagged finding, propose a revised version that’s more cautiously worded.
Step 3: Insert a rule for each error you catch.
Whenever Claude generates a report with an obviously incorrect or overly confident finding, instruct it to add a new restriction to your skill. Gradually, your skill essentially becomes a catalog of everything Claude gets wrong.
A Quick Warning
Inserting restrictions into your skill can occasionally cause Claude to produce write-ups where nearly every line concludes with “…though more evidence would be required to verify this.”
That’s not helpful either.
The objective is balanced confidence—where the assertiveness of Claude’s wording aligns with how strong the evidence is. If Claude starts sounding excessively indecisive, you can include a corrective rule:
Don’t qualify every single statement. If a trend appears clearly and repeatedly across many reviews, state it directly and reference the supporting data. Only use hedging language for genuinely uncertain or speculative points.
Conclusion
Claude is excellent at producing polished, professional-looking reports—and that can sometimes be the issue.
The surface-level polish masks the underlying overconfidence. Your stakeholders see sleek formatting and convincing language, and they take the findings as rock-solid even when they’re not.
The four guidelines I’ve outlined here don’t weaken Claude. They make it more truthful. And in a reporting environment, truthful is more valuable than polished.
Explore other practical applications for Claude here, including creating dashboards, troubleshooting, and drafting documentation:
→ 3 Claude Skills Every Data Scientist Needs in 2026
Thanks for Reading
Reach out to me on LinkedIn
Buy me a coffee to support my work!



