The Support QA Playbook

Support QA is hard to get right. If you've tried it before and it didn't stick, you're not alone.
{.text-lg .text-neutral-600 .mb-8}

Most QA programs fail. The rubric gets built for auditors, not for the people who actually need to use it: agents, team leads, and the customers who never see it but feel the results.

You've probably seen this pattern: Someone builds a scorecard with 15 categories and 47 sub-questions. Reviewers spend 30 minutes per ticket. Scores go into a spreadsheet. Nothing changes. Within three months, QA becomes a checkbox exercise everyone resents, and the spreadsheet stops getting updated.

Or maybe your QA program does run, but it feels disconnected from reality. Agents with high scores somehow have worse customer satisfaction. The rubric rewards following scripts instead of solving problems. You're measuring something, but you're not sure it's the right thing.

This playbook takes a different approach. It's built around one question: Did we solve the problem and keep the customer?

Everything else only matters if it serves that goal: the tone, the greeting, the script adherence—all of it. Look, I know you've seen QA programs die before. Give this a read anyway. By the end, you'll have a rubric you can actually use, guidance on the stuff that makes it stick, and a clear path to getting started this week.

Why Most QA Programs Fail #

Before we get to the rubric, it's worth understanding why QA fails. You already know it's hard. But understanding the specific failure modes helps you avoid them.

Measuring what's easy instead of what matters #

Process metrics are seductive because they're objective. "Did the agent use the customer's name?" has a clear answer. "Did the customer leave feeling valued?" requires judgment. So rubrics drift toward what's easy to measure: greetings, sign-offs, script compliance, hold time.

Here's the thing: customers want their problem solved. The greeting is background noise. An agent who says "Thanks for calling, how can I provide you with excellent service today?" but fails to fix the issue is worse than an agent who says "Hey, what's up?" and resolves it in two minutes.

What works: Weight outcomes heavily. This rubric uses 60% outcomes, 40% process. Some organisations go as high as 85/15. (I've watched teams argue about the exact ratio for hours. Start with 60/40 and adjust based on what you learn.) The right ratio depends on your context, but if your rubric weights process and outcomes equally, you're over-weighting process.

Calibration isn't optional #

Here's what happens without it: your 4 is someone else's 3. I've seen teams where scores varied by a full point depending on who reviewed. At that point you're measuring reviewer mood, not agent performance.

The fix is simple but people skip it anyway. Monthly calibration sessions where reviewers score the same conversations independently, then compare. Target 80% agreement within one point. (Honestly, 70% is fine when you're starting—the point is you're converging.) The disagreements are the whole point. That's where you figure out what the rubric actually means.

Scores without action are just data entry #

If you're not coaching off the scores, stop scoring. You're wasting everyone's time.

Enough about failure. Here's the rubric.

The Outcomes Rubric #

This rubric has six categories, weighted toward outcomes. It's designed to be scored in 10-15 minutes per conversation by someone who understands your product and customers. It won't capture everything, but it will capture what matters most.

A note on the 60/40 split: We recommend 60% outcomes, 40% process as a starting point. SQM Group's benchmark research uses 85/15. The right ratio depends on your industry and what actually correlates with retention in your data. Regulated industries may need higher process weights. Start with 60/40, then adjust based on what you learn.

Outcomes (60% of total score) #

These three categories ask: What happened for the customer?

1. Resolution Quality (25%) #

This is the most important category—worth spending time on. A customer contacts support because something is wrong. Did we make it right? Everything else is secondary.

Score	Criteria
5	Issue fully resolved in this interaction. Customer confirmed or clearly indicated satisfaction.
4	Issue resolved but minor loose ends remain (e.g., "tracking email coming in 24 hours").
3	Core issue addressed but secondary concerns left open. Acceptable for complex cases.
2	Attempted resolution but issue persists. Customer will need to contact again.
1	No meaningful progress. Issue unaddressed, or agent actions made the situation worse.

What about issues that can't be resolved in one contact? Shipping delays, third-party dependencies, and complex bugs take time. Score based on whether the agent did everything within their control and set clear expectations. A 3 is appropriate when resolution requires factors outside the agent's influence. Shipping delays aren't the agent's fault—score what they controlled.

2. Customer Effort (20%) #

CES research suggests effort predicts loyalty better than satisfaction scores. When customers have to repeat themselves, get transferred, or do the agent's research for them, they remember. And they leave.

Score	Criteria
5	Effortless. Agent anticipated needs and took initiative. Customer provided info once.
4	Low effort. Minimal clarification needed. Conversation flowed smoothly.
3	Moderate effort. Some back-and-forth, but justified by issue complexity.
2	High effort. Customer repeated information, faced unnecessary transfers or holds.
1	Customer had to diagnose, research, or solve parts of the problem themselves.

Security verification requires effort by design. Don't penalise agents for identity checks, two-factor authentication, or compliance requirements. Score based on effort beyond what's necessary for security.

3. Retention Signal (15%) #

This one's fuzzy. On purpose. You're reading signals, not predicting the future. Some customers never say thank you; others complain loudly but stay for years. Focus on explicit statements and clear sentiment shifts.

Score	Criteria
5	Explicit positive signal: gratitude, stated intent to continue, or sentiment clearly improved.
4	Positive close. No churn indicators. Customer's final message suggests satisfaction.
3	Neutral. Issue handled but customer's sentiment unchanged from start of conversation.
2	Warning signs: customer mentioned competitors, expressed unresolved frustration.
1	Explicit churn signal: threatened to leave, requested cancellation, or escalated complaint.

Important: A 1 here doesn't mean the agent failed. Some customers have already decided to leave before they contact you. A graceful goodbye isn't a failure—score the signal, then separately note whether recovery was realistic.

Process (40% of total score) #

These three categories ask: How did we get there? Process matters because it's what you can train and coach.

4. Accuracy & Knowledge (20%) #

Wrong information erodes trust faster than almost anything else. A customer who gets incorrect pricing, wrong return windows, or bad technical advice will remember. And they'll tell others.

Score	Criteria
5	All information accurate. Agent demonstrated strong product knowledge and gave proactive guidance.
4	Accurate on key points. Minor details missing but nothing that affected resolution.
3	Mostly accurate. One minor error that didn't meaningfully impact the customer.
2	Error present but caught and corrected during the conversation.
1	Significant error given and not corrected. Misinformation affected the outcome.

Policy often has legitimate ambiguity. If an agent interprets policy in a customer-friendly way that's defensible, that's not inaccuracy. Reserve low scores for objectively wrong information.

5. Communication & Empathy (15%) #

Customers can tell when they're being processed versus being helped. Empathy means acknowledging the situation before jumping to solutions. That's it. (Tone-matching is harder than people think—I've watched teams argue about customer effort for forty minutes. The answer is almost always: depends on the situation.)

Score	Criteria
5	Exceptional. Natural language, genuine empathy, tone appropriate to the customer's state.
4	Good. Professional, warm, acknowledged the customer's situation directly.
3	Adequate. Professional but noticeably templated. Got the job done.
2	Needs work. Tone mismatch, dismissed customer's concerns, or felt robotic.
1	Poor. Dismissive, condescending, or made the customer feel worse about contacting us.

Quick note on tone-matching: it's harder than people think. Match the customer's energy, not their negativity. An agent who stays composed while a customer yells deserves a high score, not a low one.

6. Process Adherence (5%) #

This category is weighted low. Process should serve outcomes. But some processes exist for good reasons: compliance, security, documentation.

Score	Criteria
5	Full compliance. Escalation paths followed, proper documentation, SLAs met.
3	Minor gaps. Key procedures followed but documentation incomplete or minor steps missed.
1	Major violation. Required compliance steps skipped, SLA breach, or policy violation.

If you're in finance, healthcare, or legal, consider increasing this to 10-15%. Compliance failures carry disproportionate risk in regulated industries.

That's the rubric. Six categories, weighted toward outcomes. Simple enough to score in 15 minutes, nuanced enough to capture what matters.

But having a good rubric is necessary, not sufficient. In my experience, the hard part isn't the rubric. It's using it consistently, fairly, and in a way that actually improves performance.

What Nobody Tells You About QA Scoring #

If you've made it this far, you have a rubric. But rubrics don't score conversations. People do. And people bring biases, moods, and assumptions that affect their scores in ways they don't notice.

This section covers the unwritten rules. Experienced QA reviewers know these things but rarely explain them. It's become second nature, so they forget it ever needed explaining. If you're new to QA, this will save you months of learning the hard way.

Your Brain Will Sabotage Your Scores #

You anchor on your first score. If you start with a 5 on Resolution Quality, you'll unconsciously be more lenient on Communication. If you start with a 2, you'll be harsher throughout. This happens to everyone.

Fix: Randomise the order you score categories. Or do a quick "high/medium/low" pass on all categories first, then assign specific numbers.

Your mood affects your scores. End of day? After a frustrating meeting? Hungry? Your scores will be lower. Fatigue makes everyone harsher. (I scored 50 conversations one afternoon and my averages were a full point lower than the morning. Took me a while to figure out why.)

Fix: Batch your QA work when you're in a neutral state. If you notice you're irritated, take a break before scoring. Never score right after a difficult conversation of your own.

You'll score the conversation you would have had. You'll think "I would have offered a refund here" and penalise the agent for not doing what you imagined. But your approach isn't the only valid one.

Fix: Ask "Did the agent's approach work?" A different approach isn't wrong if it worked.

One bad moment will colour everything. If an agent makes a clumsy statement in message 3, you'll read messages 4-10 looking for more problems.

Fix: Score each category based only on what's relevant to that category. A tone slip doesn't make their product knowledge wrong. Compartmentalise.

The "3" Problem #

This will undermine your entire program if you ignore it.

On a 1-5 scale, 3 should mean "met expectations." Baseline. Acceptable. Solid.

But most teams accidentally create a culture where 3 feels like failure. Agents get anxious about 3s. Managers treat 3s as problems to discuss. This happens because school taught us 60% is barely passing, performance reviews inflate scores to 4s and 5s as "normal," and we only discuss low scores, never adequate ones.

The result? Reviewers inflate scores to avoid giving "bad" 3s. Agents don't trust their scores. The rubric becomes meaningless. In my experience, the 3 problem kills more programs than anything else.

Fix: Address this explicitly with your team. Say the words: "A 3 means you did your job. It's not a problem. We're looking for patterns of 2s, and celebrating 5s. A bunch of 3s means you're solid." Then say it again next week. The first time you give someone a 3 in person, you'll feel bad about it. That's normal. Say the words anyway.

Putting It Into Practice #

You have a rubric. You know the psychology. Now let's talk about making it work in the real world, where you have limited time, imperfect tools, and a team that may be skeptical of another QA initiative.

Sampling: Make Your 2% Count #

The average team reviews 2% of conversations (Source: Zendesk QA Benchmark Report). Most teams I've talked to are closer to 1%. So if you hit 2%, you're ahead. You can't review everything. The goal is making your sample representative enough to catch patterns.

Baseline: 4-5 conversations per agent per week. Most guides recommend 4-5 reviews. This gives you ~200 data points per agent per year, enough to spot trends.
Prioritise edge cases: Long conversations, escalations, refund requests, low CSAT. These are where process breaks down.
Mix channels if you handle chat, email, and phone. Performance varies.
Include random samples: 2-3 purely random tickets per week to catch blind spots.

Reality check: 5 reviews × 20 agents = 100 reviews/week. At 10-15 minutes each, that's 17-25 hours of QA time. If those numbers made you wince, you're not alone. If you don't have dedicated QA staff, you'll need to reduce sample size or use automation to surface conversations worth reviewing.

Calibration: Make Scores Mean Something #

Weekly (15 min): Quick standup to discuss edge cases. "How would you score this?" These sessions can get tense. That's actually a good sign—it means people care about getting it right.
Monthly (1 hour): Everyone scores the same 3 conversations independently. Compare. Target 80% agreement within 1 point.
Quarterly: Revisit the whole rubric. Has your product changed? Have your patterns?

Closing the Loop: Scores to Action #

Coaching trigger: Scores of 2 or below on any category trigger a coaching conversation within the week. Focus on one category at a time.
Pattern recognition: One low score is noise. Three weeks of declining scores is a signal. Look for trends.
System issues: Low Accuracy across multiple agents usually indicates a KB gap, not individual failure. Fix the system.
Recognition: Share examples of 5s. Examples work better than rules.

Your First 4 Weeks #

You don't need to implement everything at once. Here's a realistic rollout that builds momentum without overwhelming your team.

Week 1: Foundation #

Customise this rubric for your context. Adjust category weights if needed. Add specific examples from your own conversations. Real ones from your queue, not hypotheticals.

Score 10 conversations yourself to get a feel for the rubric. Don't share these scores; they're practice. Identify 2-3 people who will be doing regular reviews.

By Friday: you should have a rubric you believe in and know who's reviewing.

Week 2: Calibration #

Hold the calibration session. Select 3 conversations that span the range, have everyone score them independently, then sit in a room and compare. You'll be tempted to skip this because it feels like overhead. Don't. Update the rubric based on the disagreements—that's where the learning happens.

Week 3: Soft Launch #

Start reviewing 2-3 conversations per agent. Don't communicate scores to agents yet. Focus on whether the rubric is working: Are scores feeling accurate? What's missing?

This is where you'll find the rubric's rough edges. That's normal. Address the "3 problem" with your review team now, before agents see their scores.

This is where you'll find the rubric's rough edges. That's normal. You have real data now—you've caught issues before agents see them.

Week 4: Full Launch #

Communicate the program to agents. Explain the rubric. Emphasise that 3 = good (say it more than once). Start sharing scores with written feedback. Schedule coaching conversations for any 2s.

After this, you're running. It gets easier.

After Week 4 #

Now it's about consistency. Weekly quick calibrations. Monthly deep calibrations. Quarterly rubric reviews. Track whether scores correlate with retention and CSAT. Adjust weights based on what you learn.

The first four weeks are the hardest. After that, QA stops feeling like a project. That's when it starts working.