Recipe: a Claude subagent ICP scorer for inbound

A working recipe for a Claude sub-agent that scores inbound leads against ICP using closed-won DNA. Inputs, prompts, outputs, validation loop.

BY ROME THORNDIKE May 14, 2026

The ICP scorer is the most-shipped subagent in our GTM engagements. It reads new accounts, scores them against closed-won DNA, writes back to the CRM, and feeds downstream subagents that do contact discovery and drafting. This post is a concrete recipe: inputs, prompts, outputs, validation loop. Working code patterns from the builds we run.

The recipe assumes an ICP discovery pass has already produced a closed-won analysis. If it hasn't, run that first; the scorer is only as good as the reference data behind it. The discovery process is documented in ICP discovery from closed-won data.

The shape of the subagent

The scorer lives in a `subagents/icp-scorer/` folder inside the GTM repo. Five files matter.

SKILL.md: when to use the scorer and how it thinks about the task.
prompt.md: the full scoring prompt the agent runs per account.
rubric.md: the weighted feature list derived from closed-won DNA.
examples/: 5-10 worked examples of scored accounts with rationale.
output-schema.json: the structured output the scorer emits.

The root CLAUDE.md at the repo level provides the ICP definition, voice rules, and the data contract. The scorer's files specialize that context for the scoring task. A new engineer or RevOps lead can read the five files and understand exactly what the scorer does and why.

Inputs

The scorer reads four things per run.

The account record: pulled from the CRM via an MCP server or a pre-staged warehouse table. Name, domain, industry, employee count, revenue band, geography, tech stack (if known), recent activities (if any), open opportunities (if any).

The signals table: the output of the signal capture stage. Hiring events, funding events, leadership changes, tech moves. Joined to the account by domain or org ID. Only signals from the last 60 days come in; older signals are filtered out upstream.

The rubric: the weighted feature list in rubric.md. Each feature has a weight, a definition, and one or two example values. The rubric is the result of the ICP discovery pass; updating it is a deliberate act, not something the scorer changes on its own.

The examples: five to ten worked examples covering high-score, medium-score, and low-score accounts. Each example shows the account input, the score, the top three rationale points, and a one-line summary of why the score is what it is. The examples calibrate the model on the team's actual ICP.

The prompt

A working scorer prompt has four sections in order.

Section one: role and task. "You score B2B accounts against this team's ICP. Your output is a structured record with score, rationale, and signal references. You do not draft messages. You do not invent triggers." Clear scope. No drift.

Section two: the rubric. The full feature list with weights. The agent reads the rubric on every run. We tested injecting the rubric versus referencing it; injecting performs measurably better for scoring tasks.

Section three: the few-shot examples. The five worked accounts with their scores and rationale. The examples teach the model the voice of the rationale (terse, signal-anchored, no fluff) and the calibration of the scores (don't drift above 8 of 10, don't bunch in the middle).

Section four: the account and signals. The agent reads the account record, joins it against the signals table, and produces the scored output. The model is now constrained by the rubric, calibrated by the examples, and operating on the account at hand.

The output schema

Every score is a structured JSON record with a fixed schema. Loose schemas produce loose downstream behavior; the contract is non-negotiable.

account_id: stable identifier from the CRM.
score: integer from 1 to 10.
tier: enum from "kill", "watch", "qualified", "priority".
top_signals: array of up to three signal references with date and source.
top_rationale: array of up to three short rationale strings, each tied to a feature in the rubric.
confidence: enum from "low", "medium", "high" based on data completeness.
asset_recommendation: optional reference to an asset from the library that fits the trigger.

The downstream subagents (contact discovery, drafting) read this record and do their work. The CRM write-back step takes the score, tier, top three rationale points, and asset recommendation and updates the account record. The full JSON is archived in a `/runs/` folder for audit.

The validation loop

A scorer is wrong about something every week. The validation loop catches the drift before it compounds.

Every Monday, the team pulls two cohorts. The 10 highest-scored accounts from the last 90 days that didn't convert to opportunities. The 10 lowest-scored accounts from the same window that did convert. These are the false positives and false negatives.

For each account, the team reads the score, the rationale, and the actual outcome. They look for patterns. Are the false positives clustered in one industry that the rubric over-weights? Are the false negatives clustered around a signal type the scorer is missing? Are the rationale strings systematically wrong on a specific feature?

The patterns drive updates. The rubric weights change. The examples get refreshed. The CLAUDE.md ICP definition gets sharpened. Most weeks, two or three small changes. Over a quarter, the scorer's precision (qualified rate among scored accounts) typically rises from 25-35% at launch to 55-70% with regular tuning.

Common failure modes

Three failures show up in early builds.

Score bunching. The scorer outputs 6s and 7s for almost every account. The cause is usually a weak rubric (features that don't discriminate) or examples that don't cover the score range. The fix is to update the examples to include real 3-score and real 9-score accounts and to remove low-discrimination features from the rubric.

Signal blindness. The scorer ignores the signals table and scores on firmographics alone. The cause is usually a prompt that mentions signals as supplementary rather than as a required input. The fix is to restructure the prompt so signals are read before firmographics and rationale must reference signals when they exist.

Rationale that doesn't match the score. The scorer says 8 of 10 and gives rationale that reads like a 5. The cause is usually the model defaulting to optimistic rationale framing. The fix is to add a calibration line to the prompt: "If your rationale doesn't include at least one strong positive signal, the score cannot exceed 6." Calibration constraints in the prompt are the cheapest fix and the highest yield.

How it ships in production

A typical deployment runs as a nightly job on the team's repo or scheduler. The job reads new accounts and signals from the warehouse, calls the scorer subagent, parses the structured output, and writes back to the CRM.

The shell command is roughly: `claude run icp-scorer --input data/new-accounts.json --output runs/$(date +%Y-%m-%d).json --update-crm`. The first run scores the entire account universe (usually 5,000-50,000 accounts). Subsequent runs score incremental additions plus a weekly re-scoring of the whole universe to capture signal changes.

The total cost runs $30-120 per nightly run depending on universe size and signal density. Annualized, that's $11K-43K for the scorer infrastructure, compared to $150-300K for a BDR doing comparable account research at lower coverage and lower quality.

What ships with the recipe

When we deploy this recipe as part of a GTM engagement, the team receives the full repo: SKILL.md, prompt.md, rubric.md, examples, output schema, and a working CRM write-back integration. The CLAUDE.md is tuned to the team's ICP and voice. A documented validation loop runs every Monday for the first quarter.

The team owns the repo on day one. The rubric, the examples, and the prompt are theirs to edit. We hand off, we don't lock in. The scorer keeps shipping value as the team's ICP sharpens, which is the point.

Where the recipe stops being a recipe

The recipe above is the working baseline. The teams that get the most out of it extend it in three ways within the first six months.

First, they add a segmentation layer. Instead of one rubric, the scorer reads three or four rubrics, one per ICP segment, and decides which rubric applies before scoring. The segmentation logic is a thin upstream subagent that reads the account and emits a segment label. The scorer then loads the matching rubric.

Second, they wire the scorer's output to the dashboard layer. Lead score by week, tier mix by source, false positive rate by industry. The dashboard is what makes the scorer's work visible to the CRO. Without the dashboard, the scorer is invisible and the team forgets it's running.

Third, they tighten the validation cadence. Monday becomes daily for the priority tier. Reps flag false positives in real time. The flags feed a separate review queue that triggers a small CLAUDE.md update once a week. The faster the feedback, the faster the scorer converges on what real ICP looks like for this team.

For the broader pipeline this scorer feeds into, see the outbound data pipeline post. For the ICP discovery process that produces the rubric, see ICP discovery from closed-won data. For the CLAUDE.md that sits above the scorer, see how to write a CLAUDE.md for your revenue team. We publish the ICP rebuild worksheet as a downloadable companion to this recipe.

Questions.

What's closed-won DNA?

The structured profile of accounts that have closed-won. Firmographic features (industry, size, geography), technographic features (stack), behavioral features (how they bought, how long it took, what triggered the deal), and revenue features (ACV, expansion rate, churn). The scorer uses closed-won DNA as the reference set, not the marketing team's aspirational ICP doc.

How many closed-won accounts do we need for this to work?

Minimum 20-30 deals. Below that, the scorer overfits to a handful of cases. Above 100, the signal is strong enough to support multi-segment scoring. Most Series A through C companies we work with have 30-100 closed-won deals, which is the sweet spot for this recipe.

Does the scorer write back to the CRM?

Yes, for the accounts that score above threshold. Score, top-three rationale, and signal references get written to a custom field on the account record. Reps can sort, filter, and report on the score. The write-back is the integration point that makes the scorer useful day-to-day.

How often should we re-run the scorer?

Weekly for full re-scoring of the account universe. Daily for incremental scoring of new accounts (new logos pulled from the warehouse). The weekly run captures signal changes (new triggers); the daily run keeps the account universe fresh. Both runs are cheap enough to run more often if the team wants.

What's the validation loop?

Each week, the team reviews the 10 highest-scored accounts that didn't convert and the 10 lowest-scored accounts that did. The patterns expose where the scorer is over-weighting or under-weighting features. CLAUDE.md updates follow. Over a quarter, the scorer's precision climbs and the team's understanding of ICP sharpens together.

ABOUT THE AUTHOR

Rome Thorndike is the founder of subagent/gtm, where he builds AI agents that analyze GTM data for CROs and RevOps leaders. His career spans 15+ years in B2B sales and revenue leadership: enterprise sales at Salesforce and Microsoft, helping scale Sequoia-backed Snapdocs from Series A through Series D, and leading sales at Datajoy through its acquisition by Databricks. Rome holds an MBA from UC Berkeley's Haas School of Business, where he studied statistical analysis and machine learning. He's based in the San Francisco Bay Area and publishes The AI News Digest, a top-ranked technology newsletter.

LINKEDIN ABOUT CONTACT

Want this built?

We deploy Claude Code subagents into your GTM stack. Fixed fee. You own everything.

→ Fix your GTM