AI Sales Agent Case Study

Q: How do AI sales agents work with HubSpot?

AI sales agents connect to HubSpot via the Private App API to read contacts, deals, and activity history, then generate prioritized outreach tasks. In this system, the agent scores every contact 0-100 based on deal value, recency, and ICP match, then drafts personalized emails queued for human approval.

Q: Can AI agents send emails autonomously?

Yes, but with guardrails. This system requires human approval for all outreach except pure scheduling emails, which can auto-send if they pass 5 safety checks: scheduling keyword detected, under daily cap, within business hours, non-VIP contact, no escalation flag.

Q: How long does it take to build a multi-agent AI sales system?

The core single-agent system shipped in 7 days. Expanding to 5 agents with reply routing, deal management, sales reporting, and lead generation took an additional 2 weeks. Total: 23 days from first commit to full deployment.

Q: What's the test coverage for AI agent systems?

This system has 1,360 tests across 46 test files, a 2.2:1 test-to-code line ratio. Tests cover golden-path integration, per-rep routing, scheduling autonomy edge cases, guardrails enforcement, CAN-SPAM compliance, and audit logging.

Q: How do you prevent AI sales agents from sending spam?

Three layers: (1) human approval required for all non-scheduling outreach, (2) CAN-SPAM opt-out checking with normalized field comparison, (3) independent kill switches per agent with a master pause that stops all outbound within 60 seconds.

Q: What is OpenClaw?

OpenClaw is a self-hosted AI agent framework that runs an agentic loop: think, plan, act, observe. It serves as the reasoning engine while Python CLI tools handle execution (CRM reads, email drafting, task queuing). It runs in Docker and uses Claude as the underlying LLM.

The Challenge

A B2B services company with 16,000+ HubSpot contacts and a multi-million dollar revenue target needed their sales pipeline automated. The CEO was manually reviewing contacts, drafting outreach emails, and managing follow-ups across multiple reps. The ask: build an AI system that generates a prioritized task queue every morning that the CEO can review and approve in under 20 minutes.

Nothing sends without human approval. This was a non-negotiable constraint that shaped the entire architecture. Every outreach email, every follow-up, every cold intro sits in a queue until a human says go. The system recommends. The human decides.

The Architecture

OpenClaw (a self-hosted AI agent framework) serves as the brain. Python CLI tools serve as the hands. The pattern: OpenClaw thinks and plans, Python tools execute. The tools read HubSpot, score contacts, draft emails, and queue tasks in Airtable.

The execution surface is Airtable. The CEO controls everything through table views and status fields. Adding a new sales rep or changing ICP criteria requires zero code changes. n8n (a self-hosted workflow automation platform) polls Airtable every 60 seconds and sends approved emails from the correct inbox.

Config-driven architecture: every business rule lives in Airtable rows, not code.

Five Agents, One Container

All 5 agents run in a single Docker container with shared tool libraries. Each agent has a distinct role and trigger schedule.

Agent	Role	Trigger
Sales Agent	Morning scan, contact scoring 0-100, email drafting, task generation	6 AM daily + midday
Reply Router	Classifies inbound sales email into 8 types, drafts contextual reply	On new email
Deal Mover	Cadence-based follow-ups, stage velocity tracking, flags stalling deals	Hourly heartbeat
Sales Manager	Revenue vs target by service line, rep scorecards, weighted escalation	7 AM daily Slack report
Lead Generator	ICP scoring, lead enrichment, 100+ leads/day for cold callers	Daily batch

The system uses a dual persona model. An AI sales assistant persona handles routine outreach. A second persona matches the CEO's writing voice for VIP and C-level contacts. Routing is automatic based on contact tier and deal value.

What Broke (And What We Learned)

Production systems reveal bugs that tests and staging never surface. These are the real failures from this build.

The scoring system was theater.

Three of five scoring components returned hardcoded or default values because the HubSpot fields they read were never populated. The priority queue looked functional. Contacts had scores, tasks were ranked. But the rankings were essentially random. Found and fixed by auditing every scoring function against real CRM data.

CAN-SPAM compliance gap.

The opt-out check was case-sensitive. HubSpot stores opt-out flags inconsistently ("YES", "yes", "True"). Only one variant matched. Contacts who had opted out could have received email. Fixed with normalized comparison + 13 regression tests.

Formula injection in Airtable queries.

Contact names with apostrophes (O'Brien) or quotes broke Airtable filter formulas, or worse, could inject formula logic. Found across 16 formula-building sites in 7 files. Fixed with a shared escape utility.

Field naming assumptions.

The code assumed Airtable field names. The client's actual Airtable had different names. 28 tests failing, systematic remap across 4 core modules.

The Numbers

7 days

Phase 1 Delivery

5

Autonomous Agents

1,360

Tests (46 Files)

2.2:1

Test-to-Code Ratio

~10,500

Lines of Python

16,000+

HubSpot Contacts

<60s

Send SLA

40+

Warmed Sending Inboxes

160

Commits in 23 Days

Kill switch: any agent pausable independently. Master pause stops all outbound in under 60 seconds.

Key Takeaways

Research the platform paradigm before writing code. Understanding that OpenClaw uses an agentic loop (not Unix piping) saved the entire architecture from a false start. The first day was spent reading source code, not writing it.
Config-driven beats code-driven for client-facing systems. The CEO never touches code. Every business rule change is an Airtable row edit. Adding a sales rep, changing ICP criteria, adjusting email cadence: all zero-deploy operations.
Test heavier than you ship. 22,900 lines of tests for 10,500 lines of production code. When the CAN-SPAM bug surfaced, the test suite caught every edge case variant. The 2.2:1 ratio is not overhead; it is the reason the system works.
Human-in-the-loop is a trust-building mechanism, not a limitation. The autonomous scheduling path (the AI assistant can auto-send pure scheduling emails under 5 guardrails) was earned after proving the approval system worked. Trust scales with evidence.

Frequently Asked Questions

How do AI sales agents work with HubSpot?