← Back to projects

Data Automation · 2024

LeadHunter archived

Automated B2B lead identification

18,000
Targets processed
95%
Time reduction
$750K
Qualified orders
Weeks → Hours
Research cycle

B2B prospecting at scale is a time sink. Sales teams spend weeks manually researching targets across directories, LinkedIn, and company databases — qualifying leads one at a time.

Built a Selenium-driven ETL pipeline that automated prospecting across 18,000 B2B targets. The system crawls multiple data sources, normalizes company profiles, scores leads against qualification criteria, and feeds results into a structured database for sales outreach.

How it's wired

Selenium ETL with a Redis-backed job queue, retry worker, and dead-letter handling. Animated arrows show data flow direction. Coral arrows = failure paths.

DATA SOURCES LinkedIn / Directories Company DBs Web crawl targets ORCHESTRATION Selenium pool Headless workers • Session rotation • Rate limit guard • Proxy rotation • Watchdog timeout JOB QUEUE Redis queue pending → running → complete / failed Dead-letter queue Hung > 60s Malformed payload VALIDATION Schema + scoring Type coerce Industry, size, signals PostgreSQL Indexed by score CRM export view RETRY Retry worker Exp backoff Sales outreach 18k targets · $2.2M pipeline

Technical decisions

Why this stack, what the trade-offs were.

Why a Redis queue instead of running Selenium synchronously?

Two reasons. First, a hung Selenium session on a complex page can take 60+ seconds to detect — without a queue that blocks the entire pipeline. The queue lets a watchdog kill stuck jobs and reassign them. Second, it decouples ingestion speed from validation speed; sources can run at their own pace while the scoring engine drains the queue at its own.

Why no message queue in the original version?

Honest answer: the original was a batch job, not a user-facing service. A nightly cron run is acceptable when nobody's waiting on a UI. The queue is the 10x answer — once you have 50 sources hitting the pipeline concurrently, the synchronous model breaks. I'm showing the 10x architecture here because it's the question a hiring team will ask.

Why PostgreSQL instead of Mongo or a CSV pipeline?

Lead scoring needs indexed range queries (revenue, headcount) and joins across enriched sources. CSVs break past ~50k rows; Mongo would have worked but Postgres gave SQL-native scoring functions and a real query planner. Postgres won on observability — sales ops could write their own SELECTs.

How was rate limiting handled without getting blocked?

Three layers: per-domain token bucket, randomized inter-request jitter (sampled from a beta distribution to look human), and a session-rotation pool with sticky residential proxies. Failed sessions cool down on a 24h exponential backoff.

Edge case handling

What breaks at the edges, and how the system responds.

Limitations

What this system is not today — to be precise about scope.

What breaks first at 10x

Single Python service today. To 10x: split orchestrator from workers (Celery + Redis already shown), move scoring to a dedicated container, front Postgres with a read replica. The natural bottleneck is the Selenium worker pool — adding workers is a config change, not a rewrite.

What I'd build next

Replace the cron-driven batch with the queued architecture shown above. Add a small admin UI for sales ops to inspect dead-letter rows and trigger reprocessing without engineer involvement.

Build details

Python Selenium PostgreSQL ETL

Want to dig deeper?

Happy to walk through code, decisions, or design files.

Get in touch →