LeadHunter archived

Automated B2B lead identification

18,000

Targets processed

95%

Time reduction

$750K

Qualified orders

Weeks → Hours

Research cycle

The problem

B2B prospecting at scale is a time sink. Sales teams spend weeks manually researching targets across directories, LinkedIn, and company databases — qualifying leads one at a time.

The solution

Built a Selenium-driven ETL pipeline that automated prospecting across 18,000 B2B targets. The system crawls multiple data sources, normalizes company profiles, scores leads against qualification criteria, and feeds results into a structured database for sales outreach.

Why a Redis queue instead of running Selenium synchronously?

Two reasons. First, a hung Selenium session on a complex page can take 60+ seconds to detect — without a queue that blocks the entire pipeline. The queue lets a watchdog kill stuck jobs and reassign them. Second, it decouples ingestion speed from validation speed; sources can run at their own pace while the scoring engine drains the queue at its own.

Why no message queue in the original version?

Honest answer: the original was a batch job, not a user-facing service. A nightly cron run is acceptable when nobody's waiting on a UI. The queue is the 10x answer — once you have 50 sources hitting the pipeline concurrently, the synchronous model breaks. I'm showing the 10x architecture here because it's the question a hiring team will ask.

Why PostgreSQL instead of Mongo or a CSV pipeline?

Lead scoring needs indexed range queries (revenue, headcount) and joins across enriched sources. CSVs break past ~50k rows; Mongo would have worked but Postgres gave SQL-native scoring functions and a real query planner. Postgres won on observability — sales ops could write their own SELECTs.

How was rate limiting handled without getting blocked?

Three layers: per-domain token bucket, randomized inter-request jitter (sampled from a beta distribution to look human), and a session-rotation pool with sticky residential proxies. Failed sessions cool down on a 24h exponential backoff.

Selenium worker hangs on a complex page → watchdog kills the process after 60s, job goes to dead-letter, retry worker picks it up with exponential backoff
Malformed listings (missing fields, encoded HTML) → routed to dead-letter table with raw payload for later schema patching, never silently dropped
CAPTCHA / soft-blocks → session marked dirty, rotated out, IP cooldown starts; orchestrator scales remaining workers up to maintain throughput
Duplicate companies across sources → fuzzy-match on normalized domain + name; canonical record keeps highest-confidence fields per column
Scoring drift over time → weekly re-scoring batch job; deltas surfaced to the sales team so they know what changed and why

Single-tenant — one customer's data per deployment, not multi-tenant SaaS
Batch, not real-time — leads are scored in nightly windows, not on submit
Proxy budget is the practical cap on throughput; scaling past ~50k targets/day requires a real proxy spend

Single Python service today. To 10x: split orchestrator from workers (Celery + Redis already shown), move scoring to a dedicated container, front Postgres with a read replica. The natural bottleneck is the Selenium worker pool — adding workers is a config change, not a rewrite.

Replace the cron-driven batch with the queued architecture shown above. Add a small admin UI for sales ops to inspect dead-letter rows and trigger reprocessing without engineer involvement.

Implementation notes

Build details

Selenium WebDriver orchestrates headless browser sessions across multiple data sources simultaneously
Custom ETL pipeline normalizes heterogeneous data formats into unified company profiles
Scoring algorithm ranks leads by fit criteria — industry, headcount, revenue signals, tech stack indicators
PostgreSQL backend with indexed queries for fast filtering and export to CRM-ready formats
Fault-tolerant architecture with retry logic, session rotation, and rate limiting to avoid detection
Modular design allows new data sources to be added without modifying core pipeline logic

Tech stack

Python Selenium PostgreSQL ETL

Want to dig deeper?

Happy to walk through code, decisions, or design files.

Get in touch →

LeadHunter archived

How it's wired

Technical decisions

Why a Redis queue instead of running Selenium synchronously?

Why no message queue in the original version?

Why PostgreSQL instead of Mongo or a CSV pipeline?

How was rate limiting handled without getting blocked?

Edge case handling

Limitations

What breaks first at 10x

What I'd build next

Build details

Want to dig deeper?