Data Automation · 2024
LeadHunter archived
Automated B2B lead identification
The problem
B2B prospecting at scale is a time sink. Sales teams spend weeks manually researching targets across directories, LinkedIn, and company databases — qualifying leads one at a time.
The solution
Built a Selenium-driven ETL pipeline that automated prospecting across 18,000 B2B targets. The system crawls multiple data sources, normalizes company profiles, scores leads against qualification criteria, and feeds results into a structured database for sales outreach.
System architecture
How it's wired
Selenium ETL with a Redis-backed job queue, retry worker, and dead-letter handling. Animated arrows show data flow direction. Coral arrows = failure paths.
Engineering rationale
Technical decisions
Why this stack, what the trade-offs were.
Why a Redis queue instead of running Selenium synchronously?
Two reasons. First, a hung Selenium session on a complex page can take 60+ seconds to detect — without a queue that blocks the entire pipeline. The queue lets a watchdog kill stuck jobs and reassign them. Second, it decouples ingestion speed from validation speed; sources can run at their own pace while the scoring engine drains the queue at its own.
Why no message queue in the original version?
Honest answer: the original was a batch job, not a user-facing service. A nightly cron run is acceptable when nobody's waiting on a UI. The queue is the 10x answer — once you have 50 sources hitting the pipeline concurrently, the synchronous model breaks. I'm showing the 10x architecture here because it's the question a hiring team will ask.
Why PostgreSQL instead of Mongo or a CSV pipeline?
Lead scoring needs indexed range queries (revenue, headcount) and joins across enriched sources. CSVs break past ~50k rows; Mongo would have worked but Postgres gave SQL-native scoring functions and a real query planner. Postgres won on observability — sales ops could write their own SELECTs.
How was rate limiting handled without getting blocked?
Three layers: per-domain token bucket, randomized inter-request jitter (sampled from a beta distribution to look human), and a session-rotation pool with sticky residential proxies. Failed sessions cool down on a 24h exponential backoff.
Failure modes
Edge case handling
What breaks at the edges, and how the system responds.
- Selenium worker hangs on a complex page → watchdog kills the process after 60s, job goes to dead-letter, retry worker picks it up with exponential backoff
- Malformed listings (missing fields, encoded HTML) → routed to dead-letter table with raw payload for later schema patching, never silently dropped
- CAPTCHA / soft-blocks → session marked dirty, rotated out, IP cooldown starts; orchestrator scales remaining workers up to maintain throughput
- Duplicate companies across sources → fuzzy-match on normalized domain + name; canonical record keeps highest-confidence fields per column
- Scoring drift over time → weekly re-scoring batch job; deltas surfaced to the sales team so they know what changed and why
Current scope
Limitations
What this system is not today — to be precise about scope.
- Single-tenant — one customer's data per deployment, not multi-tenant SaaS
- Batch, not real-time — leads are scored in nightly windows, not on submit
- Proxy budget is the practical cap on throughput; scaling past ~50k targets/day requires a real proxy spend
Scaling
What breaks first at 10x
Single Python service today. To 10x: split orchestrator from workers (Celery + Redis already shown), move scoring to a dedicated container, front Postgres with a read replica. The natural bottleneck is the Selenium worker pool — adding workers is a config change, not a rewrite.
Roadmap
What I'd build next
Implementation notes
Build details
- Selenium WebDriver orchestrates headless browser sessions across multiple data sources simultaneously
- Custom ETL pipeline normalizes heterogeneous data formats into unified company profiles
- Scoring algorithm ranks leads by fit criteria — industry, headcount, revenue signals, tech stack indicators
- PostgreSQL backend with indexed queries for fast filtering and export to CRM-ready formats
- Fault-tolerant architecture with retry logic, session rotation, and rate limiting to avoid detection
- Modular design allows new data sources to be added without modifying core pipeline logic
Tech stack
Want to dig deeper?
Happy to walk through code, decisions, or design files.
Get in touch →