- Published on
Building a Distributed Job Scraper with Electron and Edge Functions
Building a Distributed Job Scraper with Electron and Edge Functions
Web scraping at scale breaks in predictable ways: IP rate-limiting, client-side rendering walls, and geographic content differences. A single-machine script hitting 20 job boards will get blocked within hours. The solution is a distributed architecture where the heavy lifting happens across multiple execution environments.
Why Electron for the desktop layer
Electron gets a bad rap for memory usage, but for a job-scraping desktop app, it solves two hard problems elegantly. First, it bundles a full Chromium runtime, so you get Puppeteer-grade browser automation without asking users to install Chrome Canary or maintain headless binaries. Second, it provides a cross-platform UI shell where users can configure search parameters, view results, and trigger actions without touching a terminal.
The desktop client is not doing the scraping itself. It is a configuration and notification layer. The actual scraping happens elsewhere.
The distributed scraping architecture
Here is the topology that actually works at scale:
Desktop Client (Electron)
-> Config API (user preferences, search terms)
-> Edge Functions (one per job board, geographically distributed)
-> Database (deduplicated listings)
-> Notification Pipeline (desktop alerts, email, Slack)
The key insight: each job board gets its own edge function deployed near the board's primary geographic region. A function scraping Indeed US runs in us-east-1. A function scraping StepStone runs in eu-central-1. This minimizes latency and reduces the chance of geographic IP blocking.
The Puppeteer-backed edge function
Here is a minimal scraping function that runs in a Cloudflare Worker or Deno Deploy environment with browser rendering support:
async function scrapeLeverJobs(company: string): Promise<Job[]> {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto(`https://jobs.lever.co/${company}`, {
waitUntil: 'networkidle0',
});
const jobs = await page.evaluate(() => {
const listings = document.querySelectorAll('.posting');
return Array.from(listings).map((el) => ({
title: el.querySelector('h5')?.textContent?.trim(),
location: el.querySelector('.location')?.textContent?.trim(),
url: (el as HTMLAnchorElement).href,
}));
});
await browser.close();
return jobs.filter((j) => j.title && j.url);
}
Deduplication and change detection
Scraping on a schedule produces duplicates. Every run may return the same 80 listings plus 3 new ones. You need a dedup layer that hashes each job by a composite key of title, company, and location:
function jobHash(job: Job): string {
const normalized = `${job.title}|${job.company}|${job.location}`
.toLowerCase().replace(/\s+/g, ' ');
return crypto.createHash('sha256').update(normalized).digest('hex');
}
Store hashes in your database. On each scrape run, compare incoming hashes against the stored set. Only insert rows for new hashes. This keeps your database lean and ensures every notification represents a genuinely new posting.
The Electron client responsibilities
The Electron app handles three things: letting users configure search terms and target companies, polling your API for new results, and firing OS-level notifications when matches appear. It does zero scraping directly. This separation means you can update the scraping layer without pushing a desktop app update.
Rate-limit coordination
When you have 15 edge functions polling different boards, you need a coordination layer to prevent accidental thundering-herd problems against shared infrastructure. A simple Redis-backed semaphore works:
def acquire_scrape_slot(board: str, max_concurrent=3):
key = f"scraping:{board}"
current = redis.incr(key)
redis.expire(key, 60)
if current > max_concurrent:
redis.decr(key)
raise RateLimitError(f"{board} at capacity")
This ensures you never run more than three concurrent scrape operations against any single board, regardless of how many edge functions are deployed.
The takeaway
A desktop job scraper is not a weekend project if you want it to run reliably. But the architecture described here—Electron for UI, distributed edge functions for scraping, hash-based dedup for storage, and Redis for coordination—scales from one user to thousands without collapsing under rate limits or IP bans.