- Published on
Ethical Job Board Scraping: Rate Limits, robots.txt, and Staying Out of Legal Trouble
Ethical Job Board Scraping: Rate Limits, robots.txt, and Staying Out of Legal Trouble
Scraping job boards sits in a legal and ethical gray zone. But "gray zone" does not mean "anything goes." If you are building a job aggregator, a personal alert system, or just a weekend script, the difference between a responsible scraper and an abusive one is a handful of configuration decisions.
Start with robots.txt—every time
Every major job board publishes a robots.txt file that declares which paths are off-limits to crawlers. Before writing a single line of scraping code, fetch and read it:
curl -s https://www.indeed.com/robots.txt
Look for Disallow directives on job listing paths. If a path is explicitly disallowed, scraping it is a terms-of-service violation that could lead to IP bans, legal letters, or worse. The robots.txt file is not legally binding in all jurisdictions, but it is the industry-standard signal of the site owner's intent. Respect it.
Rate limiting is not optional
A single-threaded script hammering a job board at 50 requests per second is a denial-of-service attack, whether you meant it or not. Most small-to-medium job boards run on modest infrastructure. You can take them offline by accident.
Set a minimum delay of 2 to 5 seconds between requests. If the board returns a 429 Too Many Requests status, back off exponentially. A good pattern:
import time
import requests
def polite_get(url, min_delay=3):
time.sleep(min_delay)
resp = requests.get(url, headers={"User-Agent": "MyJobBot/1.0"})
if resp.status_code == 429:
retry_after = int(resp.headers.get("Retry-After", 60))
time.sleep(retry_after)
resp = requests.get(url, headers={"User-Agent": "MyJobBot/1.0"})
return resp
Set an honest User-Agent header
Spoofing a Chrome user-agent string to bypass blocks is both unethical and easy to detect. Server-side fingerprinting can identify headless browsers and generic scripts regardless of the UA string. Instead, use a transparent header that identifies your bot and includes contact information:
MyJobBot/1.0 (your-email@example.com)
Many site operators are willing to whitelist transparent bots. They universally block deceptive ones.
Cache aggressively
Every duplicate request to the same URL is wasted bandwidth for the target server and wasted compute for your script. Job listings do not change minute-to-minute. Cache responses with a sensible TTL:
- Job search result pages: cache for 15 to 30 minutes
- Individual job detail pages: cache for 6 to 12 hours
- Company profile pages: cache for 24 hours
A simple SQLite or Redis cache eliminates 80% of redundant requests and makes your scraper faster on the user side too.
Do not scrape behind login walls
If a job board requires authentication to access listings, scraping it is almost certainly a terms-of-service violation and may fall under the Computer Fraud and Abuse Act in the United States. Publicly accessible pages are one thing. Authenticated pages are another. The legal risk escalates sharply.
Consider using official APIs first
Before writing a scraper, check whether the board offers a public API. LinkedIn, Indeed, and Glassdoor all have API programs—some free, some paid. An API call is faster, more reliable, and legally unambiguous. Scraping should be a fallback, not a first resort.
The practical checklist
Before deploying any job board scraper, run through these questions:
- Have I read and respected robots.txt?
- Am I rate-limiting to 1 request per 3+ seconds?
- Am I using a transparent User-Agent with contact info?
- Am I caching responses to minimize duplicate requests?
- Am I only scraping publicly accessible pages?
- Did I check for an official API first?
If you can answer yes to all six, your scraper is probably on solid ground. Skip any one of them, and you are taking on unnecessary risk for yourself and unnecessary load for someone else's server.