By Hunter Cestone · 2026-05-11

How to Scrape Client Career Pages Without Getting Blocked

Scraping has changed. The simple "requests.get(url) + BeautifulSoup" pattern that worked in 2018 will get you IP-banned in 30 seconds in 2026. Cloudflare Turnstile, Akamai bot manager, and PerimeterX have made even basic scraping a real engineering project.

This is the playbook we use at placement.solutions to keep 100+ ATS adapters running cleanly across 5,000+ active scrapes per day. Steal it.

This is for legitimate scraping of public job postings on client career pages. Do not use this to scrape gated content, copyrighted material, or anything explicitly disallowed by robots.txt for a specific path.

The four anti-bot systems you will hit

If you scrape any career page in 2026 you will encounter one of:

System	Used by	Difficulty
Cloudflare Turnstile	~30% of career pages	Medium
Akamai Bot Manager	~15% of large enterprise	Hard
PerimeterX (HUMAN)	~10% of mid-market SaaS	Hard
DataDome	~5%, growing	Hard
Custom rate limiting	Most others	Easy

Each one detects bots differently. Cloudflare uses TLS fingerprinting and JS challenges. Akamai runs server-side fingerprinting and behavioral analysis. PerimeterX combines challenges with behavioral biometrics. DataDome is increasingly behavioral.

The five rules that get you through 95% of cases

Use real Chromium via Playwright, not raw HTTP libraries.
Throttle to 1 request per 5+ seconds per host. Jitter the gaps.
Respect robots.txt for any path you cannot prove is a public job listing endpoint.
Rotate IPs only if you have to, and only with residential proxies.
Do not scrape past anti-bot challenges programmatically. Stop and ask the customer to whitelist your scraper.

The fifth rule is what separates mature scrapers from the ones that flame out in 2 weeks. If a target has a real anti-bot system, do not try to defeat it. Ask the customer to add your scraper IP to their allowlist or to use the API. We have not had a customer say no to this in 18 months.

Why Playwright beats raw HTTP

Raw requests or httpx calls give you a TLS fingerprint that screams "Python." Cloudflare sees this and challenges you immediately.

Playwright drives a real Chromium. The TLS fingerprint matches a real browser. The browser executes the JavaScript challenge. The browser handles cookies. You get the page like a normal user would.

Cost: Playwright is heavier (RAM, CPU, latency). Plan for 200-500MB per browser instance and 3-8 second page loads.

For 100% of the time, the tradeoff is worth it.

Detecting the ATS so you do not waste browser cycles

Loading Chromium for every URL is expensive. Most career pages run on a known ATS. Detect the ATS from the URL and use the API directly when possible.

ATS	URL signal	Skip browser?
Workday	`myworkdayjobs.com`	Yes — JSON API
Greenhouse	`boards.greenhouse.io`	Yes — JSON API
Lever	`jobs.lever.co`	Yes — JSON API
SmartRecruiters	`careers.smartrecruiters.com`	Yes — JSON API
Ashby	`jobs.ashbyhq.com`	Yes — GraphQL
iCIMS	`careers-{slug}.icims.com`	No — HTML scrape
Taleo	`*.taleo.net`	No — sticky session HTML
BambooHR	`{slug}.bamboohr.com/jobs`	Yes — clean HTML
Custom / unknown	varies	No — Playwright fallback

For known ATSes with public APIs, hit the API directly. Faster, cleaner, less likely to break.

Workday specifically (the #1 ATS in legal recruiting)

Workday is the most common ATS at AmLaw firms. The trick: do not scrape the SPA. Hit the JSON endpoint behind it.

Pattern:

GET /wday/cxs/{tenant}/{site_id}/jobs
Content-Type: application/json
{
  "appliedFacets": {},
  "limit": 20,
  "offset": 0,
  "searchText": ""
}

The exact URL varies by tenant. Find it by opening the SPA in a real browser, opening DevTools Network tab, and watching for the jobs POST request. Once you have the URL pattern for one tenant, it generalizes across all Workday clients.

This pattern gets you 50 to 200 jobs per call. Paginate with offset. No browser needed.

iCIMS specifically (second most common)

iCIMS is harder. The job list is paginated HTML and they use anti-bot measures. The pattern that works:

Use Playwright with headless=True initially. If blocked, switch to headless=False (visible browser actually helps).
Set realistic viewport: 1920x1080 typical.
Wait for the job list to render: page.wait_for_selector('.iCIMS_JobsTable, [data-test=job-list]', timeout=15000).
Extract job links from the rendered HTML.
Visit each job detail page with a 5-10 second jittered delay.

iCIMS will rate-limit you if you hit more than ~6 pages per minute. Plan accordingly.

Throttling rules that work

Aggressive throttling is the single biggest factor in scraper longevity. The rules:

Per-host gap: 5 seconds minimum between requests to the same host. Jitter +/- 30%.
Per-host concurrency: 1. Always. No exceptions.
Daily volume cap per host: 500 requests. If you need more, split across multiple days.
Backoff on errors: 1 second on first 5xx, 5 seconds on second, 30 seconds on third, then pause for an hour.

If you violate these, you get banned within hours. Once banned, recovery is days, sometimes never. We have lost access to specific subdomains permanently because of one bad weekend.

IP rotation: when to do it (and when not to)

The default: do not rotate IPs. Most scraping problems are not IP problems, they are throttling problems. Rotating IPs to send more requests faster is what gets your scraper banned across all your IPs at once.

Rotate IPs only when:

A specific host is hard-blocking your IP across all retries.
You can prove the block is IP-based and not throttling-based.
You have residential IPs available (not datacenter).

Datacenter IPs (AWS, GCP, Azure, DigitalOcean) are flagged immediately by most anti-bot systems. Residential IPs (consumer ISPs, mobile networks) cost more — $300 to $1,500 per month for sufficient pool size — but actually work.

Vendors we have used: Bright Data, Smartproxy, Oxylabs. Bright Data has the largest pool but the highest price. Smartproxy is the best balance for most agencies.

When to give up and ask the client to whitelist you

You will hit hosts where every trick fails. The pattern looks like:

95% success rate degrades to 50% over a week
Pages start returning Cloudflare challenge HTML instead of content
IP rotation does not help
Adding random delays does not help

When this happens, do not escalate the scraping. Instead, send the client this email:

Hi {client_contact},

We pull jobs from your career page (URL) into our system to keep your candidate pipeline fresh. Recently your anti-bot system has started challenging our scraper, which means we are missing new postings.

Two ways to fix this:

1. Whitelist our IPs ({your_ip_list}) on the career page domain.

2. Get us API access if you have an internal feed.

Either is a 5-minute change for your IT team. Let me know which is easier and I'll send the technical details.

{your_name}

We have sent this email 30+ times. Every client said yes. Most fixed it within a week.

Handling the long tail (custom career pages)

Maybe 5% of clients run custom career pages — usually built in-house by their marketing team. These are the wild west. Patterns:

HTML scraping with Playwright is your fallback.
Use Defuddle or similar content extraction library to pull job text without manual selectors.
For each new custom page, expect 30 to 90 minutes of one-time setup to build a recipe.

Maintain a per-client recipe file. When the client redesigns their site, you only have to fix one recipe.

Storing what you scrape

A practical schema for scraped jobs:

Column	Type	Why
job_id	text PK	UUID, your internal ID
client_id	text	FK to clients table
source_url	text	Where you got it
scraped_at	timestamp	For freshness checks
title	text	Normalized
location	text	Normalized to city, state
description	text	Full body
salary_min, salary_max	int	Parsed from text
hash	text	SHA256 of (title + location + first 500 chars desc)
is_active	bool	False after 2 missing pulls

The hash column is critical. On every pull, compare new hash to stored. New hash for an existing tuple = job updated. Missing tuple = job removed.

Monitoring and alerting

A scraper without monitoring is a scraper that has been broken for 6 weeks and you do not know.

Track per-client:

Last successful scrape timestamp
Job count delta vs. previous pull (anomaly: delta > 50%)
Error rate (anomaly: > 5% over 24 hours)
Time to complete (anomaly: > 2x normal)

Alert on any anomaly. The alert should be a Slack ping with the client name and the issue. Do not wait for the customer to notice.

How long does this stack stay working?

The honest answer: scrapers break constantly. ATS vendors update their UIs. Client sites get redesigned. Anti-bot systems get tightened. Plan for 5 to 15% of your scrapers to break in any given month.

The mitigations:

Tiered scrapers: when a client-specific scraper fails, fall back to the ATS-family scraper, then to the generic Playwright scraper.
Auto-recovery: when a scraper has been failing for 24 hours, kick a regeneration job that uses an LLM to suggest new selectors based on a screenshot.
Customer-notify: when a scraper has been failing for 72 hours, email the customer.

We rebuild roughly 8% of our scraper recipes per month. The customer never sees most of these because tier 1 falls back to tier 2 cleanly.

Build vs. buy

Building a scraping pipeline that handles 100+ ATSes, anti-bot systems, throttling, monitoring, and auto-recovery takes a senior engineer 4 to 6 months full time. Maintenance is roughly 8 to 16 hours per week ongoing.

If you have engineering and like control, build it. If you want to spend that engineering time on other parts of your product, placement.solutions bundles this for $99 to $499 per month with all 100+ adapters and auto-recovery already built in.

About placement.solutions: Built for recruiting agencies. 100+ ATS adapters, auto-recovering scrapers, semantic candidate matching. Sign up free.