placement.solutions
HomeBlog › How to Scrape Client Career Pages Without Getting Blocked

How to Scrape Client Career Pages Without Getting Blocked

Scraping has changed. The simple "requests.get(url) + BeautifulSoup" pattern that worked in 2018 will get you IP-banned in 30 seconds in 2026. Cloudflare Turnstile, Akamai bot manager, and PerimeterX have made even basic scraping a real engineering project.

This is the playbook we use at placement.solutions to keep 100+ ATS adapters running cleanly across 5,000+ active scrapes per day. Steal it.

This is for legitimate scraping of public job postings on client career pages. Do not use this to scrape gated content, copyrighted material, or anything explicitly disallowed by robots.txt for a specific path.

The four anti-bot systems you will hit

If you scrape any career page in 2026 you will encounter one of:

SystemUsed byDifficulty
Cloudflare Turnstile~30% of career pagesMedium
Akamai Bot Manager~15% of large enterpriseHard
PerimeterX (HUMAN)~10% of mid-market SaaSHard
DataDome~5%, growingHard
Custom rate limitingMost othersEasy

Each one detects bots differently. Cloudflare uses TLS fingerprinting and JS challenges. Akamai runs server-side fingerprinting and behavioral analysis. PerimeterX combines challenges with behavioral biometrics. DataDome is increasingly behavioral.

The five rules that get you through 95% of cases

  1. Use real Chromium via Playwright, not raw HTTP libraries.
  2. Throttle to 1 request per 5+ seconds per host. Jitter the gaps.
  3. Respect robots.txt for any path you cannot prove is a public job listing endpoint.
  4. Rotate IPs only if you have to, and only with residential proxies.
  5. Do not scrape past anti-bot challenges programmatically. Stop and ask the customer to whitelist your scraper.

The fifth rule is what separates mature scrapers from the ones that flame out in 2 weeks. If a target has a real anti-bot system, do not try to defeat it. Ask the customer to add your scraper IP to their allowlist or to use the API. We have not had a customer say no to this in 18 months.

Why Playwright beats raw HTTP

Raw requests or httpx calls give you a TLS fingerprint that screams "Python." Cloudflare sees this and challenges you immediately.

Playwright drives a real Chromium. The TLS fingerprint matches a real browser. The browser executes the JavaScript challenge. The browser handles cookies. You get the page like a normal user would.

Cost: Playwright is heavier (RAM, CPU, latency). Plan for 200-500MB per browser instance and 3-8 second page loads.

For 100% of the time, the tradeoff is worth it.

Detecting the ATS so you do not waste browser cycles

Loading Chromium for every URL is expensive. Most career pages run on a known ATS. Detect the ATS from the URL and use the API directly when possible.

ATSURL signalSkip browser?
Workdaymyworkdayjobs.comYes — JSON API
Greenhouseboards.greenhouse.ioYes — JSON API
Leverjobs.lever.coYes — JSON API
SmartRecruiterscareers.smartrecruiters.comYes — JSON API
Ashbyjobs.ashbyhq.comYes — GraphQL
iCIMScareers-{slug}.icims.comNo — HTML scrape
Taleo*.taleo.netNo — sticky session HTML
BambooHR{slug}.bamboohr.com/jobsYes — clean HTML
Custom / unknownvariesNo — Playwright fallback

For known ATSes with public APIs, hit the API directly. Faster, cleaner, less likely to break.

Workday specifically (the #1 ATS in legal recruiting)

Workday is the most common ATS at AmLaw firms. The trick: do not scrape the SPA. Hit the JSON endpoint behind it.

Pattern:

GET /wday/cxs/{tenant}/{site_id}/jobs
Content-Type: application/json
{
  "appliedFacets": {},
  "limit": 20,
  "offset": 0,
  "searchText": ""
}

The exact URL varies by tenant. Find it by opening the SPA in a real browser, opening DevTools Network tab, and watching for the jobs POST request. Once you have the URL pattern for one tenant, it generalizes across all Workday clients.

This pattern gets you 50 to 200 jobs per call. Paginate with offset. No browser needed.

iCIMS specifically (second most common)

iCIMS is harder. The job list is paginated HTML and they use anti-bot measures. The pattern that works:

  1. Use Playwright with headless=True initially. If blocked, switch to headless=False (visible browser actually helps).
  2. Set realistic viewport: 1920x1080 typical.
  3. Wait for the job list to render: page.wait_for_selector('.iCIMS_JobsTable, [data-test=job-list]', timeout=15000).
  4. Extract job links from the rendered HTML.
  5. Visit each job detail page with a 5-10 second jittered delay.

iCIMS will rate-limit you if you hit more than ~6 pages per minute. Plan accordingly.

Throttling rules that work

Aggressive throttling is the single biggest factor in scraper longevity. The rules:

  1. Per-host gap: 5 seconds minimum between requests to the same host. Jitter +/- 30%.
  2. Per-host concurrency: 1. Always. No exceptions.
  3. Daily volume cap per host: 500 requests. If you need more, split across multiple days.
  4. Backoff on errors: 1 second on first 5xx, 5 seconds on second, 30 seconds on third, then pause for an hour.

If you violate these, you get banned within hours. Once banned, recovery is days, sometimes never. We have lost access to specific subdomains permanently because of one bad weekend.

IP rotation: when to do it (and when not to)

The default: do not rotate IPs. Most scraping problems are not IP problems, they are throttling problems. Rotating IPs to send more requests faster is what gets your scraper banned across all your IPs at once.

Rotate IPs only when:

Datacenter IPs (AWS, GCP, Azure, DigitalOcean) are flagged immediately by most anti-bot systems. Residential IPs (consumer ISPs, mobile networks) cost more — $300 to $1,500 per month for sufficient pool size — but actually work.

Vendors we have used: Bright Data, Smartproxy, Oxylabs. Bright Data has the largest pool but the highest price. Smartproxy is the best balance for most agencies.

When to give up and ask the client to whitelist you

You will hit hosts where every trick fails. The pattern looks like:

When this happens, do not escalate the scraping. Instead, send the client this email:

Hi {client_contact},

We pull jobs from your career page (URL) into our system to keep your candidate pipeline fresh. Recently your anti-bot system has started challenging our scraper, which means we are missing new postings.

Two ways to fix this:

1. Whitelist our IPs ({your_ip_list}) on the career page domain.

2. Get us API access if you have an internal feed.

Either is a 5-minute change for your IT team. Let me know which is easier and I'll send the technical details.

{your_name}

We have sent this email 30+ times. Every client said yes. Most fixed it within a week.

Handling the long tail (custom career pages)

Maybe 5% of clients run custom career pages — usually built in-house by their marketing team. These are the wild west. Patterns:

Maintain a per-client recipe file. When the client redesigns their site, you only have to fix one recipe.

Storing what you scrape

A practical schema for scraped jobs:

ColumnTypeWhy
job_idtext PKUUID, your internal ID
client_idtextFK to clients table
source_urltextWhere you got it
scraped_attimestampFor freshness checks
titletextNormalized
locationtextNormalized to city, state
descriptiontextFull body
salary_min, salary_maxintParsed from text
hashtextSHA256 of (title + location + first 500 chars desc)
is_activeboolFalse after 2 missing pulls

The hash column is critical. On every pull, compare new hash to stored. New hash for an existing tuple = job updated. Missing tuple = job removed.

Monitoring and alerting

A scraper without monitoring is a scraper that has been broken for 6 weeks and you do not know.

Track per-client:

Alert on any anomaly. The alert should be a Slack ping with the client name and the issue. Do not wait for the customer to notice.

How long does this stack stay working?

The honest answer: scrapers break constantly. ATS vendors update their UIs. Client sites get redesigned. Anti-bot systems get tightened. Plan for 5 to 15% of your scrapers to break in any given month.

The mitigations:

We rebuild roughly 8% of our scraper recipes per month. The customer never sees most of these because tier 1 falls back to tier 2 cleanly.

Build vs. buy

Building a scraping pipeline that handles 100+ ATSes, anti-bot systems, throttling, monitoring, and auto-recovery takes a senior engineer 4 to 6 months full time. Maintenance is roughly 8 to 16 hours per week ongoing.

If you have engineering and like control, build it. If you want to spend that engineering time on other parts of your product, placement.solutions bundles this for $99 to $499 per month with all 100+ adapters and auto-recovery already built in.

About placement.solutions: Built for recruiting agencies. 100+ ATS adapters, auto-recovering scrapers, semantic candidate matching. Sign up free.

Related reading