JDML
← All notes

Web Scraping

Headless browsers for scraping: when to use Playwright, Puppeteer, or no browser at all

· 7 min read · By

A headless browser is a real Chromium, Firefox, or WebKit running without a visible window. It renders JavaScript, fires network requests, holds cookies, and behaves like a user. That makes it the right tool for scraping modern single-page apps, sites that ship their data only after client-side rendering, and any flow that needs a real session. It is also heavy, slow, and expensive compared to a plain HTTP client. The whole job of picking a scraper is deciding when the extra weight is worth it.

First question: do you actually need a browser?

Before reaching for Playwright, open the site in Chrome with DevTools, go to the Network tab, and watch what happens when the page loads. Nine times out of ten the data you want is coming back in a JSON or XHR response that you can hit directly with an HTTP client. No JavaScript execution needed. No browser. A couple of headers, maybe a session cookie, and you are done. This path is ten to twenty times faster, uses a fraction of the memory, and survives site updates more gracefully. If the endpoint exists, use it.

You need a real browser when: the site genuinely renders content client-side with no clean API call, anti-bot systems (Cloudflare, Akamai Bot Manager, DataDome) challenge non-browser clients on the first request, or the flow requires real cookies, localStorage, or multi-step interaction that HTTP alone cannot replicate. Everything else is an HTTP job.

Playwright: our default in 2026

Playwright is the tool we reach for on new scraping work. It is maintained by Microsoft, supports Chromium, Firefox, and WebKit from the same API, and its auto-waiting model (locators retry until they find the element or time out) is genuinely better than the manual sleeps and polling that older libraries relied on. Its device emulation, realistic user-agent handling, and first-class support for concurrent contexts make it a natural fit for scaling scrapes across thousands of targets without writing the same boilerplate a dozen times.

The Python binding is excellent. We run Playwright on Cloud Run Jobs for scheduled scrapes and on Compute Engine for long-lived workloads where the browser stays warm between targets. Memory per browser is in the 150-400 MB range depending on the page complexity, so plan your container sizing around how many concurrent pages you want per instance. For most Australian business clients, one browser per Cloud Run job instance with 10-30 tabs cycling through targets is the pattern that scales best.

Puppeteer: still strong, especially in Node ecosystems

Puppeteer is the Google-maintained Node library that predates Playwright and was the foundation many teams built on. It is Chromium-only, which is fine for most scraping since Chromium renders the modern web correctly, and the API is mature and well-documented. If your team lives in Node and you already have Puppeteer infrastructure running, there is no reason to rewrite. Its ecosystem of stealth plugins (puppeteer-extra, puppeteer-extra-plugin-stealth) is also more mature than Playwright's equivalent, which matters when you are actively fighting detection.

Selenium: the compatibility layer, not the first choice

Selenium is still the standard when you need cross-browser scraping at scale or when you are integrating with legacy test harnesses. For pure web scraping in 2026, Playwright handles the same cases with a saner API and better defaults. We only pick Selenium when we inherit an existing Selenium codebase that would cost more to rewrite than to maintain, or when a target specifically requires a browser profile setup that is easier to automate with Selenium's older, more flexible driver model.

The anti-bot reality

Every serious commercial target now runs bot detection. Cloudflare is the most common, Akamai and DataDome show up on enterprise properties, and PerimeterX guards a lot of Australian retail. Raw headless browsers are detected easily because the JavaScript environment leaks (missing plugins, navigator.webdriver set to true, suspicious screen dimensions). Stealth plugins patch the obvious leaks. Residential proxies hide the IP reputation problem. Neither is a silver bullet. The scrapers that survive long-term are the ones that look genuinely human: realistic user agents, real mouse and scroll events, sensible timing between actions, cookies persisted across runs, and session rotation when something starts to smell.

Operational reality: browsers break more than HTTP

The biggest operational cost of running headless browsers in production is not the compute. It is the debugging. A Playwright script that works on your laptop fails in a container because a font is missing. A scrape that works today breaks tomorrow because the site shipped a new layout. Memory leaks in long-lived browser sessions silently kill your job. We build every browser scraper with screenshots on failure, video capture on retry, structured logs for every action, and automated schema-drift detection so we know when the target changed before the data goes stale. If you are putting a browser-based scraper into production without these guardrails, you are buying yourself a pager at 3am.

The short version

  • First rule: check DevTools Network tab. If there is a clean API, use it and skip the browser
  • Playwright is the default for new browser scraping work in 2026 (Python or Node)
  • Puppeteer is fine if you are already running it in a Node stack, especially with stealth plugins
  • Selenium stays for legacy codebases and specific driver requirements, not for new work
  • Plan for anti-bot: realistic user agents, residential proxies where needed, human-shaped timing
  • In production, instrument everything: screenshots, video on retry, schema drift alerts, memory watchdogs

Related reading

Building something in this space? Let's talk.

We spend a lot of time with these tools. If you're trying to figure out which model fits your workload, we're happy to share what we've learned.

Get in touch