JDML
← All notes

Web Scraping

Cloud Run Jobs, Compute Engine, and Selenium: how we actually run scrapers in production

· 7 min read · By

Most scraping engagements we take on in Australia eventually land on the same infrastructure question: do we run this on Cloud Run Jobs, Compute Engine, or something else entirely? The answer depends on how long the job runs, whether it needs a real browser, how much IP reputation matters, and how often it fires. This post walks through how we actually decide, what each piece does, and where Selenium (and more often Playwright) fits into the stack.

Cloud Run Jobs: the default for most scraping work

Cloud Run Jobs is the first tool we reach for on almost every scraping engagement. It runs a containerised job on Google Cloud, scales to zero between runs, handles retries and timeouts natively, and is cheap. You pay only for the minutes the job actually runs. For scheduled scrapes (daily price feeds, weekly listing pulls, hourly competitor snapshots) it is hard to beat.

The sweet spot is jobs that run for less than an hour, fit inside 8 GB of memory, and do not need a long-lived IP address. That covers the majority of business-grade scraping: BigQuery-fed competitor monitoring, property listings for Australian real estate platforms, pricing intelligence for retail, and regulatory filings. We wire Cloud Scheduler to fire the job on a cron, the job reads its target list from a GCS bucket or BigQuery, writes structured output back to BigQuery, and emits metrics to Cloud Logging. That is a complete scraper in a few hundred lines plus a Dockerfile.

When Cloud Run Jobs starts to hurt

Cloud Run Jobs stops being the right choice when you hit one of four walls. Runtime longer than the job timeout (currently a hard ceiling). Memory pressure from holding a large in-memory state or a headless browser session that will not release. Target sites that aggressively rate-limit or fingerprint the shared Google Cloud IP ranges. Or a crawl that needs persistent state (cookies, session tokens, cached embeddings) across runs without rebuilding from scratch each time.

Compute Engine: when you need a long-lived machine

Compute Engine is the escape hatch for scrapes that Cloud Run cannot hold. A VM runs continuously, has a stable external IP (or a pool of them), holds state between runs, and can drive a full browser for hours without restart. It costs more per hour because it is always on, but per job it can be cheaper than Cloud Run for long, continuous work.

We use Compute Engine for three specific cases. First: crawls that need IP warmup. Some target sites gate access behind a reputation score for the IP that has been hitting them for months. A VM with a reserved static IP earns that reputation; a Cloud Run Job on a shared egress range never will. Second: browser automation at scale. Holding a Playwright or Selenium browser session open for hours is cheaper and more reliable on a VM than spinning up a fresh container per job. Third: any scrape that needs a long tail of human-like interaction, which means real cookies, real local storage, and occasionally a real mouse moving across a page.

Where Selenium and Playwright fit

Selenium is the old workhorse of browser automation. It works, it has been around forever, and you can find an engineer who knows it anywhere. For most new work in 2026 we use Playwright instead. Playwright is faster, has a saner API, handles modern single-page apps more predictably, and has first-class support for the anti-bot gymnastics that real scrapes require (realistic user-agent rotation, device emulation, locator auto-waiting). If you are still maintaining legacy Selenium scrapers, keep them running; do not rewrite for the sake of it. If you are starting fresh, use Playwright.

A common mistake is reaching for a browser automation library when you do not need one. If the target site returns clean HTML or a JSON API on the network tab, plain HTTP plus a parser (httpx + selectolax is our default in Python) is ten times faster, uses a fraction of the memory, and breaks less often. We only bring in a headless browser when JavaScript rendering is genuinely required, and even then we try to capture the XHR calls the page is making and hit those directly. The rule of thumb: use the smallest tool that gets the data.

What a production stack looks like for an Australian client

A typical production scrape we build for an Australian business looks like this. Cloud Run Job runs the scraper on a schedule. Python with httpx for the HTTP path and Playwright for the browser path. Proxies via a residential pool if IP reputation matters, or direct egress if not. Structured output goes to BigQuery. Failures and schema drift fire alerts to Slack via Cloud Logging sinks. Dead letters land in a GCS bucket for reprocessing. A Looker Studio dashboard on top gives the client live visibility. If the scrape needs a long-lived browser or a reserved IP, we swap the Cloud Run Job for a Compute Engine VM and run the same code with a different driver.

The short version

  • Default to Cloud Run Jobs for scheduled, containerised scrapes under an hour
  • Switch to Compute Engine when you need IP reputation, long-lived browser sessions, or persistent state
  • Prefer Playwright over Selenium for new browser automation (but keep existing Selenium scrapers running)
  • Do not use a browser at all if the target has an HTTP or JSON endpoint you can hit directly
  • Always write structured output to BigQuery, alert on schema drift, and keep a dead-letter path for failures

Related reading

Building something in this space? Let's talk.

We spend a lot of time with these tools. If you're trying to figure out which model fits your workload, we're happy to share what we've learned.

Get in touch