How I Use Serverless Web Scraping for Cheap, Reliable Data — and When It Breaks

A practical playbook for using serverless web scraping to collect scheduled data cheaply, with tradeoffs, IP strategies, and India‑specific pitfalls.

Written by: Rohan Deshpande

Person coding on a laptop with code visible on the screen, workspace in soft light
Image credit: Christin Hume on Unsplash

I needed a reliable daily snapshot of a handful of Indian websites — price lists, public tender entries, a competitor’s mobile plan page — without running a 24/7 server or paying for a big proxy farm. The solution that stuck was simple: schedule tiny serverless functions to spin up a headless browser, fetch the page, save HTML to object storage, and parse later.

If you care about cost, simplicity, and low ops overhead, serverless web scraping is a surprisingly practical pattern. It’s not a magic wand — it has a handful of real limits — but for many use cases it’s the best tradeoff between money, reliability, and maintenance.

What I run (the architecture)

Why this works for me

Three practical patterns I use

  1. Small, frequent snapshots

    • Best when you track small changes (price, availability).
    • Keep runs short: request, wait for a specific DOM element, dump HTML, quit.
    • Save screenshots only on change to cut storage costs.
  2. Render‑then‑parse

    • Use Puppeteer/Playwright to render JavaScript-heavy pages, then hand the HTML to a parser.
    • Split rendering and parsing: a cheap renderer function writes HTML to storage; a cheaper parser reads and extracts. This keeps retry logic isolated and cheaper.
  3. Containerized headless browsers

    • Lambda cold starts and binary size issues push me to use Cloud Run or small containers when pages need a full browser.
    • Container images let me bake in fonts, locales, and binary tweaks so the render matches what users see in India.

Real constraints and tradeoffs

India‑specific notes

Operational tips that saved me time

When to avoid serverless web scraping

Final thought Serverless web scraping isn’t a one-size solution, but it is a pragmatic, low‑ops way to collect small to moderate amounts of web data reliably and cheaply — especially if you need a handful of regionally accurate snapshots for analytics, monitoring, or small automation tasks. The pattern forces you to design for idempotency, logging, and graceful degradation — which pays dividends later.

If you want, I can share my Cloud Run container Dockerfile and a tiny Puppeteer starter job (the version I use to fetch a retail price page and save a screenshot). It’s a handy starting point if you want to test this pattern without committing to a VM.