Build a Content Scraper That Works at Scale

We manage over $1M/month in paid media with a team of five media buyers. No creative team. No creative budget. To scale, we built an automated pipeline that generates hundreds of ad variations from client content—quiz questions, article sections, product descriptions, landing pages.

But there was a problem: we needed structured content data to feed the pipeline.

Some clients had APIs. Most didn’t. We needed to scrape thousands of URLs across WordPress sites, Shopify stores, custom React apps, and headless CMS platforms. The content had to be clean, structured, and ready to feed into our image generation and ad assembly systems.

I thought I’d built a content scraper. Five minutes into testing, I realized I’d built a very expensive way to scrape nothing.

The code was clean. The logic was sound. I’d fetch the URL, parse the HTML with Cheerio, extract the content, and store it in our database. Simple, elegant, and—as it turned out—completely useless.

Why Everything Breaks

Here’s what I was getting: empty divs, skeleton loaders, and placeholder text. The kind of HTML payload you’d see if you disabled JavaScript in your browser and tried to use a modern web app. Which is exactly what I was doing.

Modern websites aren’t static HTML anymore. They’re React components, Vue templates, and Next.js apps. The server sends a shell, and JavaScript renders the content. When you fetch() a URL, you get the before-picture—the empty container waiting for JavaScript to fill it in.

But it’s worse than that. Even on sites that do server-side rendering, half the content is lazy-loaded. Scroll down, new content appears. Click “load more,” more content appears. Your simple fetch call happens before any of that loads, so you miss it.

I needed something that could:

Execute JavaScript and wait for content to render
Handle infinite scroll and lazy loading
Work across WordPress, Shopify, custom React apps—everything
Not break every time a site updates their HTML structure

The $2 Solution

Enter ScrapingFish. It’s a paid API ($0.002 per request) that handles the hard parts: JavaScript rendering, browser automation, infinite scroll, IP rotation. At $2 per thousand pages, it’s cheaper than the time I’d spend debugging my own browser automation setup.

The key setting is this:

const params = new URLSearchParams({
  url,
  api_key: apiKey,
  render_js: 'true',  // The magic switch
  js_scenario: JSON.stringify(scenario),
})

That render_js: 'true' is the difference between scraping modern websites and scraping empty shells. Learn more about JavaScript rendering in ScrapingFish. But the really clever part is the js_scenario.

Teaching a Scraper to Scroll

Most interesting content on modern sites loads as you scroll. Product descriptions, blog posts, quiz questions—they’re not in the initial payload. They load when you get near them. A scraper that doesn’t scroll is a scraper that’s missing most of the content.

Here’s what we learned: you need to scroll slowly, wait between scrolls, and do it enough times to trigger all the lazy-loaded sections. After testing on dozens of sites, we found this pattern:

function buildInfiniteScrollScenario() {
  const steps = [{ wait: 800 }]  // Initial page load
  
  // Scroll down 8 times, waiting between each
  for (let i = 0; i < 8; i++) {
    steps.push({ scroll: 1400 })  // Scroll 1400px
    steps.push({ wait: 700 })     // Wait 700ms for content to load
  }
  
  steps.push({ wait: 1200 })  // Final wait
  return { steps }
}

Eight scrolls with 700ms waits. Why those numbers? Trial and error. Too few scrolls and you miss content. Too many and you’re wasting time and money. No waits and content doesn’t have time to load. This pattern catches about 95% of lazy-loaded content across the sites we tested.

Tip

Real example: A quiz site that loads questions as you scroll. Without this scenario, we got 3 questions. With it, we got all 20. That’s the difference between useful and useless.

Making It Work Everywhere

Every CMS structures HTML differently. WordPress uses .entry-content wrappers. Shopify has [data-product-description] attributes. Custom React apps might use data-section tags. Headless CMS sites usually have clean semantic HTML.

Our strategy: try the best-case scenario first, fall back to generic patterns.

function extractSectionsFromHtml(html: string) {
  const $ = load(html)
  const sections = []

  $('h2').each((_, heading) => {
    const title = $(heading).text().trim()
    const paragraphs = []

    // Try semantic wrapper first
    const section = $(heading).closest('[data-section]')
    if (section.length > 0) {
      section.find('p').each((_, p) => {
        const text = $(p).text().trim()
        if (text.length > 10) paragraphs.push(text)
      })
    } else {
      // Fallback: follow DOM siblings until next heading
      let sibling = $(heading).next()
      while (sibling.length && !/^h[1-3]$/.test(sibling[0].tagName ?? '')) {
        if (sibling.is('p')) {
          const text = sibling.text().trim()
          if (text.length > 10) paragraphs.push(text)
        }
        sibling = sibling.next()
      }
    }

    if (paragraphs.length > 0) {
      sections.push({ title, paragraphs })
    }
  })

  return sections
}

If a developer added data-section attributes, we trust them—they’re telling us “this is a content section.” When that’s not available, we follow a pattern that works on 80% of sites: headings followed by paragraphs until the next heading.

The length > 10 filter matters. Short text is usually navigation, buttons, or metadata. Real content is longer.

The Unglamorous Part: Not Breaking

When you’re scraping thousands of URLs, things will break. APIs fail, pages 404, content structures change. The key is: one failure shouldn’t kill the entire batch.

for (const offer of targetOffers) {
  try {
    const sections = await scrapeOfferContent(offer.url, apiKey)
    const paragraphs = flattenSections(sections)
    
    await storeContent(db, offer.id, paragraphs)
    await updateStatus(db, offer.id, { 
      status: 'completed',
      paragraphs_scraped: paragraphs.length 
    })
    
    summaries.push({ offerId: offer.id, status: 'success' })
  } catch (error) {
    summaries.push({ 
      offerId: offer.id, 
      status: 'error',
      error: error.message 
    })
    
    await updateStatus(db, offer.id, { 
      status: 'failed',
      error: error.message 
    })
  }
}

Process items one at a time (rate limits), catch errors per-item (resilience), and track status in the database (so you can retry failures without re-scraping successes). It’s boring, but it’s the difference between a script that works once and a system that runs reliably.

The Bigger Picture

This scraper is Step 1 of our 3-step automated creative pipeline. Once content is synced and structured in our database, we:

Generate prompts using gpt-4o-mini tailored to the content type
Generate images using fast models like Flux at ~$5 per 1,000 images
Assemble banners automatically with matched copy and overlays

The scraper is the foundation. Without reliable, structured content ingestion, the rest of the pipeline doesn’t work.

Info

We use ScrapingFish, a $0.002/request API that handles JavaScript rendering, browser automation, and infinite scroll. At that price point, building and maintaining our own browser automation infrastructure wasn’t worth it.

Want to see the full pipeline in action? Get a free creative pack with 25 ad-ready images generated from your actual content.

What We Learned

Modern web scraping requires JavaScript rendering. Simple fetch calls work on maybe 20% of sites today.
Lazy loading is everywhere. If you’re not simulating scroll behavior, you’re missing content.
Sites change their HTML structure constantly. Build fallback extraction strategies, not brittle selectors.
Per-item error handling is non-negotiable. One broken URL shouldn’t stop processing of 999 others.
$0.002 per request is cheaper than your time. Building and maintaining browser automation infrastructure isn’t worth it unless you’re doing millions of requests.

The scraper we built handles thousands of URLs across different CMS platforms, recovers gracefully from errors, and hasn’t broken once in production. Not because we’re brilliant, but because we learned these lessons the hard way and built them into the system.

When should you use this approach? Any time you’re scraping JavaScript-heavy sites at scale. When is direct fetch fine? Static HTML sites you control, or simple one-off scripts where reliability doesn’t matter.

Sometimes the pragmatic solution isn’t the one you’d build from scratch—it’s the one that works and lets you move on to the next problem.

How to Build a Content Scraper for Context (Without Breaking)

Why Everything Breaks

The $2 Solution

Teaching a Scraper to Scroll

Making It Work Everywhere

The Unglamorous Part: Not Breaking

The Bigger Picture

What We Learned

How to Build a Content Scraper for Context (Without Breaking)

Why AI Images Suck — And the Pipeline We Built to Fix It

See What Our Creative System Can Do for Your Brand