Tutorial on Web Scraping Using a Headless Browser: An Introduction to Puppeteer

This guide explores web scraping (web automation) through the use of a headless browser.

What Is a Headless Browser, and Why Use One?

The web has transformed from simple HTML/CSS websites to dynamic web apps with advanced UIs, often powered by frameworks like Angular or React. JavaScript now plays a crucial role, influencing most website interactions.

JavaScript, being client-side, is processed by the browser after the server sends JavaScript files or scripts embedded in HTML. This poses a challenge for web scraping and automation because the content we want to access is often rendered by JavaScript, not directly available in the raw HTML response.

Browsers can process JavaScript and display web pages correctly. Headless browser automation allows us to utilize this capability by programmatically controlling browsers for our scraping requirements.

Headless? It means there’s no graphical user interface (GUI). Instead of typical visual interactions (mouse, touch), automation is done via a command-line interface (CLI).

Headless Chrome and Puppeteer

Numerous web scraping tools support headless browsing, like Zombie.js or headless Firefox using Selenium. This guide focuses on headless Chrome via Puppeteer, a relatively new option released in early 2018. Editor’s note: Intoli’s Remote Browser is another new tool, but we’ll save that for another article.

Puppeteer, a Node.js library, provides a high-level API to control headless Chrome or Chromium, interacting with the DevTools protocol. It’s maintained by the Chrome DevTools team and a dedicated open-source community.

Let’s dive into the code and explore automated web scraping with Puppeteer’s headless browsing!

Setting Up

You’ll need Node.js 8+ installed. Get it from here, or if you use Ubuntu like me, run these commands:

1
2
curl -sL https://deb.nodesource.com/setup_8.x | sudo -E bash -
sudo apt-get install -y nodejs

Additionally, ensure you have these packages:

1
sudo apt-get install -yq --no-install-recommends libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 libnss3

Headless Chrome and Puppeteer Setup

Installing Puppeteer with npm is recommended, as it includes a compatible, stable Chromium version.

In your project directory, run:

1
npm i puppeteer --save

Note: This might take a while; Puppeteer downloads and installs Chromium in the background.

With everything set up, let’s begin!

Automated Web Scraping with the Puppeteer API

Let’s start our Puppeteer tutorial with a simple example: a script to take a website screenshot using our headless browser.

In your project directory, create screenshot.js and open it in your code editor.

First, import the Puppeteer library:

1
const puppeteer = require('puppeteer');

Next, get the URL from command-line arguments:

1
2
3
4
const url = process.argv[2];
if (!url) {
    throw "Please provide a URL as the first argument";
}

Remember that Puppeteer is promise-based, using asynchronous calls to the headless Chrome instance. Let’s maintain clean code with async/await by defining an async function for our Puppeteer code:

1
2
3
4
5
6
7
8
async function run () {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url);
    await page.screenshot({path: 'screenshot.png'});
    browser.close();
}
run();

The complete code looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
const puppeteer = require('puppeteer');
const url = process.argv[2];
if (!url) {
    throw "Please provide URL as a first argument";
}
async function run () {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url);
    await page.screenshot({path: 'screenshot.png'});
    browser.close();
}
run();

Execute the following in your project’s root directory:

1
node screenshot.js https://github.com

In moments, our headless browser creates screenshot.png with the rendered GitHub homepage. We have a working Chrome web scraper!

Let’s analyze the run() function.

We launch a headless browser instance, open a new page (tab), and navigate to the provided URL. Puppeteer’s built-in screenshot method saves it to the specified path. Finally, we close the headless browser.

With the basics covered, let’s try something more advanced.

Another Puppeteer Scraping Example

Let’s scrape the latest articles from Hacker News.

Create ycombinator-scraper.js and add this code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
const puppeteer = require('puppeteer');
function run () {
    return new Promise(async (resolve, reject) => {
        try {
            const browser = await puppeteer.launch();
            const page = await browser.newPage();
            await page.goto("https://news.ycombinator.com/");
            let urls = await page.evaluate(() => {
                let results = [];
                let items = document.querySelectorAll('a.storylink');
                items.forEach((item) => {
                    results.push({
                        url:  item.getAttribute('href'),
                        text: item.innerText,
                    });
                });
                return results;
            })
            browser.close();
            return resolve(urls);
        } catch (e) {
            return reject(e);
        }
    })
}
run().then(console.log).catch(console.error);

There’s more happening here compared to the previous example.

The run() function now returns a promise, hence the async keyword in its definition.

A try-catch block handles potential promise rejection errors.

We’re using Puppeteer’s evaluate() method to run custom JavaScript as if in the DevTools console. The returned value is resolved by the promise. This method is invaluable for scraping and custom actions.

The code within evaluate() is basic JavaScript that constructs an array of objects, each containing url and text properties representing story URLs on https://news.ycombinator.com/.

The script’s output resembles this (truncated to 30 entries):

1
2
3
4
5
6
7
8
9
[ { url: 'https://www.nature.com/articles/d41586-018-05469-3',
    text: 'Bias detectives: the researchers striving to make algorithms fair' },
  { url: 'https://mino-games.workable.com/jobs/415887',
    text: 'Mino Games Is Hiring Programmers in Montreal' },
  { url: 'http://srobb.net/pf.html',
    text: 'A Beginner\'s Guide to Firewalling with pf' },
  // ...
  { url: 'https://tools.ietf.org/html/rfc8439',
    text: 'ChaCha20 and Poly1305 for IETF Protocols' } ]

Impressive!

However, we only have 30 items, while more exist on subsequent pages. We need to click “More” to load them.

Let’s modify our script to incorporate pagination:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
const puppeteer = require('puppeteer');
function run (pagesToScrape) {
    return new Promise(async (resolve, reject) => {
        try {
            if (!pagesToScrape) {
                pagesToScrape = 1;
            }
            const browser = await puppeteer.launch();
            const page = await browser.newPage();
            await page.goto("https://news.ycombinator.com/");
            let currentPage = 1;
            let urls = [];
            while (currentPage <= pagesToScrape) {
                let newUrls = await page.evaluate(() => {
                    let results = [];
                    let items = document.querySelectorAll('a.storylink');
                    items.forEach((item) => {
                        results.push({
                            url:  item.getAttribute('href'),
                            text: item.innerText,
                        });
                    });
                    return results;
                });
                urls = urls.concat(newUrls);
                if (currentPage < pagesToScrape) {
                    await Promise.all([
                        await page.click('a.morelink'),
                        await page.waitForSelector('a.storylink')
                    ])
                }
                currentPage++;
            }
            browser.close();
            return resolve(urls);
        } catch (e) {
            return reject(e);
        }
    })
}
run(5).then(console.log).catch(console.error);

Here’s what we’ve done:

  1. Added a pagesToScrape argument to run() to control the number of scraped pages.
  2. Introduced a currentPage variable, initially set to 1, tracking the current results page. We’ve wrapped our evaluate() function in a while loop to continue as long as currentPage is less than or equal to pagesToScrape.
  3. Added a block to navigate to the next page and wait for it to load before resuming the while loop.

We used page.click() to simulate a click on “More” and waitForSelector() to pause execution until the page content loads.

These are convenient high-level Puppeteer API methods.

A common challenge in Puppeteer scraping is waiting for page loads. Hacker News has a simple structure, making it easy. For complex cases, Puppeteer provides extensive built-in functionality detailed in the API documentation on GitHub.

Let’s optimize our Puppeteer script for better performance.

Optimizing Our Puppeteer Script

The key is minimizing unnecessary work for the headless browser, such as loading images, applying CSS, or executing XHR requests.

Puppeteer optimization, like any tool, is use-case dependent. Some optimizations might not be suitable for your project. For instance, preventing image loading in our first example would’ve resulted in an incomplete screenshot.

Optimizations involve caching assets on the first request or blocking unnecessary HTTP requests.

Let’s start with caching.

Puppeteer creates a temporary profile directory for each new headless browser instance. This directory is deleted when the browser closes and isn’t accessible in subsequent instances. Consequently, cached images, CSS, cookies, and other data are lost.

We can configure Puppeteer to use a persistent data directory for cookies and cache, allowing reuse across runs until they expire or are manually cleared.

1
2
3
const browser = await puppeteer.launch({
    userDataDir: './data',
});

This significantly improves performance, as cached CSS and images from the first request are reused, reducing download overhead.

However, these assets are still used during rendering. Since we only need the raw HTML for scraping Y Combinator news articles, let’s block unnecessary requests.

Puppeteer offers custom hooks, enabling us to intercept and cancel requests.

Here’s an example interceptor:

1
2
3
4
5
6
7
8
await page.setRequestInterception(true);
page.on('request', (request) => {
    if (request.resourceType() === 'document') {
        request.continue();
    } else {
        request.abort();
    }
});

This gives us granular control over requests. We can implement custom logic to allow or abort requests based on their resourceType or other properties like request.url.

In this example, only "document" requests (the main HTML response) are allowed, blocking images, CSS, and other assets.

Here’s our optimized code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
const puppeteer = require('puppeteer');
function run (pagesToScrape) {
    return new Promise(async (resolve, reject) => {
        try {
            if (!pagesToScrape) {
                pagesToScrape = 1;
            }
            const browser = await puppeteer.launch();
            const page = await browser.newPage();
            await page.setRequestInterception(true);
            page.on('request', (request) => {
                if (request.resourceType() === 'document') {
                    request.continue();
                } else {
                    request.abort();
                }
            });
            await page.goto("https://news.ycombinator.com/");
            let currentPage = 1;
            let urls = [];
            while (currentPage <= pagesToScrape) {
                await page.waitForSelector('a.storylink');
                let newUrls = await page.evaluate(() => {
                    let results = [];
                    let items = document.querySelectorAll('a.storylink');
                    items.forEach((item) => {
                        results.push({
                            url:  item.getAttribute('href'),
                            text: item.innerText,
                        });
                    });
                    return results;
                });
                urls = urls.concat(newUrls);
                if (currentPage < pagesToScrape) {
                    await Promise.all([
                        await page.waitForSelector('a.morelink'),
                        await page.click('a.morelink'),
                        await page.waitForSelector('a.storylink')
                    ])
                }
                currentPage++;
            }
            browser.close();
            return resolve(urls);
        } catch (e) {
            return reject(e);
        }
    })
}
run(5).then(console.log).catch(console.error);

Respecting Rate Limits

Headless browsers are powerful, capable of complex web automation tasks made even easier with Puppeteer. However, it’s crucial to adhere to website terms of service and avoid system abuse.

While architectural considerations are beyond the scope of this Puppeteer tutorial, a basic rate-limiting technique is adding sleep commands:

js await page.waitFor(5000);

This pauses the script for five seconds (5000 ms). You can place this before browser.close().

Beyond basic rate limiting, more robust approaches exist, such as implementing a queue system with limited workers. Each Puppeteer task is added to the queue, but only a certain number of workers process tasks concurrently. This common practice for handling third-party API rate limits applies to Puppeteer web data scraping as well.

Puppeteer in a Rapidly Evolving Web

This Puppeteer tutorial showcased its basic web-scraping capabilities. However, its potential extends to headless browser testing, PDF generation, performance monitoring, and more.

Web technologies are advancing rapidly. Many websites heavily rely on JavaScript rendering, making traditional HTTP request-based scraping and automation challenging. Headless browsers like Puppeteer, driven by dedicated teams, provide increasingly accessible solutions for our automation needs.

Licensed under CC BY-NC-SA 4.0