This guide explores web scraping (web automation) through the use of a headless browser.
What Is a Headless Browser, and Why Use One?
The web has transformed from simple HTML/CSS websites to dynamic web apps with advanced UIs, often powered by frameworks like Angular or React. JavaScript now plays a crucial role, influencing most website interactions.
JavaScript, being client-side, is processed by the browser after the server sends JavaScript files or scripts embedded in HTML. This poses a challenge for web scraping and automation because the content we want to access is often rendered by JavaScript, not directly available in the raw HTML response.
Browsers can process JavaScript and display web pages correctly. Headless browser automation allows us to utilize this capability by programmatically controlling browsers for our scraping requirements.
Headless? It means there’s no graphical user interface (GUI). Instead of typical visual interactions (mouse, touch), automation is done via a command-line interface (CLI).
Headless Chrome and Puppeteer
Numerous web scraping tools support headless browsing, like Zombie.js or headless Firefox using Selenium. This guide focuses on headless Chrome via Puppeteer, a relatively new option released in early 2018. Editor’s note: Intoli’s Remote Browser is another new tool, but we’ll save that for another article.
Puppeteer, a Node.js library, provides a high-level API to control headless Chrome or Chromium, interacting with the DevTools protocol. It’s maintained by the Chrome DevTools team and a dedicated open-source community.
Let’s dive into the code and explore automated web scraping with Puppeteer’s headless browsing!
Setting Up
You’ll need Node.js 8+ installed. Get it from here, or if you use Ubuntu like me, run these commands:
| |
Additionally, ensure you have these packages:
| |
Headless Chrome and Puppeteer Setup
Installing Puppeteer with npm is recommended, as it includes a compatible, stable Chromium version.
In your project directory, run:
| |
Note: This might take a while; Puppeteer downloads and installs Chromium in the background.
With everything set up, let’s begin!
Automated Web Scraping with the Puppeteer API
Let’s start our Puppeteer tutorial with a simple example: a script to take a website screenshot using our headless browser.
In your project directory, create screenshot.js and open it in your code editor.
First, import the Puppeteer library:
| |
Next, get the URL from command-line arguments:
| |
Remember that Puppeteer is promise-based, using asynchronous calls to the headless Chrome instance. Let’s maintain clean code with async/await by defining an async function for our Puppeteer code:
| |
The complete code looks like this:
| |
Execute the following in your project’s root directory:
| |
In moments, our headless browser creates screenshot.png with the rendered GitHub homepage. We have a working Chrome web scraper!
Let’s analyze the run() function.
We launch a headless browser instance, open a new page (tab), and navigate to the provided URL. Puppeteer’s built-in screenshot method saves it to the specified path. Finally, we close the headless browser.
With the basics covered, let’s try something more advanced.
Another Puppeteer Scraping Example
Let’s scrape the latest articles from Hacker News.
Create ycombinator-scraper.js and add this code:
| |
There’s more happening here compared to the previous example.
The run() function now returns a promise, hence the async keyword in its definition.
A try-catch block handles potential promise rejection errors.
We’re using Puppeteer’s evaluate() method to run custom JavaScript as if in the DevTools console. The returned value is resolved by the promise. This method is invaluable for scraping and custom actions.
The code within evaluate() is basic JavaScript that constructs an array of objects, each containing url and text properties representing story URLs on https://news.ycombinator.com/.
The script’s output resembles this (truncated to 30 entries):
| |
Impressive!
However, we only have 30 items, while more exist on subsequent pages. We need to click “More” to load them.
Let’s modify our script to incorporate pagination:
| |
Here’s what we’ve done:
- Added a
pagesToScrapeargument torun()to control the number of scraped pages. - Introduced a
currentPagevariable, initially set to1, tracking the current results page. We’ve wrapped ourevaluate()function in awhileloop to continue as long ascurrentPageis less than or equal topagesToScrape. - Added a block to navigate to the next page and wait for it to load before resuming the
whileloop.
We used page.click() to simulate a click on “More” and waitForSelector() to pause execution until the page content loads.
These are convenient high-level Puppeteer API methods.
A common challenge in Puppeteer scraping is waiting for page loads. Hacker News has a simple structure, making it easy. For complex cases, Puppeteer provides extensive built-in functionality detailed in the API documentation on GitHub.
Let’s optimize our Puppeteer script for better performance.
Optimizing Our Puppeteer Script
The key is minimizing unnecessary work for the headless browser, such as loading images, applying CSS, or executing XHR requests.
Puppeteer optimization, like any tool, is use-case dependent. Some optimizations might not be suitable for your project. For instance, preventing image loading in our first example would’ve resulted in an incomplete screenshot.
Optimizations involve caching assets on the first request or blocking unnecessary HTTP requests.
Let’s start with caching.
Puppeteer creates a temporary profile directory for each new headless browser instance. This directory is deleted when the browser closes and isn’t accessible in subsequent instances. Consequently, cached images, CSS, cookies, and other data are lost.
We can configure Puppeteer to use a persistent data directory for cookies and cache, allowing reuse across runs until they expire or are manually cleared.
| |
This significantly improves performance, as cached CSS and images from the first request are reused, reducing download overhead.
However, these assets are still used during rendering. Since we only need the raw HTML for scraping Y Combinator news articles, let’s block unnecessary requests.
Puppeteer offers custom hooks, enabling us to intercept and cancel requests.
Here’s an example interceptor:
| |
This gives us granular control over requests. We can implement custom logic to allow or abort requests based on their resourceType or other properties like request.url.
In this example, only "document" requests (the main HTML response) are allowed, blocking images, CSS, and other assets.
Here’s our optimized code:
| |
Respecting Rate Limits
Headless browsers are powerful, capable of complex web automation tasks made even easier with Puppeteer. However, it’s crucial to adhere to website terms of service and avoid system abuse.
While architectural considerations are beyond the scope of this Puppeteer tutorial, a basic rate-limiting technique is adding sleep commands:
js await page.waitFor(5000);
This pauses the script for five seconds (5000 ms). You can place this before browser.close().
Beyond basic rate limiting, more robust approaches exist, such as implementing a queue system with limited workers. Each Puppeteer task is added to the queue, but only a certain number of workers process tasks concurrently. This common practice for handling third-party API rate limits applies to Puppeteer web data scraping as well.
Puppeteer in a Rapidly Evolving Web
This Puppeteer tutorial showcased its basic web-scraping capabilities. However, its potential extends to headless browser testing, PDF generation, performance monitoring, and more.
Web technologies are advancing rapidly. Many websites heavily rely on JavaScript rendering, making traditional HTTP request-based scraping and automation challenging. Headless browsers like Puppeteer, driven by dedicated teams, provide increasingly accessible solutions for our automation needs.