pwshub.com

The best Node.js web scrapers for your use case

Editor’s note: This article was last updated on 17 October 2024.

The Best Node.js Web Scrapers For Your Use Case

In this article, we’ll explore a few of the best Node.js web scraping libraries and techniques. You’ll also learn about their differences, considering when each is the right fit for your project’s needs.

The best Node.js web scraping libraries

Whether you want to build your own search engine, monitor a website to alert you when tickets for your favorite concert are available, or you need essential information for your company, there are many Node.js web scraper libraries that have you covered.

Axios

If you’re familiar with Axios, it might not sound like the most appealing option for scraping the web. Be that as it may, it is a simple solution that can help you get the job done, and it offers the added benefit of being a library you likely already know quite well.

Axios is a promised-based HTTP client for Node.js that became super popular among JavaScript projects for its simplicity and adaptability. Although Axios is typically used in the context of calling REST APIs, it can fetch websites’ HTML as well.

Because Axios will only get the response from the server, it will be up to you to parse and work with the result. Therefore, I recommend using this library when working with JSON responses or for simple scraping needs.

You can install Axios using your favorite package manager as follows:

npm install axios

Below is an example of using Axios to list all the articles headlines from the LogRocket Blog’s homepage:

const axios = require('axios');
axios
  .get("https://logrocket.com/blog")
  .then(function (response) {
    const reTitles = /(?<=\<h2 class="card-title"><a\shref=.*?\>).*?(?=\<\/a\>)/g;
    [...response.data.matchAll(reTitles)].forEach(title => console.log(`- ${title}`));
   });

In the example above, you can see how Axios is great for HTTP requests. However, parsing the HTML in complex structures requires elaborating complex rules, or regular expressions, even for simple tasks.

So, if regular expressions aren’t your thing and you prefer a more DOM-based approach, you could transform the HTML into a DOM-like object with libraries like JSDom or Cheerio. Let’s explore the same example from above using JSDom instead:

const axios = require('axios');
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
axios
  .get("https://logrocket.com/blog")
  .then(function (response) {
    const dom = new JSDOM(response.data);
    [...dom.window.document.querySelectorAll('.card-title a')].forEach(el => console.log(`- ${el.textContent}`));
   });

This kind of solution would soon encounter its limitations. For example, you’ll only get the raw response from the server — what if elements on the page you want to access are loaded asynchronously?

What about single-page applications (SPAs), where the HTML simply loads JavaScript libraries that do all the rendering work on the client? Or what if you encounter one of the limitations imposed by such libraries? After all, they aren’t a full HTML/DOM implementation but a subset of the same.

In scenarios like these, or for complex websites, the best choice may be a completely different approach using other libraries.

Puppeteer

Puppeteer is a high-level Node.js API to control Chrome or Chromium with code. So, what does it mean for us in terms of web scraping?

With Puppeteer, you access the power of a full-fetch browser like Chromium, running in the background in headless mode, to navigate websites and fully render styles, scripts, and asynchronous information.

To use Puppeteer in your project, you can install it like any other JavaScript package:

npm install puppeteer

Now, let’s see an example of Puppeteer in action:

const puppeteer = require("puppeteer");
async function parseLogRocketBlogHome() {
    // Launch the browser
    const browser = await puppeteer.launch();
    // Open a new tab
    const page = await browser.newPage(); 
    // Visit the page and wait until network connections are completed
    await page.goto('https://logrocket.com/blog', { waitUntil: 'networkidle2' });
    // Interact with the DOM to retrieve the titles
    const titles = await page.evaluate(() => { 
        // Select all elements with crayons-tag class 
        return [...document.querySelectorAll('.card-title a')].map(el => el.textContent);
    });
    // Don't forget to close the browser instance to clean up the memory
    await browser.close();
    // Print the results
    titles.forEach(title => console.log(`- ${title}`))
}
parseLogRocketBlogHome();

While Puppeteer is a fantastic solution, it is more complex to work on, especially for simple projects. It is also much more demanding in terms of resources — you are, after all, running a full Chromium browser, and we know how memory-hungry those can be.

X-Ray

X-Ray is a Node.js library created for scraping the web, so it’s no surprise that its API is heavily focused on that task. As such, it abstracts most of the complexity we encounter in Puppeteer and Axios.

To install X-Ray, you can run the following command:

npm install x-ray

Now, let’s build our example using X-Ray:

const Xray = require('x-ray');
const x = Xray()
x('https://logrocket.com/blog', {
    titles: ['.card-title a']
})((err, result) => {
    result.titles.forEach(title => console.log(`- ${title}`));
});

X-Ray is a great option if your use case involves scraping a large number of webpages. It supports concurrency and pagination out of the box, so you don’t need to worry about those details.

Osmosis

Osmosis is very similar to X-Ray, designed explicitly for scraping webpages and extracting data from HTML, XML, and JSON documents.

To install Osmosis, run the following code:

npm install osmosis

Below is the sample code:

var osmosis = require('osmosis');
osmosis.get('https://logrocket.com/blog')
.set({
    titles: ['.card-title a']
})
.data(function(result) {
    result.titles.forEach(title => console.log(`- ${title}`));
});

As you can see, Osmosis is similar to X-Ray in terms of syntax and style used to retrieve and work with data.

Superagent

Superagent is a lightweight, progressive, client-side Node.js library for handling HTTP requests. Due to its simplicity and ease of use, it is commonly used for web scraping.

Just like Axios, Superagent is also limited to only getting the response from the server; it will be up to you to parse and work with the result. Depending on your scraping needs, you can retrieve HTML pages, JSON data, or other types of content using Superagent.

To use Superagent in your project, you can install it like any other JavaScript package:

npm install superagent

When scraping HTML pages, you must parse the HTML content to extract the desired data. For this, you can use libraries like Cheerio or JSDOM.

To use Cheerio in your project, you can install it like any other JavaScript package:

npm install cheerio

Let’s review an example of web scraping with Superagent and Cheerio in action:

const superagent = require("superagent");
const cheerio = require("cheerio");
const url = "https://blog.logrocket.com";
superagent.get(url).end((err, res) => {
  if (err) {
    console.error("Error fetching the website:", err);
    return;
  }
  const $ = cheerio.load(res.text);
  // Replace the following selectors with the actual HTML elements you want to scrape
  const titles = $(".card-title a")
    .map((i, el) => $(el).text())
    .get();
  const descriptions = $("p.description")
    .map((i, el) => $(el).text())
    .get();
  // Display the scraped data
  console.log("Titles:", titles);
  console.log("Descriptions:", descriptions);
});

The script will make an HTTP GET request to the specified URL using Superagent, fetch the HTML content of the page, and then use Cheerio to extract the data from the specified selectors.

While Superagent is a great solution, using it for web scraping may result in incomplete or inaccurate data extraction resulting in data inconsistency, depending on the complexity of the website’s structure and the parsing methods used.

Playwright

Playwright is a powerful tool for web scraping and browser automation, especially when dealing with modern web applications with dynamic content and complex interactions. Its multibrowser support, automation capabilities, and performance make it an excellent choice for developers looking to perform advanced web scraping tasks in Node.js applications.

Playwright is a relatively new open source library developed by Microsoft. It provides complete control over the browser’s state, cookies, network requests, and browser events, making it ideal for complex scraping scenarios.

To use Playwright in your project, you can install it like so:

npm install playwright

Let’s look at an example of web scraping with Playwright:

const { chromium } = require("playwright");
(async () => {
  const browser = await chromium.launch();
  const context = await browser.newContext();
  const page = await context.newPage();
  const url = "https://blog.logrocket.com"; // Replace with the URL of the website you want to scrape
  try {
    await page.goto(url);
    // Replace the following selectors with the actual HTML elements you want to scrape
    const titleElement = await page.$("h1");
    const descriptionElement = await page.$("p.description");
    const title = await titleElement.textContent();
    const description = await descriptionElement.textContent();
    const inputElement = await page.$('input[type="text"]');
    const value = await inputElement.inputValue();
    console.log(value);
    console.log("Title:", title);
    console.log("Description:", description);
  } catch (error) {
    console.error("Error while scraping:", error);
  } finally {
    await browser.close();
  }
})();

The script will launch a Chromium browser, navigate to the specified URL, and use Playwright’s methods to interact with the website and extract data from the specified selectors.

Playwright is a robust scraping library, but when compared to lightweight HTTP-based scraping libraries, it incurs more resource overhead because it uses headless browsers to perform scraping tasks. This can have an impact on performance and memory usage, especially if you’re scraping multiple pages or performing a large number of scraping tasks.

Things to know about scraping the web

Although web scraping is legal for publicly available information, you should be aware that many sites put limitations in place as part of their terms of service. Some may even include rate limits to prevent you from slowing down their services — but why is that?

When you scrape information from a site, you use its resources.

Let’s suppose you’re aggressive in terms of accessing too many pages too quickly. In that case, you may degrade the site’s general performance for its users. So, when scraping the web, you must get consent or permission from the owner and be mindful of the strains you are putting on their sites.

Lastly, web scraping requires a considerable effort for development and, in many cases, maintenance. Changes in the structure of the target site may break your scraping code and require you to update your script to adjust to the new formats.

For this reason, I prefer consuming an API when possible and scraping the web only as a last resort.

Which is the best Node.js scraper?

Ultimately, the best Node.js scraper is the one that best fits your project needs. In this article, we covered some factors to help influence your decision.

For most tasks, any of these options will suffice, so choose the one you feel most comfortable with. In my professional life, I’ve had the opportunity to build multiple projects with information-gathering requirements from publicly available information and internal systems.

Because the requirements were diverse, each of these projects used different approaches and libraries, ranging from Axios to X-Ray, and ultimately resulting in Puppeteer for the most complex situations.

Finally, you should always respect the website’s terms and conditions regardless of what scraper you choose. Scraping data can be a powerful tool, but with that comes great responsibility. Thanks for reading!

Source: blog.logrocket.com

Related stories
1 month ago - Auth.js makes adding authentication to web apps easier and more secure. Let's discuss why you should use it in your projects. The post Auth.js adoption guide: Overview, examples, and alternatives appeared first on LogRocket Blog.
1 week ago - Node.js v22.5.0 introduced a native SQLite module, which is is similar to what other JavaScript runtimes like Deno and Bun already have. The post Using the built-in SQLite module in Node js appeared first on LogRocket Blog.
3 weeks ago - Have you ever wanted to create a web app that works smoothly on any device—whether it's on the web, mobile, or desktop? Imagine if your app could load quickly, work without an internet connection, and feel like a native app, all without...
3 weeks ago - You’ve written and deployed your application and gathered users – congrats! But what’s next? Improvements, getting rid of bottlenecks, increasing […] The post Node.js performance hooks and measurement APIs appeared first on LogRocket Blog.
15 hours ago - Cloud hosting is a type of web hosting that hosts websites and applications on virtual servers provisioned across multiple geographic locations. It provides scalability, redundancy, dedicated resources, and superior control compared to...
Other stories
5 hours ago - Infinite runner games have been a favorite for gamers and developers alike due to their fast-paced action and replayability. These games often feature engaging mechanics like endless levels, smooth character movement, and dynamic...
7 hours ago - Yesterday, Elizabeth Siegle, a developer advocate for CLoudflare, showed off a really freaking cool demo making use of Cloudflare's Workers AI support. Her demo made use of WNBA stats to create a beautiful dashboard that's then enhanced...
7 hours ago - User interviews are great — only if you don't rush them. In this piece, I share how using debrief questions in research can help capture better insights and improve your interview process. The post Using debrief questions to get the most...
7 hours ago - Inertia.js enables you to build SPAs with a traditional backend framework and a modern JavaScript frontend with server-side routing. The post Inertia.js adoption guide: Overview, examples, and alternatives appeared first on LogRocket Blog.
7 hours ago - The most common application of correlation and regression is predictive analytics, which you can use to make day-to-day decisions. The post A guide to correlation vs. regression appeared first on LogRocket Blog.