WebReaper 3.3.0

There is a newer version of this package available.
See the version list below for details.
dotnet add package WebReaper --version 3.3.0                
NuGet\Install-Package WebReaper -Version 3.3.0                
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="WebReaper" Version="3.3.0" />                
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add WebReaper --version 3.3.0                
#r "nuget: WebReaper, 3.3.0"                
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install WebReaper as a Cake Addin
#addin nuget:?package=WebReaper&version=3.3.0

// Install WebReaper as a Cake Tool
#tool nuget:?package=WebReaper&version=3.3.0                

logo

WebReaper

NuGet build status

Overview

WebReaper is a declarative high performance web scraper, crawler and parser in C#. Designed as simple, extensible and scalable web scraping solution. Easily crawl any web site and parse the data, save structed result to a file, DB, or pretty much to anywhere you want.

It provides a simple yet extensible API to make web scraping a breeze.

Install

dotnet add package WebReaper

Requirements

.NET 7

📋 Example:

<img width="758" alt="raycast-untitled (4)" src="https://user-images.githubusercontent.com/6662454/205595931-e7e9c764-3d6a-42d0-92fc-152592fefc81.png">

Features:

  • ⚡ It's extremely fast due to parallelism and asynchrony
  • 🗒 Declarative parsing with a structured scheme
  • 💾 Saving data to any sinks such as JSON or CSV file, MongoDB, CosmosDB, Redis, etc.
  • 🌎 Distributed crawling support: run your web scraper on ony cloud VMs, serverless functions, on-prem servers, etc.
  • 🐙 Crawling and parsing Single Page Applications with Puppeteer
  • 🖥 Proxy support
  • 🌀 Automatic reties

Usage examples

  • Data mining
  • Gathering data for machine learning
  • Online price change monitoring and price comparison
  • News aggregation
  • Product review scraping (to watch the competition)
  • Gathering real estate listings
  • Tracking online presence and reputation
  • Web mashup and web data integration
  • MAP compliance
  • Lead generation

API overview

SPA parsing example

Parsing single page applications is super simple, just use the GetWithBrowser and/or FollowWithBrowser method. In this case Puppeteer will be used to load the pages.

var engine = await new ScraperEngineBuilder()
    .GetWithBrowser("https://www.reddit.com/r/dotnet/")
    .Follow("a.SQnoC3ObvgnGjWt90zD9Z._2INHSNB8V5eaWp4P0rY_mE")
    .Parse(new()
    {
        new("title", "._eYtD2XCVieq6emjKBH3m"),
        new("text", "._3xX726aBn29LDbsDtzr_6E._1Ap4F5maDtT1E1YuCiaO0r.D3IL3FD0RFy_mkKLPwL4")
    })
    .WriteToJsonFile("output.json")
    .LogToConsole()
    .BuildAsync()

await engine.RunAsync();

Additionally, you can run any JavaScript on dynamic pages as they are loaded with headless browser. In order to do that you need to add some page actions:

using WebReaper.Core.Builders;

var engine = await new ScraperEngineBuilder()
    .GetWithBrowser("https://www.reddit.com/r/dotnet/", actions => actions
        .ScrollToEnd()
        .Build())
    .Follow("a.SQnoC3ObvgnGjWt90zD9Z._2INHSNB8V5eaWp4P0rY_mE")
    .Parse(new()
    {
        new("title", "._eYtD2XCVieq6emjKBH3m"),
        new("text", "._3xX726aBn29LDbsDtzr_6E._1Ap4F5maDtT1E1YuCiaO0r.D3IL3FD0RFy_mkKLPwL4")
    })
    .WriteToJsonFile("output.json")
    .LogToConsole()
    .BuildAsync()

await engine.RunAsync();

Console.ReadLine();

It can be helpful if the required content is loaded only after some user interactions such as clicks, scrolls, etc.

Persist the progress locally

If you want to persist the vistited links and job queue locally, so that you can start crawling where you left off you can use ScheduleWithTextFile and TrackVisitedLinksInFile methods:

var engine = await new ScraperEngineBuilder()
	.WithLogger(logger)
	.Get("https://rutracker.org/forum/index.php?c=33")
	.Follow("#cf-33 .forumlink>a")
	.Follow(".forumlink>a")
	.Paginate("a.torTopic", ".pg")
	.Parse(new()
	{
		new("name", "#topic-title"),
		new("category", "td.nav.t-breadcrumb-top.w100.pad_2>a:nth-child(3)"),
		new("subcategory", "td.nav.t-breadcrumb-top.w100.pad_2>a:nth-child(5)"),
		new("torrentSize", "div.attach_link.guest>ul>li:nth-child(2)"),
		new("torrentLink", ".magnet-link", "href"),
		new("coverImageUrl", ".postImg", "src")
	})
	.WriteToJsonFile("result.json")
	.IgnoreUrls(blackList)
	.ScheduleWithTextFile("jobs.txt", "progress.txt")
	.TrackVisitedLinksInFile("links.txt")
	.BuildAsync();

Authorization

If you need to pass authorization before parsing the web site, you can call SetCookies method on Scraper that has to fill CookieContainer with all cookies required for authorization. You are responsible for performing the login operation with your credentials, the Scraper only uses the cookies that you provide.

var engine = await new ScraperEngineBuilder()
    .WithLogger(logger)
    .Get("https://rutracker.org/forum/index.php?c=33")
    .SetCookies(cookies =>
    {
        cookies.Add(new Cookie("AuthToken", "123");
    })
	...

How to disable headless mode

If you scrape pages with a browser using GetWithBrowser and FollowWithBrowser methods, the default mode is headless meaning that you won't see the browser during scraping. However, seeing the browser during scraping for debugging or troubleshooting may be useful. To disable headless mode you the .HeadlessMode(false) method call.


var engine = await new ScraperEngineBuilder()
    .GetWithBrowser("https://www.reddit.com/r/dotnet/", actions => actions
        .ScrollToEnd()
        .Build())
    .HeadlessMode(false)
	...

How to clean prevously scraped data during the next web scrapping run

You may want to clean the data recived during the previous scraping to start you web scraping from scratch. In this case use dataCleanupOnStart when adding a new sink:


var engine = await new ScraperEngineBuilder()
    .Get("https://www.reddit.com/r/dotnet/")
    .WriteToJsonFile("output.json", dataCleanupOnStart: true)

This dataCleanupOnStart parameter is present for all sinks, e.g. MongoDbSink, RedisSink, CosmosSink, etc.

Distributed web scraping with Serverless approach

In the Examples folder you can find the project called WebReaper.AzureFuncs. It demonstrates the use of WebReaper with Azure Functions. It consists of two serverless functions:

StartScrapting

First of all, this function uses ScraperConfigBuilder to build the scraper configuration e. g.:

Secondly, this function writes the first web scraping job with startUrl to the Azure Service Bus queue:

WebReaperSpider

This Azure function is triggered by messages sent to the Azure Service Bus queue. Messages represent web scraping job.

Firstly, this function builds the spider that is going to execute the job from the queue.

Secondly, it executes the job by loading the page, parsing content, saving to the database, etc.

Finally, it iterates through these new jobs and sends them the the Job queue.

Extensibility

Adding a new sink to persist your data

Out of the box there are 4 sinks you can send your parsed data to: ConsoleSink, CsvFileSink, JsonFileSink, CosmosSink ( Azure Cosmos database).

You can easly add your own by implementing the IScraperSink interface:

public interface IScraperSink
{
    public Task EmitAsync(ParsedData data);
}

Here is an example of the Console sink:

public class ConsoleSink : IScraperSink
{
    public Task EmitAsync(ParsedData parsedItam)
    {
        Console.WriteLine($"{parsedItam.Data.ToString()}");
        return Task.CompletedTask;
    }
}

Adding your sink to the Scraper is simple, just call AddSink method on the Scraper:

_ = new ScraperEngineBuilder()
    .AddSink(new ConsoleSink());
    .Get("https://rutracker.org/forum/index.php?c=33")
    .Follow("#cf-33 .forumlink>a")
    .Follow(".forumlink>a")
    .Paginate("a.torTopic", ".pg")
    .Parse(new() {
        new("name", "#topic-title"),
    });

For other ways to extend your functionality see the next section.

Intrefaces

Interface Description
IScheduler Reading and writing from the job queue. By default, the in-memory queue is used, but you can provider your implementation
IVisitedLinkTracker Tracker of visited links. A default implementation is an in-memory tracker. You can provide your own for Redis, MongoDB, etc.
IPageLoader Loader that takes URL and returns HTML of the page as a string
IContentParser Takes HTML and schema and returns JSON representation (JObject).
ILinkParser Takes HTML as a string and returns page links
IScraperSink Represents a data store for writing the results of web scraping. Takes the JObject as parameter
ISpider A spider that does the crawling, parsing, and saving of the data

Main entities

  • Job - a record that represents a job for the spider
  • LinkPathSelector - represents a selector for links to be crawled

Repository structure

Project Description
WebReaper Library for web scraping
WebReaper.ScraperWorkerService Example of using WebReaper library in a Worker Service .NET project.
WebReaper.DistributedScraperWorkerService Example of using WebReaper library in a distributed way wih Azure Service Bus
WebReaper.AzureFuncs Example of using WebReaper library with serverless approach using Azure Functions
WebReaper.ConsoleApplication Example of using WebReaper library with in a console application

See the LICENSE file for license rights and limitations (GNU GPLv3).

Product Compatible and additional computed target framework versions.
.NET net7.0 is compatible.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 was computed.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
3.5.2 256 10/19/2024
3.5.1 2,827 8/15/2023
3.5.0 148 8/9/2023
3.4.0 248 4/17/2023
3.3.0 220 4/3/2023
3.2.0 198 4/2/2023
3.1.0 320 2/28/2023
3.0.8 435 11/12/2022
3.0.7 357 11/4/2022
3.0.6 320 11/3/2022
3.0.5 347 10/31/2022
3.0.4 368 10/29/2022
3.0.3 357 10/29/2022
3.0.2 372 10/24/2022
3.0.1 379 10/21/2022
3.0.0 382 10/7/2022