ScrapeAAS 1.0.3

This package has a SemVer 2.0.0 package version: 1.0.3+build.43.
dotnet add package ScrapeAAS --version 1.0.3
NuGet\Install-Package ScrapeAAS -Version 1.0.3
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="ScrapeAAS" Version="1.0.3" />
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add ScrapeAAS --version 1.0.3
#r "nuget: ScrapeAAS, 1.0.3"
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install ScrapeAAS as a Cake Addin
#addin nuget:?package=ScrapeAAS&version=1.0.3

// Install ScrapeAAS as a Cake Tool
#tool nuget:?package=ScrapeAAS&version=1.0.3

Scrape as a service

ScrapeAAS integrates existing packages and ASP.NET features into a toolstack enabling you, the developer, to design your scraping service using a fammilar environment.

Quickstart

Add ASP.NET Hosting, ScrapeAAS, a validator of your choice (here Dawn.Guard RIP), and a object mapper of your choice (here AutoMapper), and the database/messagequeue you feel most comftable with (here EFcore with SQLite).

dotnet add package Microsoft.Extensions.Hosting
dotnet add package ScrapeAAS
dotnet add package Dawn.Guard
dotnet add package AutoMapper.Extensions.Microsoft.DependencyInjection

Full example of scraping the r/dotnet subreddit.

Create a crawler, a that service periodically triggers scraping

var builder = Host.CreateApplicationBuilder(args);
builder.Services
  .AddAutoMapper()
  .AddScrapeAAS()
  .AddHostedService<RedditSubredditCrawler>()
  .AddDataflow<RedditPostSpider>()
  .AddDataflow<RedditSqliteSink>()

sealed class RedditSubredditCrawler : BackgroundService {
  private readonly IAngleSharpBrowserPageLoader _browserPageLoader;
  private readonly IDataflowPublisher<RedditPost> _publisher;
  ...
  protected override async Task ExecuteAsync(CancellationToken stoppingToken) {
    ... execute service scope periotically
  }

  private async Task CrawlAsync(IDataflowPublisher<RedditSubreddit> publisher, CancellationToken stoppingToken)
  {
    _logger.LogInformation("Crawling /r/dotnet");
    await publisher.PublishAsync(new("dotnet", new("https://old.reddit.com/r/dotnet")), stoppingToken);
    _logger.LogInformation("Crawling complete");
  }
}

Implement your spiders, services that collect, and normalize data.


sealed class RedditPostSpider : IDataflowHandler<RedditSubreddit> {
  private readonly IAngleSharpBrowserPageLoader _browserPageLoader;
  private readonly IDataflowPublisher<RedditComment> _publisher;
  ...

  private async Task ParseRedditTopLevelPosts(RedditSubreddit subreddit, CancellationToken stoppingToken)
  {
    Url root = new("https://old.reddit.com/");
    _logger.LogInformation("Parsing top level posts from {RedditSubreddit}", subreddit);
    var document = await _browserPageLoader.LoadAsync(subreddit.Url, stoppingToken);
    _logger.LogInformation("Request complete");
    var queriedContent = document
      .QuerySelectorAll("div.thing")
      .AsParallel()
      .Select(div => new
      {
        PostUrl = div.QuerySelector("a.title")?.GetAttribute("href"),
        Title = div.QuerySelector("a.title")?.TextContent,
        Upvotes = div.QuerySelector("div.score.unvoted")?.GetAttribute("title"),
        Comments = div.QuerySelector("a.comments")?.TextContent,
        CommentsUrl = div.QuerySelector("a.comments")?.GetAttribute("href"),
        PostedAt = div.QuerySelector("time")?.GetAttribute("datetime"),
        PostedBy = div.QuerySelector("a.author")?.TextContent,
      })
      .Select(queried => new RedditPost(
        new(root, Guard.Argument(queried.PostUrl).NotEmpty()),
        Guard.Argument(queried.Title).NotEmpty(),
        long.Parse(queried.Upvotes.AsSpan()),
        Regex.Match(queried.Comments ?? "", "^\\d+") is { Success: true } commentCount ? long.Parse(commentCount.Value) : 0,
        new(queried.CommentsUrl),
        DateTimeOffset.Parse(queried.PostedAt.AsSpan()),
        new(Guard.Argument(queried.PostedBy).NotEmpty())
      ), IExceptionHandler.Handle((ex, item) => _logger.LogInformation(ex, "Failed to parse {RedditTopLevelPostBrief}", item)));
    foreach (var item in queriedContent)
    {
      await _publisher.PublishAsync(item, stoppingToken);
    }
    _logger.LogInformation("Parsing complete");
  }
}

Add a sink, a service that commits the scraped data disk/network.

sealed class RedditSqliteSink : IAsyncDisposable, IDataflowHandler<RedditSubreddit>, IDataflowHandler<RedditPost>
{
  private readonly RedditPostSqliteContext _context;
  private readonly IMapper _mapper;
  ...
  public async ValueTask DisposeAsync()
  {
    await _context.Database.EnsureCreatedAsync();
    await _context.SaveChangesAsync();
  }

  public async ValueTask HandleAsync(RedditSubreddit message, CancellationToken cancellationToken = default)
  {
    var messageDto = _mapper.Map<RedditSubredditDto>(message);
    await _context.Database.EnsureCreatedAsync(cancellationToken);
    await _context.Subreddits.AddAsync(messageDto, cancellationToken);
  }

  public async ValueTask HandleAsync(RedditPost message, CancellationToken cancellationToken = default)
  {
    var messageDto = _mapper.Map<RedditPostDto>(message);
    if (await _context.Users.FindAsync(new object[] { message.PostedBy.Id }, cancellationToken) is { } existingUser)
    {
      messageDto.PostedById = existingUser.Id;
      messageDto.PostedBy = existingUser;
    }
    await _context.Database.EnsureCreatedAsync(cancellationToken);
    await _context.Posts.AddAsync(messageDto, cancellationToken);
  }
}

Why not WebReaper or DotnetSpider?

I have tried both toolstacks, and found them wanting. So I tried to make it better by delegating as much work as reasonable to existing projects.

In addition to my own goals; from evaluating both libraries I wish to keep all thier pros, and discard all their cons. The verbocity of this library sits comtably between WebReaper and DotnetSpider, but more towards the DotnetSpider end of things.

  • Integration into ASP.NET Hosting.
  • No dependencies at the core of the project. Instead package a reasonable set of addons by default.
  • Use and expose integrated NuGet packages in addons when possible to allow develops to benefit form existing ecosystems.

Evaluation of DotnetSpider

The overall data flow in ScrapeAAS is adopted from DotnetSpider: Crawler --> Spider --> Sink .

  • Pro: Pub/Sub event handling for decoupled data flow.
  • Pro: Easy extendibility by tapping events.
  • Con: Terrible debugging experience using model annotations.
  • Con: Smelly dynamic riddeled design when storing to a database.
  • Con: Retry policies missing.
  • Con: Much boilerplate nessessary.

Evaluation of WebReaper

The Puppeteer browser handling is a mixture of the lifetime tracking http handler and the WebReaper Puppeteer integration.

  • Pro: Simple declarative builder API. No boilderplate needed.
  • Pro: Easy extendibility by implementing interfaces.
  • Pro: Puppeteer browser.
  • Con: Unable to control data flow.
  • Con: Unable to parse data.
  • Con: No ASP.NET or any DI integration possible.
  • Con: Dependencies for optional extendibilites, such as Redis, MySql, RabbitMq, are always included in the package.
Product Compatible and additional computed target framework versions.
.NET net6.0 is compatible.  net6.0-android was computed.  net6.0-ios was computed.  net6.0-maccatalyst was computed.  net6.0-macos was computed.  net6.0-tvos was computed.  net6.0-windows was computed.  net7.0 is compatible.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 was computed.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
1.0.3 132 12/31/2023
1.0.2 47 12/31/2023
1.0.1 46 12/31/2023
1.0.0 47 12/21/2023
0.1.2 96 11/5/2023
0.1.1 79 10/15/2023
0.1.0-hotfix.1 49 10/15/2023
0.1.0-alpha.3 39 10/14/2023
0.0.0-preview.0.71 44 10/14/2023