RafaelEstevam.Simple.Spider
0.9.1
See the version list below for details.
dotnet add package RafaelEstevam.Simple.Spider --version 0.9.1
NuGet\Install-Package RafaelEstevam.Simple.Spider -Version 0.9.1
<PackageReference Include="RafaelEstevam.Simple.Spider" Version="0.9.1" />
paket add RafaelEstevam.Simple.Spider --version 0.9.1
#r "nuget: RafaelEstevam.Simple.Spider, 0.9.1"
// Install RafaelEstevam.Simple.Spider as a Cake Addin #addin nuget:?package=RafaelEstevam.Simple.Spider&version=0.9.1 // Install RafaelEstevam.Simple.Spider as a Cake Tool #tool nuget:?package=RafaelEstevam.Simple.Spider&version=0.9.1
The the NuGet package was changed
Simple Spider
A simple to implement and modular web spider written in C#. Multi Target with:
- .Net 5.0
- .Net Core 3.1
- .NET Standard 2.1
Why I should use a "simple" spider instead of a full stack framework ?
The main focus of this project is to create a library that is simple to implement and operate.
It's lightweight, use less resources and fewer libraries as possible
Ideal scenarios:
- Personal bots; want to know when something enter a sale or became available ?
- Lots of small projects; It's easy to implement help the creation of small bots
- A good number of small bots can be archived with a few lines of code
- With new .Net5.0 Top-level statements, entire applications can be achieved with very little effort
Content
- Simple Spider
- Content
Some advantages
- Very simple to use and operate, ideal for lots of small projects or personal ones
- Easy html filter with HObject (a HtmlNode wrap with use similar to JObject)
- Internal conversion from html to XElement, no need to external tools on use
- Automatic Json parser to JObject
- Automatic Json deserialize <T>
- Modular Parser engine (you can add your own parsers!)
- JSON and XML already included
- Modular Storage Engine to easily save what you collect (you can add your own!)
- The main library has a JsonLines engine
- External (Modules folder) has a SQLite one
- Modular Caching engine (you can add your own!)
- Stand alone Cache engine included, no need to external softwares
- Modular Downloader engine (you can add your own!)
- WebClient with cookies or HttpClient download engine included
Easy import with NuGet
Installation
Install the SimpleSpider NuGet package: Install-Package RafaelEstevam.Simple.Spider
Getting started
- Start a new console project and add Nuget Reference
- PM>
Install-Package RafaelEstevam.Simple.Spider
- Create a class for your spider (or leave in program)
- create a new instance of SimpleSpider
- Give it a name, cache and log will be saved with that name
- Give it a domain (your spider will not fleet from it)
- Add a event
FetchCompleted
to - Optionally give a first page with
AddPage
. If omitted, it will use the home page of the domain - Call
Execute()
void run()
{
var spider = new SimpleSpider("QuotesToScrape", new Uri("http://quotes.toscrape.com/"));
// Set the completed event to implement your stuff
spider.FetchCompleted += fetchCompleted_items;
// execute
spider.Execute();
}
void fetchCompleted_items(object Sender, FetchCompleteEventArgs args)
{
// walk around ...
// TIP: inspect args to see stuff
var hObj = args.GetHObject();
string[] quotes = hObj["span > .text"];
}
TIP: Use the Simple.Tests project to see examples and poke around
Samples
Inside the Simple.Tests folders are all samples, these are some of them:
Use XPath to select content
Use XPath to select html elements and filter data.
void run()
{
var spider = new SimpleSpider("BooksToScrape", new Uri("http://books.toscrape.com/"));
// callback to gather items, new links are collected automatically
spider.FetchCompleted += fetchCompleted_items;
// Ignore (cancel) the pages containing "/reviews/"
spider.ShouldFetch += (s, a) => { a.Cancel = a.Link.Uri.ToString().Contains("/reviews/"); };
// execute from first page
spider.Execute();
}
void fetchCompleted_items(object Sender, FetchCompleteEventArgs args)
{
// ignore all pages except the catalogue
if (!args.Link.ToString().Contains("/catalogue/")) return;
// HObject also processes XPath
var hObj = args.GetHObject();
// collect book data
var articleProd = hObj.XPathSelect("//article[@class=\"product_page\"]");
if (articleProd.IsEmpty()) return; // not a book
// Book info
string sTitle = articleProd.XPathSelect("//h1");
string sPrice = articleProd.XPathSelect("//p[@class=\"price_color\"]");
string sStock = articleProd.XPathSelect("//p[@class=\"instock availability\"]").GetValue().Trim();
string sDesc = articleProd.XPathSelect("p")?.GetValue(); // books can be description less
}
Below we have the same example but using HObject to select html elements
void run() ... /* Same run() method */
void fetchCompleted_items(object Sender, FetchCompleteEventArgs args)
{
// ignore all pages except the catalogue
if (!args.Link.ToString().Contains("/catalogue/")) return;
var hObj = args.GetHObject();
// collect book data
var articleProd = hObj["article > .product_page"]; // XPath: "//article[@class=\"product_page\"]"
if (articleProd.IsEmpty()) return; // not a book
// Book info
string sTitle = articleProd["h1"]; // XPath: "//h1"
string sPrice = articleProd["p > .price_color"]; // XPath: "//p[@class=\"price_color\"]"
string sStock = articleProd["p > .instock"].GetValue().Trim();// XPath "//p[@class=\"instock\"]"
string sDesc = articleProd.Children("p"); // XPath "p"
}
Use our HObject implementation to select content
Use indexing style object representation of the html document similar to Newtonsoft's JObject.
void run()
{
// Get Quotes.ToScrape.com as HObject
HObject hObj = FetchHelper.FetchResourceHObject(new Uri("http://quotes.toscrape.com/"));
...
// Example 2
// Get all Spans and filter by Class='text'
HObject ex2 = hObj["span"].OfClass("text");
// Supports css selector style, dot for Class
HObject ex2B = hObj["span"][".text"];
// Also supports css '>' selector style
HObject ex2C = hObj["span > .text"];
...
// Example 4
// Get all Spans filters by some arbitrary attribute
// Original HTML: <span class="text" itemprop="text">
HObject ex4 = hObj["span"].OfWhich("itemprop", "text");
...
//Example 9
// Exports Values as Strings with Method and implicitly
string[] ex9A = hObj["span"].OfClass("text").GetValues();
string[] ex9B = hObj["span"].OfClass("text");
...
//Example 13
// Gets Attribute's value
string ex13 = hObj["footer"].GetClassValue();
//Example 14
// Chain query to specify item and then get Attribute Values
// Gets Next Page Url
string ex14A = hObj["nav"]["ul"]["li"]["a"].GetAttributeValue("href"); // Specify one attribute
string ex14B = hObj["nav"]["ul"]["li"]["a"].GetHrefValue(); // directly
// Multiple parameters can be parametrized as array
string ex14C = hObj["nav", "ul", "li", "a"].GetHrefValue();
// Multiple parameters can filtered with ' > '
string ex14D = hObj["nav > ul > li > a"].GetHrefValue();
}
Easy storage with StorageEngines
Store you data with Attached Storage Engines, some included !
void run()
{
var iP = new InitializationParams()
// Defines a Storage Engine
// All stored items will be in spider folder as JsonLines
.SetStorage(new Storage.JsonLinesStorage());
var spider = new SimpleSpider("BooksToScrape", new Uri("http://books.toscrape.com/"), iP);
// callback to gather items
spider.FetchCompleted += fetchCompleted_items;
// execute
spider.Execute();
}
static void fetchCompleted_items(object Sender, FetchCompleteEventArgs args)
{
// ignore all pages except the catalogue
if (!args.Link.ToString().Contains("/catalogue/")) return;
var tag = new Tag(args.GetDocument());
var books = tag.SelectTags<Article>("//article[@class=\"product_page\"]");
foreach (var book in books)
{
// process prices
var priceP = book.SelectTag<Paragraph>(".//p[@class=\"price_color\"]");
var price = priceP.InnerText.Trim();
// Store name and prices
(Sender as SimpleSpider).Storage.AddItem(args.Link, new
{
name = book.SelectTag("//h1").InnerText,
price
});
}
}
Easy initialization with chaining
Initialize your spider easily with chaining and a good variety of options.
void run()
{
var init = new InitializationParams()
.SetCacher(new ContentCacher()) // Easy cache engine change
.SetDownloader(new WebClientDownloader()) // Easy download engine change
.SetSpiderStartupDirectory(@"D:\spiders\") // Default directory
// create a json parser for our QuotesObject class
.AddParser(new JsonDeserializeParser<QuotesObject>(parsedResult_event))
.SetConfig(c => c.Enable_Caching() // Already enabled by default
.Disable_Cookies() // Already disabled by default
.Disable_AutoAnchorsLinks()
.Set_CachingNoLimit() // Already set by default
.Set_DownloadDelay(5000));
var spider = new SimpleSpider("QuotesToScrape", new Uri("http://quotes.toscrape.com/"), init);
// add first
spider.AddPage(buildPageUri(1), spider.BaseUri);
// execute
spider.Execute();
}
Easy single resource fetch
Easy API pooling for updates with single resource fetch.
void run()
{
var uri = new Uri("http://quotes.toscrape.com/api/quotes?page=1");
var quotes = FetchHelper.FetchResourceJson<QuotesObject>(uri);
// show the quotes deserialized
foreach (var quote in quotes.quotes)
{
Console.WriteLine($"Quote: {quote.text}");
Console.WriteLine($" - {quote.author.name}");
Console.WriteLine();
}
}
Use Json to deserialize Quotes
Json response? Get a event with your data already deserialized.
( yes, these few lines below are full functional examples! )
void run()
{
var spider = new SimpleSpider("QuotesToScrape", new Uri("http://quotes.toscrape.com/"));
// create a json parser for our QuotesObject class
spider.Parsers.Add(new JsonDeserializeParser<QuotesObject>(parsedResult_event));
// add first page /api/quotes?page={pageNo}
spider.AddPage(buildPageUri(1), spider.BaseUri);
// execute
spider.Execute();
}
void parsedResult_event(object sender, ParserEventArgs<QuotesObject> args)
{
// add next
if (args.ParsedData.has_next)
{
int next = args.ParsedData.page + 1;
(sender as SimpleSpider).AddPage(buildPageUri(next), args.FetchInfo.Link);
}
// process data (show on console)
foreach (var q in args.ParsedData.quotes)
{
Console.WriteLine($"{q.author.name }: { q.text }");
}
}
Complete spider application with SQLite storage in less than 50 lines
With Uses .Net5.0 Top-level statements, you can create a complete application to crawl a site, collect your data, storage in SQLite and display them into the console in less than 50 lines (including comments)
using System;
using RafaelEstevam.Simple.Spider;
using RafaelEstevam.Simple.Spider.Extensions;
using RafaelEstevam.Simple.Spider.Storage;
// Creates a new instance (can be chained in init)
var storage = new SQLiteStorage<Quote>();
// Initialize with a good set of configs
var init = InitializationParams.Default002().SetStorage(storage);
var spider = new SimpleSpider("QuotesToScrape", new Uri("http://quotes.toscrape.com/"), init);
Console.WriteLine($"The sqlite database is at {storage.DatabaseFilePath}");
Console.WriteLine($"The quotes are being stored in the table '{storage.TableNameOfT}'");
spider.FetchCompleted += spider_FetchCompleted;
spider.Execute();
// process each page
static void spider_FetchCompleted(object Sender, FetchCompleteEventArgs args)
{
var hObj = args.GetHObject();
// get all quotes, divs with class "quote"
foreach (var q in hObj["div > .quote"])
{
var quote = new Quote()
{
Text = q["span > .text"].GetValue().HtmlDecode(),
Author = q["small > .author"].GetValue().HtmlDecode(),
Tags = string.Join(';', q["a > .tag"].GetValues())
};
// store them
((SimpleSpider)Sender).Storage.AddItem(args.Link, quote);
}
}
class Quote
{
public string Author { get; set; }
public string Text { get; set; }
public string Tags { get; set; }
}
Based on SQLite module example, which has less than 70 lines of code 😉
Some Helpers
- FetchHelper: Fast single resource fetch with lots of parsers
- RequestHelper: Make requests (gets and posts) easily
- XmlSerializerHelper: Generic class to serialize and deserialize stuff using Xml, easy way to save what you collected without any database
- CSV Helper: Read csv files (even compressed) without external libraries
- UriHelper: Manipulates parts of the Uri
- XElement to Stuff: Extract tables from page in DataTable
Giants' shoulders
- Html parsing with Html Agility Pack
- Json parsing with Newtonsoft
- Logging with Serilog
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net5.0 is compatible. net5.0-windows was computed. net6.0 is compatible. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. |
.NET Core | netcoreapp3.0 was computed. netcoreapp3.1 is compatible. |
.NET Standard | netstandard2.1 is compatible. |
MonoAndroid | monoandroid was computed. |
MonoMac | monomac was computed. |
MonoTouch | monotouch was computed. |
Tizen | tizen60 was computed. |
Xamarin.iOS | xamarinios was computed. |
Xamarin.Mac | xamarinmac was computed. |
Xamarin.TVOS | xamarintvos was computed. |
Xamarin.WatchOS | xamarinwatchos was computed. |
-
.NETCoreApp 3.1
- HtmlAgilityPack (>= 1.11.43)
- Microsoft.CSharp (>= 4.7.0)
- Newtonsoft.Json (>= 13.0.1)
- Serilog (>= 2.11.0)
- Serilog.Sinks.Console (>= 4.0.1)
- Serilog.Sinks.File (>= 5.0.0)
-
.NETStandard 2.1
- HtmlAgilityPack (>= 1.11.43)
- Microsoft.CSharp (>= 4.7.0)
- Newtonsoft.Json (>= 13.0.1)
- Serilog (>= 2.11.0)
- Serilog.Sinks.Console (>= 4.0.1)
- Serilog.Sinks.File (>= 5.0.0)
-
net5.0
- HtmlAgilityPack (>= 1.11.43)
- Microsoft.CSharp (>= 4.7.0)
- Newtonsoft.Json (>= 13.0.1)
- Serilog (>= 2.11.0)
- Serilog.Sinks.Console (>= 4.0.1)
- Serilog.Sinks.File (>= 5.0.0)
-
net6.0
- HtmlAgilityPack (>= 1.11.43)
- Microsoft.CSharp (>= 4.7.0)
- Newtonsoft.Json (>= 13.0.1)
- Serilog (>= 2.11.0)
- Serilog.Sinks.Console (>= 4.0.1)
- Serilog.Sinks.File (>= 5.0.0)
NuGet packages (1)
Showing the top 1 NuGet packages that depend on RafaelEstevam.Simple.Spider:
Package | Downloads |
---|---|
RafaelEstevam.Simple.Spider.SqliteStorage
Sqlite-based storage engine to the SimpleSpider See examples and documentation on the GitHub page |
GitHub repositories
This package is not used by any popular GitHub repositories.
Version | Downloads | Last updated | |
---|---|---|---|
0.9.13 | 237 | 3/26/2024 | |
0.9.12 | 294 | 11/5/2023 | |
0.9.11 | 190 | 9/22/2023 | |
0.9.10 | 125 | 9/21/2023 | |
0.9.9 | 259 | 4/23/2023 | |
0.9.8 | 208 | 4/15/2023 | |
0.9.7 | 228 | 4/15/2023 | |
0.9.6 | 206 | 4/15/2023 | |
0.9.5 | 194 | 4/12/2023 | |
0.9.4 | 208 | 4/11/2023 | |
0.9.3 | 446 | 8/7/2022 | |
0.9.2 | 402 | 8/5/2022 | |
0.9.1 | 416 | 7/25/2022 | |
0.9.0 | 457 | 4/1/2022 | |
0.8.7 | 286 | 1/6/2022 | |
0.8.6 | 384 | 10/8/2021 | |
0.8.5 | 430 | 8/1/2021 | |
0.8.4 | 350 | 6/23/2021 | |
0.8.3 | 347 | 6/14/2021 | |
0.8.2 | 334 | 6/12/2021 | |
0.8.1 | 578 | 3/27/2021 | |
0.7.521 | 458 | 2/20/2021 | |
0.7.508 | 328 | 1/31/2021 | |
0.7.501 | 593 | 12/30/2020 | |
0.7.484 | 672 | 12/23/2020 | |
0.7.458 | 614 | 11/30/2020 | |
0.7.425 | 496 | 11/25/2020 | |
0.7.400 | 439 | 11/16/2020 | |
0.7.390 | 506 | 11/15/2020 | |
0.7.378 | 382 | 11/13/2020 |
See examples and documentation on the GitHub page
Commit 468afbd
Release Candidate