Net.RafaelEstevam.Spider.Simple.Lib
0.5.192
See the version list below for details.
dotnet add package Net.RafaelEstevam.Spider.Simple.Lib --version 0.5.192
NuGet\Install-Package Net.RafaelEstevam.Spider.Simple.Lib -Version 0.5.192
<PackageReference Include="Net.RafaelEstevam.Spider.Simple.Lib" Version="0.5.192" />
paket add Net.RafaelEstevam.Spider.Simple.Lib --version 0.5.192
#r "nuget: Net.RafaelEstevam.Spider.Simple.Lib, 0.5.192"
// Install Net.RafaelEstevam.Spider.Simple.Lib as a Cake Addin #addin nuget:?package=Net.RafaelEstevam.Spider.Simple.Lib&version=0.5.192 // Install Net.RafaelEstevam.Spider.Simple.Lib as a Cake Tool #tool nuget:?package=Net.RafaelEstevam.Spider.Simple.Lib&version=0.5.192
SimpleSpider
A simple and modular web spider written in C# .Net Core
Some advantages
- Very simple to use and operate, ideal to personal or one of projects
- Easy html filter with HObject (a XElement wrap with use similar to JObject)
- Internal conversion from html to XElement, no need to external tools on use
- Automatic Json parser to JObject
- Automatic Json deserialize <T>
- Modular Parser engine (you can add your own parsers!)
- Modular Caching engine (you can add your own!)
- Modular Downloader engine (you can add your own!)
Samples
Inside the Simple.Tests folders are various samples, these are some of them:
Use Json to deserialize Quotes
Json response? Get a event with your data already deserialized.
( yes, these few lines below are full functional examples! )
void run()
{
var spider = new SimpleSpider("QuotesToScrape", new Uri("http://quotes.toscrape.com/"));
// create a json parser for our QuotesObject class
spider.Parsers.Add(new JsonDeserializeParser<QuotesObject>(parsedResult_event));
// add first page /api/quotes?page={pageNo}
spider.AddPage(buildPageUri(1), spider.BaseUri);
// execute
spider.Execute();
}
void parsedResult_event(object sender, ParserEventArgs<QuotesObject> args)
{
// add next
if (args.ParsedData.has_next)
{
int next = args.ParsedData.page + 1;
(sender as SimpleSpider).AddPage(buildPageUri(next), args.FetchInfo.Link);
}
// process data (show on console)
foreach (var q in args.ParsedData.quotes)
{
Console.WriteLine($"{q.author.name }: { q.text }");
}
}
Use XPath to select content
Use XPath to select html elements and filter data.
void run()
{
var spider = new SimpleSpider("BooksToScrape", new Uri("http://books.toscrape.com/"));
// callback to gather items, new links are collected automatically
spider.FetchCompleted += fetchCompleted_items;
// Ignore (cancel) the pages containing "/reviews/"
spider.ShouldFetch += (s, a) => { a.Cancel = a.Link.Uri.ToString().Contains("/reviews/"); };
// execute from first page
spider.Execute();
}
void fetchCompleted_items(object Sender, FetchCompleteEventArgs args)
{
// ignore all pages except the catalogue
if (!args.Link.ToString().Contains("/catalogue/")) return;
var XElement = HtmlToXElement.Parse(args.Html);
// collect book data
var articleProd = XElement.XPathSelectElement("//article[@class=\"product_page\"]");
if (articleProd == null) return; // not a book
// Book info
string sTitle = articleProd.XPathSelectElement("//h1").Value;
string sPrice = articleProd.XPathSelectElement("//p[@class=\"price_color\"]").Value;
string sStock = articleProd.XPathSelectElement("//p[@class=\"instock availability\"]").Value.Trim();
string sDesc = articleProd.XPathSelectElement("p")?.Value; // books can be description less
}
Easy single resource fetch
Easy API pooling for updates with single resource fetch.
void run()
{
var uri = new Uri("http://quotes.toscrape.com/api/quotes?page=1");
var quotes = FetchHelper.FetchResourceJson<QuotesObject>(uri);
// show the quotes deserialized
foreach (var quote in quotes.quotes)
{
Console.WriteLine($"Quote: {quote.text}");
Console.WriteLine($" - {quote.author.name}");
Console.WriteLine();
}
}
Use our HObject implementation to select content
Use indexing style object representation of the html document similar to Newtonsoft's JObject.
void run()
{
// Get Quotes.ToScrape.com as HObject
HObject hObj = FetchHelper.FetchResourceHObject(new Uri("http://quotes.toscrape.com/"));
...
// Example 2
// Get all Spans and filter by Class='text'
HObject ex2 = hObj["span"].OfClass("text");
// Supports css selector style, dot for Class
HObject ex2B = hObj["span"][".text"];
// Also supports css '>' selector style
HObject ex2C = hObj["span > .text"];
...
// Example 4
// Get all Spans filters by some arbitrary attribute
// Original HTML: <span class="text" itemprop="text">
HObject ex4 = hObj["span"].OfWhich("itemprop", "text");
...
//Example 14
// Chain query to specify item and then get Attribute Values
// Gets Next Page Url
string ex14A = hObj["nav"]["ul"]["li"]["a"].GetAttributeValue("href"); // Specify one attribute
string ex14B = hObj["nav"]["ul"]["li"]["a"].GetHrefValue(); // directly
// Multiple parameters can be parametrized as array
string ex14C = hObj["nav", "ul", "li", "a"].GetHrefValue();
// Multiple parameters can filtered with ' > '
string ex14D = hObj["nav > ul > li > a"].GetHrefValue();
}
Easy initialization with chaining
Initialize your spider easily with chaining and a good variety of options.
void run()
{
var init = new InitializationParams()
.SetCacher(new ContentCacher()) // Easy cache engine change
.SetDownloader(new WebClientDownloader()) // Easy download engine change
.SetSpiderStartupDirectory(@"D:\spiders\") // Default directory
// create a json parser for our QuotesObject class
.AddParser(new JsonDeserializeParser<QuotesObject>(parsedResult_event))
.SetConfig(c => c.Enable_Caching() // Already enabled by default
.Disable_Cookies() // Already disabled by default
.Disable_AutoAnchorsLinks()
.Set_CachingNoLimit() // Already set by default
.Set_DownloadDelay(5000));
var spider = new SimpleSpider("QuotesToScrape", new Uri("http://quotes.toscrape.com/"), init);
// add first
spider.AddPage(buildPageUri(1), spider.BaseUri);
// execute
spider.Execute();
}
Some Helpers
- FetchHelper: Fast single resource fetch with lots of parsers
- FormsHelper: Deserialize html forms to easy manipulate data and create new requests
- XmlSerializerHelper: Generic class to serialize and deserialize stuff using Xml, easy way to save what you collected without any database
- CSV Helper: Read csv files even compressed without external libraries
- UriHelper: Manipulates parts of the Uri
- XElement to Stuff: Extract tables from page in DataTable
Giants' shoulders
- Html parsing with Html Agility Pack
- Json parsing with Newtonsoft
- Logging with Serilog
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net5.0 was computed. net5.0-windows was computed. net6.0 was computed. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
.NET Core | netcoreapp3.1 is compatible. |
-
.NETCoreApp 3.1
- HtmlAgilityPack (>= 1.11.24)
- Newtonsoft.Json (>= 12.0.3)
- Serilog (>= 2.9.0)
- Serilog.Sinks.Console (>= 3.1.1)
- Serilog.Sinks.File (>= 4.1.0)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
Version | Downloads | Last updated | |
---|---|---|---|
0.6.366 | 522 | 11/30/2020 | |
0.6.365 | 721 | 10/20/2020 | |
0.5.347 | 612 | 9/23/2020 | |
0.5.320 | 439 | 9/17/2020 | |
0.5.272 | 503 | 8/22/2020 | |
0.5.239 | 508 | 8/10/2020 | |
0.5.192 | 525 | 8/3/2020 | |
0.5.164 | 581 | 7/30/2020 | |
0.5.145 | 648 | 7/29/2020 | |
0.4.116 | 541 | 7/26/2020 | |
0.4.104 | 458 | 7/24/2020 | |
0.4.76 | 492 | 7/21/2020 | |
0.4.45 | 493 | 7/19/2020 |
Work in progress. See examples and documentation on GitHub page
Better redirect handling, added HObject support
Commit 8d5394a