Net.RafaelEstevam.Spider.Simple.Lib
0.4.45
There is a newer version of this package available.
See the version list below for details.
See the version list below for details.
dotnet add package Net.RafaelEstevam.Spider.Simple.Lib --version 0.4.45
NuGet\Install-Package Net.RafaelEstevam.Spider.Simple.Lib -Version 0.4.45
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Net.RafaelEstevam.Spider.Simple.Lib" Version="0.4.45" />
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add Net.RafaelEstevam.Spider.Simple.Lib --version 0.4.45
The NuGet Team does not provide support for this client. Please contact its maintainers for support.
#r "nuget: Net.RafaelEstevam.Spider.Simple.Lib, 0.4.45"
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install Net.RafaelEstevam.Spider.Simple.Lib as a Cake Addin #addin nuget:?package=Net.RafaelEstevam.Spider.Simple.Lib&version=0.4.45 // Install Net.RafaelEstevam.Spider.Simple.Lib as a Cake Tool #tool nuget:?package=Net.RafaelEstevam.Spider.Simple.Lib&version=0.4.45
The NuGet Team does not provide support for this client. Please contact its maintainers for support.
SimpleSpider
[!] Work in Progress
A simple and modular web spider writen in C# .Net Core
Some advantages
- Very simple to use and operate
- Internal conversion from html to XElement, no need to external tools on use
- Automatic Json parser to JObject
- Automatic Json deserialize <T>
- Modular Parser engine (you can add your own parsers!)
- Modular Caching engine (you can add your own!)
- Modular Downloader engine (you can add your own!)
Samples
Inside Simple.Tests are some spiders to show to crawl and collect data
Use Json to parse Quotes
void run()
{
var spider = new SimpleSpider("QuotesToScrape", new Uri("http://quotes.toscrape.com/"));
// createa json parser for our QuotesObject class
spider.Parsers.Add(new Parsers.JsonDeserializeParser<QuotesObject>(parsedResult_event))
// add first
spider.AddPage( buildPageUri(1), spider.BaseUri);
// execute
spider.Execute();
}
void parsedResult_event(object sender, Interfaces.ParserEventArgs<QuotesObject> args)
{
// add next
if (args.ParsedData.has_next)
{
int currPage = args.ParsedData.page;
((SimpleSpider)sender).AddPage(buildPageUri(currPage + 1), args.FetchInfo.Link);
}
// process data (show on console)
foreach (var j in args.ParsedData.quotes)
{
Console.WriteLine($"{j.author.name }: { j.text }");
}
}
Use XPath to select content
void run()
{
var spider = new SimpleSpider("BooksToScrape", new Uri("http://books.toscrape.com/"));
// callback to gather items
spider.FetchCompleted += fetchCompleted_items;
// Ignore (cancel) the pages containing "/reviews/"
spider.ShouldFetch += (s, a) => { a.Cancel = a.Link.Uri.ToString().Contains("/reviews/"); };
// execute from first page
spider.Execute();
}
void fetchCompleted_items(object Sender, FetchCompleteEventArgs args)
{
// Colect new links
(Sender as SimpleSpider).AddPage(AnchorHelper.GetAnchors(args.Link.Uri, args.Html), args.Link);
// ignore all pages except the catalogue
if (!args.Link.ToString().Contains("/catalogue/")) return;
var XElement = HtmlToEXelement.Parse(args.Html);
// collect book data
var articleProd = XElement.XPathSelectElement("//article[@class=\"product_page\"]");
if (articleProd == null) return; // not a book
// Book info
string sTitle = articleProd.XPathSelectElement("//h1").Value;
string sPrice = articleProd.XPathSelectElement("//p[@class=\"price_color\"]").Value;
string sStock = articleProd.XPathSelectElement("//p[@class=\"instock availability\"]").Value.Trim();
string sDesc = articleProd.XPathSelectElement("p")?.Value; // books can be descriptionless
}
Giants' shoulders
- Html parsing with Html Agility Pack
- Json parsing with Newtonsoft
- Logging with Serilog
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net5.0 was computed. net5.0-windows was computed. net6.0 was computed. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
.NET Core | netcoreapp3.1 is compatible. |
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
-
.NETCoreApp 3.1
- HtmlAgilityPack (>= 1.11.24)
- Newtonsoft.Json (>= 12.0.3)
- Serilog (>= 2.9.0)
- Serilog.Sinks.Console (>= 3.1.1)
- Serilog.Sinks.File (>= 4.1.0)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
Version | Downloads | Last updated | |
---|---|---|---|
0.6.366 | 495 | 11/30/2020 | |
0.6.365 | 705 | 10/20/2020 | |
0.5.347 | 593 | 9/23/2020 | |
0.5.320 | 423 | 9/17/2020 | |
0.5.272 | 482 | 8/22/2020 | |
0.5.239 | 491 | 8/10/2020 | |
0.5.192 | 506 | 8/3/2020 | |
0.5.164 | 560 | 7/30/2020 | |
0.5.145 | 628 | 7/29/2020 | |
0.4.116 | 521 | 7/26/2020 | |
0.4.104 | 442 | 7/24/2020 | |
0.4.76 | 474 | 7/21/2020 | |
0.4.45 | 475 | 7/19/2020 |
Work in progress