Imagibee.Gigantor
0.6.1
See the version list below for details.
dotnet add package Imagibee.Gigantor --version 0.6.1
NuGet\Install-Package Imagibee.Gigantor -Version 0.6.1
<PackageReference Include="Imagibee.Gigantor" Version="0.6.1" />
paket add Imagibee.Gigantor --version 0.6.1
#r "nuget: Imagibee.Gigantor, 0.6.1"
// Install Imagibee.Gigantor as a Cake Addin #addin nuget:?package=Imagibee.Gigantor&version=0.6.1 // Install Imagibee.Gigantor as a Cake Tool #tool nuget:?package=Imagibee.Gigantor&version=0.6.1
Gigantor
Gigantor provides classes that support regular expression searches of gigantic files
The purpose of Gigantor is robust, easy, ready-made searching of gigantic files that avoids common pitfalls including unresponsiveness, excessive memory usage, and processing time that are often encountered with this type of application.
In order to accomplish this goal, Gigantor provides RegexSearcher
and LineIndexer
classes that work together to search and read a file. Both these classes use a similar approach. They partition the file into chunks in the background, launch threads to work on each partition, update progress statistics, and finally join and sort the results.
Since many file processing applications fit into this parallel chunk processing paradigm, Gigantor also provides FileMapJoin<T>
as a reusable base class for creating new file map/join classes. This base class is customizable through its Start
, Map
, Join
, Finish
methods as well as its chunkSize
, maxWorkers
, and joinMode
constructor parameters.
Contents
RegexSearcher
- regex searching in the backgroundLineIndexer
- line counting in background, maps lines to fpos and fpos to linesDuplicateChecker
- file duplicate detection in the backgroundFileMapJoin<T>
- base class for implementing custom file-based map/join operationsIBackground
- common interface for contolling a background jobBackground
- functions for managing collections of IBackground
Stream Mode
RegexSearcher supports stream and file modes. File mode is faster and more flexible, but in some situations stream mode may be preferable. For example, stream mode allows searching a compressed file without decompressing it to disk.
Example 1
Here is an example that illustrates constructing a searcher to search a gzipped file without decompressing it to disk.
// Open a compressed file stream
using var fs = new FileStream(
"myfile.gz", FileMode.Open);
// Create a decompression stream
var stream = new GZipStream(
fs, CompressionMode.Decompress, true);
// Create the searcher passing it the decompressed stream
RegexSearcher searcher = new(
stream,
regex,
progress);
Example 2
Here is a more extensive examples that illustrate searching a large uncompressed file and reading several lines around a match.
using Imagibee.Gigantor;
// Get enwik9 (this takes a moment)
var path = Utilities.GetEnwik9();
// The regular expression for the search
const string pattern = @"comfort\s*food";
Regex regex = new(
pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
// A shared wait event for progress notifications
AutoResetEvent progress = new(false);
// Create the search and indexing workers
LineIndexer indexer = new(path, progress);
RegexSearcher searcher = new(path, regex, progress);
// Create a IBackground collection for convenient managment
var processes = new List<IBackground>()
{
indexer,
searcher
};
// Create a progress bar to illustrate progress updates
Utilities.ByteProgress progressBar = new(
40, processes.Count * Utilities.FileByteCount(path));
// Start search and indexing in parallel and wait for completion
Console.WriteLine($"Searching ...");
Background.StartAndWait(
processes,
progress,
(_) =>
{
progressBar.Update(
processes.Select((p) => p.ByteCount).Sum());
},
1000);
Console.Write('\n');
// All done, check for errors
var error = Background.AnyError(processes);
if (error.Length != 0) {
throw new Exception(error);
}
// Check for cancellation
if (Background.AnyCancelled(processes)) {
throw new Exception("search cancelled");
}
// Display search results
if (searcher.MatchCount != 0) {
Console.WriteLine($"Found {searcher.MatchCount} matches ...");
var matchDatas = searcher.GetMatchData();
for (var i = 0; i < matchDatas.Count; i++) {
var matchData = matchDatas[i];
Console.WriteLine(
$"[{i}]({matchData.Value}) ({matchData.Name}) " +
$"line {indexer.LineFromPosition(matchData.StartFpos)} " +
$"fpos {matchData.StartFpos}");
}
// Get the line of the 1st match
var matchLine = indexer.LineFromPosition(
searcher.GetMatchData()[0].StartFpos);
// Open the searched file for reading
using FileStream fileStream = new(path, FileMode.Open);
Imagibee.Gigantor.StreamReader gigantorReader = new(fileStream);
// Seek to the first line we want to read
var contextLines = 6;
fileStream.Seek(indexer.PositionFromLine(
matchLine - contextLines), SeekOrigin.Begin);
// Read and display a few lines around the match
for (var line = matchLine - contextLines;
line <= matchLine + contextLines;
line++) {
Console.WriteLine(
$"[{line}]({indexer.PositionFromLine(line)}) " +
gigantorReader.ReadLine());
}
}
Example console output
Searching ...
########################################
Found 11 matches ...
[0](Comfort food) (0) line 2115660 fpos 185913740
[1](comfort food) (0) line 2115660 fpos 185913753
[2](comfort food) (0) line 2405473 fpos 212784867
[3](comfort food) (0) line 3254241 fpos 275813781
[4](comfort food) (0) line 3254259 fpos 275817860
[5](comfort food) (0) line 3993946 fpos 334916584
[6](comfort food) (0) line 4029113 fpos 337507601
[7](comfort food) (0) line 4194105 fpos 350053436
[8](comfort food) (0) line 8614841 fpos 691616502
[9](comfort food) (0) line 10190137 fpos 799397876
[10](comfort food) (0) line 12488963 fpos 954837923
[2115654](185912493)
[2115655](185912494) Some [[fruit]]s were available in the area. [[Muscadine]]s, [[blackberry|blackberries]], [[raspberry|raspberries]], and many other wild berries were part of settlers&#8217; diets when they were available.
[2115656](185912703)
[2115657](185912704) Early settlers also supplemented their diets with meats. Most meat came from the hunting of native game. [[Venison]] was an important meat staple due to the abundance of [[white-tailed deer]] in the area. Settlers also hunted [[rabbit]]s, [[squirrel]]s, [[opossum]]s, and [[raccoon]]s, all of which were pests to the crops they raised. [[Livestock]] in the form of [[hog]]s and [[cattle]] were kept. When game or livestock was killed, the entire animal was used. Aside from the meat, it was not uncommon for settlers to eat organ meats such as [[liver]], [[brain]]s and [[intestine]]s. This tradition remains today in hallmark dishes like [[chitterlings]] (commonly called ''chit&#8217;lins'') which are fried large [[intestines]] of [[hog]]s, [[livermush]] (a common dish in the Carolinas made from hog liver), and pork [[brain]]s and eggs. The fat of the animals, particularly hogs, was rendered and used for cooking and frying.
[2115658](185913646)
[2115659](185913647) ===Southern cuisine for the masses===
[2115660](185913685) A niche market for Southern food along with American [[Comfort food|comfort food]] has proven profitable for chains such as [[Cracker Barrel]], who have extended their market across the country, instead of staying solely in the South.
[2115661](185913920)
[2115662](185913921) Southern chains that are popular across the country include [[Stuckey's]] and [[Popeyes Chicken & Biscuits|Popeye's]]. The former is known for being a "pecan shoppe" and the latter is known for its spicy fried chicken.
[2115663](185914154)
[2115664](185914155) Other Southern chains which specialize in this type of cuisine, but have decided mainly to stay in the South, are [[Po' Folks]] (also known as ''Folks'' in some markets) and Famous Amos. Another type of selection is [[Sonny's Real Pit Bar-B-Q]].
[2115665](185914401)
[2115666](185914402) ==Cajun and Creole cuisine==
Refer to the tests and console apps for additional examples.
Performance
The first performance graph consists of running the included benchmarking apps over enwik9 and measuring the throughput at different values of maxWorkers
. For RegexSearcher file, stream, and gzipped stream modes are all benchmarked. Enwik9 is a 1e9 byte file that is not included.
Here is the search benchmark console output for a 5 GiByte search. On the test system, performance peaked around 16 worker threads, and the peak is roughly eight times faster (8x) than the single threaded baseline.
$ dotnet SearchApp/bin/Release/net6.0/SearchApp.dll benchmark ${TMPDIR}/enwik9
........................
maxWorkers=1, chunkKiBytes=512, maxThread=32767
105160 matches found
searched 5000000000 bytes in 24.0289207 seconds
-> 208.0825877460239 MBytes/s
..............
maxWorkers=2, chunkKiBytes=512, maxThread=32767
105160 matches found
searched 5000000000 bytes in 12.692795 seconds
-> 393.92426963485974 MBytes/s
.........
maxWorkers=4, chunkKiBytes=512, maxThread=32767
105160 matches found
searched 5000000000 bytes in 6.8668367 seconds
-> 728.1373095707955 MBytes/s
....
maxWorkers=8, chunkKiBytes=512, maxThread=32767
105160 matches found
searched 5000000000 bytes in 3.7174496 seconds
-> 1345.0081475213544 MBytes/s
....
maxWorkers=16, chunkKiBytes=512, maxThread=32767
105160 matches found
searched 5000000000 bytes in 3.0211296 seconds
-> 1655.0100995336313 MBytes/s
....
maxWorkers=32, chunkKiBytes=512, maxThread=32767
105160 matches found
searched 5000000000 bytes in 3.191699 seconds
-> 1566.5637643148682 MBytes/s
....
maxWorkers=64, chunkKiBytes=512, maxThread=32767
105160 matches found
searched 5000000000 bytes in 3.2240221 seconds
-> 1550.8578554718963 MBytes/s
....
maxWorkers=128, chunkKiBytes=512, maxThread=32767
105160 matches found
searched 5000000000 bytes in 3.3693127 seconds
-> 1483.982178323787 MBytes/s
For the gzip stream mode, throughput caps out at 2 workers. Another option for getting more throughput in this mode is to search multiple files in parallel. The following data compares searching multiple copies of Enwik9 with maxWorkers=16
in gzip stream mode.
The hardware used to measure performance was a Macbook Pro
- 8-Core Intel Core i9
- L2 Cache (per Core): 256 KB
- L3 Cache: 16 MB
- Memory: 16 GB
License
Versioning
This package uses semantic versioning. Tags on the main branch indicate versions. It is recomended to use a tagged version. The latest version on the main branch should be considered under development when it is not tagged.
Issues
Report and track issues here.
Contributing
Minor changes such as bug fixes are welcome. Simply make a pull request. Please discuss more significant changes prior to making the pull request by opening a new issue that describes the change.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net6.0 is compatible. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
-
net6.0
- No dependencies.
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
Version | Downloads | Last updated |
---|---|---|
3.0.0 | 478 | 4/5/2023 |
2.0.1 | 198 | 4/3/2023 |
2.0.0 | 192 | 4/3/2023 |
1.0.2 | 185 | 3/30/2023 |
1.0.1 | 203 | 3/25/2023 |
1.0.0 | 217 | 3/24/2023 |
0.8.2 | 247 | 3/8/2023 |
0.8.1 | 218 | 3/8/2023 |
0.8.0 | 240 | 3/6/2023 |
0.7.1 | 235 | 3/6/2023 |
0.7.0 | 233 | 3/5/2023 |
0.6.3 | 234 | 3/1/2023 |
0.6.2 | 233 | 2/21/2023 |
0.6.1 | 236 | 2/18/2023 |
0.6.0 | 261 | 2/18/2023 |
0.5.0 | 250 | 2/13/2023 |
0.4.1 | 257 | 2/10/2023 |
0.4.0 | 272 | 2/8/2023 |
0.3.5 | 375 | 2/7/2023 |
0.3.4 | 244 | 2/7/2023 |
0.3.3 | 252 | 2/6/2023 |
0.3.2 | 272 | 2/6/2023 |
0.3.1 | 262 | 2/6/2023 |