vforteli.DataLakeClientExtensions 0.3.0

dotnet add package vforteli.DataLakeClientExtensions --version 0.3.0                
NuGet\Install-Package vforteli.DataLakeClientExtensions -Version 0.3.0                
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="vforteli.DataLakeClientExtensions" Version="0.3.0" />                
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add vforteli.DataLakeClientExtensions --version 0.3.0                
#r "nuget: vforteli.DataLakeClientExtensions, 0.3.0"                
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install vforteli.DataLakeClientExtensions as a Cake Addin
#addin nuget:?package=vforteli.DataLakeClientExtensions&version=0.3.0

// Install vforteli.DataLakeClientExtensions as a Cake Tool
#tool nuget:?package=vforteli.DataLakeClientExtensions&version=0.3.0                

DataLakeFileSystemClientExtension ListPathsParallelAsync

Extension method for listing paths in parallel with Azure DataLakeFileSystemClient. In Azure DataLakeGen2, Using the ListPathsAsync method on the DataLakeServiceClient can take tens of minutes or even hours with as little as hundreds of thousands of files across directories.

This extension method uses multiple threads to avoid calling the expensive recursive version of ListPathsAsync. This improves performance significantly, however the actual numbers varies depending on the directory structure.

Benchmarks

The not so scientific benchmarks have been run on a storage account containing one filesystem containing 32 folders, each folder contains 1600 subfolders and one file and each subfolder contains 10 files.
Total files and folders: 563234.

Tests run on an MacBook Pro M2 with 100/10 Mbit connection against an Azure Storage Account with Standard SKU and hierarchical namespace enabled (Datalakegen2).

Test Duration
SDK GetPathsAsync 474 sec
ListPathsParallelAsync 16 threads 157 sec
ListPathsParallelAsync 128 threads 25 sec
ListPathsParallelAsync 256 threads 17 sec

Installation

Build from source or download NuGet package: https://www.nuget.org/packages/vforteli.DataLakeClientExtensions

Target frameworks .Net 6 and .Net Standard 2.1

Usage

List files in directory

  // List paths with IAsyncEnumerable
  var sourceFileSystemClient = new DataLakeServiceClient(new Uri(sourceConnection)).GetFileSystemClient("somefilesystem");
  await foreach (var path in sourceFileSystemClient.ListPathsParallelAsync("/"))       
  {
      // do something with PathItem
  } 
Product Compatible and additional computed target framework versions.
.NET net6.0 is compatible.  net6.0-android was computed.  net6.0-ios was computed.  net6.0-maccatalyst was computed.  net6.0-macos was computed.  net6.0-tvos was computed.  net6.0-windows was computed.  net7.0 was computed.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 was computed.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
0.3.0 175 12/26/2023
0.2.0 266 6/12/2023

Switch from BlockingCollection to Channel