Encamina.Enmarcha.SemanticKernel.Connectors.Document 8.1.3-preview-04

This is a prerelease version of Encamina.Enmarcha.SemanticKernel.Connectors.Document.
There is a newer version of this package available.
See the version list below for details.
dotnet add package Encamina.Enmarcha.SemanticKernel.Connectors.Document --version 8.1.3-preview-04                
NuGet\Install-Package Encamina.Enmarcha.SemanticKernel.Connectors.Document -Version 8.1.3-preview-04                
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Encamina.Enmarcha.SemanticKernel.Connectors.Document" Version="8.1.3-preview-04" />                
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add Encamina.Enmarcha.SemanticKernel.Connectors.Document --version 8.1.3-preview-04                
#r "nuget: Encamina.Enmarcha.SemanticKernel.Connectors.Document, 8.1.3-preview-04"                
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install Encamina.Enmarcha.SemanticKernel.Connectors.Document as a Cake Addin
#addin nuget:?package=Encamina.Enmarcha.SemanticKernel.Connectors.Document&version=8.1.3-preview-04&prerelease

// Install Encamina.Enmarcha.SemanticKernel.Connectors.Document as a Cake Tool
#tool nuget:?package=Encamina.Enmarcha.SemanticKernel.Connectors.Document&version=8.1.3-preview-04&prerelease                

Semantic Kernel - Document Connectors

Nuget package

Document Connectors specializes in reading information from files in various formats and subsequently chunking it. The most typical use case is, within the context of generating document embeddings, reading information from a variety of file formats (pdf, docx, pptx, etc.) and chunks its content into smaller parts.

Setup

Nuget package

First, install NuGet. Then, install Encamina.Enmarcha.SemanticKernel.Connectors.Document from the package manager console:

PM> Install-Package Encamina.Enmarcha.SemanticKernel.Connectors.Document

.NET CLI:

First, install .NET CLI. Then, install Encamina.Enmarcha.SemanticKernel.Connectors.Document from the .NET CLI:

dotnet add package Encamina.Enmarcha.SemanticKernel.Connectors.Document

How to use

Starting from a Program.cs or a similar entry point file in your project, add the following code:

// Entry point
var builder = WebApplication.CreateBuilder(new WebApplicationOptions
{
   // ...
});

// ...

services.AddDefaultDocumentContentExtractor();

This extension method will add the default implementation of the IDocumentContentExtractor interface as a singleton. The default implementation is DefaultDocumentContentExtractor. With this, we can resolve the IDocumentContentExtractor interface and obtain the chunks of a file:

Construction injection

public class MyClass
{
    private readonly IDocumentContentExtractor documentContentExtractor;

    public MyClass(IDocumentContentExtractor documentContentExtractor)
    {
        this.documentContentExtractor = documentContentExtractor;
    }

    public IEnumerable<string> GetPdfChunks()
    {
        using var file = File.OpenRead("example.pdf");

        var pdfChunks = documentContentExtractor.GetDocumentContent(file, ".pdf");

        return pdfChunks;
    }
}

Service Provider

var serviceProvider = services.BuildServiceProvider();
var documentContentExtractor = serviceProvider.GetRequiredService<IDocumentContentExtractor>();

using var file = File.OpenRead("example.pdf");
var fileChunks = documentContentExtractor.GetDocumentContent(file, ".pdf");

For the above code to be fully functional, it is necessary to configure some additional services, specifically the ITextSplitter interface and a function to calculate the length of each chunk.

The previous code, based on the file extension, searches for a suitable IDocumentConnector for the file type, processes the file to extract its text and finally, it uses an ITextSplitter to split the text into chunks.

Details about the IDocumentConnector

The default implementation DefaultDocumentContentExtractor, uses the following IDocumentConnectors:

  • WordDocumentConnector: For .docx files, it extracts the text from the file by adding each paragraph on a new line.

  • CleanPdfDocumentConnector: For .pdf files, it extracts the raw text from the file (with all words separated by spaces) and removes common words, typically headers or footers that appear in at least 25% of the document.

  • ParagraphPptxDocumentConnector: For .pptx files, it extracts the text from the file, with one line per paragraph found in each slide.

  • TxtDocumentConnector: For .txt files, it extracts the raw text from the file using UTF-8 as the character encoding.

  • TxtDocumentConnector: For .md files, it extracts the raw text from the file using UTF-8 as the character encoding.

  • VttDocumentConnector: For .vtt files, it extracts the text from the subtitles while removing the timestamp marks. Use UTF-8 as the character encoding.

For other formats, it throws a NotSupportedException.

Others available IDocumentConnector

  • SlidePptxDocumentConnector: For .pptx files, it extracts the text from the file with just one line for each slide found.

  • PdfDocumentConnector: For .pdf files, it extracts the raw text from the file for each page (all words separated by spaces) and add a line break between the text of each page.

  • PdfWithTocDocumentConnector: For .pdf files, it retrieve the Table of Contents and generates, for each Table of Contents item, a text with the section title, a colon mark (:), and the content text of the section (e.g. Title1: Content of the Title1 section). Add a line break between each section. The output format of the text is configurable with the TocItemFormat property. Additionally, remove common words, typically headers or footers that appear in at least 25% of the document.

  • StrictFormatCleanPdfDocumentConnector: For .pdf files, it extracts the text from the file and attempts to preserve the document's formatting, including paragraphs, titles, and other structural elements. Additionally, it removes common words, typically headers or footers that appear in at least 25% of the document, and it excludes non-horizontal text. During the text extraction process, an effort is made to retain the document's format; however, it is important to note that this process relies on OCR recognition, which is not perfect, and the results may vary depending on the quality of the PDF.

Use your own IDocumentConnector

To use your own IDocumentConnectors, you can use the base class DocumentContentExtractorBase and override the GetDocumentConnector method. This way, you can return your own IDocumentConnectors to handle a specific file format based on the file extension.

public class MyCustomDocumentContentExtractor : DocumentContentExtractorBase
{
    public MyCustomDocumentContentExtractor(ITextSplitter textSplitter, Func<string, int> lengthFunction) : base(textSplitter, lengthFunction)
    {
    }

    protected override IDocumentConnector GetDocumentConnector(string fileExtension)
    {
        return fileExtension.ToUpperInvariant() switch
        {
            @".rtf" => new MyCustomRtfDocumentConnector(),
            @".pdf" => new PdfWithTocDocumentConnector(),
            @".txt" => new TxtDocumentConnector(Encoding.UTF8),
            _ => throw new NotSupportedException(fileExtension),
        };
    }
}

Don't forget to register it.

// Entry point
var builder = WebApplication.CreateBuilder(new WebApplicationOptions
{
   // ...
});

// ...

// Now we use our own implementation
// services.AddDefaultDocumentContentExtractor();

services.AddSingleton<IDocumentContentExtractor, MyCustomDocumentContentExtractor>();

With this, you will be able to use the extractor you need for each type of file.

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 was computed.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
8.2.0 372 10/22/2024
8.2.0-preview-01-m01 94 9/17/2024
8.1.9-preview-03 257 11/19/2024
8.1.9-preview-02 69 10/22/2024
8.1.9-preview-01 211 10/4/2024
8.1.8 170 9/23/2024
8.1.8-preview-07 305 9/12/2024
8.1.8-preview-06 144 9/11/2024
8.1.8-preview-05 143 9/10/2024
8.1.8-preview-04 215 8/16/2024
8.1.8-preview-03 135 8/13/2024
8.1.8-preview-02 92 8/13/2024
8.1.8-preview-01 104 8/12/2024
8.1.7 106 8/7/2024
8.1.7-preview-09 131 7/3/2024
8.1.7-preview-08 112 7/2/2024
8.1.7-preview-07 89 6/10/2024
8.1.7-preview-06 92 6/10/2024
8.1.7-preview-05 115 6/6/2024
8.1.7-preview-04 93 6/6/2024
8.1.7-preview-03 101 5/24/2024
8.1.7-preview-02 110 5/10/2024
8.1.7-preview-01 101 5/8/2024
8.1.6 138 5/7/2024
8.1.6-preview-08 59 5/2/2024
8.1.6-preview-07 102 4/29/2024
8.1.6-preview-06 280 4/26/2024
8.1.6-preview-05 102 4/24/2024
8.1.6-preview-04 109 4/22/2024
8.1.6-preview-03 99 4/22/2024
8.1.6-preview-02 130 4/17/2024
8.1.6-preview-01 108 4/15/2024
8.1.5 112 4/15/2024
8.1.5-preview-15 91 4/10/2024
8.1.5-preview-14 146 3/20/2024
8.1.5-preview-13 88 3/18/2024
8.1.5-preview-12 121 3/13/2024
8.1.5-preview-11 94 3/13/2024
8.1.5-preview-10 126 3/13/2024
8.1.5-preview-09 100 3/12/2024
8.1.5-preview-08 83 3/12/2024
8.1.5-preview-07 101 3/8/2024
8.1.5-preview-06 206 3/8/2024
8.1.5-preview-05 95 3/7/2024
8.1.5-preview-04 95 3/7/2024
8.1.5-preview-03 87 3/7/2024
8.1.5-preview-02 149 2/28/2024
8.1.5-preview-01 147 2/19/2024
8.1.4 194 2/15/2024
8.1.3 144 2/13/2024
8.1.3-preview-07 80 2/13/2024
8.1.3-preview-06 109 2/12/2024
8.1.3-preview-05 96 2/9/2024
8.1.3-preview-04 103 2/8/2024
8.1.3-preview-03 126 2/7/2024
8.1.3-preview-02 83 2/2/2024
8.1.3-preview-01 91 2/2/2024
8.1.2 165 2/1/2024
8.1.2-preview-9 99 1/22/2024
8.1.2-preview-8 94 1/19/2024
8.1.2-preview-7 90 1/19/2024
8.1.2-preview-6 93 1/19/2024
8.1.2-preview-5 97 1/19/2024
8.1.2-preview-4 110 1/19/2024
8.1.2-preview-3 93 1/18/2024
8.1.2-preview-2 83 1/18/2024
8.1.2-preview-16 76 1/31/2024
8.1.2-preview-15 88 1/31/2024
8.1.2-preview-14 193 1/25/2024
8.1.2-preview-13 94 1/25/2024
8.1.2-preview-12 100 1/23/2024
8.1.2-preview-11 87 1/23/2024
8.1.2-preview-10 79 1/22/2024
8.1.2-preview-1 101 1/18/2024
8.1.1 132 1/18/2024
8.1.0 116 1/18/2024
8.0.3 178 12/29/2023
8.0.1 160 12/14/2023
8.0.0 159 12/7/2023
6.0.4.3 163 12/29/2023
6.0.4.2 168 12/20/2023
6.0.4.1 107 12/19/2023
6.0.4 173 12/4/2023
6.0.3.20 125 11/27/2023
6.0.3.19 136 11/22/2023