GroupDocs.Parser 24.9.0

There is a newer version of this package available.
See the version list below for details.

dotnet add package GroupDocs.Parser --version 24.9.0

NuGet\Install-Package GroupDocs.Parser -Version 24.9.0

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="GroupDocs.Parser" Version="24.9.0" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

paket add GroupDocs.Parser --version 24.9.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: GroupDocs.Parser, 24.9.0"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

// Install GroupDocs.Parser as a Cake Addin
#addin nuget:?package=GroupDocs.Parser&version=24.9.0

// Install GroupDocs.Parser as a Cake Tool
#tool nuget:?package=GroupDocs.Parser&version=24.9.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

Advanced Document Parsing API for .NET

Important Note: Starting from 24.2.0, the GroupDocs.Parser package has been split into two distinct platform packages: .NET Standard and .NET Framework. The GroupDocs.Parser package is specifically designed to support the .NET Standard platform, making it compatible with .NET Core, .NET 5, .NET 6, etc. It includes backward compatibility improvements, allowing it to function with .NET Framework versions starting from 4.6.2. In addition, we have introduced the GroupDocs.Parser.NETFramework package, which is optimized to run seamlessly in the .NET Framework runtime because it includes all the GroupDocs product libraries in their respective .NET Framework versions. It is tailored specifically for .NET Framework users and offers better dependency resolution for those utilizing the .NET Framework. We hope these changes will enhance your experience and provide a more streamlined approach to using the GroupDocs.Parser package. If you have any further questions or concerns, please don't hesitate to reach out to our free support forum.

GroupDocs.Parser for .NET is a powerful API designed for advanced document parsing, offering extensive features like text extraction, metadata retrieval, and image extraction across various document formats, including PDFs, Word, Excel, and PowerPoint. This robust API supports .NET Standard and .NET Framework, making it compatible with .NET Core, .NET 5, and .NET 6, while also providing backward compatibility with older .NET Framework versions. With specialized parsing capabilities for PDF documents, email parsing, and template-based data extraction, GroupDocs.Parser ensures high-performance, secure parsing and scalability, suitable for cross-platform environments including Windows, Linux, and macOS. It's the ideal solution for developers needing to integrate efficient document processing into their .NET applications.

Text Extraction

Document Text Extraction

Extract text from PDF, Word, Excel, and more.

Retain Text Formatting

Extract text with font styles, sizes, and colors.

Text Search and Extraction

Search for and extract specific text.

OCR Text Extraction

Extract text from images using OCR.

Metadata Extraction

Document Metadata Extraction

Extract properties like author, title, and subject.

Date Property Extraction

Extract creation and modification dates.

Field-Specific Data Extraction

Extract custom fields like invoice numbers.

Image and Attachment Extraction

Extract Embedded Images

Extract images within documents.

Extract File Attachments

Extract attachments from PDF and email files.

Barcode Extraction

Extract and recognize barcodes from documents.

Document Structure Analysis

Structured Document Analysis

Analyze and extract tables, lists, and paragraphs.

Table Extraction

Extract tables and their content.

Hyperlink Extraction

Extract hyperlinks from documents.

Bookmark Extraction

Extract bookmarks from PDFs.

PDF-Specific Parsing

PDF Parsing

Extract text, images, and metadata from PDFs.

Extract PDF Page Count

Extract page count and PDF-specific properties.

PDF Bookmark Management

Extract and manage bookmarks in PDFs.

Email Parsing

Email Content Extraction

Extract text, attachments, and metadata from emails.

Email Property Extraction

Extract sender, receiver, subject, and body content.

Spreadsheet Parsing

Excel Data Extraction

Extract text, metadata, and data from Excel files.

Specific Range Extraction

Extract specific cells, ranges, or sheets from Excel.

Presentation Parsing

PowerPoint Extraction

Extract text, images, and metadata from presentations.

Slide-Specific Extraction

Extract content from slides, including notes and shapes.

Template-Based Data Extraction

Template Data Extraction

Use templates for structured data extraction.

Template Editor

Create and edit templates for data extraction.

Custom Parsing Rules

Define custom content extraction rules.

Advanced Features

Multi-Format Support

Support for PDF, DOCX, XLSX, PPTX, and more.

Cross-Platform Compatibility

Works on Windows, Linux, and macOS.

.NET Integration

Integrate with .NET applications.

High Performance

Efficient handling of large documents.

Secure Parsing

Maintain document security and integrity.

Scalable Batch Processing

Handle large document volumes.

Additional Features

Page Count Retrieval

Retrieve the number of pages in a document.

Form Data Extraction

Extract data from forms and interactive elements.

Content-Aware Parsing

Detect and extract specific data types.

Supported Document Formats

Word Processing

Document Type	Parse Document by Template	Extract Text (Accurate)	Extract Structured Text and Formatted Text	Extract Text Areas	Extract Metadata	Extract Images	Extract Containers and Attachments	Parse Form Data	Extract Table of Contents	Scan Barcode
DOC - Microsoft Word Document	✔️	✔️	✔️	✔️	✔️	✔️	✔️		✔️	✔️
DOT - Microsoft Word Document Template	✔️	✔️	✔️	✔️	✔️	✔️	✔️		✔️	✔️
DOCX - Office Open XML Document	✔️	✔️	✔️	✔️	✔️	✔️	✔️		✔️	✔️
DOCM - Office Open XML Macro-Enabled Document	✔️	✔️	✔️	✔️	✔️	✔️	✔️		✔️	✔️
DOTX - Office Open XML Document Template	✔️	✔️	✔️	✔️	✔️	✔️	✔️		✔️	✔️
DOTM - Office Open XML Document Macro-Enabled Template	✔️	✔️	✔️	✔️	✔️	✔️	✔️		✔️	✔️
TXT - Plain text		✔️
ODT - Open Document Text	✔️	✔️	✔️	✔️	✔️			✔️	✔️	✔️
OTT - Open Document Text Template	✔️	✔️	✔️	✔️	✔️			✔️	✔️	✔️
RTF - Rich Text Format	✔️	✔️	✔️	✔️	✔️			✔️	✔️	✔️

PDF

Document Type	Parse Document by Template	Extract Text (Accurate)	Extract Text (Raw)	Extract Structured Text and Formatted Text	Extract Text Areas	Extract Metadata	Extract Images	Extract Containers and Attachments	Parse Form Data	Extract Table of Contents	Scan Barcode
PDF - Portable Document Format	✔️	✔️	✔️		✔️	✔️	✔️	✔️	✔️	✔️	✔️

Markup

Document Type	Extract Text (Accurate)	Extract Structured Text and Formatted Text	Extract Metadata
XHTML - Extensible Hypertext Markup Language File	✔️		✔️
MHTML - MIME HTML File	✔️		✔️
MD - Markdown	✔️	✔️ (Formatted Text is Not supported)
XML - XML File	✔️

Ebook

Document Type	Extract Text (Accurate)	Extract Structured Text and Formatted Text	Extract Metadata	Extract Containers and Attachments	Scan Barcode
CHM - Compiled HTML Help File	✔️	✔️	✔️	✔️	✔️
EPUB - Digital E-Book File Format	✔️	✔️	✔️	✔️	✔️
FB2 - FictionBook 2.0 File	✔️	✔️
MOBI - Mobipocket	✔️
AZW3 - Kindle Format 8	✔️

Spreadsheet

Document Type	Parse Document by Template	Extract Text (Accurate)	Extract Text (Raw)	Extract Structured Text and Formatted Text	Extract Text Areas	Extract Metadata	Extract Images	Extract Containers and Attachments
XLS - Microsoft Excel Spreadsheet	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️
XLT - Microsoft Excel Template	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️
XLSX - Office Open XML Spreadsheet	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️
XLSM - Office Open XML Macro-Enabled Spreadsheet	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️
XLSB - Office Open XML Binary Spreadsheet	✔️	✔️		✔️	✔️	✔️	✔️	✔️
XLTX - Office Open XML Spreadsheet Template	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️
XLTM - Office Open XML Macro-Enabled Spreadsheet Template	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️
ODS - Open Document Spreadsheet	✔️	✔️		✔️	✔️	✔️
OTS - Open Document Spreadsheet Template	✔️	✔️		✔️	✔️	✔️
CSV - Comma Separated Values		✔️
XLA - Excel Add-In File	✔️	✔️	✔️	✔️	✔️	✔️	✔️
XLAM - Excel Open XML Macro-Enabled Add-In	✔️	✔️	✔️	✔️	✔️	✔️	✔️
NUMBERS - Apple iWork Numbers	✔️	✔️		✔️		✔️

Presentation

Document Type	Parse Document by Template	Extract Text (Accurate)	Extract Text (Raw)	Extract Structured Text and Formatted Text	Extract Text Areas	Extract Metadata	Extract Images	Extract Containers and Attachments	Scan Barcode
PPT - PowerPoint Presentation	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️
PPS - PowerPoint Slideshow	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️
POT - PowerPoint Template	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️
PPTX - Office Open XML Presentation	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️
PPTM - Office Open XML Macro-Enabled Presentation	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️
POTX - Office Open XML Presentation Template	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️
POTM - Office Open XML Macro-Enabled Presentation Template	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️
PPSX - Office Open XML Presentation Slideshow	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️
PPSM - Office Open XML Macro-Enabled Presentation Slideshow	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️
ODP - Open Document Presentation	✔️	✔️		✔️	✔️	✔️			✔️
OTP - Open Document Presentation Template	✔️	✔️		✔️	✔️	✔️			✔️

Email

Document Type	Extract Text (Accurate)	Extract Structured Text and Formatted Text	Extract Metadata	Extract Images	Extract Containers and Attachments
PST - Outlook Personal Information Store File					✔️
OST - Outlook Offline Data File					✔️
EML - E-Mail Message	✔️	✔️	✔️	✔️	✔️
EMLX - Apple Mail Message	✔️	✔️	✔️	✔️	✔️
MSG - Outlook Mail Message	✔️	✔️	✔️	✔️	✔️

Note

Document Type	Parse Document by Template	Extract Text (Accurate)	Extract Text (Raw)	Extract Structured Text and Formatted Text	Extract Text Areas	Extract Metadata	Extract Images	Extract Containers and Attachments	Parse Form Data	Extract Table of Contents	Scan Barcode
ONE - OneNote Document		✔️

Image*

Document Type	Extract Text (Accurate)	Extract Table of Contents
BMP - Bitmap Image file	✔️	✔️
GIF - Graphical Interchange Format		✔️
JP2 - JPEG 2000		✔️
JPG, JPEG - JPEG Image file	✔️	✔️
PNG - Portable Network Graphics	✔️	✔️
TIF, TIFF - Tagged Image File Format	✔️	✔️
DICOM - DICOM (Digital Imaging and Communications in Medicine)		✔️
DJVU - DjVu File Format	✔️	✔️
EMF - Enhanced metafile		✔️
J2K - JPEG 2000		✔️
PS - PostScript File Format		✔️
PSD - Photoshop Document		✔️
SVG - Scalar Vector Graphics file		✔️
SVGZ - Scalar Vector Graphics file (with gzip compression)		✔️
WEBP - WebP Image File Format		✔️
WMF - Microsoft Windows Metafile		✔️

Database

Document Type	Parse Document by Template	Extract Text (Accurate)	Extract Text (Raw)	Extract Structured Text and Formatted Text	Extract Text Areas	Extract Metadata	Extract Images	Extract Containers and Attachments	Parse Form Data	Extract Table of Contents	Scan Barcode
ADO.NET		✔️									✔️

Platform Independence

GroupDocs.Parser for .NET does not require any external software or third-party tool to be installed. GroupDocs.Parser for .NET supports any 32-bit or 64-bit operating system where .NET or Mono framework is installed. The other details are as follows:

Microsoft Windows: Microsoft Windows Desktop (x86, x64) (XP & up), Microsoft Windows Server (x86, x64) (2000 & up), Windows Azure
Mac OS: Mac OS X
Linux: Linux (Ubuntu, OpenSUSE, CentOS and others)
Development Environments: Microsoft Visual Studio (2010 & up), Xamarin.Android, Xamarin.IOS, Xamarin.Mac, MonoDevelop 2.4 and later.
Supported Frameworks: GroupDocs.Conversion for .NET supports .NET and Mono frameworks.

Get Started

Are you ready to give GroupDocs.Parser for .NET a try? Simply execute Install-Package GroupDocs.Parser from Package Manager Console in Visual Studio to fetch & reference GroupDocs.Parser assembly in your project. If you already have GroupDocs.Parser for .Net and want to upgrade it, please execute Update-Package GroupDocs.Parser to get the latest version.

Please check the GitHub Repository for other common usage scenarios.

How to Install GroupDocs.Parser for .NET

1. Install from NuGet

Option 1: Using Package Manager GUI

Open Visual Studio:
- Load your solution/project.
Access NuGet Package Manager:
- Go to Tools -> NuGet Package Manager -> Manage NuGet Packages for Solution.
- Alternatively, right-click the solution or project in Solution Explorer and select Manage NuGet Packages.
Search for GroupDocs.Parser:
- Navigate to the Browse tab.
- Type “GroupDocs.Parser” in the search box.
Install the Package:
- Click the Install button to add the latest version of GroupDocs.Parser to your project.

Option 2: Using Package Manager Console

Open Visual Studio:
- Load your solution/project.
Open Package Manager Console:
- Go to Tools -> NuGet Package Manager -> Package Manager Console.
Install GroupDocs.Parser:
- Type the command Install-Package GroupDocs.Parser and press Enter.
Verify Installation:
- GroupDocs.Parser should now be referenced in your application.

2. Handling .NET Framework and .NET Standard

Starting with version 24.2, GroupDocs.Parser is split into two packages: one for .NET Framework and one for .NET Standard.
For .NET Framework projects:
- Ensure AutoGenerateBindingRedirects is enabled.
- Add the following to your project file for unit tests:

<PropertyGroup>
    <AutoGenerateBindingRedirects>true</AutoGenerateBindingRedirects>
    <GenerateBindingRedirectsOutputType>true</GenerateBindingRedirectsOutputType>
</PropertyGroup>

3. Install from the Official GroupDocs Website

Download GroupDocs.Parser:
- Visit the official GroupDocs website and download the package.
Unpack or Install:
- Unzip the archive or run the MSI installer.
Add a Reference in Visual Studio:
- In Solution Explorer, right-click the References node of your project and select Add Reference.
- If you used the MSI installer, select GroupDocs.Parser from the .NET tab. Otherwise, browse to the location of the GroupDocs.Parser.dll file.
Confirm Reference:
- Ensure GroupDocs.Parser appears under the References node in your project.

4. Additional Considerations

.NET Standard 2.0 Version:
- This version has external references to several packages like System.Drawing.Common, System.Text.Encoding.CodePages, SkiaSharp, etc.
Linux Environment:
- Install the following packages for proper functionality:
  - libgdiplus
  - libc6-dev
  - ttf-mscorefonts-installer (e.g., sudo apt-get install ttf-mscorefonts-installer)
- Also, ensure SkiaSharp.NativeAssets.Linux.NoDependencies is installed.

GroupDocs.Parser for .NET Coding Samples

Code Sample 1: Extracting Text from a PDF Document

This code loads a PDF file (sample.pdf) and extracts its text content using the GetText() method. The extracted text is then displayed in the console.

using GroupDocs.Parser;
using GroupDocs.Parser.Options;

public class ExtractTextFromPdf
{
    public static void Run()
    {
        // Load the PDF document
        using (Parser parser = new Parser("sample.pdf"))
        {
            // Extract text from the document
            string text = parser.GetText();
            
            // Output the extracted text
            Console.WriteLine(text);
        }
    }
}

Code Sample 2: Extracting Images from a Word Document

This code loads a Word document (sample.docx) and extracts all images found within the document. Each image is saved as a separate PNG file.

using GroupDocs.Parser;
using GroupDocs.Parser.Data;

public class ExtractImagesFromWord
{
    public static void Run()
    {
        // Load the Word document
        using (Parser parser = new Parser("sample.docx"))
        {
            // Get images from the document
            IEnumerable<PageImageArea> images = parser.GetImages();
            
            // Save each image to a file
            int imageNumber = 1;
            foreach (PageImageArea image in images)
            {
                image.Save($"image{imageNumber++}.png");
            }
        }
    }
}

Code Sample 3: Parsing Metadata from an Excel Spreadsheet

This code loads an Excel spreadsheet (sample.xlsx) and extracts its metadata, such as author, title, and creation date. The metadata is then displayed in the console.

using GroupDocs.Parser;
using GroupDocs.Parser.Data;

public class ExtractMetadataFromExcel
{
    public static void Run()
    {
        // Load the Excel spreadsheet
        using (Parser parser = new Parser("sample.xlsx"))
        {
            // Get document's metadata
            IEnumerable<MetadataItem> metadata = parser.GetMetadata();
            
            // Output the metadata
            foreach (var item in metadata)
            {
                Console.WriteLine($"{item.Name}: {item.Value}");
            }
        }
    }
}

Product	Compatible and additional computed target framework versions.
.NET	net5.0 was computed. net5.0-windows was computed. net6.0 was computed. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed.
.NET Core	netcoreapp2.0 was computed. netcoreapp2.1 was computed. netcoreapp2.2 was computed. netcoreapp3.0 was computed. netcoreapp3.1 was computed.
.NET Standard	netstandard2.0 is compatible. netstandard2.1 was computed.
.NET Framework	net461 was computed. net462 was computed. net463 was computed. net47 was computed. net471 was computed. net472 was computed. net48 was computed. net481 was computed.
MonoAndroid	monoandroid was computed.
MonoMac	monomac was computed.
MonoTouch	monotouch was computed.
Tizen	tizen40 was computed. tizen60 was computed.
Xamarin.iOS	xamarinios was computed.
Xamarin.Mac	xamarinmac was computed.
Xamarin.TVOS	xamarintvos was computed.
Xamarin.WatchOS	xamarinwatchos was computed.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last updated
24.10.0	705	11/1/2024
24.9.0	2,187	9/30/2024
24.8.0	23,133	8/30/2024
24.7.0	1,505	7/24/2024
24.6.0	2,574	6/29/2024
24.5.0	5,193	5/31/2024
24.4.0	5,312	4/23/2024
24.2.1	7,005	3/13/2024
24.2.0	1,295	2/29/2024
23.12.0	133,525	12/23/2023
23.11.0	36,269	11/24/2023
23.10.0	13,456	10/21/2023
23.8.0	65,476	8/18/2023
23.5.0	84,331	5/31/2023
23.3.0	16,027	3/31/2023
23.2.0	22,844	3/1/2023
22.11.1	24,565	1/17/2023
22.11.0	38,848	11/29/2022
22.8.0	74,089	8/12/2022
22.6.0	31,412	6/7/2022
22.2.0	37,152	2/25/2022
21.5.0	63,072	5/31/2021
21.2.0	50,838	2/22/2021
20.12.0	24,372	12/30/2020
20.10.0	168,028	10/27/2020
20.8.0	48,774	8/19/2020
20.6.1	47,372	6/30/2020
20.6.0	20,025	6/19/2020
20.5.0	35,106	5/8/2020
20.3.0	48,312	3/19/2020
20.1.0	35,646	1/31/2020
19.12.0	33,523	12/27/2019
19.11.0	28,444	11/22/2019
19.9.0	2,801	9/27/2019
19.5.0	3,031	5/29/2019
18.12.0	3,207	12/11/2018
18.11.0	2,694	11/8/2018
18.10.0	2,777	10/10/2018
18.9.0	2,765	9/5/2018
18.8.0	2,834	8/7/2018
18.7.0	2,784	7/3/2018
18.5.0	3,006	5/23/2018

https://releases.groupdocs.com/parser/net/release-notes/2024/groupdocs-parser-for-net-24-9-release-notes/

Document Type	Extract Images	Extract Containers and Attachments
7Z* - 7Z File	✔️	✔️
ZIP - Zipped File	✔️	✔️
RAR - Rar File	✔️	✔️
TAR - Tar File	✔️	✔️
GZ - GZip file	✔️	✔️
BZ2 - BZip2 File	✔️	✔️

GroupDocs.Parser 24.9.0

Advanced Document Parsing API for .NET

Text Extraction

Document Text Extraction

Retain Text Formatting

Text Search and Extraction

OCR Text Extraction

Metadata Extraction

Document Metadata Extraction

Date Property Extraction

Field-Specific Data Extraction

Image and Attachment Extraction

Extract Embedded Images

Extract File Attachments

Barcode Extraction

Document Structure Analysis

Structured Document Analysis

Table Extraction

Hyperlink Extraction

Bookmark Extraction

PDF-Specific Parsing

PDF Parsing

Extract PDF Page Count

PDF Bookmark Management

Email Parsing

Email Content Extraction

Email Property Extraction

Spreadsheet Parsing

Excel Data Extraction

Specific Range Extraction

Presentation Parsing

PowerPoint Extraction

Slide-Specific Extraction

Template-Based Data Extraction

Template Data Extraction

Template Editor

Custom Parsing Rules

Advanced Features

Multi-Format Support

Cross-Platform Compatibility

.NET Integration

High Performance

Secure Parsing

Scalable Batch Processing