Document Processing

Search docs

Ask HelpBot

Working with Text Extraction

17 Nov 202521 minutes to read

Essential^® PDF allows you to extract the text from a particular page or the entire PDF document.

Check the following video to quickly get started with extracting text from a PDF document in .NET using the PDF Library.

Working with basic text extraction

You can extract the text from a page using ExtractText method in PdfPageBase class.

The following code snippet explains how to extract the texts from a page.

C# [Cross-platform]
C# [Windows-specific]
VB.NET [Windows-specific]
using Syncfusion.Pdf;
using Syncfusion.Pdf.Parsing;

//Load the PDF document.
PdfLoadedDocument loadedDocument = new PdfLoadedDocument("Input.pdf");
//Load the first page.
PdfPageBase page = loadedDocument.Pages[0];

//Extract text from first page.
string extractedText = page.ExtractText();
//Close the document.
loadedDocument.Close(true);
using Syncfusion.Pdf;
using Syncfusion.Pdf.Parsing;

//Load an existing PDF.
PdfLoadedDocument loadedDocument = new PdfLoadedDocument("Input.pdf");
//Load the first page.
PdfPageBase page = loadedDocument.Pages[0];

//Extract text from first page.
string extractedText = page.ExtractText();
//Close the document
loadedDocument.Close(true);
Imports Syncfusion.Pdf
Imports Syncfusion.Pdf.Parsing

'Load an existing PDF.
Dim loadedDocument As New PdfLoadedDocument("Input.pdf")
'Load the first page.
Dim page As PdfPageBase = loadedDocument.Pages(0)

'Extract the text from first page.
Dim extractedText As String = page.ExtractText()
'close the document.
loadedDocument.Close(True)

You can download a complete working sample from GitHub.

NOTE

In this method, the text is extracted in the order in which it is written in the document stream and it may not be in the order in which it is viewed in the PDF reader application.

NOTE

Extracting text from the PDF document pages will not load the entire document content into memory.

The below code illustrates how to extract the text from entire PDF document:

C# [Cross-platform]
C# [Windows-specific]
VB.NET [Windows-specific]
using Syncfusion.Pdf;
using Syncfusion.Pdf.Parsing;

//Load the PDF document.
PdfLoadedDocument loadedDocument = new PdfLoadedDocument("Input.pdf");
// Loading page collections
PdfLoadedPageCollection loadedPages = loadedDocument.Pages;

string extractedText = string.Empty;
// Extract text from existing PDF document pages
foreach (PdfLoadedPage loadedPage in loadedPages)
{
    extractedText += loadedPage.ExtractText();
}
//Close the document.
loadedDocument.Close(true);
using Syncfusion.Pdf;
using Syncfusion.Pdf.Parsing;

// Load an existing PDF document.
PdfLoadedDocument loadedDocument = new PdfLoadedDocument("Input.pdf");
// Loading page collections
PdfLoadedPageCollection loadedPages = loadedDocument.Pages;

string extractedText = string.Empty;
// Extract text from existing PDF document pages
foreach (PdfLoadedPage loadedPage in loadedPages)
{
extractedText += loadedPage.ExtractText();
}
// Close the document.
loadedDocument.Close(true);
Imports Syncfusion.Pdf
Imports Syncfusion.Pdf.Parsing

' Load an existing PDF document.
Dim loadedDocument As New PdfLoadedDocument("Input.pdf")
' Loading page collections
Dim loadedPages As PdfLoadedPageCollection = loadedDocument.Pages

Dim extractedText As String = String.Empty
' Extract text from existing PDF document pages
For Each loadedPage As PdfLoadedPage In loadedPages
extractedText &= loadedPage.ExtractText()
Next loadedPage
' Close the document.
loadedDocument.Close(True)

You can download a complete working sample from GitHub.

Working with layout based text extraction

You can extract text from the given PDF page based on its layout using ExtractText(bool) overload. In this method, the text is extracted in the layout as it is viewed in the reader application.

Please refer the following code snippet to extract the text with layout.

C# [Cross-platform]
C# [Windows-specific]
VB.NET [Windows-specific]
using Syncfusion.Pdf;
using Syncfusion.Pdf.Parsing;

//Load the PDF document.
PdfLoadedDocument loadedDocument = new PdfLoadedDocument("Input.pdf");
//Load first page.
PdfPageBase page = loadedDocument.Pages[0];

//Extract text from first page.
string extractedTexts = page.ExtractText(true);
//Close the document.
loadedDocument.Close(true);

//Save the document 
loadedDocument.Save("Output.pdf");
//Closes the document
loadedDocument.Close(true);
using Syncfusion.Pdf;
using Syncfusion.Pdf.Parsing;

//Load an existing PDF.
PdfLoadedDocument loadedDocument = new PdfLoadedDocument("Input.pdf");
//Load first page.
PdfPageBase page = loadedDocument.Pages[0];

//Extract text from first page.
string extractedTexts = page.ExtractText(true);
//close the document
loadedDocument.Close(true);
Imports Syncfusion.Pdf
Imports Syncfusion.Pdf.Parsing

' Load an existing PDF
Dim loadedDocument As New PdfLoadedDocument("Input.pdf")

' Load the first page
Dim page As PdfPageBase = loadedDocument.Pages(0)

' Extract text from the first page
Dim extractedTexts As String = page.ExtractText(True)

' Close the document
loadedDocument.Close(True)

You can download a complete working sample from GitHub.

NOTE

Layout based text extraction may take additional processing time when compared to the normal extraction mode.

Text Extraction with Bounds

Working with Lines

You can get the line and its properties that contains texts by using TextLine. Refer to the following code sample.

C# [Cross-platform]
C# [Windows-specific]
VB.NET [Windows-specific]
//PDF supports getting the lines and its properties using TextLine only in WinForms, WPF and Xamarin platforms. Instead of TextLine, TextLineCollection can be used in ASP.NET Core.

using Syncfusion.Pdf;
using Syncfusion.Pdf.Parsing;

// Load the existing PDF document
PdfLoadedDocument loadedDocument = new PdfLoadedDocument("Input.pdf");
// Get the first page of the loaded PDF document
PdfPageBase page = loadedDocument.Pages[0];
var lineCollection = new TextLineCollection();

// Extract text from the first page
string extractedText = page.ExtractText(out lineCollection);
// Gets each line from the collection
foreach (var line in lineCollection.TextLine)
{
    // Gets bounds of the line
    RectangleF lineBounds = line.Bounds;
    // Gets text in the line
    string text = line.Text;
}
using Syncfusion.Pdf;
using Syncfusion.Pdf.Parsing;

// Load the existing PDF document
PdfLoadedDocument loadedDocument = new PdfLoadedDocument("Input.pdf");
// Get the first page of the loaded PDF document
PdfPageBase page = loadedDocument.Pages[0];
TextLines lineCollection = new TextLines();

// Extract text from the first page
string extractedText = page.ExtractText(out lineCollection);
// Gets specific line from the collection
TextLine line = lineCollection[0];
// Gets bounds of the line
RectangleF lineBounds = line.Bounds;
// Gets text in the line
string text = line.Text;
Imports Syncfusion.Pdf
Imports Syncfusion.Pdf.Parsing

' Load the existing PDF document
Dim loadedDocument As PdfLoadedDocument = New PdfLoadedDocument("Input.pdf")
' Get the first page of the loaded PDF document
Dim page As PdfPageBase = loadedDocument.Pages(0)
Dim lineCollection As TextLines = New TextLines()

' Extract text from the first page
Dim extractedText As String = page.ExtractText(lineCollection)
' Gets specific line from the collection
Dim line As TextLine = lineCollection(0)
' Gets bounds of the line
Dim lineBounds As RectangleF = line.Bounds
' Gets text in the line
Dim text As String = line.Text

You can download a complete working sample from GitHub.

Working with words

You can get the single word and its properties by using TextWord. Refer to the following code sample.

C# [Cross-platform]
C# [Windows-specific]
VB.NET [Windows-specific]
//PDF supports getting the word and its properties using TextWord only in WinForms, WPF and Xamarin platforms. Instead of TextLine, TextLineCollection can be used in ASP.NET Core.

using Syncfusion.Drawing;
using Syncfusion.Pdf;
using Syncfusion.Pdf.Parsing;

// Load the existing PDF document
PdfLoadedDocument loadedDocument = new PdfLoadedDocument("Input.pdf");
// Get the first page of the loaded PDF document
PdfPageBase page = loadedDocument.Pages[0];
var lineCollection = new TextLineCollection();

// Extract text from the first page
string extractedText = page.ExtractText(out lineCollection);
// Gets each line from the collection
foreach (var line in lineCollection.TextLine)
{   
    // Gets bounds of the line
    RectangleF lineBounds = line.Bounds;
    // Gets text in the line
    string text = line.Text;
    // Gets collection of the words in the line
    List<TextWord> textWordCollection = line.WordCollection;
}
using System.Drawing;
using Syncfusion.Pdf;
using Syncfusion.Pdf.Parsing;

// Load the existing PDF document
PdfLoadedDocument loadedDocument = new PdfLoadedDocument("Input.pdf");
// Get the first page of the loaded PDF document
PdfPageBase page = loadedDocument.Pages[0];
TextLines lineCollection = new TextLines();

// Extract text from the first page
string extractedText = page.ExtractText(out lineCollection);
// Gets specific line from the collection
TextLine line = lineCollection[0];
// Gets bounds of the line
RectangleF lineBounds = line.Bounds;
// Gets text in the line
string text = line.Text;
// Gets collection of the words in the line
List<TextWord> textWordCollection = line.WordCollection;
Imports Syncfusion.Pdf
Imports Syncfusion.Pdf.Parsing
Imports System.Drawing

' Load the existing PDF document
Dim loadedDocument As PdfLoadedDocument = New PdfLoadedDocument("Input.pdf")
' Get the first page of the loaded PDF document
Dim page As PdfPageBase = loadedDocument.Pages(0)
Dim lineCollection As TextLines = New TextLines()

' Extract text from the first page
Dim extractedText As String = page.ExtractText(lineCollection)
' Gets specific line from the collection
Dim line As TextLine = lineCollection(0)
' Gets bounds of the line
Dim lineBounds As RectangleF = line.Bounds
' Gets text in the line
Dim text As String = line.Text
' Gets collection of the words in the line
Dim textWordCollection As List(Of TextWord) = line.WordCollection

You can download a complete working sample from GitHub.

Working with characters

You can retrieve a single character and its properties, including bounds, font name, font size, and text color, using the instance. Refer to the code sample below.

C# [Cross-platform]
VB.NET [Windows-specific]
using Syncfusion.Drawing;
using Syncfusion.Pdf;
using Syncfusion.Pdf.Parsing;

// Load the existing PDF document
PdfLoadedDocument loadedDocument = new PdfLoadedDocument("Input.pdf");
// Get the first page of the loaded PDF document
PdfPageBase page = loadedDocument.Pages[0];
TextLineCollection lineCollection = new TextLineCollection();

// Extract text from the first page
string extractedText = page.ExtractText(out lineCollection);
// Get a specific line from the collection
TextLine line = lineCollection.TextLine[0];
// Get the collection of words in the line
List<TextWord> textWordCollection = line.WordCollection;
// Get a word from the collection using an index
TextWord textWord = textWordCollection[0];
// Get Glyph details of the word
List<TextGlyph> textGlyphCollection = textWord.Glyphs;

// Get a character from the word
TextGlyph textGlyph = textGlyphCollection[0];
// Get bounds of the character
RectangleF glyphBounds = textGlyph.Bounds;
// Get font name of the character
string glyphFontName = textGlyph.FontName;
// Get font size of the character
float glyphFontSize = textGlyph.FontSize;
// Get font style of the character
FontStyle glyphFontStyle = textGlyph.FontStyle;
// Get the character in the word
char glyphText = textGlyph.Text;
// Get the color of the character
Color glyphColor = textGlyph.TextColor;
Imports Syncfusion.Pdf
Imports Syncfusion.Pdf.Parsing
Imports System.Drawing

' Load the existing PDF document
Dim loadedDocument As PdfLoadedDocument = New PdfLoadedDocument("Input.pdf")
' Get the first page of the loaded PDF document
Dim page As PdfPageBase = loadedDocument.Pages(0)
Dim lineCollection As New TextLineCollection()

' Extract text from the first page
Dim extractedText As String = page.ExtractText(lineCollection)
' Get a specific line from the collection
Dim line As TextLine = lineCollection.TextLine(0)
' Get a collection of words in the line
Dim textWordCollection As List(Of TextWord) = line.WordCollection
' Get a word from the collection using an index
Dim textWord As TextWord = textWordCollection(0)
' Get Glyph details of the word
Dim textGlyphCollection As List(Of TextGlyph) = textWord.Glyphs

' Get a character from the word
Dim textGlyph As TextGlyph = textGlyphCollection(0)
' Get bounds of the character
Dim glyphBounds As RectangleF = textGlyph.Bounds
' Get font name of the character
Dim glyphFontName As String = textGlyph.FontName
' Get font size of the character
Dim glyphFontSize As Single = textGlyph.FontSize
' Get font style of the character
Dim glyphFontStyle As FontStyle = textGlyph.FontStyle
' Get the character in the word
Dim glyphText As Char = textGlyph.Text
' Get the color of the character
Dim glyphColor As Color = textGlyph.TextColor

You can download a complete working sample from GitHub.

NOTE

In .NET Framework, use the ExtractText(out List<TextData>) or ExtractText(out List<TextLine>) method to extract text with metadata from a PDF.
In contrast, for .NET Core, the equivalent method is ExtractText(out TextLineCollection), which provides a unified structure for handling extracted text data.

Find Text

The code example provided below demonstrates the utilization of the FindText method from the PdfLoadedDocument class to locate text within a PDF document. This method facilitates the retrieval of both the page number and the rectangular coordinates of the identified text occurrences.

C# [Cross-platform]
C# [Windows-specific]
VB.NET [Windows-specific]
using Syncfusion.Pdf;
using Syncfusion.Pdf.Parsing;

//Load an existing PDF document. 
PdfLoadedDocument loadedDocument = new PdfLoadedDocument("Input.pdf");
//Returns page number and rectangle positions of the text maches. 
Dictionary<int, List<Syncfusion.Drawing.RectangleF>> matchRects = new Dictionary<int, List<Syncfusion.Drawing.RectangleF>>();
loadedDocument.FindText("document", out matchRects);
//Close the document.
loadedDocument.Close(true);
using Syncfusion.Pdf;
using Syncfusion.Pdf.Parsing;

//Load an existing PDF document. 
PdfLoadedDocument loadedDocument = new PdfLoadedDocument("Input.pdf");
//Returns page number and rectangle positions of the text maches.
Dictionary<int, List<System.Drawing.RectangleF>> matchRects = new Dictionary<int, List<System.Drawing.RectangleF>>();
loadedDocument.FindText("document", out matchRects);           
//Close the document.
loadedDocument.Close(true);
Imports Syncfusion.Pdf
Imports Syncfusion.Pdf.Parsing

'Load an existing PDF document. 
Dim loadedDocument As PdfLoadedDocument = New PdfLoadedDocument("Input.pdf")
'Returns page number and rectangle positions of the text maches.
Dim matchRects As Dictionary(Of Integer, List(Of System.Drawing.RectangleF)) = New Dictionary(Of Integer, List(Of System.Drawing.RectangleF))()
loadedDocument.FindText("document", matchRects)
'Close the document.
loadedDocument.Close(True)

You can download a complete working sample from GitHub.

FindText Module API Reference

Method	Return Type	Description
FindText(List<string> searchItems, out TextSearchResultCollection searchResult)	bool	Searches for a list of text strings (`searchItems`) across the entire document, storing the results in `searchResult`.
FindText(List<string> searchItems, out TextSearchResultCollection searchResult, bool enableMultiThreading)	bool	Searches for a list of text strings with multi-threading enabled, storing the results in `searchResult`.
FindText(List<string> searchItems, int pageIndex, out List<MatchedItem> searchResults)	bool	Searches for text strings on a specific page (`pageIndex`), returning matches in `searchResults`.
FindText(List<string> searchItems, int pageIndex, TextSearchOptions textSearchOption, out List<MatchedItem> searchResults)	bool	Searches on a specific page with search options, returning matches in `searchResults`.
FindText(List<string> searchItems, TextSearchOptions textSearchOption, out TextSearchResultCollection searchResult)	bool	Searches with custom options, storing results in `searchResult`.
FindText(List<string> searchItems, TextSearchOptions textSearchOption, out TextSearchResultCollection searchResult, bool enableMultiThreading)	bool	Performs a multi-threaded search with options, saving results in `searchResult`.
FindText(List<TextSearchItem> searchItems, out TextSearchResultCollection searchResult)	bool	Searches using `TextSearchItem` objects, storing results in `searchResult`.
FindText(List<TextSearchItem> searchItems, out TextSearchResultCollection searchResult, bool enableMultiThreading)	bool	Performs a multi-threaded search using `TextSearchItem` objects, storing results in `searchResult`.
FindText(List<TextSearchItem> searchItems, int pageIndex, out List<MatchedItem> searchResults)	bool	Searches using `TextSearchItem` on a specific page, returning `MatchedItem` results.
FindText(string text, out Dictionary<int, List<RectangleF>> matchRect)	bool	Finds a text string and returns match rectangles for all pages in `matchRect`.
FindText(string text, int index, out List<RectangleF> matchRect)	bool	Finds a text string on a specific page (`index`), returning rectangles in `matchRect`.