Extract Text from PDF Files in WPF Pdf Viewer

29 Jul 20218 minutes to read

PDF Viewer allows you to extract the text from a particular page or from the entire PDF file using the ExtractText methods of PdfDocumentView.

NOTE

PDF Viewer uses PDFium as a default rendering engine to extract text from PDF files. Refer to this link for more details about the PDF rendering engines.

Extract text from a particular page

You can extract the text from a page using ExtractText method in PdfDocumentView class. The following code sample explains how to extract the text from the first page.

using System.Windows;
using Syncfusion.Pdf;
using Syncfusion.Windows.PdfViewer;

namespace TextExtractionDemo
{
    /// <summary>
    /// Interaction logic for Window1.xaml
    /// </summary>
    public partial class MainWindow : Window
    {
        #region Constructor
        public MainWindow()
        {
            InitializeComponent();

            //Initialize the `PdfDocumentView` control.
            PdfDocumentView pdfDocumentView = new PdfDocumentView();

            //Load the PDF file.
            pdfDocumentView.Load(@"Sample.pdf");

            //Extract text from the first page.
            TextLines textLines = new TextLines();
            string extractedText = pdfDocumentView.ExtractText(0, out textLines);
        }
        #endregion
    }
}

NOTE

In this method, the text is extracted in the order in which it is written in the document stream and it may not be in the order in which it is viewed in the PDF reader application.

Extract text from an entire file

You can extract text from an entire file by using the following code sample.

using System.Windows;
using Syncfusion.Pdf;
using Syncfusion.Windows.PdfViewer;

namespace TextExtractionDemo
{
    /// <summary>
    /// Interaction logic for Window1.xaml
    /// </summary>
    public partial class MainWindow : Window
    {
        #region Constructor
        public MainWindow()
        {
            InitializeComponent();
			
			//Initialize the `PdfDocumentView` control.
            PdfDocumentView pdfDocumentView = new PdfDocumentView();
			
			//Load the PDF file.
            pdfDocumentView.Load(@"Sample.pdf");
			
			//Extract text from the file.
            TextLines textLines = new TextLines();
            string extractedText = string.Empty;
            for (int i = 0; i < pdfDocumentView.PageCount; i++)
            {
                extractedText += pdfDocumentView.ExtractText(i, out textLines);
            }
        }
        #endregion
    }
}

Extract text with bounds

Extract lines

You can get the text line by line along with the bounds using the TextLines property from the ExtractText method. Refer to the following code sample to perform the same.

using System.Drawing;
using System.Windows;
using Syncfusion.Pdf;
using Syncfusion.Windows.PdfViewer;

namespace TextExtractionDemo
{
    /// <summary>
    /// Interaction logic for Window1.xaml
    /// </summary>
    public partial class MainWindow : Window
    {
        #region Constructor
        public MainWindow()
        {
            InitializeComponent();

            //Initialize the `PdfDocumentView` control.
            PdfDocumentView pdfDocumentView = new PdfDocumentView();

            //Load the PDF file.
            pdfDocumentView.Load(@"Sample.pdf");

            //Initialize the `TextLines`
            TextLines textLines = new TextLines();

            //Pass the `TextLines` as a parameter to the `ExtractText` method.
            pdfDocumentView.ExtractText(0, out textLines);

            //Gets specific line from the collection through the index.
            TextLine line = textLines[0];

            //Get text in the line.
            string text = line.Text;
			
            //Get bounds of the line.
            RectangleF lineBounds = line.Bounds;
        }
        #endregion
    }
}

Extract words

You can get the words in a line along with the bounds using the WordCollection property of the TextLine using ExtractText method. Refer to the following code sample to perform the same.

using System.Collections.Generic;
using System.Drawing;
using System.Windows;
using Syncfusion.Pdf;
using Syncfusion.Windows.PdfViewer;

namespace TextExtractionDemo
{
    /// <summary>
    /// Interaction logic for Window1.xaml
    /// </summary>
    public partial class MainWindow : Window
    {
        #region Constructor
        public MainWindow()
        {
            InitializeComponent();

            //Initialize the `PdfDocumentView` control.
            PdfDocumentView pdfDocumentView = new PdfDocumentView();

            //Load the PDF file.
            pdfDocumentView.Load(@"Sample.pdf");

            //Initialize the `TextLines`
            TextLines textLines = new TextLines();

            //Pass the `TextLines` as a parameter to the `ExtractText` method.
            pdfDocumentView.ExtractText(0, out textLines);

            //Gets specific line from the collection through the index.
            TextLine line = textLines[0];

            //Get the word collection in a line.
            List<TextWord> wordCollection= line.WordCollection;

            //Get the word
            string word = wordCollection[0].Text;

            //Get the bounds of the word
            RectangleF bounds= wordCollection[0].Bounds;
        }
        #endregion
    }
}

NOTE

You can refer to our WPF PDF Viewer feature tour page for its groundbreaking feature representations. You can also explore our WPF PDF Viewer example to know how to render and configure the pdfviewer.