How can I help you?
Working with Data Extraction
25 May 202624 minutes to read
Extract Data as JSON from PDF or Image
The Smart Data Extractor enables you to process PDF documents or scanned images and export the structured content as JSON.
This section covers two scenarios:
- Extracting data as JSON from a PDF document.
- Extracting data as JSON from an image.
Extract Data as JSON from PDF
To extract structured data from a PDF document using the ExtractDataAsJson method of the DataExtractor class, refer to the following code example:
using System.Text;
using Syncfusion.SmartDataExtractor;
//Open the input PDF file as a stream.
using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read))
{
//Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
//Extract data as JSON.
string data = extractor.ExtractDataAsJson(stream);
//Save the extracted JSON data into an output file.
File.WriteAllText("Output.json", data, Encoding.UTF8);
}using System.Text;
using Syncfusion.SmartDataExtractor;
//Open the input PDF file as a stream.
using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read))
{
//Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
//Extract data as JSON.
string data = extractor.ExtractDataAsJson(stream);
//Save the extracted JSON data into an output file.
File.WriteAllText("Output.json", data, Encoding.UTF8);
}You can download a complete working sample from GitHub.
Extract Data as JSON from an Image
To extract structured data from an image document using the ExtractDataAsJson method of the DataExtractor class, refer to the following code examples.
using System.Text;
using Syncfusion.SmartDataExtractor;
//Open the input image file as a stream.
using (FileStream stream = new FileStream("Image.png", FileMode.Open, FileAccess.Read))
{
//Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
//Extract data as JSON from the image stream.
string data = extractor.ExtractDataAsJson(stream);
//Save the extracted JSON data into an output file.
File.WriteAllText("Output.json", data, Encoding.UTF8);
}using Syncfusion.SmartDataExtractor;
using System.Text;
//Open the input image file as a stream.
using (FileStream stream = new FileStream("Image.png", FileMode.Open, FileAccess.Read))
{
//Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
//Extract data as JSON from the image stream.
string data = extractor.ExtractDataAsJson(stream);
//Save the extracted JSON data into an output file.
File.WriteAllText("Output.json", data, Encoding.UTF8);
}You can download a complete working sample from GitHub.
Extract Data as Markdown from PDF or Image
The Smart Data Extractor enables you to process PDF documents or scanned images and export the structured content as Markdown (MD).
This section covers two scenarios:
- Extracting data as Markdown from a PDF document.
- Extracting data as Markdown from an image.
Extract Data as Markdown from PDF
To extract structured data from a PDF document using the ExtractDataAsMarkdown method of the DataExtractor class, refer to the following code example:
using System.Text;
using Syncfusion.SmartDataExtractor;
//Open the input PDF file as a stream.
using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read))
{
//Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
//Extract data as Markdown.
string data = extractor.ExtractDataAsMarkdown(stream);
//Save the extracted Markdown data into an output file.
File.WriteAllText("Output.md", data, Encoding.UTF8);
}using Syncfusion.SmartDataExtractor;
using System.Text;
//Open the input PDF file as a stream.
using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read))
{
//Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
//Extract data as Markdown.
string data = extractor.ExtractDataAsMarkdown(stream);
//Save the extracted Markdown data into an output file.
File.WriteAllText("Output.md", data, Encoding.UTF8);
}You can download a complete working sample from GitHub.
Extract Data as Markdown from Image
To extract structured data from an image file using the ExtractDataAsMarkdown method of the DataExtractor class, refer to the following code example:
using System.Text;
using Syncfusion.SmartDataExtractor;
//Open the input image file as a stream.
using (FileStream stream = new FileStream("Input.png", FileMode.Open, FileAccess.Read))
{
//Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
//Extract data as Markdown.
string data = extractor.ExtractDataAsMarkdown(stream);
//Save the extracted Markdown data into an output file.
File.WriteAllText("Output.md", data, Encoding.UTF8);
}using System.Text;
using Syncfusion.SmartDataExtractor;
//Open the input image file as a stream.
using (FileStream stream = new FileStream("Input.png", FileMode.Open, FileAccess.Read))
{
//Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
//Extract data as Markdown.
string data = extractor.ExtractDataAsMarkdown(stream);
//Save the extracted Markdown data into an output file.
File.WriteAllText("Output.md", data, Encoding.UTF8);
}Extract Data from PDF or Image and Save as Digital PDF
The Smart Data Extractor allows you to process PDF documents or scanned images and generate a digital PDF output.
In this section, you will learn how to:
- Extract structured content and save it directly as a PDF document.
- Work with the extracted content as a PDF stream for flexible storage or further processing.
Extract Data from PDF Document
To extract structured data such as text, form fields, tables and images from an entire PDF document using the ExtractDataAsPdfDocument method of the DataExtractor class, refer to the following code example:
using Syncfusion.Pdf.Parsing;
using Syncfusion.SmartDataExtractor;
//Open the input PDF file as a stream.
using (FileStream inputStream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read, FileShare.Read))
{
//Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
//Extract data and return as a loaded PDF document.
PdfLoadedDocument document = extractor.ExtractDataAsPdfDocument(inputStream);
//Save the extracted output as a new PDF file.
document.Save("Output.pdf");
//Close the document.
document.Close(true);
}using Syncfusion.Pdf.Parsing;
using Syncfusion.SmartDataExtractor;
//Open the input PDF file as a stream.
using (FileStream inputStream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read, FileShare.Read))
{
//Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
//Extract data and return as a loaded PDF document.
PdfLoadedDocument document = extractor.ExtractDataAsPdfDocument(inputStream);
//Save the extracted output as a new PDF file.
document.Save("Output.pdf");
//Close the document.
document.Close(true);
}You can download a complete working sample from GitHub.
Extract Data as Stream
To extract structured data from a PDF document and return the output as a stream using the ExtractDataAsPdfStream method of the DataExtractor class, refer to the following example.
using Syncfusion.SmartDataExtractor;
//Open the input PDF file as a stream.
using (FileStream inputStream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read, FileShare.Read))
{
//Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
//Extract data and return as a PDF stream.
Stream pdfStream = extractor.ExtractDataAsPdfStream(inputStream);
//Save the extracted PDF stream into an output file.
using (FileStream outputStream = new FileStream("Output.pdf", FileMode.Create, FileAccess.Write))
{
pdfStream.CopyTo(outputStream);
}
}using Syncfusion.SmartDataExtractor;
//Open the input PDF file as a stream.
using (FileStream inputStream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read, FileShare.Read))
{
//Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
//Extract data and return as a PDF stream.
Stream pdfStream = extractor.ExtractDataAsPdfStream(inputStream);
//Save the extracted PDF stream into an output file.
using (FileStream outputStream = new FileStream("Output.pdf", FileMode.Create, FileAccess.Write))
{
pdfStream.CopyTo(outputStream);
}
}You can download a complete working sample from GitHub.
Disable Form Detection
To disable form field detection while extracting structured data from a PDF document using the ExtractDataAsJson method of the DataExtractor class, refer to the following code example:
using Syncfusion.Pdf.Parsing;
using Syncfusion.SmartDataExtractor;
//Open the input PDF file as a stream.
using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read))
{
//Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
//Disable form detection in the document to identify form fields.
//By default - true
extractor.EnableFormDetection = false;
//Extract form data and return as a loaded json file.
PdfLoadedDocument pdf = extractor.ExtractDataAsJson(stream);
//Save the extracted output as a new json file.
pdf.Save("Output.json");
//Close the document.
pdf.Close(true);
}using Syncfusion.Pdf.Parsing;
using Syncfusion.SmartDataExtractor;
//Open the input PDF file as a stream.
using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read))
{
//Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
//Disable form detection in the document to identify form fields.
//By default - true
extractor.EnableFormDetection = false;
//Extract form data and return as a loaded json file.
PdfLoadedDocument pdf = extractor.ExtractDataAsJson(stream);
//Save the extracted output as a new json file.
pdf.Save("Output.json");
//Close the document.
pdf.Close(true);
}You can download a complete working sample from GitHub.
Disable Table detection
To disable table detection while extracting structured data from a PDF document using the ExtractDataAsJson method of the DataExtractor class, refer to the following code example:
using Syncfusion.Pdf.Parsing;
using Syncfusion.SmartDataExtractor;
// Load the input PDF file.
using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read))
{
// Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
// Disable table detection.
//By default - true
extractor.EnableTableDetection = false;
// Extract data and return as a loaded json document.
PdfLoadedDocument pdf = extractor.ExtractDataAsJson(stream);
// Save the extracted output as a new json file.
pdf.Save("Output.json");
// Close the document.
pdf.Close(true);
}using Syncfusion.Pdf.Parsing;
using Syncfusion.SmartDataExtractor;
// Load the input PDF file.
using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read))
{
// Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
// Disable table detection.
//By default - true
extractor.EnableTableDetection = false;
// Extract data and return as a loaded json file.
PdfLoadedDocument pdf = extractor.ExtractDataAsJson(stream);
// Save the extracted output as a new json file.
pdf.Save("Output.json");
// Close the document.
pdf.Close(true);
}You can download a complete working sample from GitHub.
Extract Data with Form Recognizer options
To extract structured data from a PDF document using different Form Recognizer options with the ExtractDataAsJson method of the DataExtractor class, refer to the following code example:
using Syncfusion.Pdf.Parsing;
using Syncfusion.SmartDataExtractor;
using Syncfusion.SmartFormRecognizer;
//Open the input PDF file as a stream.
using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read))
{
//Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
//Enable form detection in the document to identify form fields.
extractor.EnableFormDetection = true;
//Configure form recognition options for advanced detection.
FormRecognizeOptions formOptions = new FormRecognizeOptions();
//Recognize forms across pages 1 to 5 in the document.
formOptions.PageRange = new int[,] { { 1, 5 } };
//Set confidence threshold for form recognition to filter results.
formOptions.ConfidenceThreshold = 0.6;
//Enable detection of signatures within the document.
formOptions.DetectSignatures = true;
//Enable detection of textboxes within the document.
formOptions.DetectTextboxes = true;
//Enable detection of checkboxes within the document.
formOptions.DetectCheckboxes = true;
//Enable detection of radio buttons within the document.
formOptions.DetectRadioButtons = true;
//Assign the configured form recognition options to the extractor.
extractor.FormRecognizeOptions = formOptions;
//Extract form data and return as a loaded json file.
PdfLoadedDocument pdf = extractor.ExtractDataAsJson(stream);
//Save the extracted output as a new json file.
pdf.Save("Output.json");
//Close the document.
pdf.Close(true);
}using Syncfusion.Pdf.Parsing;
using Syncfusion.SmartDataExtractor;
using Syncfusion.SmartFormRecognizer;
//Open the input PDF file as a stream.
using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read))
{
//Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
//Enable form detection in the document to identify form fields.
extractor.EnableFormDetection = true;
//Configure form recognition options for advanced detection.
FormRecognizeOptions formOptions = new FormRecognizeOptions();
//Recognize forms across pages 1 to 5 in the document.
formOptions.PageRange = new int[,] { { 1, 5 } };
//Set confidence threshold for form recognition to filter results.
formOptions.ConfidenceThreshold = 0.6;
//Enable detection of signatures within the document.
formOptions.DetectSignatures = true;
//Enable detection of textboxes within the document.
formOptions.DetectTextboxes = true;
//Enable detection of checkboxes within the document.
formOptions.DetectCheckboxes = true;
//Enable detection of radio buttons within the document.
formOptions.DetectRadioButtons = true;
//Assign the configured form recognition options to the extractor.
extractor.FormRecognizeOptions = formOptions;
//Extract form data and return as a loaded json document.
PdfLoadedDocument pdf = extractor.ExtractDataAsJson(stream);
//Save the extracted output as a new json file.
pdf.Save("Output.json");
//Close the document.
pdf.Close(true);
}You can download a complete working sample from GitHub.
Extract Data with Table Extraction options
To extract structured table data from a PDF document using advanced Table Extraction options with the ExtractDataAsJson method of the DataExtractor class, refer to the following code example:
using Syncfusion.Pdf.Parsing;
using Syncfusion.SmartDataExtractor;
using Syncfusion.SmartTableExtractor;
// Load the input PDF file.
using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read))
{
// Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
// Enable table detection and set confidence threshold.
extractor.EnableTableDetection = true;
// Configure table extraction options.
TableExtractionOptions tableOptions = new TableExtractionOptions();
// Extract tables across pages 1 to 5.
tableOptions.PageRange = new int[,] { { 1, 5 } };
// Set confidence threshold for table extraction.
tableOptions.ConfidenceThreshold = 0.6;
// Enable detection of borderless tables.
tableOptions.DetectBorderlessTables = true;
// Assign the table extraction options to the extractor.
extractor.TableExtractionOptions = tableOptions;
// Extract data and return as a loaded json file.
PdfLoadedDocument pdf = extractor.ExtractDataAsJson(stream);
// Save the extracted output as a new json file.
pdf.Save("Output.json");
// Close the document.
pdf.Close(true);
}using Syncfusion.Pdf.Parsing;
using Syncfusion.SmartDataExtractor;
using Syncfusion.SmartTableExtractor;
// Load the input PDF file.
using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read))
{
// Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
// Enable table detection and set confidence threshold.
extractor.EnableTableDetection = true;
// Configure table extraction options.
TableExtractionOptions tableOptions = new TableExtractionOptions();
// Extract tables across pages 1 to 5.
tableOptions.PageRange = new int[,] { { 1, 5 } };
// Set confidence threshold for table extraction.
tableOptions.ConfidenceThreshold = 0.6;
// Enable detection of borderless tables.
tableOptions.DetectBorderlessTables = true;
// Assign the table extraction options to the extractor.
extractor.TableExtractionOptions = tableOptions;
// Extract data and return as a loaded json document.
PdfLoadedDocument pdf = extractor.ExtractDataAsJson(stream);
// Save the extracted output as a new json file.
pdf.Save("Output.json");
// Close the document.
pdf.Close(true);
}You can download a complete working sample from GitHub.
Apply Confidence Threshold for Data Extraction
To apply confidence thresholding when extracting data from a PDF document using the ExtractDataAsJson method of the DataExtractor class, refer to the following code example:
using Syncfusion.Pdf.Parsing;
using Syncfusion.SmartDataExtractor;
// Load the input PDF file.
using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read))
{
// Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
// Apply confidence threshold to extract the data.
// Only elements with confidence >= 0.75 will be included in the results.
//default confidence threshold value is 0.6
extractor.ConfidenceThreshold = 0.75;
// Extract data and return as a loaded json document.
PdfLoadedDocument pdf = extractor.ExtractDataAsJson(stream);
// Save the extracted output as a new json file.
pdf.Save("Output.json");
// Close the document.
pdf.Close(true);
}using Syncfusion.Pdf.Parsing;
using Syncfusion.SmartDataExtractor;
// Load the input PDF file.
using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read))
{
// Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
// Apply confidence threshold to extract the data.
// Only elements with confidence >= 0.75 will be included in the results.
//default confidence threshold value is 0.6
extractor.ConfidenceThreshold = 0.75;
// Extract data and return as a loaded json file.
PdfLoadedDocument pdf = extractor.ExtractDataAsJson(stream);
// Save the extracted output as a new json file.
pdf.Save("Output.json");
// Close the document.
pdf.Close(true);
}You can download a complete working sample from GitHub.
Extract Data within a Specific Page Range
To extract data from a specific range of pages in a PDF document using the ExtractDataAsJson method of the DataExtractor class, refer to the following code example:
using Syncfusion.Pdf.Parsing;
using Syncfusion.SmartDataExtractor;
//Open the input PDF file as a stream.
using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read))
{
//Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
//Set the page range for extraction (pages 1 to 3).
extractor.PageRange = new int[,] { { 1, 3 } };
//Extract data and return as a loaded json document.
PdfLoadedDocument pdf = extractor.ExtractDataAsJson(stream);
//Save the extracted output as a new PDF file.
pdf.Save("Output.json");
//Close the document.
pdf.Close(true);
}using Syncfusion.Pdf.Parsing;
using Syncfusion.SmartDataExtractor;
//Open the input PDF file as a stream.
using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read))
{
//Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
//Set the page range for extraction (pages 1 to 3).
extractor.PageRange = new int[,] { { 1, 3 } };
//Extract data and return as a loaded json document.
PdfLoadedDocument pdf = extractor.ExtractDataAsJson(stream);
//Save the extracted output as a new json file.
pdf.Save("Output.json");
//Close the document.
pdf.Close(true);
}You can download a complete working sample from GitHub.