Syncfusion AI Assistant

How can I help you?

Overview of Smart Data Extractor

25 May 20264 minutes to read

Syncfusion® Smart Data Extractor is a high‑performance, deterministic C# library for extracting structured document content from PDFs and images. Tailored for modern .NET workflows, it interprets visual layout patterns—lines, boxes, labels, and alignment—to accurately identify and extract tables, text elements, images, headers, footers, and form fields. Each extracted element includes per‑field confidence scores, ensuring reliable validation, seamless export, and smooth integration into applications.

Key Features of Syncfusion® Smart Data Extractor

The following list highlights the core capabilities of the Syncfusion® Smart Data Extractor:

  • Document structure extraction: detects text elements, images, headers/footers, and complete table structures (regions, header rows, columns, cell boundaries, merged cells).
  • File format support: works with PDF and common image formats such as JPEG and PNG.
  • Table extraction: specialized parsing to recover table rows, columns, header detection, and cell spans.
  • Form recognition: detects and extracts form fields (text inputs, checkboxes, radio buttons) with field types and values.
  • Page‑level control: extract data from specific pages or defined page ranges.
  • Confidence thresholding: filters results based on a configurable confidence score (0.0–1.0).
  • Deterministic performance: ensures predictable, repeatable extraction across environments including Windows, Linux, Azure, and Docker.

JSON Output Structure and Attributes

The Syncfusion® Data Extraction libraries process PDFs and scanned images to extract structured document data—including tables, form fields, text elements, images, headers, and footers—by analyzing layout patterns, table regions, borders, alignment cues, and cell structures. The extracted output is returned as structured JSON with per‑field and per‑cell confidence scores, along with complete document and table hierarchies, making it ready for immediate review, export, or integration into downstream workflows.

Root Structure

Below is the root structure of the JSON result:

{
  "Pages": [
    {
      "PageNumber": 1,
      "Width": 0,
      "Height": 0,
      "PageObjects": [],
      "FormObjects": [] 
    }
  ]
}

NOTE

In the Smart Table Extractor root structure, the FormObjects element will not be present.

JSON Attributes

Page Object

The Page Object represents the metadata of a page along with all the detected elements it contains in the Smart Data Extractor, and the table elements it contains in the Smart Table Extractor.

Attribute Type Description
PageNumber Integer Sequential number of the page in the document.
Width Float Page width in points/pixels.
Height Float Page height in points/pixels.
PageObjects Array List of detected objects (table).
FormObjects Array List of detected form fields (checkboxes, text boxes, radio buttons, signatures etc.)

NOTE

The FormObjects array is not included in the Smart Table Extractor output structure, as it is specific to the Smart Data Extractor and Smart Form Recognizer.

PageObjects

PageObjects represent the metadata of a page along with the detected elements it contains—such as text, headers, footers, tables, images, and numbers—in the Smart Data Extractor, while in the Smart Table Extractor they represent the detected table elements on a page.

Attribute Type Description
Type String Defines the kind of object detected on the page (Table).
Bounds Array of Floats The bounding box coordinates [X, Y, Width, Height] representing the object's position and size on the page.
Content Object Holds the extracted textual content along with its style attributes (FontName, FontStyle, FontSize) that describe the appearance of the text.
Confidence Float Confidence score (0–1) indicating the accuracy of detection.
TableFormat (only for tables) Object Metadata about table detection, including detection score and label.
Rows (only for tables) Array Collection of row objects that make up the table.

Row Object

The Row Object represents a single horizontal group of cells within a table, along with its bounding box.

Attribute Type Description
Type String Specifies the row type (for example, tr).
Rect Array Bounding box coordinates for the row.
Cells Array Collection of cell objects contained in the row.

Cell Object

The Cell Object represents an individual table entry, containing text values, spanning details, and positional coordinates.

Attribute Type Description
Type String Cell type (e.g., td).
Rect Array Bounding box coordinates for the cell.
RowSpan / ColSpan Integer Number of rows or columns spanned by the cell.
RowStart / ColStart Integer Starting row and column index of the cell.
Content.Value String Text content inside the cell.

FormObjects

FormObjects represent interactive form fields detected on the page, such as text boxes, checkboxes, radio buttons, and signature regions.Each object includes positional data, dimensions, field type, and a confidence score that indicates detection reliability.

Attribute Type Description
X / Y Float Coordinates of the form field on the page.
Width / Height Float Dimensions of the form field.
Type Integer Numeric identifier for the form field type (for example, 0 = TextArea, 1 = Checkbox, 2 = Radio Button, 3 = Signature).
Confidence Float Confidence score (0–1) indicating detection accuracy.

NOTE

The FormObjects structure is not available in the Smart Table Extractor output.

Text Attribute

Represents the text formatting attributes (font family, font style, font size) applied to the extracted text.

Attribute Type Description
FontName String Specifies the font family name used for the text (for example, "Arial").
FontStyle Integer Specifies the numeric identifier for the font style (for example, 0 = Regular, 1 = Bold, 2 = Italic).
FontSize Float Specifies the font size used for the text.