Overview of Smart Data Extractor

26 Jun 20264 minutes to read

.NET Smart Data Extractor is a high‑performance, deterministic C# library for extracting structured document content from PDFs and images. Tailored for modern .NET workflows, it interprets visual layout patterns—lines, boxes, labels, and alignment—to accurately identify and extract tables, text elements, images, headers, footers, and form fields. Each extracted element includes per‑field confidence scores, ensuring reliable validation, seamless export, and smooth integration into applications.

Key Features of Syncfusion^® Smart Data Extractor

The following list highlights the core capabilities of the Syncfusion^® Smart Data Extractor:

Document structure extraction: detects text elements, images, headers/footers, and complete table structures (regions, header rows, columns, cell boundaries, merged cells).
File format support: works with PDF and common image formats such as JPEG and PNG.
Table extraction: specialized parsing to recover table rows, columns, header detection, and cell spans.
Form recognition: detects and extracts form fields (text inputs, checkboxes, radio buttons) with field types and values.
Page‑level control: extract data from specific pages or defined page ranges.
Confidence thresholding: filters results based on a configurable confidence score (0.0–1.0).
Deterministic performance: ensures predictable, repeatable extraction across environments including Windows, Linux, Azure, and Docker.

JSON Output Structure and Attributes

The Syncfusion® Data Extraction libraries process PDFs and scanned images to extract structured document data—including tables, form fields, text elements, images, headers, and footers—by analyzing layout patterns, table regions, borders, alignment cues, and cell structures. The extracted output is returned as structured JSON with per‑field and per‑cell confidence scores, along with complete document and table hierarchies, making it ready for immediate review, export, or integration into downstream workflows.

Root Structure

Below is the root structure of the JSON result:

JSON
{
  "Pages": [
    {
      "PageNumber": 1,
      "Width": 0,
      "Height": 0,
      "PageObjects": [],
      "FormObjects": [] 
    }
  ]
}

NOTE

In the Smart Table Extractor root structure, the FormObjects element will not be present.

JSON Attributes

Page Object

The Page Object represents the metadata of a page along with all the detected elements it contains in the Smart Data Extractor, and the table elements it contains in the Smart Table Extractor.

Attribute	Type	Description
PageNumber	Integer	Sequential number of the page in the document.
Width	Float	Page width in points/pixels.
Height	Float	Page height in points/pixels.
PageObjects	Array	List of detected objects (table).
FormObjects	Array	List of detected form fields (checkboxes, text boxes, radio buttons, signatures etc.)

NOTE

The FormObjects array is not included in the Smart Table Extractor output structure, as it is specific to the Smart Data Extractor and Smart Form Recognizer.

PageObjects

PageObjects represent the metadata of a page along with the detected elements it contains—such as text, headers, footers, tables, images, and numbers—in the Smart Data Extractor, while in the Smart Table Extractor they represent the detected table elements on a page.

Attribute	Type	Description
Type	String	Defines the kind of object detected on the page (Table).
Bounds	Array of Floats	The bounding box coordinates [X, Y, Width, Height] representing the object's position and size on the page.
Content	Object	Holds the extracted textual content along with its style attributes (FontName, FontStyle, FontSize) that describe the appearance of the text.
Confidence	Float	Confidence score (0–1) indicating the accuracy of detection.
TableFormat (only for tables)	Object	Metadata about table detection, including detection score and label.
Rows (only for tables)	Array	Collection of row objects that make up the table.

Row Object

The Row Object represents a single horizontal group of cells within a table, along with its bounding box.

Attribute	Type	Description
Type	String	Specifies the row type (for example, tr).
Rect	Array	Bounding box coordinates for the row.
Cells	Array	Collection of cell objects contained in the row.

Cell Object

The Cell Object represents an individual table entry, containing text values, spanning details, and positional coordinates.

Attribute	Type	Description
Type	String	Cell type (e.g., td).
Rect	Array	Bounding box coordinates for the cell.
RowSpan / ColSpan	Integer	Number of rows or columns spanned by the cell.
RowStart / ColStart	Integer	Starting row and column index of the cell.
Content.Value	String	Text content inside the cell.

FormObjects

FormObjects represent interactive form fields detected on the page, such as text boxes, checkboxes, radio buttons, and signature regions.Each object includes positional data, dimensions, field type, and a confidence score that indicates detection reliability.

Attribute	Type	Description
X / Y	Float	Coordinates of the form field on the page.
Width / Height	Float	Dimensions of the form field.
Type	Integer	Numeric identifier for the form field type (for example, 0 = TextArea, 1 = Checkbox, 2 = Radio Button, 3 = Signature).
Confidence	Float	Confidence score (0–1) indicating detection accuracy.

NOTE

The FormObjects structure is not available in the Smart Table Extractor output.

Text Attribute

Represents the text formatting attributes (font family, font style, font size) applied to the extracted text.

Attribute	Type	Description
FontName	String	Specifies the font family name used for the text (for example, "Arial").
FontStyle	Integer	Specifies the numeric identifier for the font style (for example, 0 = Regular, 1 = Bold, 2 = Italic).
FontSize	Float	Specifies the font size used for the text.

Search docs

Ask Syncfusion AI Assistant

Search docs

Ask Syncfusion AI Assistant

Overview of Smart Data Extractor

Key Features of Syncfusion^® Smart Data Extractor

JSON Output Structure and Attributes

Root Structure

JSON Attributes

Page Object

PageObjects

Row Object

Cell Object

FormObjects

Text Attribute

Overview of Smart Data Extractor

Key Features of Syncfusion® Smart Data Extractor

JSON Output Structure and Attributes

Root Structure

JSON Attributes

Page Object

PageObjects

Row Object

Cell Object

FormObjects

Text Attribute

Key Features of Syncfusion^® Smart Data Extractor