How can I help you?
Overview of Smart Data Extractor
25 May 20264 minutes to read
Syncfusion® Smart Data Extractor is a high‑performance, deterministic C# library for extracting structured document content from PDFs and images. Tailored for modern .NET workflows, it interprets visual layout patterns—lines, boxes, labels, and alignment—to accurately identify and extract tables, text elements, images, headers, footers, and form fields. Each extracted element includes per‑field confidence scores, ensuring reliable validation, seamless export, and smooth integration into applications.
Key Features of Syncfusion® Smart Data Extractor
The following list highlights the core capabilities of the Syncfusion® Smart Data Extractor:
- Document structure extraction: detects text elements, images, headers/footers, and complete table structures (regions, header rows, columns, cell boundaries, merged cells).
- File format support: works with PDF and common image formats such as JPEG and PNG.
- Table extraction: specialized parsing to recover table rows, columns, header detection, and cell spans.
- Form recognition: detects and extracts form fields (text inputs, checkboxes, radio buttons) with field types and values.
- Page‑level control: extract data from specific pages or defined page ranges.
- Confidence thresholding: filters results based on a configurable confidence score (0.0–1.0).
- Deterministic performance: ensures predictable, repeatable extraction across environments including Windows, Linux, Azure, and Docker.
JSON Output Structure and Attributes
The Syncfusion® Data Extraction libraries process PDFs and scanned images to extract structured document data—including tables, form fields, text elements, images, headers, and footers—by analyzing layout patterns, table regions, borders, alignment cues, and cell structures. The extracted output is returned as structured JSON with per‑field and per‑cell confidence scores, along with complete document and table hierarchies, making it ready for immediate review, export, or integration into downstream workflows.
Root Structure
Below is the root structure of the JSON result:
{
"Pages": [
{
"PageNumber": 1,
"Width": 0,
"Height": 0,
"PageObjects": [],
"FormObjects": []
}
]
}NOTE
In the Smart Table Extractor root structure, the FormObjects element will not be present.
JSON Attributes
Page Object
The Page Object represents the metadata of a page along with all the detected elements it contains in the Smart Data Extractor, and the table elements it contains in the Smart Table Extractor.
| Attribute | Type | Description |
|---|---|---|
| PageNumber | Integer | Sequential number of the page in the document. |
| Width | Float | Page width in points/pixels. |
| Height | Float | Page height in points/pixels. |
| PageObjects | Array | List of detected objects (table). |
| FormObjects | Array | List of detected form fields (checkboxes, text boxes, radio buttons, signatures etc.) |
NOTE
The FormObjects array is not included in the Smart Table Extractor output structure, as it is specific to the Smart Data Extractor and Smart Form Recognizer.
PageObjects
PageObjects represent the metadata of a page along with the detected elements it contains—such as text, headers, footers, tables, images, and numbers—in the Smart Data Extractor, while in the Smart Table Extractor they represent the detected table elements on a page.
| Attribute | Type | Description |
|---|---|---|
| Type | String | Defines the kind of object detected on the page (Table). |
| Bounds | Array of Floats | The bounding box coordinates [X, Y, Width, Height] representing the object's position and size on the page. |
| Content | Object | Holds the extracted textual content along with its style attributes (FontName, FontStyle, FontSize) that describe the appearance of the text. |
| Confidence | Float | Confidence score (0–1) indicating the accuracy of detection. |
| TableFormat (only for tables) | Object | Metadata about table detection, including detection score and label. |
| Rows (only for tables) | Array | Collection of row objects that make up the table. |
Row Object
The Row Object represents a single horizontal group of cells within a table, along with its bounding box.
| Attribute | Type | Description |
|---|---|---|
| Type | String | Specifies the row type (for example, tr). |
| Rect | Array | Bounding box coordinates for the row. |
| Cells | Array | Collection of cell objects contained in the row. |
Cell Object
The Cell Object represents an individual table entry, containing text values, spanning details, and positional coordinates.
| Attribute | Type | Description |
|---|---|---|
| Type | String | Cell type (e.g., td). |
| Rect | Array | Bounding box coordinates for the cell. |
| RowSpan / ColSpan | Integer | Number of rows or columns spanned by the cell. |
| RowStart / ColStart | Integer | Starting row and column index of the cell. |
| Content.Value | String | Text content inside the cell. |
FormObjects
FormObjects represent interactive form fields detected on the page, such as text boxes, checkboxes, radio buttons, and signature regions.Each object includes positional data, dimensions, field type, and a confidence score that indicates detection reliability.
| Attribute | Type | Description |
|---|---|---|
| X / Y | Float | Coordinates of the form field on the page. |
| Width / Height | Float | Dimensions of the form field. |
| Type | Integer | Numeric identifier for the form field type (for example, 0 = TextArea, 1 = Checkbox, 2 = Radio Button, 3 = Signature). |
| Confidence | Float | Confidence score (0–1) indicating detection accuracy. |
NOTE
The FormObjects structure is not available in the Smart Table Extractor output.
Text Attribute
Represents the text formatting attributes (font family, font style, font size) applied to the extracted text.
| Attribute | Type | Description |
|---|---|---|
| FontName | String | Specifies the font family name used for the text (for example, "Arial"). |
| FontStyle | Integer | Specifies the numeric identifier for the font style (for example, 0 = Regular, 1 = Bold, 2 = Italic). |
| FontSize | Float | Specifies the font size used for the text. |