Pdf Processing

PDF Processing API (Deep Perception Engine)

Extract component information, specifications, and technical data from PDF datasheets and catalogs using AI-powered multi-agent analysis. The Deep Perception Engine uses GPT-4o-mini vision capabilities for accurate extraction.

5 EndpointsGPT-4o VisionMulti-Agent PipelineAuto-Fill Forms

Overview

Capabilities

• Component Extraction: Name, manufacturer, specs
• Catalog Support: Multi-product PDFs
• Table Detection: Parse specification tables
• Image Analysis: Product diagrams

Processing Pipeline

1. Upload PDF (max 50MB)
2. Convert pages to images
3. GPT-4o-mini vision analysis
4. Structured JSON output

Component Extraction

Upload a PDF datasheet and extract structured component information automatically. For multi-product catalogs, provide a user instruction to specify which product to extract.

POST/v1/pdf/extract-componentAuth Required

Extract Component from PDF

Upload PDF and extract component information. Supports both single-product datasheets and multi-product catalogs.

Request Body

Requestjson

// Form data (multipart/form-data)
{
  "pdf_file": <file>,
  "user_instruction": "Extract the M8 connector on page 3"  // Optional
}

Response

Responsejson

{
  "success": true,
  "component_data": {
    "name": "M8 4-Pin Connector",
    "manufacturer": "Festo",
    "model": "NEBU-M8G4-K-2.5-M8G3",
    "category": "CONNECTOR",
    "part_number": "550236",
    "specifications": {
      "pin_count": "4",
      "cable_length": "2.5m",
      "ip_rating": "IP67",
      "voltage_rating": "30V DC"
    }
  },
  "processing_details": {
    "pages_analyzed": 2,
    "extraction_confidence": 0.92,
    "catalog_detected": false
  },
  "pdf_url": "https://storage.sapienstream.com/org-123/pdfs/datasheet.pdf"
}

Try it out

cURLbash

curl -X POST "https://sapienstream.com/api/v1/pdf/extract-component" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

POST/v1/pdf/detect-catalogAuth Required

Detect Catalog Structure

Analyze a PDF to detect if it's a multi-product catalog and list available products.

Request Body

Requestjson

// Form data (multipart/form-data)
{
  "pdf_file": <file>
}

Response

Responsejson

{
  "is_catalog": true,
  "product_count": 12,
  "products": [
    {
      "name": "M8 4-Pin Connector",
      "page_start": 2,
      "page_end": 3
    },
    {
      "name": "M12 8-Pin Connector",
      "page_start": 4,
      "page_end": 6
    }
  ],
  "suggestion": "Use user_instruction parameter to specify which product to extract"
}

Try it out

cURLbash

curl -X POST "https://sapienstream.com/api/v1/pdf/detect-catalog" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

Complete Processing Pipeline

Full pipeline that uploads, analyzes, and returns component data ready for form auto-fill.

POST/v1/pdf/upload-and-analyzeAuth Required

Upload and Analyze PDF

Complete processing: upload PDF to storage, extract component info, return structured data.

Request Body

Requestjson

// Form data (multipart/form-data)
{
  "file": <pdf_file>
}

Response

Responsejson

{
  "success": true,
  "message": "PDF processed successfully",
  "component_data": {
    "name": "Position Sensor SRBS",
    "manufacturer": "Festo",
    "category": "SENSOR",
    "part_number": "560781",
    "specifications": {
      "sensing_range": "0-120mm",
      "output_type": "Analog 0-10V",
      "supply_voltage": "24V DC",
      "protection_class": "IP65"
    },
    "description": "Analog position sensor for pneumatic cylinders"
  },
  "pdf_url": "https://storage.sapienstream.com/org-123/component-files/srbs-datasheet.pdf",
  "processing_details": {
    "steps_completed": ["upload", "text_extraction", "vision_analysis", "data_structuring"],
    "processing_time_ms": 3420,
    "tokens_used": 2150
  }
}

Try it out

cURLbash

curl -X POST "https://sapienstream.com/api/v1/pdf/upload-and-analyze" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

POST/v1/pdf/analyze-existingAuth Required

Analyze Existing PDF

Analyze a PDF that's already uploaded to storage (by file_id).

Request Body

Requestjson

{
  "file_id": "550e8400-e29b-41d4-a716-446655440000"
}

Response

Responsejson

{
  "success": true,
  "component_data": {
    "name": "Servo Drive",
    "manufacturer": "Siemens",
    "model": "SINAMICS V90"
  },
  "file_id": "550e8400-e29b-41d4-a716-446655440000"
}

Try it out

cURLbash

curl -X POST "https://sapienstream.com/api/v1/pdf/analyze-existing" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

Processing Status

Check the status of asynchronous PDF processing tasks.

GET/v1/pdf/processing-status/{task_id}Auth Required

Get Processing Status

Check the status of an asynchronous PDF processing task.

Parameters

task_idstringRequired

Task identifier returned from upload

Response

Responsejson

{
  "task_id": "task_abc123",
  "status": "completed",
  "progress": 100,
  "result": {
    "success": true,
    "component_data": { ... }
  },
  "created_at": "2024-08-26T15:30:00Z",
  "completed_at": "2024-08-26T15:30:05Z"
}

Try it out

cURLbash

curl -X GET "https://sapienstream.com/api/v1/pdf/processing-status/{task_id}" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

GET/v1/pdf/supported-formats

Get Supported Formats

List supported file formats and processing capabilities.

Response

Responsejson

{
  "supported_formats": [".pdf"],
  "max_file_size_mb": 50,
  "max_pages": 100,
  "features": [
    "component_extraction",
    "table_detection",
    "catalog_analysis",
    "specification_parsing"
  ]
}

Try it out

cURLbash

curl -X GET "https://sapienstream.com/api/v1/pdf/supported-formats" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

6-Stage Multi-Agent Pipeline

Page Agent

Analyzes PDF structure, identifies product pages vs index/legal content

Image Conversion

Converts relevant pages to high-quality images for vision analysis

Vision Agent

GPT-4o-mini analyzes images to extract text, tables, and diagrams

Extraction Agent

Identifies and extracts component name, specs, part numbers

Validation Agent

Cross-checks extracted data for consistency and completeness

Structuring Agent

Formats output as structured JSON matching component schema

Best Practices

PDF Quality

Use high-quality PDFs with clear text and images. Scanned documents work but may have lower accuracy than native PDFs.

Catalog Instructions

For multi-product catalogs, provide specific user_instruction like "Extract the M8 connector" or "Get specifications from page 5".

Token Usage

Each page analyzed consumes approximately 500-1500 tokens. For large PDFs, consider using detect-catalog first to identify relevant pages.