NAV
curl

Introduction

PDFDATA provides PDF data extraction as a service using pdfQL®, via in-browser tools as well as the HTTP API described in this document.

The examples in this documentation are currently shown in curl, the command-line HTTP client.

Setup

curl is probably available via your operating system's package manager, or direct download.

While you can use curl itself to interact with PDFDATA, these examples are provided in part because curl invocations are widely understood as representations of HTTP requests. Until we provide a client library for your programming language, you should be able to use these example curl interactions to guide your integration with the PDFDATA API using your language's HTTP client implementation.

API Endpoints

If your PDFDATA account's service level includes region pinning — i.e. the ability to require that data and operations be strictly located in a single selected geographic region, usually to satisfy data residency requirements — your API endpoint will correspond with that region. For example, an EU PDFDATA API endpoint might be

https://api.eu-1.pdfdata.com

while a United States API endpoint might be

https://api.us-2.pdfdata.com

"Standard" PDFDATA accounts that do not include region pinning are randomly assigned to an API "pool"; API endpoints for these accounts always begin with a api.gN prefix, e.g.

https://api.g3.pdfdata.com

While unpinned accounts may be rebalanced from time to time to different compute regions, their associated domain name and API endpoints will not change.

Your PDFDATA account has an associated API endpoint, determined by the service level you've selected, and then potentially which geographic region you have opted to "pin" your account to.

To find your account's API endpoint and credentials, go to the "Settings" tab after logging into PDFDATA.

Throughout this documentation, samples will refer to one of the United States regions' PDFDATA API endpoints, https://api.us-1.pdfdata.com.

Authentication & Credentials

Throughout this documentation, samples will use a dummy API key:

api___key

Use either the -u option to provide your API key (remember to add a trailing colon, :, to prevent curl from asking for a corresponding password):

curl https://api.us-1.pdfdata.com/... \
  -u api___key:
  .....

Or, you can splice your API key into the URL, just before the hostname:

curl https://api___key@api.us-1.pdfdata.com/...

Requests to the PDFDATA API must carry credentials, and must be made via HTTPS; any that do not will fail.

API requests are authenticated by HTTP Basic Auth. Your API key is the HTTP Basic username; the password is empty / not provided.

You can view and manage your API keys in the "Settings" tab of the PDFDATA application. While all communications between your application and PDFDATA are secured by HTTPS / TLS, API keys enable access to your source documents and extracted data, so be sure to keep them secret! Do not leave your API keys in source control repositories, client-side code, and other widely-accessible areas. We will honor any API request that includes your credentials as being authorized by you, so protect them accordingly.

Using the PDFDATA API

Once you have:

  1. An active PDFDATA account (create one now!)
  2. Set up your development environment and obtained your API key
  3. Written a pdfQL query via the PDFDATA application that is ready to be applied (or you've opted to have us write queries for you!)

…you can productively use the API, which always follows the same cycle:

  1. Upload your PDF files
  2. Run a pdfQL query
  3. Retrieve the extracted data from the query job

Errors

Example error response

PDFDATA's HTTP API uses standard HTTP status codes and explanatory JSON bodies to indicate API request failures. All API-level errors can be broadly separated into problems with the request (these will provoke responses with 4xx status codes), or unexpected problems on our part in processing the request (resulting in a 5xx status code).

HTTP/1.1 400 Bad Request
Content-Type: application/json
Content-Length: 55

{ "error": { "message": "No `source` parameter(s)" } }

PDFDATA error responses are JSON documents that describe the nature of the error as much as possible.

  Error object attributes
error An object describing the nature of the error. It will always include a message string attribute, which will contain a human-friendly description of the problem.

Files

You need to get your PDF files into PDFDATA in order for it to work with them; this is done by uploading files via either the in-browser application, or via the API. Once uploaded, you can run pdfQL queries over them to extract the data you care about.

Uploading files

Definition

POST https://api.us-1.pdfdata.com/v1/file

Example request

curl "https://api.us-1.pdfdata.com/v1/file?name=Weekly invoices" \
  -u api___key: \
  -F pdf=@{PATH_TO_PDF} \
  -F pdf=@{PATH_TO_PDF2} \

Documents are uploaded by sending a multipart/form-data POST request, which can contain one or many documents along with their filenames.

  Query parameters
name optional The name that should be given to the upload job (useful for finding uploads in the in-browser PDFDATA application)
  Request parameters
pdf Each pdf parameter's value should consist of the raw binary content of a source PDF document. The filename document attribute is sourced from the filename property of each pdf request parameter. Any provided type attribute is ignored; every range of pdf data is assumed to be a PDF document.

When successful, file uploads will yield a 201 Created HTTP response, the body of which will be a JSON-encoded representation of the upload job associated with the files that were sent. Take note of the job ID and the IDs assigned to the individual files you provided, as those will be needed when issuing requests to run pdfQL queries over those files.

Example response

HTTP/1.1 201 Created
Content-Type: application/json
Content-Length: 428

{
  "id": "jc92d0c4f-e133-46c9-81bb-823913a6c5ba",
  "name": "Weekly invoices",
  "files": [
    {
      "id": "f831017ab-b130-49e7-b39e-5a13c056fd8d",
      "name": "Return.pdf",
      "hash": "16285d42df088864f77c6d33f1f300e2e76eb621"
    },
    {
      "id": "fdde795a2-c907-44cc-9a27-f707e6c6df17",
      "name": "18-30550_1574638_7-23-2018_4567527.pdf",
      "hash": "4055d3bd01939c34a25d08a4d0078e6cc5cbcbc3"
    }
  ]
}
  Upload job response
id The unique ID of the upload job
name optional The name given to the upload job
files An array of file objects
  File object
id The unique ID assigned to the named file
name The name of the file, as provided at the time of the upload
hash The SHA-1 hash of the file's contents, useful for verifying PDFDATA's accurate receipt of each file's contents.

Queries

PDFDATA is built around pdfQL®; you interact with pdfQL queries in two ways:

  1. By authoring, testing, and maintaining queries in the PDFDATA application
  2. By running those pdfQL queries against one or many source PDF files, via either the PDFDATA application, or via the query facilities provided by this API.

Definition

POST https://api.us-1.pdfdata.com/v1/query/<QUERY_ID>

You can run a pdfQL query by issuing a POST request to the URL corresponding to that query indicating which files to be queried. To do this, find the query ID at the top of its page in the PDFDATA application (it will be a long alphanumeric string, prefixed with a q), and append it to the root query resource URL.

Example query-running request

curl https://api.us-1.pdfdata.com/v1/query/qc92d0c4f-e133-46c9-81bb-823913a6c5ba \
  -u api___key:
  -d "name=Membership 2022"
  -d source=jf6dfeae2-ca66-42f2-9dbf-6845cfb761d6
  -d source=f831017ab-b130-49e7-b39e-5a13c056fd8d
  Request parameters
name optional The name that should be given to the query job (useful for finding a particular job in the in-browser PDFDATA application)
source IDs of the files to be queried, or document acquisition job IDs that identify groups of files to be queried.

Example new query job response

HTTP/1.1 202 Accepted
Content-Type: application/json
Location: /job/jf5f4d319-d942-4255-b8ca-4174539dda84
Content-Length: 235

{
  "id": "jf5f4d319-d942-4255-b8ca-4174539dda84",
  "query_id": "qc92d0c4f-e133-46c9-81bb-823913a6c5ba",
  "name": "Membership 2022",
  "files": [
    "f831017ab-b130-49e7-b39e-5a13c056fd8d",
    "fdde795a2-c907-44cc-9a27-f707e6c6df17"
  ]
}

This will start an (asynchronous) query job, and respond with some metadata about it, including a Location header providing the API's URL for the new job where you can "pick up" the query results, as well as a listing of the files that are being queried.

  Pending query job response
id The unique ID of the query job
query_id The ID of the pdfQL query being run in the job
name optional The name given to the job
files An array of file IDs; if the query-running request included any document acquisition job IDs as sources, those jobs' files will be resolved and included here.

Jobs

Definition

POST https://api.us-1.pdfdata.com/v1/job

A PDFDATA job is any long-running process. Examples include:

Example job request

curl https://api.us-1.pdfdata.com/v1/job/jc92d0c4f-e133-46c9-81bb-823913a6c5ba \
  -u api___key:

To retrieve a job's status (and in the case of a query job, the results of a pdfQL query), issue a GET request to the corresponding job resource.

Example (upload) job response

HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 428

{
  "id": "jc92d0c4f-e133-46c9-81bb-823913a6c5ba",
  "name": "weekly-invoices",
  "files": [
    {
      "id": "f831017ab-b130-49e7-b39e-5a13c056fd8d",
      "name": "Return.pdf",
      "hash": "16285d42df088864f77c6d33f1f300e2e76eb621"
    },
    {
      "id": "fdde795a2-c907-44cc-9a27-f707e6c6df17",
      "name": "18-30550_1574638_7-23-2018_4567527.pdf",
      "hash": "4055d3bd01939c34a25d08a4d0078e6cc5cbcbc3"
    }
  ]
}

When successful and complete, job requests will yield a 200 OK HTTP response, the body of which will be a JSON-encoded representation of the requested job; the specifics of that representation vary based on the type of job that was requested.

If a job is still pending, job requests will yield a 202 Accepted response, and may include status information (e.g. a list of counts indicating how many files are yet to be processed in the course of a pdfQL query job).

Example (query) job response

HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 214
 
{
  "id": "j6fbbdb41-26d9-49e3-89a1-20276903b596",
  "query_id": "q61432972-baaa-4829-ad41-d8ebbdb8a6cf",
  "name": "testquery-123",
  "results": {
    "solutions": [ { "data": [ "AZ", "AY", "AX" ], "score": 6 } ]
  }
}
  Common job attributes
id The unique ID of the requested job
name optional The name of the requested job

In addition to the common attributes above, specific types of job objects also contain additional attributes:

  Upload job attributes
files An array of file objects
  Completed query job attributes
query_id The ID of the pdfQL query applied to the PDF files in the job.
results A list of result objects, one per successfully-queried file.
  Pending query job attributes
query_id The ID of the pdfQL query applied to the PDF files in the job.
status A status object indicating the aggregate status of querying each of the source PDF files in the job.
  Result object
source An object indicating the file from which the result data was extracted, contains (file) id and (file) name attributes.
solutions A list of "solution" objects, each containing a data attribute that is a list of the data elements produced by the pdfQL query. The structure of those data elements is not documented here, as they can vary significantly depending on the structure and options specified by the pdfQL query; see our documentation of pdfQL itself elsewhere for details on this subject.
  Status object
waiting optional The number of files that have not yet been processed.
running optional The number of files currently being processed.
recovering optional  
timeout optional The number of files that could not be processed within the quota window associated with the owning account.
failed optional The number of files that produced hard failures upon querying.
done optional The number of files that have been successfully queried.