Introduction

PDFDATA provides PDF data extraction as a service using pdfQL™, via in-browser tools as well as the HTTP API described in this document.

You can use any HTTP client with the PDFDATA API, but the examples in this documentation are currently shown in curl, a popular command-line HTTP client available for just about any operating system.

Setup

curl is probably available via your operating system's package manager, or direct download.

While you can use curl itself to interact with PDFDATA, these examples are provided in part because curl invocations are widely understood as representations of HTTP requests. So, you can use curl directly, its library variant libcurl, or you should be able to use these example curl interactions to guide your integration with the PDFDATA API using your language's HTTP client implementation.

API Endpoints

If your PDFDATA account's service level includes region pinning — the ability to require that data and operations be strictly located in a single geographic region, usually to satisfy data residency requirements — your API endpoint will correspond with that region. For example, an EU PDFDATA API endpoint might be
https://eu1.api.pdfdata.com
while a United States API endpoint might be
https://us2.api.pdfdata.com
"Basic" PDFDATA accounts that do not include region pinning are randomly assigned to an API "pool"; API endpoints for these accounts always begin with a basicN.api prefix, e.g.
https://basic1.api.pdfdata.com
While unpinned accounts may be rebalanced from time to time to different compute regions, their associated domain name and API endpoints will not change.

Your PDFDATA account has an associated API endpoint, determined by the service level you've selected, and then potentially which geographic region you have opted to "pin" your account to.

To find your account's API endpoint and credentials, go to the "Account" tab after logging into PDFDATA, and look for the "API Access" section.

Throughout this documentation, samples will refer to one of the United States regions' PDFDATA API endpoints, https://us1.api.pdfdata.com.

Authentication & Credentials

Throughout this documentation, samples will use a dummy API key:

api___key

Use either the -u option to provide your API key (remember to add a trailing colon, :, to prevent curl from asking for a corresponding password):
curl https://us1.api.pdfdata.com/... \
  -u api___key:
  .....
Or, you can splice your API key into the URL, just before the hostname:
curl https://api___key@us1.api.pdfdata.com/...

Requests to the PDFDATA API must carry credentials, and must be made via HTTPS; any that do not will fail.

API requests are authenticated by HTTP Basic Auth. Your API key is the HTTP Basic username; the password is empty / not provided.

You can view and manage your API keys in the "Account" tab of the PDFDATA application. While all communications between your application and PDFDATA are secured by HTTPS / TLS, API keys enable access to your source documents and extracted data, so be sure to keep them secret! Do not leave your API keys in source control repositories, client-side code, and other widely-accessible areas. We will honor any API request that includes your credentials as being authorized by you, so protect them accordingly.

Using the PDFDATA API

Once you have:

An active PDFDATA account (create one now!)
Set up your development environment and obtained your API key
Written a pdfQL query via the PDFDATA application that is ready to be applied (or you've opted to have us write queries for you!)

…you can productively use the API, which always follows the same cycle:

Upload your PDF files
Run a pdfQL query
Either:
1. Receive the extracted data via webhook delivery, OR
2. Retrieve the extracted data from the query job

Errors

Example error response

PDFDATA's HTTP API uses standard HTTP status codes and explanatory JSON bodies to indicate API request failures. All API-level errors can be broadly separated into problems with the request (these will provoke responses with 4xx status codes), or unexpected problems on our part in processing the request (resulting in a 5xx status code).

HTTP/1.1 400 Bad Request
Content-Type: application/json
Content-Length: 55

{ "error": { "message": "No `source` parameter(s)" } }

PDFDATA error responses are JSON documents that describe the nature of the error as much as possible.

	Error object attributes
error	An object describing the nature of the error. It will always include a `message` string attribute, which will contain a human-friendly description of the problem.

Files

You need to get your PDF files into PDFDATA in order for it to work with them; this is done by uploading files via either the in-browser application, or via the API. Once uploaded, you can run pdfQL queries over them to extract the data you care about.

Uploading files

Definition

POST https://us1.api.pdfdata.com/v1/file

Example request

curl "https://us1.api.pdfdata.com/v1/file?name=Weekly invoices" \
  -u api___key: \
  -F pdf=@{PATH_TO_PDF} \
  -F pdf=@{PATH_TO_PDF2} \

Documents are uploaded by sending a multipart/form-data POST request, which can contain one or many documents along with their filenames.

	Query parameters
name optional	The name that should be given to the upload job (useful for finding uploads in the in-browser PDFDATA application)
tag optional	The tag that should be added to each of the uploaded files. To add multiple tags to the set of uploaded files, use multiple `tag` parameters.

	Request parameters
pdf	Each `pdf` parameter's value should consist of the raw binary content of a source PDF document. The `filename` document attribute is sourced from the `filename` property of each `pdf` request parameter. Any provided `type` attribute is ignored; every range of `pdf` data is assumed to be a PDF document.

When successful, file uploads will yield a 201 Created HTTP response, the body of which will be a JSON-encoded representation of the upload job associated with the files that were sent. Take note of the job ID and the IDs assigned to the individual files you provided, as those will be needed when issuing requests to run pdfQL queries over those files.

Example response

HTTP/1.1 201 Created
Content-Type: application/json
Content-Length: 428

{
  "id": "jc92d0c4f-e133-46c9-81bb-823913a6c5ba",
  "name": "Weekly invoices",
  "files": [
    {
      "id": "f831017ab-b130-49e7-b39e-5a13c056fd8d",
      "name": "Return.pdf",
      "hash": "16285d42df088864f77c6d33f1f300e2e76eb621"
    },
    {
      "id": "fdde795a2-c907-44cc-9a27-f707e6c6df17",
      "name": "18-30550_1574638_7-23-2018_4567527.pdf",
      "hash": "4055d3bd01939c34a25d08a4d0078e6cc5cbcbc3"
    }
  ]
}

	Upload job response
id	The unique ID of the upload job
name optional	The name given to the upload job
files	An array of file objects

	File object
id	The unique ID assigned to the named file
name	The name of the file, as provided at the time of the upload
hash	The SHA-1 hash of the file's contents, useful for verifying PDFDATA's accurate receipt of each file's contents.

Queries

PDFDATA is built around pdfQL™; you interact with pdfQL queries in two ways:

By authoring, testing, and maintaining queries in the PDFDATA application
By running those pdfQL queries against one or many source PDF files, via either the PDFDATA application, or via the query facilities provided by this API.

Definition

POST https://us1.api.pdfdata.com/v1/query/<QUERY_ID>

You can run a pdfQL query by issuing a POST request to the URL corresponding to that query indicating which files to be queried. To do this, find the query ID at the top of its page in the PDFDATA application (it will be a string like ~/<QUERY_NAME>), and append it to the root query resource URL.

Example query-running request

curl https://us1.api.pdfdata.com/v1/query/~/roll-call-query \
  -u api___key:
  -d "name=Membership 2022"
  -d "webhook_url=https://...."
  -d source=jf6dfeae2-ca66-42f2-9dbf-6845cfb761d6
  -d source=f831017ab-b130-49e7-b39e-5a13c056fd8d

	Request parameters
name optional	The name that should be given to the query job (useful for finding a particular job in the in-browser PDFDATA application)
webhook_url optional	The URL that should receive a webhook request once the query job is complete.
source	The ID of the file to be queried, or a tag that should be used to find and include all files that have that tag. To include multiple files or tags, use multiple `source` parameters.

Example new query job response

HTTP/1.1 202 Accepted
Content-Type: application/json
Location: /job/jf5f4d319-d942-4255-b8ca-4174539dda84
Content-Length: 235

{
  "id": "jf5f4d319-d942-4255-b8ca-4174539dda84",
  "query_id": "~/roll-call-query",
  "name": "Membership 2022",
  "files": [
    "f831017ab-b130-49e7-b39e-5a13c056fd8d",
    "fdde795a2-c907-44cc-9a27-f707e6c6df17"
  ]
}

This will start an (asynchronous) query job, and respond with some metadata about it, including a Location header providing the API's URL for the new job where you can retrieve query job status and results, as well as a listing of the files that are being queried.

	Pending query job response
id	The unique ID of the query job
query_id	The ID of the pdfQL query being run in the job
name optional	The name given to the job
files	An array of file IDs; if the query-running request included any tags as `source`s, all files with those tags will be resolved and included here.

Jobs

Definition

POST https://us1.api.pdfdata.com/v1/job

A PDFDATA job is any long-running process. Examples include:

Acquiring files from any source (whether a direct upload, or retrieval from external sources like Dropbox or an S3 bucket)
Running a pdfQL query over a set of source files

Example job request

curl https://us1.api.pdfdata.com/v1/job/jc92d0c4f-e133-46c9-81bb-823913a6c5ba \
  -u api___key:

To retrieve a job's status (and in the case of a query job, the results of a pdfQL query), issue a GET request to the corresponding job resource.

Example (upload) job response

HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 428

{
  "id": "jc92d0c4f-e133-46c9-81bb-823913a6c5ba",
  "name": "weekly-invoices",
  "files": [
    {
      "id": "f831017ab-b130-49e7-b39e-5a13c056fd8d",
      "name": "Return.pdf",
      "hash": "16285d42df088864f77c6d33f1f300e2e76eb621"
    },
    {
      "id": "fdde795a2-c907-44cc-9a27-f707e6c6df17",
      "name": "18-30550_1574638_7-23-2018_4567527.pdf",
      "hash": "4055d3bd01939c34a25d08a4d0078e6cc5cbcbc3"
    }
  ]
}

When successful and complete, job requests will yield a 200 OK HTTP response, the body of which will be a JSON-encoded representation of the requested job; the specifics of that representation vary based on the type of job that was requested.

If a job is still pending, job requests will yield a 202 Accepted response, and include status information: a list of counts indicating how many files are yet to be processed in the course of a pdfQL query job.

Example (query) job response

HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 214
 
{
  "id": "j6fbbdb41-26d9-49e3-89a1-20276903b596",
  "query_id": "~/roll-call-query",
  "name": "testquery-123",
  "results": {
    "solutions": [ { "data": [ "AZ", "AY", "AX" ], "score": 6 } ]
  }
}

	Common job attributes
id	The unique ID of the requested job
name optional	The name of the requested job

In addition to the common attributes above, specific types of job objects also contain additional attributes:

	Upload job attributes
files	An array of file objects

	Completed query job attributes
query_id	The ID of the pdfQL query applied to the PDF files in the job.
results	A list of result objects, one per successfully-queried file.

	Pending query job attributes
query_id	The ID of the pdfQL query applied to the PDF files in the job.
status	A status object indicating the aggregate status of querying each of the source PDF files in the job.

	Result object
source	An object indicating the file from which the result data was extracted, contains (file) `id` and (file) `name` attributes.
solutions	A list of "solution" objects, each containing a `data` attribute that is a list of the data elements produced by the pdfQL query. The structure of those data elements is not documented here, as they can vary significantly depending on the structure and options specified by the pdfQL query; see our documentation of pdfQL itself elsewhere for details on this subject.

	Status object
waiting optional	The number of files that have not yet been processed.
running optional	The number of files currently being processed.
recovering optional
timeout optional	The number of files that could not be processed within the quota window associated with the owning account.
failed optional	The number of files that produced hard failures upon querying.
done optional	The number of files that have been successfully queried.

Webhooks

Every pdfQL™ query job is an asynchronous, maybe long-running task. This means that when you start such a job (either via the PDFDATA web application, or this API), results from that job are not immediately available; PDFDATA needs to do a bunch of work to gather the source PDF files you've selected, analyze the pdfQL query to be used, and then apply it to each document.

While you could continuously poll the API to track a job's status and then eventually retrieve its results, we strongly encourage you to make use of PDFDATA's webhook facility. A webhook is basically an HTTP callback; many services use webhooks to notify other services of events and deliver data in a "push" fashion, which is always faster and more efficient than "pull" methods like repeated polling.

Configuring webhooks

In the "Account" section of the PDFDATA application, you can set a default webhook URL, and optionally a secret token that will be relayed with all webhook requests:

The webhook URL must use https. There are no restrictions on the content of the optional secret token; if set, it will be sent with all webhook requests, and can be used to verify that each request is actually originating from PDFDATA.

The webhook URL you provide here will be used by default for all pdfQL query jobs that you start via the API, and will be pre-filled in the form used to start a query job in the PDFDATA web application.

Receiving webhook requests

When a pdfQL query job finishes, PDFDATA will send an HTTP POST request to the webhook URL for that job carrying:

The same data that you would obtain by retrieving a job via the API, and
The job's ID as the value of the X-PDFDATA-JOBID HTTP header, and
If you have set a secret token in your account's webhook settings, it will be included in the webhook request as the value of the X-PDFDATA-TOKEN HTTP header.

How you implement your webhook handler is entirely up to you: it can be a web service you deploy to a public cloud, a server you own and maintain yourself, or it can be an endpoint provided by a service like Zapier. As long as the webhook URL points to a service that can accept HTTPS POST requests, PDFDATA will be able to deliver your pdfQL query job results to it.