Introduction
PDFDATA provides PDF data extraction as a service using pdfQL™, via in-browser tools as well as the HTTP API described in this document.
You can use any HTTP client with the PDFDATA API, but the examples in this documentation are currently shown in
curl
, a popular command-line HTTP client available for just about any operating system.
Setup
curl
is probably available via your operating system's package manager, or
direct download.
While you can use curl
itself to interact with PDFDATA, these examples are provided in
part because curl
invocations are widely understood as representations of HTTP requests.
So, you can use curl
directly, its library variant libcurl
, or you should be able to use
these example curl
interactions to guide your integration with the PDFDATA API using
your language's HTTP client implementation.
API Endpoints
If your PDFDATA account's service level includes region pinning — the ability to require that data and operations be strictly located in a single geographic region, usually to satisfy data residency requirements — your API endpoint will correspond with that region. For example, an EU PDFDATA API endpoint might be
https://eu1.api.pdfdata.com
while a United States API endpoint might be
https://us2.api.pdfdata.com
"Basic" PDFDATA accounts that do not include region pinning are randomly assigned to an API "pool"; API endpoints for these accounts always begin with a
basicN.api
prefix, e.g.https://basic1.api.pdfdata.com
While unpinned accounts may be rebalanced from time to time to different compute regions, their associated domain name and API endpoints will not change.
Your PDFDATA account has an associated API endpoint, determined by the service level you've selected, and then potentially which geographic region you have opted to "pin" your account to.
To find your account's API endpoint and credentials, go to the "Account" tab after logging into PDFDATA, and look for the "API Access" section.
Throughout this documentation, samples will refer to one of the United States regions'
PDFDATA API endpoints, https://us1.api.pdfdata.com
.
Authentication & Credentials
Throughout this documentation, samples will use a dummy API key:
api___key
Use either the
-u
option to provide your API key (remember to add a trailing colon,:
, to preventcurl
from asking for a corresponding password):curl https://us1.api.pdfdata.com/... \ -u api___key: .....
Or, you can splice your API key into the URL, just before the hostname:
curl https://api___key@us1.api.pdfdata.com/...
Requests to the PDFDATA API must carry credentials, and must be made via HTTPS; any that do not will fail.
API requests are authenticated by HTTP Basic Auth. Your API key is the HTTP Basic username; the password is empty / not provided.
You can view and manage your API keys in the "Account" tab of the PDFDATA application. While all communications between your application and PDFDATA are secured by HTTPS / TLS, API keys enable access to your source documents and extracted data, so be sure to keep them secret! Do not leave your API keys in source control repositories, client-side code, and other widely-accessible areas. We will honor any API request that includes your credentials as being authorized by you, so protect them accordingly.
Using the PDFDATA API
Once you have:
- An active PDFDATA account (create one now!)
- Set up your development environment and obtained your API key
- Written a pdfQL query via the PDFDATA application that is ready to be applied (or you've opted to have us write queries for you!)
…you can productively use the API, which always follows the same cycle:
- Upload your PDF files
- Run a pdfQL query
- Either:
- Receive the extracted data via webhook delivery, OR
- Retrieve the extracted data from the query job
Errors
Example error response
PDFDATA's HTTP API uses standard HTTP status codes and explanatory JSON bodies to indicate API request failures. All API-level errors can be broadly separated into problems with the request (these will provoke responses with 4xx status codes), or unexpected problems on our part in processing the request (resulting in a 5xx status code).
HTTP/1.1 400 Bad Request
Content-Type: application/json
Content-Length: 55
{ "error": { "message": "No `source` parameter(s)" } }
PDFDATA error responses are JSON documents that describe the nature of the error as much as possible.
Error object attributes | |
---|---|
error | An object describing the nature of the error. It will always include a message string attribute, which will contain a human-friendly description of the problem. |
Files
You need to get your PDF files into PDFDATA in order for it to work with them; this is done by uploading files via either the in-browser application, or via the API. Once uploaded, you can run pdfQL queries over them to extract the data you care about.
Uploading files
Definition
POST https://us1.api.pdfdata.com/v1/file
Example request
curl "https://us1.api.pdfdata.com/v1/file?name=Weekly invoices" \
-u api___key: \
-F pdf=@{PATH_TO_PDF} \
-F pdf=@{PATH_TO_PDF2} \
Documents are uploaded by sending a multipart/form-data
POST
request, which
can contain one or many documents along with their filenames.
Query parameters | |
---|---|
name optional | The name that should be given to the upload job (useful for finding uploads in the in-browser PDFDATA application) |
tag optional | The tag that should be added to each of the uploaded files. To add multiple tags to the set of uploaded files, use multiple tag parameters. |
Request parameters | |
---|---|
Each pdf parameter's value should consist of the raw binary content of a source PDF document. The filename document attribute is sourced from the filename property of each pdf request parameter. Any provided type attribute is ignored; every range of pdf data is assumed to be a PDF document. |
When successful, file uploads will yield a 201 Created
HTTP response, the body of which
will be a JSON-encoded representation of the upload job associated with the files that
were sent. Take note of the job ID and the IDs assigned to the individual files you
provided, as those will be needed when issuing requests to run pdfQL queries
over those files.
Example response
HTTP/1.1 201 Created
Content-Type: application/json
Content-Length: 428
{
"id": "jc92d0c4f-e133-46c9-81bb-823913a6c5ba",
"name": "Weekly invoices",
"files": [
{
"id": "f831017ab-b130-49e7-b39e-5a13c056fd8d",
"name": "Return.pdf",
"hash": "16285d42df088864f77c6d33f1f300e2e76eb621"
},
{
"id": "fdde795a2-c907-44cc-9a27-f707e6c6df17",
"name": "18-30550_1574638_7-23-2018_4567527.pdf",
"hash": "4055d3bd01939c34a25d08a4d0078e6cc5cbcbc3"
}
]
}
Upload job response | |
---|---|
id | The unique ID of the upload job |
name optional | The name given to the upload job |
files | An array of file objects |
Queries
PDFDATA is built around pdfQL™; you interact with pdfQL queries in two ways:
- By authoring, testing, and maintaining queries in the PDFDATA application
- By running those pdfQL queries against one or many source PDF files, via either the PDFDATA application, or via the query facilities provided by this API.
Definition
POST https://us1.api.pdfdata.com/v1/query/<QUERY_ID>
You can run a pdfQL query by issuing a POST
request to the URL corresponding to that
query indicating which files to be queried. To do this, find the query ID at the top of
its page in the PDFDATA application (it will be a string like ~/<QUERY_NAME>
), and
append it to the root query resource URL.
Example query-running request
curl https://us1.api.pdfdata.com/v1/query/~/roll-call-query \
-u api___key:
-d "name=Membership 2022"
-d "webhook_url=https://...."
-d source=jf6dfeae2-ca66-42f2-9dbf-6845cfb761d6
-d source=f831017ab-b130-49e7-b39e-5a13c056fd8d
Request parameters | |
---|---|
name optional | The name that should be given to the query job (useful for finding a particular job in the in-browser PDFDATA application) |
webhook_url optional | The URL that should receive a webhook request once the query job is complete. |
source | The ID of the file to be queried, or a tag that should be used to find and include all files that have that tag. To include multiple files or tags, use multiple source parameters. |
Example new query job response
HTTP/1.1 202 Accepted
Content-Type: application/json
Location: /job/jf5f4d319-d942-4255-b8ca-4174539dda84
Content-Length: 235
{
"id": "jf5f4d319-d942-4255-b8ca-4174539dda84",
"query_id": "~/roll-call-query",
"name": "Membership 2022",
"files": [
"f831017ab-b130-49e7-b39e-5a13c056fd8d",
"fdde795a2-c907-44cc-9a27-f707e6c6df17"
]
}
This will start an (asynchronous) query job, and respond with some metadata about it,
including a Location
header providing the API's URL for the new job where you can
retrieve query job status and results, as well as a listing of the files that are
being queried.
Pending query job response | |
---|---|
id | The unique ID of the query job |
query_id | The ID of the pdfQL query being run in the job |
name optional | The name given to the job |
files | An array of file IDs; if the query-running request included any tags as source s, all files with those tags will be resolved and included here. |
Jobs
Definition
POST https://us1.api.pdfdata.com/v1/job
A PDFDATA job is any long-running process. Examples include:
- Acquiring files from any source (whether a direct upload, or retrieval from external sources like Dropbox or an S3 bucket)
- Running a pdfQL query over a set of source files
Example job request
curl https://us1.api.pdfdata.com/v1/job/jc92d0c4f-e133-46c9-81bb-823913a6c5ba \
-u api___key:
To retrieve a job's status (and in the case of a query job, the results of a pdfQL query),
issue a GET
request to the corresponding job
resource.
Example (upload) job response
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 428
{
"id": "jc92d0c4f-e133-46c9-81bb-823913a6c5ba",
"name": "weekly-invoices",
"files": [
{
"id": "f831017ab-b130-49e7-b39e-5a13c056fd8d",
"name": "Return.pdf",
"hash": "16285d42df088864f77c6d33f1f300e2e76eb621"
},
{
"id": "fdde795a2-c907-44cc-9a27-f707e6c6df17",
"name": "18-30550_1574638_7-23-2018_4567527.pdf",
"hash": "4055d3bd01939c34a25d08a4d0078e6cc5cbcbc3"
}
]
}
When successful and complete, job requests will yield a 200 OK
HTTP response, the body
of which will be a JSON-encoded representation of the requested job; the specifics of that
representation vary based on the type of job that was requested.
If a job is still pending, job requests will yield a 202 Accepted
response, and
include status information: a list of counts indicating how many files are yet to be
processed in the course of a pdfQL query job.
Example (query) job response
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 214
{
"id": "j6fbbdb41-26d9-49e3-89a1-20276903b596",
"query_id": "~/roll-call-query",
"name": "testquery-123",
"results": {
"solutions": [ { "data": [ "AZ", "AY", "AX" ], "score": 6 } ]
}
}
Common job attributes | |
---|---|
id | The unique ID of the requested job |
name optional | The name of the requested job |
In addition to the common attributes above, specific types of job objects also contain additional attributes:
Upload job attributes | |
---|---|
files | An array of file objects |
Completed query job attributes | |
---|---|
query_id | The ID of the pdfQL query applied to the PDF files in the job. |
results | A list of result objects, one per successfully-queried file. |
Pending query job attributes | |
---|---|
query_id | The ID of the pdfQL query applied to the PDF files in the job. |
status | A status object indicating the aggregate status of querying each of the source PDF files in the job. |
Webhooks
Every pdfQL™ query job is an asynchronous, maybe long-running task. This means that when you start such a job (either via the PDFDATA web application, or this API), results from that job are not immediately available; PDFDATA needs to do a bunch of work to gather the source PDF files you've selected, analyze the pdfQL query to be used, and then apply it to each document.
While you could continuously poll the API to track a job's status and then eventually retrieve its results, we strongly encourage you to make use of PDFDATA's webhook facility. A webhook is basically an HTTP callback; many services use webhooks to notify other services of events and deliver data in a "push" fashion, which is always faster and more efficient than "pull" methods like repeated polling.
Configuring webhooks
In the "Account" section of the PDFDATA application, you can set a default webhook URL, and optionally a secret token that will be relayed with all webhook requests:
The webhook URL must use https
. There are no restrictions on the content of the
optional secret token; if set, it will be sent with all webhook requests, and can be used
to verify that each request is actually originating from PDFDATA.
The webhook URL you provide here will be used by default for all pdfQL query jobs that you start via the API, and will be pre-filled in the form used to start a query job in the PDFDATA web application.
Receiving webhook requests
When a pdfQL query job finishes, PDFDATA will send an HTTP POST
request to the webhook
URL for that job carrying:
- The same data that you would obtain by retrieving a job via the API, and
- The job's ID as the value of the
X-PDFDATA-JOBID
HTTP header, and - If you have set a secret token in your account's webhook settings, it will be included
in the webhook request as the value of the
X-PDFDATA-TOKEN
HTTP header.
How you implement your webhook handler is entirely up to you: it can be a web service you
deploy to a public cloud, a server you own and maintain yourself, or it can be an endpoint
provided by a service like Zapier. As long as the webhook URL points to a service that can
accept HTTPS POST
requests, PDFDATA will be able to deliver your pdfQL query job results
to it.