How To Validate Three Common Document Types in Python

2024-11-30

Regardless of the industry vertical we work in – whether that be somewhere in the technology, e-commerce, manufacturing, or financial fields (or some Venn diagram of them all) – we mostly rely on the same handful of standard document formats to share critical information between our internal teams and with external organizations. These documents are almost always on the move, bouncing across our networks as fast as our servers will allow them to. Many internal business automation workflows, for example – whether written with custom code or pieced together through an enterprise automation platform – are designed to process standard PDF invoice documents step by step as they pass from one stakeholder to another. Similarly, customized reporting applications are typically used to access and process Excel spreadsheets which the financial stakeholders of a given organization (internal or external) rely on. All the while, these documents remain beholden to strictly enforced data standards, and each application must consistently uphold these standards. That’s because every document, no matter how common, is uniquely capable of defying some specific industry regulation, containing an unknown error in its encoding, or even - at the very worst - hiding a malicious security threat behind a benign façade.

As rapidly evolving business applications continue to make our professional lives more efficient, business users on any network place more and more trust in the cogs that turn within their suite of assigned applications to uphold high data standards on their behalf. As our documents travel from one location to another, the applications they pass through are ultimately responsible for determining the integrity, security, and compliance of each document’s contents. If an invalid PDF file somehow reaches its end destination, the application which processes it – and, by extension, those stakeholders responsible for creating, configuring, deploying, and maintaining the application in the first place – will have some difficult questions to answer. It’s important to know upfront, right away, whether there are any issues present within the documents our applications are actively processing. If we don’t have a way of doing that, we run the risk of allowing our own applications to shoot us in the foot.

Thankfully, it’s straightforward (and standard) to solve this problem with layers of data validation APIs. In particular, document validation APIs are designed to fit seamlessly within the architecture of a file processing application, providing a quick feedback loop on each individual file they encounter to ensure the application runs smoothly when valid documents pass through and halting its process immediately when invalid documents are identified. There are dozens of common document types which require validation in a file processing application, and many of the most common among those, including PDF, Excel, and DOCX (which this article seeks to highlight), are all compressed and encoded in very unique ways, making it particularly vital to programmatically identify whether their contents are structured correctly and securely.

Document Validation APIs

The purpose of this article is to highlight three API solutions that can be used to validate three separate and exceedingly common document types within your various document processing applications: PDF, Excel XLSX, and Microsoft Word DOCX. These APIs are all free to use, requiring a free-tier API key and only a few lines of code (provided below in Python for your convenience) to call their services. While the process of validating each document type listed above is unique, the response body provided by each API is standardized, making it efficient and straightforward to identify whether an error was found within each document type and if so, what warnings are associated with that error. Below, I’ll quickly outline the general body of information supplied in each of the above document validation API's response:

DocumentIsValid – This response contains a simple Boolean value indicating whether the document in question is valid based on its encoding.
PasswordProtected – This response provides a Boolean value indicating whether the document in question contains password protection (which – if unexpected – can indicate an underlying security threat).
ErrorCount – This response provides an integer reflecting the number of errors detected within the document in question.
WarningCount – This response indicates the number of warnings produced by the API response independently of the error count.
ErrorsAndWarnings – This response category includes more detailed information about each error identified within a document, including an error description, error path, error URI (uniform resource identifier, such as URL or URN), and IsError Boolean.

Demonstration

To use any of the three APIs referred to above, the first step is to install the Python SDK with a pip command provided below:

pip install cloudmersive-convert-api-client

With installation complete, we can turn our attention to the individual functions which call each individual API’s services.

To call the PDF validation API, we can use the following code:

     Python 
   
 
 
   from __future__ import print_function
import time
import cloudmersive_convert_api_client
from cloudmersive_convert_api_client.rest import ApiException
from pprint import pprint

# Configure API key authorization: Apikey
configuration = cloudmersive_convert_api_client.Configuration()
configuration.api_key['Apikey'] = 'YOUR_API_KEY'



# create an instance of the API class
api_instance = cloudmersive_convert_api_client.ValidateDocumentApi(cloudmersive_convert_api_client.ApiClient(configuration))
input_file = '/path/to/inputfile' # file | Input file to perform the operation on.

try:
    # Validate a PDF document file
    api_response = api_instance.validate_document_pdf_validation(input_file)
    pprint(api_response)
except ApiException as e:
    print("Exception when calling ValidateDocumentApi->validate_document_pdf_validation: %s\n" % e) 
  

To call the Microsoft Excel XLSX validation API, we can use the following code instead:

     Python 
   
 
 
   from __future__ import print_function
import time
import cloudmersive_convert_api_client
from cloudmersive_convert_api_client.rest import ApiException
from pprint import pprint

# Configure API key authorization: Apikey
configuration = cloudmersive_convert_api_client.Configuration()
configuration.api_key['Apikey'] = 'YOUR_API_KEY'



# create an instance of the API class
api_instance = cloudmersive_convert_api_client.ValidateDocumentApi(cloudmersive_convert_api_client.ApiClient(configuration))
input_file = '/path/to/inputfile' # file | Input file to perform the operation on.

try:
    # Validate a Excel document (XLSX)
    api_response = api_instance.validate_document_xlsx_validation(input_file)
    pprint(api_response)
except ApiException as e:
    print("Exception when calling ValidateDocumentApi->validate_document_xlsx_validation: %s\n" % e) 
  

And finally, to call the Microsoft Word DOCX validation API, we can use the final code snippet supplied below:

     Python 
   
 
 
   from __future__ import print_function
import time
import cloudmersive_convert_api_client
from cloudmersive_convert_api_client.rest import ApiException
from pprint import pprint

# Configure API key authorization: Apikey
configuration = cloudmersive_convert_api_client.Configuration()
configuration.api_key['Apikey'] = 'YOUR_API_KEY'



# create an instance of the API class
api_instance = cloudmersive_convert_api_client.ValidateDocumentApi(cloudmersive_convert_api_client.ApiClient(configuration))
input_file = '/path/to/inputfile' # file | Input file to perform the operation on.

try:
    # Validate a Word document (DOCX)
    api_response = api_instance.validate_document_docx_validation(input_file)
    pprint(api_response)
except ApiException as e:
    print("Exception when calling ValidateDocumentApi->validate_document_docx_validation: %s\n" % e) 
  

Please note that while these APIs do provide some basic security benefits during their document validation processes (i.e., identifying unexpected password protection on a file, which is a common method for sneaking malicious files through a network - the password can be supplied to an unsuspecting downstream user at a later date), they do not constitute fully formed security APIs, such as those that would specifically hunt for viruses, malware, and other forms of malicious content hidden within a file. Any document – especially those that originated outside of your internal network – should always be thoroughly vetted through specific security-related services (i.e., services equipped with virus and malware signatures) before entering or leaving your file storage systems.