Extracting data from PDF using Azure Form Recognizer / AI Document Intelligence
Form Recognizer is now called Azure AI Document Intelligence!
In one of my engagements, I was working with a customer department which had thousands of product research documentations in word forms kept at SharePoint. They wanted to explore the data in the forms but, had no way out except for manually opening individual documents and keying-in into a database!
In this blog we’ll see how to handle similar scenarios, help customers to extract data from PDFs/Word documents into structured format and enable them for descriptive analytics.
Azure Form Recognizer
Azure Form Recognizer is an applied AI service to extract texts from images and PDFs. This comes up with three types of APIs:
- Layout API — Detects and extracts text and layout of documents, such as tables, checkboxes and objects.
- Pre-built API — These are pre-trained models for common scenarios such as IDs, receipts and invoices, that extract text, key-value pairs and line items from documents.
- Custom API — This custom form service lets you train on your own data to learn the structure of your documents in an intelligent way.
Custom vs Layout Model
I preferred Custom API where, I created a model by labelling 5–6 well constructed PDF forms (supervised approach). PDF labelling, custom model…