Document AI - Introduction
In this tutorial we will show how setup, train and use document AI in Snowflake.
Video
Video is still in development.
Requirement
This tutorial assumes you have nothing in your Snowflake account (Trial) and no complex security needs.
Downloads
If you don't have a database, schema or warehouse yet.
use role sysadmin;
-- Create a database to store our schemas.
create database if not exists raw;
-- Create the schema. The schema stores all our objectss.
create schema if not exists raw.documents;
/*
Warehouses are synonymous with the idea of compute
resources in other systems. We will use this
warehouse to call our user defined function.
*/
create warehouse if not exists development
warehouse_size = xsmall
initially_suspended = true;
Snowflake
Lets start in the AI & ML under Document AI and we'll build our first model.
Training
Lets give the model a name and location to be stored. Once done, click "create".
Now that it's created lets upload our training resumes. Click upload documents.
Browse and upload the training pdf files.
Once all have been uploaded click done.
Now they are uploaded to the dataset lets define our question / values we want to pull.
Click "+ value" to start asking questions.
On the left will be it's title/key and right will be the question we want to ask to retrieve from the resumes. Once you click enter on it, it will pull the value and give you an accuracy number.
I asked it a four questions and one that pulls multiple responses/items. When I'm happy with the responses I can click the check box or correct the model.
After we accept all and review next we'll get a new example which we'll want to validate or correct. We'll follow this process until we have reviewed all training resumes.
Once all resumes are reviewed we'll be able to see the status of all training resumes.
If we are not happy with the accuracy we can train the model to make our accuracy better.
Once we are happy with the accuracy we can publish the model so it can be used on our testing resumes.
Parsing new documents
Now we'll notice that two examples are provided for parsing new documents using our published model. We can copy either the folder or single files example, we will use this later.
Upload Testing Data
Lets create a new stage in our schema for our testing pdf documents.
Lets call it resumes, encrypt it using "Server-side encreption", and click create.
Lets upload our testing resumes.
Browse, upload all three pdf's, and click "upload".
Parse new documents
Once upload lets open a worksheet and run one of our two example codes pointed at our stage.
Now we can see our JSON response, if you do an entire directory you can also put it in a CTE and flatten it after be parsed.