Document AI - Introduction

In this tutorial we will show how setup, train and use document AI in Snowflake.

Video

Video is still in development.

Requirement

This tutorial assumes you have nothing in your Snowflake account (Trial) and no complex security needs.

Downloads

If you don't have a database, schema or warehouse yet.

Database, schema and warehouse

use role sysadmin;

-- Create a database to store our schemas.
create database if not exists raw;

-- Create the schema. The schema stores all our objectss.
create schema if not exists raw.documents;

/*
    Warehouses are synonymous with the idea of compute
    resources in other systems. We will use this
    warehouse to call our user defined function.
*/
create warehouse if not exists development 
    warehouse_size = xsmall
    initially_suspended = true;

Snowflake

Lets start in the AI & ML under Document AI and we'll build our first model.

Training

Lets give the model a name and location to be stored. Once done, click "create".

Now that it's created lets upload our training resumes. Click upload documents.

Browse and upload the training pdf files.

Once all have been uploaded click done.

Now they are uploaded to the dataset lets define our question / values we want to pull.

Click "+ value" to start asking questions.

On the left will be it's title/key and right will be the question we want to ask to retrieve from the resumes. Once you click enter on it, it will pull the value and give you an accuracy number.

I asked it a four questions and one that pulls multiple responses/items. When I'm happy with the responses I can click the check box or correct the model.

After we accept all and review next we'll get a new example which we'll want to validate or correct. We'll follow this process until we have reviewed all training resumes.

Once all resumes are reviewed we'll be able to see the status of all training resumes.

If we are not happy with the accuracy we can train the model to make our accuracy better.

Once we are happy with the accuracy we can publish the model so it can be used on our testing resumes.

Click publish.

Parsing new documents

Now we'll notice that two examples are provided for parsing new documents using our published model. We can copy either the folder or single files example, we will use this later.

Upload Testing Data

Lets create a new stage in our schema for our testing pdf documents.

Lets call it resumes, encrypt it using "Server-side encreption", and click create.

Lets upload our testing resumes.

Browse, upload all three pdf's, and click "upload".

Parse new documents

Once upload lets open a worksheet and run one of our two example codes pointed at our stage.

Single Directory

use schema raw.documents;

select resumes!predict(get_presigned_url(@resumes,'Ex11.pdf', 1);

use schema raw.documents;

select 
    resumes!predict(get_presigned_url(@resumes,relative_path, 1)
from
    directory(@resumes);

Now we can see our JSON response, if you do an entire directory you can also put it in a CTE and flatten it after be parsed.