Skip to content

Document AI - Introduction

In this tutorial we will show how setup, train and use document AI in Snowflake.

Video

Video is still in development.

Requirement

This tutorial assumes you have nothing in your Snowflake account (Trial) and no complex security needs.

Downloads

If you don't have a database, schema or warehouse yet.
use role sysadmin;

-- Create a database to store our schemas.
create database if not exists raw;

-- Create the schema. The schema stores all our objectss.
create schema if not exists raw.documents;

/*
    Warehouses are synonymous with the idea of compute
    resources in other systems. We will use this
    warehouse to call our user defined function.
*/
create warehouse if not exists development 
    warehouse_size = xsmall
    initially_suspended = true;

Snowflake

Lets start in the AI & ML under Document AI and we'll build our first model. UPDATE

Training

Lets give the model a name and location to be stored. Once done, click "create". UPDATE

Now that it's created lets upload our training resumes. Click upload documents. UPDATE

Browse and upload the training pdf files. UPDATE

Once all have been uploaded click done. UPDATE

Now they are uploaded to the dataset lets define our question / values we want to pull. UPDATE

Click "+ value" to start asking questions. UPDATE

On the left will be it's title/key and right will be the question we want to ask to retrieve from the resumes. Once you click enter on it, it will pull the value and give you an accuracy number. UPDATE

I asked it a four questions and one that pulls multiple responses/items. When I'm happy with the responses I can click the check box or correct the model. UPDATE

After we accept all and review next we'll get a new example which we'll want to validate or correct. We'll follow this process until we have reviewed all training resumes. UPDATE

Once all resumes are reviewed we'll be able to see the status of all training resumes. UPDATE

If we are not happy with the accuracy we can train the model to make our accuracy better. UPDATE

Once we are happy with the accuracy we can publish the model so it can be used on our testing resumes. UPDATE

Click publish. UPDATE

Parsing new documents

Now we'll notice that two examples are provided for parsing new documents using our published model. We can copy either the folder or single files example, we will use this later. UPDATE

Upload Testing Data

Lets create a new stage in our schema for our testing pdf documents. UPDATE

Lets call it resumes, encrypt it using "Server-side encreption", and click create. UPDATE

Lets upload our testing resumes. UPDATE

Browse, upload all three pdf's, and click "upload". UPDATE

Parse new documents

Once upload lets open a worksheet and run one of our two example codes pointed at our stage.

1
2
3
use schema raw.documents;

select resumes!predict(get_presigned_url(@resumes,'Ex11.pdf', 1);
1
2
3
4
5
6
use schema raw.documents;

select 
    resumes!predict(get_presigned_url(@resumes,relative_path, 1)
from
    directory(@resumes);

Now we can see our JSON response, if you do an entire directory you can also put it in a CTE and flatten it after be parsed. UPDATE