Connect Snowflake to S3 Storage
Goal of this tutorial is to load JSON and CSV data from a S3 bucket using the Copy into sql command and Snowpipe to automate the ingestion process.
Requirements
- Snowflake account, you can use a free trial. We also assume no complex security needs.
- AWS account, you can setup a free account to get started.
Video
Download
- Sample data (Link)
Manual Loading
Lets start by setting up a snowflake connection to AWS s3 storage and load json data. After that use snowpipe to automate the ingestion of CSV files.
AWS
Sign into your aws account.
Create S3 bucket
Create the bucket you intend to use. In our case we'll call the bucket danielwilczak.
Upload sample data
Upload the sample data to your s3 bucket (json/csv) provided in the data folder.
Policy and role
Copy your ARN name. This wil be used in the policy step.
Copy the template policy json code below and add your buckets arn.
Enter your policy using the template / example above and click next.
Give the policy a name and then click "create policy".
Next lets create a role! Navigate back to IAM:
Select Roles on the navbar and then click create role:
Select AWS account, This account(#), Require external ID and enter 0000 for now. We will update this later.
Add the role name and click "create role":
Copy your role ARN, this will be used in the next step:
Snowflake
Lets start the snowflake setup by creating our database and schema. Followed by creating the integration to AWS by running the code below with your copied role arn and bucket name:
property | property_value |
---|---|
STORAGE_AWS_IAM_USER_ARN | arn:aws:iam::001782626159:user/8pbb0000-s |
STORAGE_AWS_EXTERNAL_ID | GGB82720_SFCRole=2_vcN2MIiC7PW0OMOyA82W5BLJrqY= |
Note the result, it will be used in the next step.
Grant Access in S3
Copy the policy json template code below and add your "STORAGE_AWS_IAM_USER_ARN" and "STORAGE_AWS_EXTERNAL_ID" from prior Snowflake step.
-
From snowflake results we got earlier:
property property_value STORAGE_AWS_IAM_USER_ARN arn:aws:iam::001782626159:user/8pbb0000-s STORAGE_AWS_EXTERNAL_ID GGB82720_SFCRole=2_vcN2MIiC7PW0OMOyA82W5BLJrqY= -
From snowflake results we got earlier:
property property_value STORAGE_AWS_IAM_USER_ARN arn:aws:iam::001782626159:user/8pbb0000-s STORAGE_AWS_EXTERNAL_ID GGB82720_SFCRole=2_vcN2MIiC7PW0OMOyA82W5BLJrqY=
Enter your policy using the template / example above.
Load the data
Lets setup the stage, file format, warehouse and finally load some json data.
file | status |
---|---|
s3://danielwilczak/json/sample.json | LOADED |
Look at the data you just loaded.
Automatic Loading
Warning
If you have not manually loaded data yet from S3. Please go back and complete that section first.
Lets create a pipe to automate copying data into a table. Create the file format, table and pipe in snowflake. This approach automates the process so you don't have to manually name all the columns. This code will also give you your SQS queue string to be entered into AWS later.
Snowflake
Note
Sometimes it may take 1-2 minutes before you see data in the table. This depends on how AWS is feeling today.
In this case we'll load a csv file by automating the creation of the table and infering the names in the csv pipe.
name | sqs_queue |
---|---|
CSV | arn:aws:sqs:us-west-2:001782626159:sf-snowpipe-AIDAQ... |
Grant Access in S3
Navigate to your bucket and click properties:
Scroll down to "Create event notification":
Add a name to the notification and select all object create notification:
Scroll down and enter your sqs queue we got from our snowflake step and click "save changes":
Almost done, in snowflake lets refresh the pipe so that we ingest all the current files.
Load the data
File | Status |
---|---|
/sample_1.csv | SENT |
Result
Note
Sometimes it may take 1-2 minutes before you see data in the table. This depends on how AWS is feeling today.
Let's add more sample data into the S3 bucket CSV folder and see it added in snowflake ~30 seconds later. We can see this by doing a count on our table and seeing 20 records, whereas the original CSV only has 10 records.