Connect Snowflake to Google Cloud Storage
Goal of this tutorial is to load JSON and CSV data from a Google Cloud Storage using the Copy into sql command and Snowpipe to automate the ingestion process.
Video
Requirements
- Snowflake account, you can use a free trial. We also assume no complex security needs.
- Google cloud account, you can setup a free account to get started.
Download
- Sample data (Link)
Manual Loading
Lets start by setting up a Snowflake connection to Google Cloud Storage and load json data. After that use snowpipe to automate the ingestion of CSV files.
Google Cloud
Sign into your google account.
Create project
If you don't have a project, start by selecting/creating a project.
In our case we'll call the project danielwilczak
and select the default no orginization
for the locaition.
Create cloud storage
We will select our new project and click cloud storage
to create a storage bucket.
Click create
or create bucket
.
I'm going to name the bucket danielwilczak
as well. Copy this name, we will use it later.
Upload sample data
Upload the sample data to your google cloud storage bucket (json/csv) provided in the data folder.
Snowflake
Let's setup snowflake by creating a worksheet in snowflake and add the code below with your bucket name from earlier and hit run:
property | principal |
---|---|
STORAGE_GCP_SERVICE_ACCOUNT | rbnxkdkujw@prod3-f617.iam.gserviceaccount.com |
Grant Access in Google Cloud
Lets navigate to IAM so that we can give snowflake access to our storage account.
Fill in the role information. We will call it snowflake
. After that click Add Permissions
.
The permissions to select can be found on Snowflake's documentation. In this tutorial I have choosen Data loading and unloading
. I have also provided a gif to show how to select the permissions because the user interface is terrible.
Navigate back to our bucket. Click permissions
, followed by add principle
.
In the new principles section, add your STORAGE_GCP_SERVICE_ACCOUNT given by Snowflake earlier.
Now add your role by clicking select role
-> custom
-> snowflake
. The last one will be your role name.
If you get a 'Domain restricted sharing' error when you click 'Save'.
If you run into this error it's because google cloud has updated thier policy as of March 2024. We'll have to update them. First select your orginization (not your project), then go to IAM in the search, followed by clicking "grant access".
Next we'll add our user email into the new principals area. We'll search and click on "Organization Policy Administrator".
Next we'll want to update the policy. By searching IAM, selecting orgianization policies, searching domain and clicking on "Domain restricted sharing".
Note
"Allow All" is the simple approach but feel free to use more fine grain approach via Snowflake documentation.
We'll want to overide the parent policy with a new rule. Select replace the policy and then select "Allow All". Click done and "Set Polcy." and your good to go.
The policy has been updated and you can retry adding the role to the new principal.
Click Save
and your finished with Google Cloud for manual loading.
Load the data
Lets setup the stage, file format and finally load some json data.
file | status |
---|---|
gcs://danielwilczak/json/sample.json | LOADED |
Look at the data you just loaded.
Automatic Loading
Warning
If you have not manually loaded data yet from Google Cloud storage. Please go back and complete that section first.
Google Cloud
First we'll start by clicking on the project selector.
We'll copy our project id. We will use this later.
Next we'll open the google cloud command line interface (CLI).
You will have to authorize the CLI but once opened a window below will show.
First we'll set the current project
Next we'll create the notification topic for our cloud storage.
After that we're done with the CLI. We can close it.
Next we'll navigate to pub/sub using the search bar.
We'll click on the topic we created.
We'll click create subscription
.
We'll give it a name. I used snowpipe_subscription
and make sure it's set to Pull
.
Once created we'll copy the full subscription name
. We will use this in the next step.
Snowflake
Lets create the notification integration in a Snowflake worksheet.
property | principal |
---|---|
GCP_PUBSUB_SERVICE_ACCOUNT | geimkrazlq@prod3-f617.iam.gserviceaccount.com |
Grant Access in Google Cloud
Lets go back into google cloud and click on our subscription.
Click show panel if not open already.
Add your principle login user we got from snowflake in the prior step.
Click select role and select Pub / Sub
-> Pub / Sub Subscriber
.
Next we'll want to go back to IAM.
Add your principle login user we got from snowflake in the prior step.
Click select role and search Monitoring Viewer
and click Monitoring Viewer
.
Final Google Cloud step - Click Save
.
Load the data
Note
Sometimes it may take 1-2 minutes before you see data in the table. This depends on how Google Cloud is feeling today.
In this case we'll load a csv file by automating the creation of the table and infering the names in the csv pipe.
File | Status |
---|---|
/sample_1.csv | SENT |
Result
Note
Sometimes it may take 1-2 minutes before you see data in the table. This depends on how Google Cloud is feeling today.
Lets add more sample data into the google cloud storage bucket csv folder and see it added in snowflake ~30 seconds later. We can see this by doing a count on our table and see 20 records where the original csv only has 10 records.