AWS Kinesis to Snowflake
In this tutorial we will show how you can setup Kinesis stream and firehose to load data into Snowflake. We will use a local python script of taxi data to act as our source.
Credit for the orginial tutorial goes to Umesh Patel!
Video
Requirement
This tutorial assumes you have nothing in your Snowflake account (Trial) and no complex security needs.
Download needed files:
- Data generator code (Link)
Snowflake Setup
For the Snowflake setup we'll want to create the service user with an ssh key and the table to load data into.
Generate an ssh key
Lets create the private and public key so that we can apply the public key to our user.
This will create three files files in the folder we are currently located.
User
Lets create the usser and apply the public key to our user in Snowflake. The public key file will end with .pub
.
- The public key file will end with
.pub
. We got this from our "Generate an ssh key step".
Table
Lets create the table in Snowflake to load data into.
If you don't have a database, schema or warehouse yet.
AWS Setup
Lets setup AWS to recieve data via a stream and the move it to Snowflake via firehose. In the process we will create a user and credentials.
Stream
Lets start by heading into aws and going to Kinesis.
Lets give it a name and select on-demand as we only want to be changed as we use it.
Once it's created we'll want to copy the region. In this case us-east-1. We will use this later in our data generator.
Firehouse
Next we'll want to search for kinesis firehose. Firehose will be the tool that moves our stream to Snowflake.
Click create firehouse stream.
We'll wanted to select kinesis data stream as a source and destination being Snowflake. Next give it a name and click "browse" to find our stream we setup earlier.
Next we'll want to copy our Account url from Snowflake.
First, paste your Snowflake URL. Next, select "Private Key" and enter your Snowflake username along with the kinesis private key we generated earlier. For the role, choose "Customer Snowflake Role" and specify SYSADMIN
.
For the database configuration we'll enter RAW
for the database, KINESIS
for the schema name and TAXI_DATA
for the table. We'll also want it to use the keys as column names when it's dropped into Snowflake.
For the backup settings we'll create a new bucket.
We'll give the bucket a name and then scroll down and click create.
We'll go back and select our bucket we just created. If it does not show up right away, try clicking the refresh button in the top right.
Finish by clicking create firehose.
Finished our firehouse setup and ready to have data moved to Snowflake.
AWS User / Access Key
Now for our data generator to be able to connect we'll need a AWS user, credentials to work with kinesis and an access key/secret. In this setup we will set that up.
User
Lets start by searching IAM
and the going to users.
Give your user a name and click next.
We need to attach a policy directly. Search for "Kinesis" and select AmazonKinesisFullAccess
. This policy grants the permissions needed to push data to the stream. Once selected, click Next to proceed.
Access Key
After the user is created we'll select it by clicking on it's name.
We'll want to create a access key by clicking "Create access key".
The key will be used for the CLI.
Now we'll want to copy our access key and secret access key, so we can put it into our CLI configuration in the next step.
AWS CLI Setup
In this section we'll setup the AWS SLI to be able to connect to our stream via python. I suggest using the GUI installer for this example.
https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
Click, download the cli and install it. It's pretty straight forward.
Once installed well want to call and fill in the questions:
Python data generator
Now we are ready to start generating the data that will be passed to kinesis and then loaded into our Snowflake table. We'll want to open the folder of the files we downloaded at the start of the tutorial.
Lets start by updating our code to use the stream we setup at the beginning. Inside our main.py
we'll update:
Next we'll want to run that code to start generating the data which will be moved to Snowflake.
We will be able to see the messages being sent to kinesis. Now we should start seeing our data being move to Snowflake. If you don't check your firehose logs. You might have a connection issue.
Result
Now we should see our data being loaded into Snowflake.
If we refresh our page we'll see new records being streamed into our table.