Exercise: Amazon Redshift

Natasha Ong
This is some text inside of a div block.
4 min read

Objective

By the end of this hands-on exercise, you should be able to do the following:

  • Launch a Redshift cluster*
  • Connect an SQL client to the Amazon Redshift cluster
  • Load data from an S3 bucket into the Amazon Redshift cluster
  • Run queries against data stored in Amazon Redshift
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud, provided by AWS. It allows users to run complex queries on large datasets and is optimised for high-performance analysis and reporting of big data.
A Redshift cluster is a collection of computing nodes/resources and storage in Amazon Redshift, used for efficiently storing and analyzing large datasets. It consists of one or more compute nodes managed by Redshift's data warehousing service.

Task 0: Accessing the AWS Management Console

  1. Sign in to your IAM user and access your AWS Management Console.
  2. Select US West (Oregon) region for doing this exercise.

Task 1: Create an IAM role and security group

In this task, you'll create an IAM role (which is like a permission ticket) for Redshift. This IAM role will give Redshift permission to access and work with data stored in Amazon S3.

  1. Navigate to IAM in your Console.
  2. On the left side, choose Roles. Choose Create role.
  3. Under the first section (Trusted entity type), what do you think best applies to Amazon Redshift?
  4. If you're stuck: the answer is AWS service.
  5. On the Use case section, type Redshift, and choose Redshift on the drop-down.
  6. Click Next.
  7. On the Add permissions page, choose Next. We'll be adding permissions in a special way in a second!
  8. On the Name, review, and create page, type in the Rule name CustomRedshift-Role
  9. Finally, choose Create role.
  10. Under the Permissions tab, choose Add permissions, and choose Create Inline policy.
An inline policy in AWS IAM is like a set of rules written directly inside a role, outlining what that role is allowed or not allowed to do.

11. Under the Policy editor header, select JSON.

JSON is a way to organise and exchange data in a way that's easy for both people and computers to understand. So while it might look like a programming language, JSON is not a language - it's a format for structuring data! Websites often use JSON to transfer and organise data between the server and the web browser

12. Below is the JSON policy that you'll add to your CustomRedshift-Role role.

Curious what this JSON policy is saying? Let's break it down:
"Version": "2012-10-17", - This means  2012-10-17 is the date of the latest policy version. This tells you whether the policy is up to date with the latest standards and practises.

The "Statement" is a main part of the policy and includes the actual permissions we're trying to set

"Action" refers to a specific task that users can do. In this case, "s3:*" means all tasks relating to S3 (like create, update, delete).

"Resource": "*". This means that the policy applies to all S3 buckets in the account.

"Effect": "Allow". This emphasises that the actions in this policy are permitted, not denied.

13. For the policy name, enter S3forRedshift-policy.

Once you added this IAM policy to your role, you're good to go to the next step.

14. Next, you will create a security group that you will use for Redshift. Navigate to VPC console, and choose Security groups in the left hand panel.

15. Choose Create security group.

16. Configure the following:

  • Security group name: Redshift Security Group
  • Description: Redshift Security Group
  • VPC: Select the default one
  • Inbound rules: Choose Add rule
  • Type: Redshift
  • Source/Source type: Anywhere-IPv4
This source type allows any device connected to the internet using IPv4 to access your resource.

17. Choose Create security group.

Task 2: Launch an Amazon Redshift Cluster

In this task, you'll be guiding yourself through the process of setting up an Amazon Redshift cluster.

1. Navi gate to Amazon Redshift in the console.

2. Before creating a cluster, you must first create a cluster subnet group. To create the cluster subnet group, on the left side of Redshift console expand the Configurations, and choose Subnet groups.

A cluster subnet group lets you pick specific places in your network where you want your Redshift cluster to live. This helps you control what your Redshift cluster can connect to and keeps things organised and secure.

3. Choose Create cluster subnet group. Configure the following:

  • Name: redshiftclustersubnetgroup
  • Description: LabClusterSubnet
  • VPC: Choose the default VPC
  • Availability Zone: <region_code>-2a
  • Subnet: Choose the one subnet that will appear on the drop-down
  • Click Add subnet.

4. Choose Create cluster subnet group.

5. Now let's create the cluster that falls in this cluster group!

6. In the left navigation pane, choose Clusters.

7. Choose Create cluster.

8. In the Cluster configuration section:

  • Cluster identifier: lab-exercise
  • Node type*: dc2.large
When you start a Redshift cluster, one of the key things you decide is the "node type." Your node type determines how much computing power the cluster will use, how much data it can store, and how much it will cost you. For datasets under 1 TB uncompressed, DC2 node types offer the best performance at the lowest price (and dc2.large is the cheapest of the lot).
  • Number of nodes: 2
Having 2 compute nodes means that the Redshift cluster is made up of two separate processing units. Each node is like a mini-computer that works with the other to process queries and store data. This setup helps improve performance and maintaining high availability, making it faster and more reliable for running queries and working with large datasets.

9. In the Database configurations section, configure:

  • Admin user name: master
  • Select Manually add the admin password.
  • Admin user password: Redshift123

10. Under Associated IAM roles, click on Associate IAM role.

11. Select CustomRedshift-Role. This is the role we created in Task 1. It grants permission for Amazon Redshift to read data from Amazon S3.

12. In the Additional configurations section, deselect Use defaults.

13. Expand Network and security, then configure:

  • Virtual private cloud: Choose the default VPC.

VPC security groups:

  • Deselect default
  • Select Redshift Security Group
  • Cluster subnet group: redshiftclustersubnetgroup

14. Expand  Database configurations, then configure:

  • Database name: labdb
  • Scroll to the bottom of the screen, then choose Create cluster.
  • The cluster will take a few minutes to launch. While you wait, select lab-exercise and browse around the details of the cluster you've just set up.

Wait for the Status of your cluster to display Available. You can click on the refresh icon at the General information section of the page, in case the status doesn't update automatically.

Task 3: Use the Redshift Query Editor to Communicate with your Redshift Cluster

Amazon Redshift uses standard SQL. Any SQL client that works with JDBC or ODBC* can connect to Redshift.

JDBC (Java Database Connectivity) and ODBC (Open Database Connectivity) are APIs that help computer programs communicate with a Redshift database using SQL queries. Without JDBC and ODBC, you would spend much more effort, time, and specialised knowledge to communicate with the database effectively.

JDBC is used for programs written in Java.

ODBC is used for programs written in a range of other programming languages, like C++, Python or C#

In this exercise, you will use the Amazon Redshift Query editor.

  1. In the left navigation pane, choose Query editor, then select Connect to database then configure:
  2. Cluster: lab-exercise
  3. Database name: labdb
  4. Database user: master
  5. Choose Connect.
  6. Close the modal if it doesn't close automatically.

You will see the status Connected.

Task 4: Create a Table

In this task, you will execute SQL commands to create a table in Redshift

1. Use this SQL command below - write it into the Query 1 window, then choose Run. This command will create a table named users that includes a user's name, address, and preferred music genre.


Woah! Lots of lines here. Overall, each line from the second line onwards defines a columns of a table called users. Here's a concise breakdown of each type:

  • INTEGER NOT NULL . INTEGER means the userid will be recorded in whole numbers. NOT NULL means userid cannot be left empty; every row in the database must have a value for userid.
  • CHAR(8), CHAR(2), CHAR(14) - CHAR is like a box with a fixed number of spaces. For example, CHAR(8) can only hold words that are exactly 8 spaces long. If you put a shorter word in the box, it fills the extra spaces with empty spaces.
  • VARCHAR(30), VARCHAR(100) - VARCHAR is like a flexible box that stretches to fit the word inside, but only up to a certain limit. The number in parentheses sets the limit for maximum word length. Unlike CHAR, VARCHAR doesn't add extra spaces; it only uses as much space as needed, but it won't go beyond the limit.
  • BOOLEAN  - BOOLEAN is like a switch with two positions: on (true) or off (false). It's a way to store simple yes/no or true/false information. Some databases may also represent it as 1 (true) or 0 (false).

After running the query, you should see the users table on the left side Resources. Now you'll get to take a peek at the table columns you've just set up in the SQL statement!

Task 5: Load Sample Data from Amazon S3

Amazon Redshift can bring in data from Amazon S3, and it can handle various file types, like CSV files or files with custom separators. In our case, the data is in a format where fields are separated by a vertical bar (|).

1. Delete the existing query, copy the below SQL command then paste it into the Query 1 window. Don't press Run yet!

Before running this command, you will need to insert the Role that Redshift will use to access Amazon S3.

2. On a separate tab, revisit the IAM role that you've created, and then copy the ARN of that role. It will start with arn:aws:iam::

3. Paste the Role into the query window, replacing the text YOUR-ROLE

It should look like this (except your ARN would be different)


The command loads data from the allusers_pipe.txt file (in the S3 bucket awssampledbuswest2/tickit) into the users table. It uses the specified IAM role for authentication and identifies the data as being pipe-separated (|).

Note: in case you were wondering why we didn't need to set up this bucket ourselves, it's because this bucket has been set up by AWS for learning and testing with sample data!

4. Choose Run. It will take about 10 seconds to load this file, which contains 49,990 rows of data!

Task 6: Query Data

Now that you have the data in your Redshift database you can query the data using SQL select statements and queries. If you are familiar with SQL, feel free to try additional commands to query the data.

1. Run this query to count the number of rows in the users table:


2. Scroll down to see the results. The result shows that there are almost 50,000 rows in the table.

3. Next, run this query:


This query displays users in Ohio (OH) who like sports but do not like opera. The list is sorted by their first name.

Finally, run this query:


This query shows the Top 10 cities where Jazz-loving users live.

Bonus Task: SQL Challenge!

Try to write a query for these requirements:

  1. Display the firstname and lastname...
  2. For users who like both Theatre and Classical...
  3. ... AND has the last name Smith.

Try to do it yourself before seeing the answer.

The expected output should be like this:

Task 7: Delete your resources

Challenge yourself: do you think you can delete the cluster and cluster subnet group we've created on your own?

Tip: you cannot delete a cluster subnet group until you've deleted the cluster associated with it. When deleting the cluster, make sure you uncheck the option for Final snapshot so that you don't incur any costs.

Feeling stuck? Here are the instructions:

  1. Navigate to the Clusters dashboard
  2. Select the cluster that you've created, and then choose Actions.
  3. Choose Delete.

4. Ensure you uncheck the option for Final snapshot so that you don't incur any costs.

5. Type delete to confirm the deletion, and choose Delete cluster.

  • You should see that your cluster is now being deleted.
If you forgot to uncheck this option, on the left hand navigation bar, expand Clusters and choose Snapshots. You can find, and delete there the saved snapshot of the clusters.

Now, let's delete the cluster subnet group:

  1. In Redshift console, expand Configurations at the left hand panel and choose Subnet groups.
  2. If your cluster subnet group does not show up on the list, click on the refresh button on the page.
  3. Select the cluster subnet group you've created, and choose Delete.

Optional:

You can delete the Redshift Security Group we created in Task 1 too.

Congratulations! You've completed this hands-on exercise! You have successfully:

  • Launched a Redshift cluster
  • Connected an SQL client to the Amazon Redshift cluster.
  • Loaded the data from an S3 bucket into the Amazon Redshift cluster
  • Run the queries against data stored in Amazon Redshift