By the end of this hands-on exercise, you should be able to do the following:
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud, provided by AWS. It allows users to run complex queries on large datasets and is optimised for high-performance analysis and reporting of big data.
A Redshift cluster is a collection of computing nodes/resources and storage in Amazon Redshift, used for efficiently storing and analyzing large datasets. It consists of one or more compute nodes managed by Redshift's data warehousing service.
In this task, you'll create an IAM role (which is like a permission ticket) for Redshift. This IAM role will give Redshift permission to access and work with data stored in Amazon S3.
An inline policy in AWS IAM is like a set of rules written directly inside a role, outlining what that role is allowed or not allowed to do.
11. Under the Policy editor header, select JSON.
JSON is a way to organise and exchange data in a way that's easy for both people and computers to understand. So while it might look like a programming language, JSON is not a language - it's a format for structuring data! Websites often use JSON to transfer and organise data between the server and the web browser
12. Below is the JSON policy that you'll add to your CustomRedshift-Role role.
Curious what this JSON policy is saying? Let's break it down:
"Version": "2012-10-17", - This means 2012-10-17 is the date of the latest policy version. This tells you whether the policy is up to date with the latest standards and practises.
The "Statement" is a main part of the policy and includes the actual permissions we're trying to set
"Action" refers to a specific task that users can do. In this case, "s3:*" means all tasks relating to S3 (like create, update, delete).
"Resource": "*". This means that the policy applies to all S3 buckets in the account.
"Effect": "Allow". This emphasises that the actions in this policy are permitted, not denied.
13. For the policy name, enter S3forRedshift-policy.
Once you added this IAM policy to your role, you're good to go to the next step.
14. Next, you will create a security group that you will use for Redshift. Navigate to VPC console, and choose Security groups in the left hand panel.
15. Choose Create security group.
16. Configure the following:
This source type allows any device connected to the internet using IPv4 to access your resource.
17. Choose Create security group.
In this task, you'll be guiding yourself through the process of setting up an Amazon Redshift cluster.
1. Navi gate to Amazon Redshift in the console.
2. Before creating a cluster, you must first create a cluster subnet group. To create the cluster subnet group, on the left side of Redshift console expand the Configurations, and choose Subnet groups.
A cluster subnet group lets you pick specific places in your network where you want your Redshift cluster to live. This helps you control what your Redshift cluster can connect to and keeps things organised and secure.
3. Choose Create cluster subnet group. Configure the following:
4. Choose Create cluster subnet group.
5. Now let's create the cluster that falls in this cluster group!
6. In the left navigation pane, choose Clusters.
7. Choose Create cluster.
8. In the Cluster configuration section:
When you start a Redshift cluster, one of the key things you decide is the "node type." Your node type determines how much computing power the cluster will use, how much data it can store, and how much it will cost you. For datasets under 1 TB uncompressed, DC2 node types offer the best performance at the lowest price (and dc2.large is the cheapest of the lot).
Having 2 compute nodes means that the Redshift cluster is made up of two separate processing units. Each node is like a mini-computer that works with the other to process queries and store data. This setup helps improve performance and maintaining high availability, making it faster and more reliable for running queries and working with large datasets.
9. In the Database configurations section, configure:
10. Under Associated IAM roles, click on Associate IAM role.
11. Select CustomRedshift-Role. This is the role we created in Task 1. It grants permission for Amazon Redshift to read data from Amazon S3.
12. In the Additional configurations section, deselect Use defaults.
13. Expand Network and security, then configure:
VPC security groups:
14. Expand Database configurations, then configure:
Wait for the Status of your cluster to display Available. You can click on the refresh icon at the General information section of the page, in case the status doesn't update automatically.
Amazon Redshift uses standard SQL. Any SQL client that works with JDBC or ODBC* can connect to Redshift.
JDBC (Java Database Connectivity) and ODBC (Open Database Connectivity) are APIs that help computer programs communicate with a Redshift database using SQL queries. Without JDBC and ODBC, you would spend much more effort, time, and specialised knowledge to communicate with the database effectively.
JDBC is used for programs written in Java.
ODBC is used for programs written in a range of other programming languages, like C++, Python or C#
In this exercise, you will use the Amazon Redshift Query editor.
You will see the status Connected.
In this task, you will execute SQL commands to create a table in Redshift
1. Use this SQL command below - write it into the Query 1 window, then choose Run. This command will create a table named users that includes a user's name, address, and preferred music genre.
Woah! Lots of lines here. Overall, each line from the second line onwards defines a columns of a table called users. Here's a concise breakdown of each type:
After running the query, you should see the users table on the left side Resources. Now you'll get to take a peek at the table columns you've just set up in the SQL statement!
Amazon Redshift can bring in data from Amazon S3, and it can handle various file types, like CSV files or files with custom separators. In our case, the data is in a format where fields are separated by a vertical bar (|).
1. Delete the existing query, copy the below SQL command then paste it into the Query 1 window. Don't press Run yet!
2. On a separate tab, revisit the IAM role that you've created, and then copy the ARN of that role. It will start with arn:aws:iam::
3. Paste the Role into the query window, replacing the text YOUR-ROLE
The command loads data from the allusers_pipe.txt file (in the S3 bucket awssampledbuswest2/tickit) into the users table. It uses the specified IAM role for authentication and identifies the data as being pipe-separated (|).
Note: in case you were wondering why we didn't need to set up this bucket ourselves, it's because this bucket has been set up by AWS for learning and testing with sample data!
4. Choose Run. It will take about 10 seconds to load this file, which contains 49,990 rows of data!
Now that you have the data in your Redshift database you can query the data using SQL select statements and queries. If you are familiar with SQL, feel free to try additional commands to query the data.
1. Run this query to count the number of rows in the users table:
2. Scroll down to see the results. The result shows that there are almost 50,000 rows in the table.
3. Next, run this query:
This query displays users in Ohio (OH) who like sports but do not like opera. The list is sorted by their first name.
Finally, run this query:
This query shows the Top 10 cities where Jazz-loving users live.
Try to write a query for these requirements:
Try to do it yourself before seeing the answer.
The expected output should be like this:
Challenge yourself: do you think you can delete the cluster and cluster subnet group we've created on your own?
Tip: you cannot delete a cluster subnet group until you've deleted the cluster associated with it. When deleting the cluster, make sure you uncheck the option for Final snapshot so that you don't incur any costs.
Feeling stuck? Here are the instructions:
4. Ensure you uncheck the option for Final snapshot so that you don't incur any costs.
5. Type delete to confirm the deletion, and choose Delete cluster.
If you forgot to uncheck this option, on the left hand navigation bar, expand Clusters and choose Snapshots. You can find, and delete there the saved snapshot of the clusters.
Now, let's delete the cluster subnet group:
Optional:
You can delete the Redshift Security Group we created in Task 1 too.
Congratulations! You've completed this hands-on exercise! You have successfully: