In a nutshell:
Amazon Athena: Search and understand data in S3 using SQL. It's fast, cheap, and serverless (you only pay per query).
Amazon Redshift: A popular cloud data warehouse that lots of data using SQL. It works with BI tools, starts at a low cost, and offers Redshift Spectrum for data queries directly against S3 objects.
Amazon Kinesis: A service that handles and analyses data as it happens, like watching live videos or tracking clicks on a website. It splits into four sub-services: Data Firehose, Managed Service for Apache Flink, Data Streams, and Video Streams for different data needs.
Amazon Kinesis Data Firehose: Grabs, changes, and loads data in real-time into S3, Redshift, or OpenSearch.
Amazon Managed Service for Apache Flink: Takes care of processing live data using an open source tool called Apache Flink.
Amazon Kinesis Data Streams: Captures gigabytes of live data every second across many sources.
Amazon Kinesis Video Streams: Streams video from devices into AWS for analysis and replay. It's also handy for using artificial intelligence on videos.
Amazon Managed Streaming for Apache Kafka (Amazon MSK): Makes it easy to use Apache Kafka (a tool for putting together data from multiple streams) by handling the server management for you.
Amazon OpenSearch Service: Helps search, analyse, and visualise data quickly. Good for tasks like monitoring applications or looking through logs.
AWS Data Exchange: A marketplace to subscribe to useful data and use it in your AWS applications without dealing with servers.
Amazon EMR: A big data tool that processes loads of different kinds of information quickly. It uses tools like Apache Spark and is much faster and cheaper than traditional ways.
AWS Glue: Sticks your data from different sources together for analysis.
Amazon Macie: Uses AI to find and protect sensitive info in your data.
Amazon QuickSight: Transforms data into charts and graphs.
What is data analytics?
Data analytics converts raw data into actionable insights. These insights then help companies have a deeper understanding of:
- Their business processes
- Employee productivity
- Their products and services
- The customer experience and customer problems
What's the difference between AI/ML and data analytics?
- Data analytics does not continuously learn or train to improve its model over time, whereas ML models do.
- Data analytics services summarise data to create reports, dashboards and graphs. AI/ML services go a step further using data make predictions, image recognition skills or entire applications (like Lex and Polly).
- How would a data analytics service analyse data and create insights? Psst... AI/ML services are running in the background to make it all happen! You'll spot AI and ML get mentioned in this topic too.
Now let's dive into some of the awesome AWS data analytics services!
Amazon Athena is your search engine for data stored in S3.
- You can analyse and query data using SQL, which is super useful for digging into log files, reports and any raw data on users' activity on your applications. Athena delivers most results in seconds.
- Athena is serverless, so you don't have to set up any complex processes or servers to run your queries.
- You pay per query or per terabyte of data scanned.
Amazon Redshift is the most widely used cloud data warehouse. We've learnt about Redshift as a column-based database solution, but let's focus on its data analytics skills.
- Redshift can work with your existing Business Intelligence tools and SQL to run queries against terabytes to petabytes of data.
- You can start small for just $0.25 per hour with no commitments and scale out to petabytes of data for $1,000 per terabyte per year, less than a tenth the cost of traditional on-premises solutions.
- It can accept data from S3, DynamoDB other databases, and Amazon Kinesis (which you'll learn about below).
- Amazon Redshift Spectrum: An extension product that lets you query data directly from S3 without needing to load it into Redshift. Redshift Spectrum is not serverless (unlike Athena), so you would need to provide compute instances to process queries and store data. But, Redshift Spectrum lets you combine data stored in your Redshift tables with external data in S3 in a single query.
Kinesis is a fully managed service that processes and analyses streaming data.
- Streaming data is data that's being generated in real-time, like live videos and audios, users clicking on your websites, and records of activity in your application.
- For example, a mobile app could use Kinesis to tracks what users tap on and which areas of that application are most used.
Amazon Kinesis currently offers four services:
- Kinesis Data Firehose
- Managed Service for Apache Flink
- Kinesis Data Streams
- Kinesis Video Streams.
Amazon Kinesis Data Firehose
- Data Firehouse can capture, transform, and load streaming data into Amazon S3, Amazon Redshift and Amazon OpenSearch Service (which you'll learn about in a second).
- It does all this in real-time, which means you can get near real-time analytics if you pass this data onto your existing business intelligence tools and dashboards set up for visualising data!
- It is a fully managed service that automatically scales to match the volume of your data.
- It can also group, compress (i.e. shrink its size), transform, and encrypt the data before loading it, minimising storage use and increasing security.
- You can set up Firehose in minutes from your AWS Management Console and continuously stream hundreds of thousands of data sources to AWS.
Amazon Managed Service for Apache Flink
- Apache Flink is a data processing engine that specialises in live data. Directly query the data it collects using SQL or write code written in Java to perform complex tasks.
- Apache Flink was not created by AWS, but Amazon Managed Service for Apache Flink (what a mouthful!) makes it easier to build and run real-time stream processing applications using Apache Flink. This service also replicates your data across multiple Availability Zones without you having to pay for additional storage capacity.
- This service integrates a wide range of destinations, including Amazon MSK (which you'll learn about in a second), Kinesis Data Streams, Kinesis Data Firehose, S3, and DynamoDB.
- It is also a fully managed service - there are no servers to manage, and there is no compute and storage infrastructure to set up. You pay only for the resources you use.
Amazon Kinesis Data Streams
- Kinesis Data Streams can continuously record gigabytes of data per second from hundreds of thousands of sources.
- These sources can be website clickstreams (i.e. data on every time users click on your website), financial transactions, social media feeds and location-tracking events.
- The data collected is available in milliseconds for real-time analytics, which is helpful for real-time dashboards, detecting suspicious behaviour online, dynamic pricing (e.g. recording stock prices changing), and more.
Amazon Kinesis Video Streams
- Amazon Kinesis Video Streams makes it easy to securely stream video from millions of connected devices to AWS.
- You can use the streamed video data for analytics, further processing with AI/ML services (like Rekognition) and replay.
Amazon Managed Streaming for Apache Kafka (Amazon MSK)
- Similar to Apache Flink, Apache Kafka is another platform for building real-time streaming data processes and apps. While Flink is more about processing and analysing data in real-time, Kafka's focus is on handling data from multiple sources.
- When you run Apache Kafka on your own, you need to set up your own servers up, replace them when they fail, manage updates, architect them for high availability and scaling, and handle storage.
- Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed service that makes it this process easy for you. With a few clicks you can create highly available Apache Kafka clusters (i.e. group of servers) with automatic settings based on best practices. Amazon MSK also monitors their performance for high availability and secures your data by encrypting data at rest.
Amazon OpenSearch Service
- OpenSearch is an open-source software that helps you easily search, analyse, and understand your data. It can find exactly what you're looking for in a massive amount of information, and turn that information into insights.
- Amazon OpenSearch Service integrates OpenSearch with AWS services like Amazon VPC, AWS KMS, Kinesis Data Firehose, AWS Lambda, AWS IAM, Amazon Cognito, and Amazon CloudWatch, so that you can go from raw data to actionable insights quickly.
- People often use OpenSearch Service for analysing computer logs, searching through text, monitoring activity on applications, and analysing how users interact on their website.
AWS Data Exchange
AWS Data Exchange makes it easy to find, subscribe to, and use third-party data in the cloud. Think of it as a data marketplace that connects you with valuable information for your projects. For data users, you can buy subscriptions to high quality data in just clicks. For data providers, Data Exchange makes it easy to reach millions of AWS customers without having to worry about servers for data storage, delivery and billing customers.
The data providers on Data Exchange are all well-known brands in their industry, like:
- Reuters (one of the largest news agencies in the world), who curate data from over 2.2 million unique news stories per year in multiple languages
- Change Healthcare (healthcare technology company with 89 locations in the world), who process and anonymise more than 14 billion healthcare transactions and $1 trillion in claims annually
- Dun & Bradstreet (one of the world's leading suppliers of business information), who maintain a database of more than 500 million records of global businesses.
- Foursquare (a leading platform that powers location services in our devices and apps), who has location data from 220 million consumers.
Once subscribed to a data source, you can use the AWS Data Exchange API to load data directly into Amazon S3 and then analyse it with other AWS analytics and ML services. Some popular use cases are:
- Property insurers can subscribe to historical weather data to figure out the riskiness of different locations.
- Restaurants (especially fast food chains) can subscribe to population and location data to identify optimal areas for expansion
- Academic researchers can subscribe to data on carbon dioxide emissions for their studies on climate change
- Healthcare professionals can subscribe to data from other clinical trials to help in their own research.
- Amazon EMR (previously called Elastic MapReduce) is a big data* service for processing huge amounts of data using open source tools.
Big data means handling a huge amount of different kinds of information. "Big"doesn't just refer to the volume of data, but also its:
- Variety i.e. data is coming in different formats and types, some data could be unstructured while others are structured.
- Velocity i.e. data is generated and collected at high speed.
- Veracity i.e. the quality and reliability of the data can vary.
- Some examples of big data are social media posts and motion sensors. Special tools are needed to make sense of big data and help businesses make smart decisions.
- An open source tool it often uses is Apache Spark, which is one of the most popular engines for using data for calculations and custom analysis and machine learning.
- Amazon EMR automates time-consuming tasks like setting up and managing clusters. You can run petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over three times faster than Apache Spark.
- You can run EMR on Amazon EC2 instances, Amazon EKS clusters, or on-premises.
Preparing your data is the first step in getting quality results in any analytics or machine learning project.
- AWS Glue is a serverless ETL* service that makes it easy to discover, prepare, move, and integrate data from multiple sources. Remember it as the 'glue' that sticks together all your data!
- ETL (extract, transform, load) is a process for handling data to make sure it's in the right shape and place for analytics.
- Extract = collecting data from different sources.
- Transform = cleaning and reshaping data to make everything consistent with each other.
- Load = putting data in the target destination.
- You can immediately search and query loaded data using Amazon Athena, or Amazon EMR, and Amazon Redshift Spectrum.
- If you're thinking this sounds oddly similar to Amazon MSK, you're not alone! They're often confused for each other, but remember that Glue is focused on preparing data in batches for analytics while MSK is focused on simplifying Kafka clusters for real-time data streaming.
- Macie uses machine learning to discover, classify, and protect sensitive data stored in S3.
- Macie also uses AI to recognise if your S3 objects contain any sensitive PII* data and provides dashboards, reports, and alerts on your data stored in S3.
PII (personal identifiable information) is personal data used to establish a person's identity. This includes your name, home address, email address, drivers license/passport number, date of birth, bank account/credit card information, and more. PII needs to be kept secure so it is not used or exploited.
- Amazon QuickSight is a business intelligence (BI) service that lets you create and publish interactive dashboards that can be accessed by anyone through their web browser.
- QuickSight is often used by companies to share insider information (e.g. financial performance) to all employees. You can also attach dashboards into your apps, giving customers analytics about their own usage or data.
- Amazon QuickSight is fully managed, so it easily scales to tens of thousands of users without any software to install, servers to deploy, or infrastructure to manage.