I find it interesting that essentially the quote from Charles Darwin stated over 100 years ago is being quoted today by Amazon Founder & CEO Jeff Bezos…but I know why!
Today more than ever businesses & IT professionals have to evolve, because everything is changing so fast. You & I need to evolve to newer technologies at a more rapid pace than ever before.
This is the point I’d like to make in this (rather long, but broken into sections) post.
I created a video tutorial that was 6.5 hours long that was reduced to 2.5 hours on AWS Glue & Amazon Athena for Pluralsight with the title “ Serverless Analytics on AWS“.
Since I had the extra content, I created “augmented videos” that compliment the course modules. The next 6 sections are the first 6 videos (here, made into a blog post) in the series I entitled “AI/ML, & Advanced Analytics on AWS“. NOTE: nothing is overlapped with the course, so it’s best to use a combination of course content & YouTube content, which is explained in the “Explainer Video 1”.
For each of the sections, I embed the YouTube video associated with the sections & then screenshot the slides for those of you who like to read (smile!)
To help you navigate these first 6 videos, I’ve created a document that describes what topics are in each video along with their respective timeframes. In addition, this document begins with a list of the DEMOS that the Pluralsight course has. Below are a couple of screenshots showing how I break down the content in the document:
SECTION/VIDEO 1: “Explainer Video“
Below you’ll find the embedded YouTube Explainer Video 1:
As of today, 11/7/2019, I have YouTube Videos 1, 2, 3.1, 3.2, 3.3, & 3.4 done: as soon as I can, I’ll finish the rest of the YouTube videos. Video 4 will be on Amazon S3 Data Lakes. The other videos will be on AWS Glue, Amazon Athena, Amazon QuickSight with a DEMO, an entire new DEMO on automating AWS Glue Crawlers & Amazon Glue Transformations, & knowing me there will be more ( 🙂 ) but I don’t know the order yet other than Video 4. I’ll get these done asap for you!
Here are some screenshots from the explainer video to give you an idea of how I explain the best way to watch the course content with the YouTube content:
SECTION/VIDEO 1: Explainer Video in Text & Screenshots
Analytics today stretches beyond Business Intelligence to include real-time streaming & analytics augmented with Artificial Intelligence. AWS Glue, Amazon Athena, & Amazon S3 Data Lakes have transformed how businesses perform cutting-edge analytics in the day & age of AI & ML. Thus, the sooner you begin to learn how to use these services, the less of a learning curve you’ll have as AI & ML progress.
Let’s look at a couple different scenarios that you’ll see in each of the videos in this series.
The screenshot on this slide shows the TOC of the Pluralsight course. Nothing in the YouTube videos is in the Pluralsight course & vice-versa.
Each video will contain a slide or two that will point out in a screenshot what you should watch before that video (if relevant) & what you should watch after that video (if relevant) for the particular YT video you’re looking at.
Let’s say a YouTube video fits perfectly in between 2 Pluralsight modules. I’d show a screenshot like the one above showing the TOC of the course showing the 2 modules surrounded by a colored rectangle & suggest you watch whatever the name is of the module you should watch first & whatever the name of the module you should watch after. This pattern of when to watch the YouTube videos interweaved with the course modules will be included in every YouTube video, so you’re getting the maximum understanding & value from both sources.
This is another scenario on the suggested structure for viewing order of this YT video series. Let’s say you’re watching the Pluralsight course module entitled, “The Power of Amazon Athena”, surrounded by an orange rectangle. This module consists of individual sections, shown by the larger orange rectangle. Let’s say you just finished watching the section entitled “Databases & Tables” in the course, shown by the yellow rectangle & you’re about to watch the first demo in the section. Demos will always having a pink rectangle around it in all screenshots of the course.
Let’s say I suggest that you watch the section on Databases & Tables in the YT video series after the section on Databases & Tables in the course, because there’s more content in the YT video, & THEN watch the Demo shown on this slide. Another example would be that I suggest that after you watch the last section in the course entitled “Monitoring & Tracing Ephemeral Data,” before you start the next course section, entitled “How to AI & ML Your Apps & Business Processes” that you watch video Demo #2 on YT, because that entire demo was cut from the course. You get the picture!
In the next few slides I’ll show you the TOC of the Pluralsight course, so you’ll know what to look forward to learning.; HOWEVER, note that you’ll get more than what you see in the course, and some completely separate full modules that aren’t in the course, as well! Note also, that all of the demos & the course code download link to GitHub are only in the course, & the content slides in the course aren’t in the YT series.
The slide above shows you the 1st 3 modules of the course. Notice that each module of every has a play button, surrounded by a blue circle, to the left of each module name that you click to watch that section. In addition, each module has a chevron (a “carrot” character) on the right, surrounded by a pink circle, that expands the top-level module name to show the individual sections & their titles within each module.
The first module only has 1 section in the module entitled “Course Overview,” that serves as a summary of what you’ll learn & how you’ll benefit from watching the course. The next module only has 1 section as well entitled “Download & Install Course Pre-requisites.” You’ll need to watch the section entitled “Pre-requisites” so you can download the course code assets from my GitHub account, as well as making sure you have a 3rd-party SQL client installed locally on your machine.
The next module shown in this screenshot is entitled “The State of Analytics in the AWS Cloud” that overviews the AWS Services we’ll be using in the course, that includes the AWS S3 Data Lake Platform, AWS Glue, Amazon Athena, and the analytical relational database in the cloud named Amazon Aurora. We’ll be creating an Aurora database in the cloud, then via our 3rd-party SQL client give the database a schema & data, then query it & do some transformations. Later in the course, we’ll use AWS Glue Data Crawlers to crawl this Aurora database, extract the schema & statistics, & populate the AWS Glue Data Catalog with that data. Subsequently, we’ll perform the same queries and transformations on the Aurora table in the Glue Data Catalog just like we did with the 3rd-party SQL client, but without having to connect to the database remotely, showing how much faster, easier, & ultimately much more efficiently & effectively doing queries using AWS Glue & Amazon Athena is vs. having to connect remotely (among other things!)
This screenshot shows the next 2 modules in the course. At the top is the module named “Infrastructure & Data Setup via Amazon CloudFormation” that has the first DEMO in the course. This demo will build out the entire relational data part of the course. This entire module isn’t replicated in the YT video series because this module is absolutely centered on the DEMO, which I can’t replicate due to my contract with Pluralsight. You should watch this module in the course because we’ll be building upon this initial AWS framework throughout the rest of the course & video series.
The next module shown in this screenshot is entitled “The Power of AWS Glue.” This module explains important core concepts & features of AWS Glue. The information in the course module is important & not duplicated in the YT video series, so you should watch the course module. However, I suggest you watch video in the YT series after that because it contains so many more integral concepts.
The screenshot above shows the next module expanded named “Creating AWS Glue Resources & Populating the AWS Glue Data Catalog.” This module has 4 DEMOS that you’ll only be able to watch in the course. The demos are:
- Configuring an Amazon S3 Bucket & Uploading the Python Transformation Script
- Creating the AWS Glue Infrastructure Architecture
- Running the First Crawler to Populate the Glue Data Catalog & Run the Glue ETL Job to Transform the Data
- Creating a New AWS Crawler to Crawl the Parquet-formatted Data
After watching this module in the course & doing all the demos, I suggest watching the YT video associated with this module because it gives you more insight into what you’re doing in the demos as well as give you a pretty good understanding of AWS Glue’s primary & necessary topics for a full understanding. In addition, the accompanying AWS Glue YT video will have a demo that automates a Glue Crawler & a Glue ETL Job for transforming data; this will build a completely automated Glue workflow. There’s no automation demos in the course, so you’ll need to watch that YT video, once published. Automation can save you a lot of time indeed!
As of today, 11/7/2019, I have YouTube Videos 1, 2, 3.1, 3.2, 3.3, & 3.4 done: as soon as I can, I’ll finish the rest of the YouTube videos.
The above screenshot shows the last 2 modules of the Pluralsight course. At the top the module is entitled “The Power of Amazon Athena”, & it has 1 DEMO on working with Amazon Athena’s Databases & Tables.
I suggest after watching the course module on Athena, that you watch the YT video that corresponds to the online course module. It’ll have MUCH MORE information in it that’s very beneficial to know, including but not limited to:
- Elaborating more on Databases & Tables in Amazon Athena
- An entire section on using Workgroups with Athena that isn’t in the course
- Elaboration on using Partitions in Athena
- An entire section on The Top 10 Performance Tuning Tips for Athena
- An extended section on Monitoring & Tracing Ephemeral Data
- A section on using Amazon QuickSight, including a DEMO
The last module for both the course & the YT video series is entitled “How to AI & ML Your Apps & Business Processes”. You should watch the course module, then afterwards watch the accompanying YT video because the YT video has a lot of additional information that totally ties together everything you’ve learned in both the Pluralsight course & the video series!!!
There will be additional content in this YT video series that isn’t in the Pluralsight course. I’ll point out those videos when I create them
Personally, I LOVE quotes. I’ve added quotes in the YT series that amazingly fit perfectly into each YT video.
The above quote is from Sergey Brin, Co-founder of Google. It reads “The only way you are going to have success is to have lots of failures first”. I’m aware that what I’m going to be teaching you through these videos isn’t going to be easy: it’s a steep learning curve. But don’t give up! Having many failures & even small wins will bring you closer to conquering that mountain!
The above quote is from Jack Welch, Former Chairman & CEO of General Electric. It reads “Change before you have to”.
Jack Welch’s quote has a direct correlation to the rapid pace of technology today. Those that start when the technologies are beginning to advance rapidly will have an easier time keeping up as newer technologies built upon what’s new now are released.
SECTION/VIDEO 2: Advanced Data & Emerging Technologies
This section augments the course that will explain how advanced analytics has impressively & massively progressed. We are now able to ask many new types of analytical questions that can be solved, giving deeper insights than ever before. This section also describes what technologies are emerging today – at an unprecedented rate – that are changing our lives completely.
Below you’ll find the YouTube Video 2 of the course:
Here are some screenshots from “Advanced Data & Emerging Technologies” to give you an idea of what’s covered in this video 2:
SECTION/VIDEO 2: Advanced Data & Emerging Technologies in Text & Screenshots
In this Video #2, I’ll explain how advanced analytics has impressively progressed, the types of questions that can be solved with analytics today, & what technologies are emerging at an unprecedented rate! I’ll also begin to explain why YOU need to know the compelling content in this video series, which will be expanded upon in subsequent videos when appropriate.
Shown above, this 2nd video in the series fits nicely AFTER “Download & Install Course Pre-requisites” & BEFORE “The State of Analytics in the AWS Cloud“.
I find that the reason some really awesome technologies aren’t as widely adopted as they should be (other than for very large enterprises), is because courses usually just say “how” to do something. If you understand “why” & “where” these technologies fit into your business, what business problems they solve, & then learn “how” to implement them, you’re more likely to take the time to advance your skills & use the very best technologies.
Thus, that’s my approach for this series, both in blog format & video format.
First, let me set the stage for this series with some “What if you could…” type questions, kind of like “In a perfect world” type questions.
(The first of 2 slides on this topic is shown above & the second slide on this topic is shown below)
What if you could…
- Know in real-time what your customers want, the prices they’re willing to pay, & the triggers that make them buy more in real-time?
- Have 1 unified view of your data no matter where in the world it’s stored?
- Query your global data without the need to move into another location in order to perform any type of analysis you want to?
- Automate any changes made to underlying data stores, keeping the 1 unified view in sync at all times?
What if you could…
- Have a single, centralized, secure and durable platform combining storage, data governance, analytics, AI and ML that allows ingestion of structured, semi-structured, and structured data, transforming these raw data sets in a “just-in-time” manner? Not ETL, extract, transform, & load, but ELT, extract, load THEN transform when you need the data performing ELT on the fly?
- Drastically reduce the amount of time spent on mapping disparate data, which is the most time-consuming step in analytics?
- Turbo-charge your existing apps by adding AI into their workflows? And build new apps & systems using this best-practice design pattern?
- Future-proof your data, have multiple permutations of the same underlying data source without affecting the source, and use a plethora of analytics and AI/ML services on that data simultaneously?
What is stated on the last 2 slides would be transformative in how your business operates. It would lead to new opportunities, give you a tremendous competitive advantage, the ability to satisfy your customers & additionally accrue new customers, wouldn’t it?
Now, imagine having all these insights on steroids!!!
The content on the above slide is Gartner’s 2019 Top Strategic Technology Trends. This annual report is called “The Intelligent Digital Mesh“. Gartner defines a strategic technology trend as one with substantial disruptive potential that is beginning to break out of an emerging state into broader impact and use, or which are rapidly growing trends with a high degree of volatility reaching tipping points over the next five years.
In other words, “be prepared” today; don’t wait for these technologies to be 100% or even 40% mainstream: then you’re too late!
The definition of the 3 categories are the following:
- The Intelligent category provides insights on the very best technologies, which today are under the heading AI
- The Digital category provides insights on the technologies that are moving us into an immersive world
- The Mesh category provides insights on what technologies are cutting-edge that intertwine the digital & physical world
In the Intelligent category, we have the AI strategic trends of:
- Autonomous Things: Autonomous Things exist across 5 types: ROBOTICS, VEHICLES, DRONES, APPLIANCES, & AGENTS
- Augmented Analytics is the results of the vast amount of data that needs to be analyzed today. It’s easy to miss key insights from hypotheses. Augmented analytics automates algorithms to explore more hypotheses through data science & machine learning platforms. This trend has transformed how businesses generate analytical insights
- AI-driven Development highlights the tools, technologies & best practices for embedding AI into apps & using AI to create AI-powered tools
In the Digital category, we have the Digital strategic trends of:
- Digital Twins: A Digital Twin is a digital representation that mirrors a real-life object, process or system. They improve enterprise decision making by providing information on maintenance & reliability, insight into how a product could perform more effectively, data about new products with increased efficiency
- Empowered Edges: This is a topology where information processing, content collection, & delivery are placed closer to the sources of the information, with the idea that keeping traffic local will reduce latency. Currently, much of the focus of this technology is a result of the need for IoT systems to deliver disconnected or distributed capabilities into the embedded IoT world
- Immersive Experiences: Gartner predicts through 2028, conversational platforms will change how users interact with the world. Technologies like Augmented Reality (AR), Mixed Reality (MR) & Virtual Reality (VR) change how users perceive the world. These technologies increase productivity, with the next generation of VR able to sense shapes & track a user’s position, while MR will enable people to view & interact with their world, & augmented reality will just blow your mind it’s oh-so-cool! (SHAMELESS PLUG: The Augmented World Expo is the biggest conference and expo for people involved in Augmented Reality, Virtual Reality and Wearable Technology (Wikipedia. This is a shameless plug because I know the founders & have attended this mind-blowing conference for years!)
In the Mesh category, we have the strategic trends of:
- Blockchain: Blockchain is a type of distributed ledger, an expanding chronologically ordered list of cryptographically signed, irrevocable transactional records shared by all participants in a network. Blockchain allows companies to trace a transaction & work with untrusted parties without the need for a centralized party such as banks. This greatly reduces business friction & has applications that began in finance, but have expanded to government, healthcare, manufacturing, supply chain & others. Blockchain could potentially lower costs, reduce transaction settlement times & improve cash flow
- Smart Spaces Smart Spaces are evolving along 5 key dimensions: OPENNESS, CONNECTEDNESS, COORDINATION, INTELLIGENCE & SCOPE. Essentially, smart spaces are developing as individual technologies emerge from silos to work together to create a collaborative & interaction environment. The most extensive example of smart spaces is smart cities, where areas that combine business, residential & industrial communities are being designed using intelligent urban ecosystem frameworks, with all sectors linking to social & community collaboration
Spanning all 3 categories are:
- Digital Ethics & Privacy: This represents how consumers have a growing awareness of the value of their personal information, & they are increasingly concerned with how it’s being used by public & private entities. Enterprises that don’t pay attention are at risk of consumer backlash
- Quantum Computing: This is a type of nonclassical computing that’s based on (Now, for all of you who aren’t familiar with this, put your seatbelts on because this is absolutely phenomenal!!!) the quantum state of subatomic particles that represent information as elements denoted as quantum bits or “qubits.” Quantum computers are an exponentially scalable & highly parallel computing model. A way to imagine the difference between traditional & quantum computers is to imagine a giant library of books. While a classic computer would read every book in a library in a linear fashion, a quantum computer would read all the books simultaneously. Quantum computers are able to theoretically work on millions of computations at once. Real-world applications range from personalized medicine to optimization of pattern recognition.
This course covers most of these strategic IT trends for 2019. You can read more about each category by visiting the entire report at the URL on the bottom left of the slide: it’s not only a fascinating read, but also a reality check on what you should be focusing on today!
Today we’re experiencing what has been called The 4th Industrial Revolution. A broad definition of the term Industrial Revolution is “unprecedented technological & economic development that have changed the world”. The timeline on the above slide indicates when each Industrial Revolution occurred & what new inventions defined such a large-scale change. The emergence of the 4th Industrial Revolution is attributed to primarily technological advances built atop the technologies that were paramount in Industry 3.0.
As the pace of technological change quickens, we need to be sure that employees & ourselves are keeping up with the right skills to thrive in the Fourth Industrial Revolution.
The quote on the slide above should confirm that having 1 location for all your global data is essential with the vast amount of data today, in order to take advantage of all your data for analytics. The quote is from Eric Schmidt (LOVE his last name!), Former Executive Chairman of Google. It reads, “There were five exabytes of information created between the dawn of civilization through 2013, but that much information is now created every two days”. This quote was from around 2013, & boy have things accelerated beyond anyone’s imagination at the time!
The next video in the YouTube series is a 4-part sub-series entitled “Cloud & Data Metamorphosis“.
SECTION/VIDEO 3.1: “Cloud & Data Metamorphosis, Part 1“
This first video in “Cloud & Data Metamorphosis” will cover many topics, including:
- Big Data Evolution
- Cloud Services Evolution
- Amazon’s Purpose-built Databases
- Polyglot Persistence
- Traditional vs. Cloud Application Characteristics
- Operational vs. Analytical Database Characteristics
If you watched the “explainer video” video #1 in this series, you’ll know that this YouTube course series complements my Pluralsight course entitled “Serverless Analytics on AWS”.
All 4 video segments of this 3rd video in the series, “Cloud & Data Metamorphosis” ideally should be watched AFTER Module 2 “Download & Install Course Pre-requisites” & BEFORE Module 3 “The State of Analytics in the AWS Cloud”.
Below you’ll find the embedded YouTube Video 3.1:
SECTION/VIDEO 3.1: Cloud & Data Metamorphosis Video in Text & Screenshots
Now that you’ve learned what was taught in the first video (Video 2, since Video 1 is an “explainer video”) in this series, “AI, ML, and Advanced Analytics on AWS,” in this video, entitled “Cloud and Data Metamorphosis” in this third video of the series, I’ll explain the technological changes that have evolved to the point of the 4th Industrial Revolution so you understand the significance of what you’ll learn in this video series!
Understanding different technology evolutions are important to know because through a linear timeline, you can see how technologies advanced WITH hardware & software advances or that technology solution had to evolve to work with advanced & emerging technologies. First, let’s look at Big Data Evolution. A big data challenge is HOW to build big data applications. A few years ago the answer was Spark no matter what the question was; today, the answer is AI, no matter what the question is.
Initially, batch processing was the only way to analyze data. Batch processing has the following characteristics: the data’s scope was limited to querying or processing over all or most of the data in the dataset. Thus, data size was in the form of large batches with data performance had latencies in minutes to hours, & analyses are complex. Think of OLTP, online transaction processing.
Next in big data evolution came stream processing. With the advent of broadband internet, cloud computing, & the IoT, data’s increased volume & velocity necessitated the need to process the continuous transfer of data rolling in over a small time period at a steady, high-speed rate. Detecting insights in data streams is instantaneous. This enabled companies to make data-driven decisions in real-time. Queries are usually done on just the most recent data in the form of individual records or micro-batches consisting of a few records whose performance has latencies in the order of seconds or milliseconds, & analyses are simple response functions, aggregates, & rolling metrics.
Today, the prevalence of AI processing gives rise to cognitive computing. Cognitive computing describes technology platforms that simulate human thought processes in a computerized model. Using self-learning algorithms that use data mining, pattern recognition, & Natural Language Understanding (NLU), the computer can mimic the way a human brain works.
The good news is that any batch or stream processing you already have in place can be “AI’d”.
Now let’s look at Cloud Services Evolution. Initially, running Virtual Machines in the cloud were the only option available. Shared computing resources of hardware & software provided a cloud computing environment that lowers the quantity of assets needed & the associated maintenance costs.
Next in cloud services evolution comes Managed Services. This term refers to outsourcing daily IT management for cloud-based services and technical support to automate and enhance your business operations.
Now we have Serverless computing. It’s the native architecture of the cloud that enables you to shift more of your operational responsibilities to AWS, increasing your agility and time for innovation. Serverless allows you to build and run applications and services without thinking about servers. It eliminates infrastructure management tasks such as server or cluster provisioning, patching, operating system maintenance, and capacity provisioning. Application scaling is automated, & it provides built-in availability & fault tolerance. You only pay for consistent throughput or execution vs. by server unit.
When you develop cloud analytical or AI apps, choosing the right database that’s right for your data & the analytics you’ll perform on that data is the first & foremost decision to make. This section builds upon cloud & big data evolution, adding Database & data storage evolution.
AWS has a number of purpose-built databases. They’re all scalable, fully-managed or serverless, & are enterprise class.
Relational databases supports ACID transactions. Data types have a pre-defined schema & relationships between the tables. Examples of where relational databases are still a valid choice include traditional applications, ERP, CRM, & e-commerce. With AWS, the offerings include MySQL & PostgreSQL databases, MariaDB, Oracle, & SQL Server. For analytics, the relational database of choice is Amazon Aurora because it’s cloud native.
Key-value databases are optimized to store & retrieve key-value pairs in large volumes & in milliseconds, without the performance overhead & scale-limitations of relational databases. Examples of where key-value databases are used include Internet-scale applications, real-time bidding, shopping carts, & customer preferences. With AWS the key-value database is the almighty DynamoDB!
Document databases are designed to store semi-structured data as documents & are intuitive for developers because the data is typically represented as a readable document. Examples of where document databases are used include Content management, personalization, & mobile applications. With AWS, this offering is Amazon DocumentDB with MongoDB-compatibility.
Graph databases are used for applications that need to enable millions of users to query & navigate relationships between highly-connected, graph datasets with millisecond latency. Examples of where graph databases are used include Fraud detection, social networking, & recommendation engines. With AWS, this offering is Amazon Neptune.
In-memory databases are used by applications that require real-time access to data. By storing data directly in memory, these databases provide microsecond latency where millisecond latency isn’t enough. Examples of where in-memory databases are used include Caching, gaming leaderboards, & real-time analytics. With AWS, this is either Amazon ElastiCache for Redis or Amazon ElastiCache for Memcached.
Time Series databases are used to efficiently collect, synthesize, & derive insights from enormous amounts of data that changes over time (aka time-series data). Examples of where time-series databases are used include IoT applications, DevOps, & Industrial telemetry. With AWS, this offering is Amazon Timestream.
Ledger databases are used when you need a centralized, trusted authority to maintain a scalable, complete & cryptographically verifiable record of transactions. QLDBs were first used in the financial industry, but since has expanded to Manufacturing, Supply Chain, Healthcare,& Government. Examples of where Ledger databases are used include Systems of record, supply chain, registrations, & banking transactions. With AWS, this offering is Amazon Quantum Ledger Database.
It’s only natural that with the variety of data types today, the old one-size-fits-all when it comes to choosing what type of data storage to use in your application just doesn’t work anymore. The plethora of purpose-built AWS databases provides the ability to use POLYGLOT PERSISTENCE in your applications. This means using multiple types of databases together to fit the need of any application, organization, or spanning multiple organizations that fit the app characteristics vs. forcing the data into a database that will perform poorly because it wasn’t built to provide the many types of data within apps today. There’s loose coupling of services that communicate through queues.
In the sample application diagram on the slide above, 5 different databases are used in 1 application. The type of database you choose in different parts of one application should be based on application characteristics.
The table on the slide above explains the difference between traditional vs cloud-based applications. This is important to take note of because application characteristics help define what data storage you’ll use. Rather than the old way of fitting data into relational databases no matter what the data structure was, today you have the flexibility to choose the databases to fit the structure & function that works most efficiently with your various application data needs.
Before I cover how to optimize raw data in S3, I want to again emphasize how much better your apps will perform by using the right data store for all the data in your app. You need to consider whether your database be used for operational workloads or for analytical workloads. The table on the left side of the slide above lists the major characteristics of operational vs. analytical databases.
On the right side of the above slide, the table at the top lists the general characteristics of operational workloads, along with the primary dimensions to consider. The table on the lower half of the slide above lists the general characteristics of analytical workloads, along with the primary dimensions to consider for analytical workloads.
The next video is “Cloud & Data Metamorphosis, Part 2“. The primary topics that’ll be covered include:
- Amazing Data Factoids that impact analytical systems
- Data Architecture Evolution
- Modern Data Analytics Pipelines
SECTION/VIDEO 3.2: “Cloud & Data Metamorphosis, Part 2“
This video is a continuation of the video entitled “Cloud & Data Metamorphosis, Part 1” which is the 3rd video, a multi-video series, in the larger video series “AI, ML, & Advanced Analytics on AWS”.
Below you’ll find the embedded YouTube Video 3.2:
SECTION/VIDEO 3.2: “Cloud & Data Metamorphosis, Part 2” in Text & Screenshots
Let’s quickly overview the topics in “Cloud & Data Metamorphosis, Part 1” & quickly introduce you to the topics that’ll be covered in this “Cloud & Data Metamorphosis, Part 2“.
In Part 1 of Cloud & Data Metamorphosis, I covered the following:
- Big Data Evolution
- Cloud Evolution
- Database Evolution & the Many Types of AWS’ Purpose-built Databases
- Polyglot Persistence
- The Differences Between Traditional & Cloud Application Characteristics
- How to Determine if Your Database is Operational or Analytical
In Part 2, I’ll cover:
- Amazing Data Factoids that impact analytical systems
- Analytical Platform Evolution
- Dark Data
- The Problems with Data Silos
- Data Architecture Evolution
- Modern, Distributed Data Pipelines
In the next few slides, I’ll overview Amazing Data Factoids that impact analytical systems.
There are more ways to analyze data than ever before:
- 11 years ago Hadoop was the data king of analysis
- 8 years ago Elasticsearch was the data king of analysis
- 5 years ago Presto was the data king of analysis
- < 4 years ago Spark was the data king of analysis
- TODAY, AI IS THE DATA KING OF ANALYSIS
I’m sure you know that there is so much more data than most people think. Today, however, we also have more ways to analyze data than ever before. With the democratization of data, there are more people working with data than ever before. Job titles that never used data before must use it now to analyze performance in many ways, for reports, and more.
The old adage “garbage in, garbage out” carries more weight today than ever before. Organizations gather huge volumes of data which, they believe, will help improve their products and services. For example, a company may collect data on how users use its products, internal statistics about software development processes, and website visits. However, a large portion of the collected data are never even analyzed. According to the International Data Corporation, or, IDC, a provider of market intelligence on technology, 90% of the unstructured data are never analyzed. Such data is known as dark data.
Dark data is a subset of big data but it constitutes the biggest portion of the total volume of big data collected by organizations in a year. Dark data are the information assets organizations collect, process, & store during regular business activities, but generally fail to be used for other purposes (ie, analytics, business relationships, & direct monetizing).
The graph shown on the above slide illustrates that as the amount of Big Data grows, the amount of dark data increases. But that does not lessen its importance in the context of business value. There are two ways to view the importance of dark data. One view is that unanalyzed data contains undiscovered, important insights and represents an opportunity lost, whether today or in the future. The other view is that unanalyzed data, if not handled well, can result in a lot of problems such as legal and security problems, like with the recent GDPR laws and more.
Just storing data isn’t enough. DATA NEEDS TO BE DISCOVERABLE, & I’ll be showing you how to do that in this video series. With the varied data types & new sources are added to an analytical platform, it’s important that the platform is able to keep up! This means the system has to be flexible enough to adapt to changing data types and volumes. In the remainder of this video & throughout the next videos in the Video series 3, I’ll explain the technologies that not only makes data discoverable, available to numerous analytical & AI services, cleans and enriches your data, BUT ALSO FUTURE-PROOFS YOUR DATA!!!
The quote on the above slide is from Atul Butte, MD Brown University, Ph.D. in Health Sciences & Technology from MIT & Harvard Medical School, & Founder of Butte Labs, Solving Problems in Genomic Medicine. It reads “Hiding within those mounds of data is knowledge that could change the life of a patient, or change the world”. Oh, how so very true! (I love data!)
Analytical systems are only as good as the data used to power it. Believe it or not, most people have “dirty data”, meaning not only incorrect data, but missing data, duplicate data & incomplete data augmentation”. A robust analytics platform is critical for an organization’s success. Dirty data can derail analytical projects completely. In addition, most organizations struggle with having complete data for analysis because of the difficulty, the complexity, the laborious & time consuming task of bringing the data they need for complete & accurate analysis from many, disparate data silos scattered across the globe in differing formats & proprietary systems. These organizations have been forced to use different systems at different stages of growth because there really wasn’t a solution, much less an elegant solution, to have all the scattered data silos in 1 place (until now, which will be covered in this series). The data silo effect is amplified by vendor lock-in and the immense work required to transform legacy data systems to work with current data systems.
Just like there’s big data & cloud evolution, there’s also data architecture evolution. Let’s dive in!
Finding Value in data is a total journey, & is often painstakingly undertaken with data silos using a plethora of tools. Data silos are a chronic, deep-rooted problem companies face on a daily basis.
Some of the problems data silos create are the following:
- The Contents of Data Silos will Likely Differ Slightly
- It’s Difficult to Determine Which One is the Most Accurate or Up to Date
- Data Silos Cause Wasted Resources and Inhibited Productivity
- Different Tools and Interfaces Used by Everyone with Differing Job Titles Isn’t in a Consistent Manner, Having Unpredictable & Sometimes Disastrous Results
- Many Situations Occur Where Separate Teams Could Use Another Team’s Data to Solve Problems More Efficiently if They Had Access to It
- It’s Difficult to Move Data Across Silos
- With Multiple People and Teams Working on the Same Data, You’re Forced to Keep Multiple Copies of Data
- Getting the Right Data From Data Silos is Extremely Complex, Often Abandoned, Doing analysis on a Subset of the Data, Resulting in Incorrect & Incomplete Analytical Insights
- Having Consistent Data Transformation & Data Governance Throughout All Silos Borders on Impossible, Depending on What Type of Storage Each Silo Has
- Users Struggle to Find the Data They Need, Because Finding Data Stored in a Siloed Architecture is Akin to Looking for a Needle in a Haystack
These factors & more slows innovation & evolution drastically. It’s expensive, not only to move the data, but to pay for many different silo services, in all probability adds up to paying for storage you don’t need today. Many companies don’t have all these silos mapped out, & if they do, paying for employees to keep up legacy systems is a waste of their time, their talents, and working on more important tasks like innovating, experimentation, and building great products.
There has to be a better way to deal with siloed data. Well, there is, & I can’t wait to explain to you what that is, starting in the remainder of this video & extending to the next video!
The diagram above shows some of the challenges data teams face today:
- With the exponential growth of data from many & varied data sources, the older systems weren’t designed to handle the volume, variety, velocity & the veracity – or accuracy – of data
- Data is now ubiquitous. Almost every employee has to work with data, but oftentimes in different ways on the same data, increasing the complexity of the data while increasing the chance of data errors
- And, with the explosion of the many ways to access the data, most people have their “tool of choice” & moreover certain job roles necessitated working with particular access mechanisms
The reason we create applications is to deliver business value delivered by creating business logic & operating it so it can provide a service to others. The time between creating business logic and providing service to users with that logic is the time to value. The cost of providing that value is the cost of creation plus the cost of delivery. As technology has progressed over the last decade, we’ve seen an evolution from monolithic applications to microservices and are now seeing the rise of serverless event driven functions, led by AWS Lambda.
What factors have driven this evolution? Low latency messaging enabled the move from monoliths to microservices, & low latency provisioning enabled the move to Lambda.
Monolithic means composed all in one piece. A monolithic application describes a single-tiered software application in which different components combined into a single program from a single platform.
The diagram on the above slide is a sample e-commerce application. Despite having different components, modules, & services, the app is built & deployed as one Application for all platforms (i.e. desktop, mobile and tablet) using RDBMS as the data source.
The drawbacks of this architecture includes:
- Apps can get so large & complex that it’s challenging to make changes fast & properly
- The size of the app can slow down startup time
- With each update no matter how small it is, you have to redeploy the entire app
- It’s very challenging to scale
- A bug in any module can bring down the entire app because everything for the app is connected together
- It’s very difficult to adopt new & advanced technologies, & if you try to do that, you have to pretty much rebuild the app, which is costly, takes a lot of time, & a ton of effort
The conceptual image on the above slide provides a representation of breaking a monolithic application into microservices. In monolithic architectures, all processes are tightly coupled and run as a single service. With a microservices architecture, an application is built as independent components that run each application process as a service. These services communicate via a well-defined interface using lightweight APIs. Services are built for business capabilities and each service performs a single function. Because they are independently run, each service can be updated, deployed, and scaled to meet demand for specific functions of an application.
Characteristics of Microservices include the following:
- Microservices are autonomous. Each component service can be developed, deployed, operated, and scaled without affecting the functioning of other services
- Microservices are specialized. Each service is designed for a set of capabilities and focuses on solving a specific problem
- Microservices have great agility. Microservices foster an organization of small, independent teams that take ownership of their services. Teams act within a small and well understood context and are empowered to work more independently and more quickly. This shortens development cycle times. You benefit significantly from the aggregate throughput of the organization
- Microservices have flexible scaling. Microservices allow each service to be independently scaled to meet demand for the application feature it supports. This enables teams to “right-size” infrastructure needs, accurately measure the cost of a feature, and maintain availability if a service experiences a spike in demand
- It’s easy to deploy Microservices. Microservices enable continuous integration and continuous delivery, making it easy to try out new ideas and to roll back if something doesn’t work.
- Microservices allow technical freedom. Microservices architectures don’t follow a “one size fits all” approach. Teams have the freedom to choose the best tool to solve their specific problems
- With microservices, the code is reusable. A service written for a certain function can be used as a building block for another feature. This allows an application to bootstrap off itself, as developers can create new capabilities without writing code from scratch
- Microservices are resilient. Service independence increases an application’s resistance to failure. Applications handle total service failure by degrading functionality and not crashing the entire application
On the above slide is the e-commerce diagram (redrawn from the previous slide) to represent a modular, Microservices architecture consisting of several components/modules. Each module supports a specific business goal and uses a simple, well-defined interface to communicate with other sets of services. Instead of sharing a single database as in Monolithic application, each microservice has its own database. Having a database per service is essential if you want to benefit from microservices, because it ensures loose coupling. Moreover, a service can use a type of database that is best suited to its needs.
The benefits of a microservice architecture includes:
- Continuous delivery & deployment of large, complex apps
- Services are smaller & faster to test
- Services can be deployed independently
- It enables you to organize the development effort around multiple teams. Each team is responsible for one or more single service
- Each microservice is relatively small
- They’re easier for developers to understand
- The IDE is faster, making developers more productive
- They have faster startup time
- They have improved fault isolation; if one service has a bug, it doesn’t affect the other services
- They eliminate any long-term commitment to a technology stack. When developing a new service you can pick a new technology stack. Similarly, when making major changes to an existing service you can rewrite it using a new technology stack
The architectural diagram on the slide above is an example of modern data architecture. You’ll notice an abundance of data sources in the far-left column surrounded by the blue rectangle. These represent a good sample of the wide variety of data sources.
Moving to the next column surrounded by a red rectangle, the services shown represent a good sample of the many different ways to ingest data into AWS. Which Service you use depends on several factors, such as if the data is streaming in or not, how the data is migrated in, if you have a hybrid cloud environment, and more.
The next section is the storage section. Notice there are 2 separate storage sections; 1 on the bottom surrounded by a blue rectangle, which represents data that’s streaming in, shown by the blue arrow. The top section is surrounded by the green rectangle. That represents slow-moving data, or batch data.
In the streaming section, Amazon Kinesis Data Streams is streaming in data, shown in both the ingestion section & the streaming storage section surrounded by red rectangles & pointed to by red arrows. The Kinesis Event Capture in the streaming storage section captures the streaming data as it flows in.
In this particular architecture, the Kinesis Event Capture passes the ingested data stream to Amazon EMR to perform some initial stream processing shown surrounded by a blue rectangle & pointed to by another blue arrow. When the initial processing is complete, the streamed-in data is put into an Amazon S3 raw zone bucket, surrounded by an orange rectangle & pointed to by an orange arrow. If the data wasn’t streaming data & was slow-moving data, then rather than Amazon Kinesis Data Streams, the data would be put into the Amazon S3 raw zone bucket by Amazon Kinesis Data Firehose, although this diagram doesn’t show that.
Whenever any further processing needs to be performed, the data in the S3 raw zone is acted upon by whatever the processing service is, shown in this diagram as Amazon EMR’s Mlib performing ETL, surrounded by the green rectangle & pointed to by a green arrow. Amazon EMR’s Mlib is one of Amazon’s built-in Apache Spark machine learning tools. The results of EMR’s ML predictive analytics is placed into an Amazon S3 processed zone bucket surrounded by an orange rectangle & pointed to by an orange arrow. Amazon S3 processed zone buckets is where the data is staged in an S3 Data Lake where it awaits any downstream requests to use that curated data.
Surrounded by a pink rectangle & pointed to by a pink arrow is a sample of AWS data stores that serve the data to consumers, illustrating which type of AWS service the curated data is used by, showing what job roles, shown by the yellow rectangle & arrow, that most people who typically work with the AWS services in the pink rectangle.
On the top right, you’ll see some AWS Services that are commonly used in a modern cloud data architecture, surrounded by a purple rectangle & pointed to by a purple arrow.
Notice how all data when ingested, whether if it comes streaming in or comes in moving slowly, eventually ends up in an Amazon S3 Raw Zone bucket &, after some initial processing to prepare the raw data for downstream consumption, ends up in an Amazon S3 Processed Zone bucket. (on this diagram it’s called “Staged Data”. I’ll elaborate on those 2 S3 storage locations in the next video on Amazon S3 Data Lakes.
Let’s now look at how modern data analytics pipelines are built today.
Modern analytical data pipelines decouple Storage from Compute & Processing. In this manner, when a newer technology is introduced, it’s much easier to swap out a storage or compute service for another in the same category. Modern data pipelines have many iterations of processing, analyzing, & storage. Pipelines can go off in multiple directions depending on what downstream applications will be doing with the processed data.
What you’re looking at on the above slide is a simplified architectural diagram to emphasize that both batch & real-time processing can occur at the same time (& even have dozens & dozens of threads being simultaneously processed within each). The point is to see that multiple strings of analytics can be performed simultaneously on the same dataset, shown here as a batch workflow & a real-time streaming workflow, when you use 3 key AWS Services – Amazon S3, AWS Glue, & Amazon Athena – that I’ll be introducing shortly.
The really cool thing about this is that using these technologies I’ll soon be describing, there’s no limit on how many concurrent users on the same underlying datasets without affecting the underlying data source, no matter where in the world it’s stored!
The really cool thing about this is that using these technologies I’ll soon be describing, there’s no limit on how many concurrent users on the same underlying datasets without affecting the underlying data source, no matter where in the world it’s stored!
In the next video, “Cloud & Data Metamorphosis, Video 3.3” I’ll cover the following:
- Cloud Services Evolution in Regard to Serverless Architectures
- Serverless Architectures
- AWS Lambda
- EVERYTHING You Need to Know About Containers
SECTION/VIDEO 3.3: “Cloud & Data Netamorphosis, Video 3, Part 3“
Part 3 of Video 3 continues from where Video 3, Part 2 left off.
“Cloud & Data Metamorphosis” is a multi-video series, in the larger video series “AI, ML, & Advanced Analytics on AWS” that augments my Pluralsight course shown on this slide called “Serverless Analytics on AWS”, surrounded by a blue rectangle & pointed to by a blue arrow.
Below you’ll find the embedded YouTube video 3.3:
SECTION/VIDEO 3.3: “Cloud & Data Metamorphosis, Video 3, Part 3” in Text & Screenshots
In Part 2 of Cloud & Data Metamorphosis, I covered the following:
- Amazing Data Factoids that impact analytical systems
- Analytical Platform Evolution
- Dark Data
- The Problems with Data Silos
- Data Architecture Evolution
- Modern, Distributed Data Pipelines
In Part 2, I’ll cover:
- Serverless Architectures
- AWS Lambda
- AWS’ Serverless Application Model, or SAM
- All About AWS’ Containers
- AWS Fargate
- Amazon Elastic Container Registry (ECR)
Let’s now look at The Evolution of Cloud Services in regard to Serverless Architectures.
It wasn’t that long ago that all companies had to buy servers, guessing how many they’d need for peak usage. What normally happened was they took the side of being prepared for the worst & ended up overprovisioning the amount of servers, costing a ton of money up front.
In the image on the right, with AWS Serverless, gone are the days of “racking & stacking”, called undifferentiated heavy lifting in AWS terminology. It has a “pay as you go” pricing model, dramatically reducing costs & thus a perfect way to experiment & innovate cheaply.
It has the proper data pipeline architecture of decoupling storage from compute & analyze. Load balancing, Autoscaling, Failure Recovery, Security Isolation, OS Management, & Utilization Management are handled for you.
- Greater Agility
- Less Overhead
- Better Focus
- Increased Scalability
- More Flexability
- Faster Time-to-Market
Serverless computing came into being back in 2014 at the AWS re:Invent conference, with Amazon Web Services’ announcement of AWS Lambda. . AWS Lambda is an event-driven, serverless computing platform provided by AWS. Serverless is a computing service that runs code in response to events that automatically manages the computing resources required by that code. Serverless computing is an extension of micro-services. The serverless architecture is divided into specific core components. To compare Microservices & Serverless components, microservices group the similar functionalities into one service while serverless computing defines functionalities into finer grained components.
The code you run on AWS Lambda is called a “Lambda function.” After you create your Lambda function, it’s always ready to run as soon as it is triggered. Lambda functions are “stateless,” with no affinity to the underlying infrastructure, so that Lambda can rapidly launch as many copies of the function as needed to scale to the rate of incoming events. Billing is metered in increments of 100 milliseconds, making it cost-effective and easy to scale automatically from a few requests per day to thousands per second.
Since AWS Lambda’s release, there’s been an astonishing growth of over 300 percent year over year. Serverless analytics can be done via Amazon Athena making it easy to analyze big data directly in S3 using standard SQL. Developers create custom code, then the code is executed as autonomous and isolated functions that run in stateless compute services called CONTAINERS.
Let’s back up a bit here & discuss common lambda use cases.
Common AWS Lambda Use Cases include:
- Static websites, complex Web Applications
- Backend Systems for apps, services, mobile, & IoT
- Data processing for either Batch or Real-time computing using AWS Lambda & Amazon MapReduce
- Powering chatbot logic
- Powering Amazon Alexa voice-enabled apps & for implementing the Alexa Skills Kit
- IT Automation for policy engines, extending AWS services, & infrastructure management
The architectural diagram on the slide above is 1 way a simple website using all SERVERLESS AWS Service. Each service is fully managed and doesn’t require you to provision or manage servers. The only thing you need to do to build this is to configure them together and upload your application code to AWS Lambda.
The workflow represented in the diagram above goes like this:
- After uploading your assets to the S3 bucket, you need to ensure that your bucket has public access settings. To do that, in the S3 permissions tab for the bucket. This tab is located within the bucket’s Properties tab & is where you enable static website hosting
This will make your objects available at the AWS Region-specific website endpoint of the bucket. Your end users will access your site using the public website URL exposed by Amazon S3. You don’t need to run any web servers or use other services in order to make your site available.
- When users visit your website they will first register a new user account. This is done by Amazon Cognito, highlighted by a #2 in a blue circle. After users submit their registration, Cognito will send a confirmation email with a verification code to the email address of the new visitor to your site. This user will return to your site and enter their email address and the verification code they received. After users have a confirmed account, they’ll be able to sign in.
- Next, you create a backend process for handling requests for your app using AWS Lambda & DynamoDB, highlighted by a #3 in a blue circle. The Lambda function runs its code in response to events like HTTP events. Each time a user makes a request to the static website, the function records the request in a DynamoDB table then responds to the front-end app with details about the data being dispatched. The Lambda function is invoked from the browser using Amazon API Gateway, highlighted by a #4 in a blue circle, which handles all the tasks involved in accepting and processing up to hundreds of thousands of concurrent API calls, including traffic management, authorization and access control, monitoring, and API version management.
- The Amazon API Gateway acts as a “front door” for applications to access data, business logic, or functionality from your backend services, such as workloads running on Amazon Elastic Compute Cloud (Amazon EC2), code running on AWS Lambda, & any web application, or real-time communication applications. The API Gateway creates a RESTful API that exposes an HTTP endpoint. The API Gateway uses the JWT tokens returned by Cognito User Pools to authenticate API calls. You then connect the Lambda function to that API in order to create a fully functional backend for your web application.
- Now, whenever a user makes a dynamic API call, every AWS Service is configured properly & will run without you to provision, scale, and manage any servers.
You can build them for nearly any type of application or backend service, and everything required to run and scale your application with high availability is handled for you. Pretty cool, eh?
AWS has an open-source framework for building serverless applications called the AWS Serverless Application Model (SAM). It provides shorthand syntax to express functions, APIs, databases, and event source mappings. With just a few lines per resource, you can define the application you want and model it using YAML. During deployment, SAM transforms and expands the SAM syntax into AWS CloudFormation syntax, enabling you to build serverless applications faster. If you watched any of the demos from the Pluralsight course (& I hope you did!), you’ll be familiar with how cool CloudFormation is!
Delivering a production serverless application that can run at scale demands a platform with a broad set of capabilities.
AWS supports enterprise-grade serverless applications in the following ways:
- The Cloud Logic Layer: Power your business logic with AWS Lambda, which can act as the control plane and logic layer for all your interconnected infrastructure resources and web APIs. Define, orchestrate, and run production-grade containerized applications and microservices without needing to manage any infrastructure using AWS Fargate.
- Responsive Data Sources: Choose from a broad set of data sources and providers that you can use to process data or trigger events in real-time. AWS Lambda integrates with other AWS services to invoke functions. A small sampling of the other services includes Amazon Kinesis, Amazon DynamoDB, AWS Cognito, various AI services, queues & messaging, and DevOps code services
- Integrations Library: The AWS Serverless Application Repository is a managed repository for serverless applications. It enables teams, organizations, and individual developers to store and share reusable applications, and easily assemble and deploy serverless architectures in powerful new ways. Using the Serverless Application Repository, you don’t need to clone, build, package, or publish source code to AWS before deploying it. Instead, you can use pre-built applications from the Serverless Application Repository in your serverless architectures, helping you and your teams reduce duplicated work, ensure organizational best practices, and get to market faster. Samples of the types of apps you’ll find in the App Repository include use cases for web & mobile backends, chatbots, IoT, Alexa Skills, data processing, stream processing, and more. You can also find integrations with popular third-party services (e.g., Slack, Algorithmia, Twilio, Loggly, Splunk, Sumo Logic, Box, etc)
- Developer Ecosystem: AWS provides tools and services that aid developers in the serverless application development process. AWS and its partner ecosystem offer tools for continuous integration and delivery, testing, deployments, monitoring and diagnostics, SDKs, frameworks, and integrated development environment (IDE) plugins
- Application Modeling Framework: The AWS Serverless Application Model (SAM) is an open-source framework for building serverless applications. It provides shorthand syntax to express functions, APIs, databases, and event source mappings. With just a few lines of configuration, you can define the application you want and model it
- Orchestration & State Management: You coordinate and manage the state of each distributed component or microservice of your serverless application usingAWS Step Functions. Step Functions makes it easy to coordinate the components of distributed applications and microservices using visual workflows. Building applications from individual components that each perform a discrete function lets you scale and change applications quickly
- Global Scale & Reach: Take your application and services global in minutes using our global reach. AWS Lambda is available in multiple AWS regions and in all AWS edge locations via Lambda@Edge. You can also run Lambda functions on local, connected devices with AWS Greengrass
- Reliability & Performance: AWS provides highly available, scalable, low-cost services that deliver performance for enterprise scale. AWS Lambda reliably executes your business logic with built-in features such as dead letter queues and automatic retries. See our customer stories to learn how companies are using AWS to run their applications
- Security & Access Control: Enforce compliance and secure your entire IT environment with logging, change tracking, access controls, and encryption. Securely control access to your AWS resources with AWS Identity and Access Management (IAM). Manage and authenticate end users of your serverless applications with Amazon Cognito. Use Amazon Virtual Private Cloud (VPC) to create private virtual networks which only you can access
With AWS Serverless Platform, big data workflows can focus on the analytics & not the infrastructure or undifferentiated heavy lifting (racking & stacking) & you only pay for what you use
I’ll show you 3 simplified use cases that use serverless architectures.
The first sample is a real-time streaming data pipeline. Explained briefly, in this architectural diagram:
- Data is published to an Amazon Kinesis Data Stream
- The AWS Lambda function is mapped to the data stream & polls the data stream for records at a base rate of once per second
- When new records are in the stream, it invokes the Lambda function synchronously with an event that contains stream records. Lambda reads records in batches and invokes your function to process records from the batch. The data collected is available in milliseconds to enable real-time analytics use cases such as real-time dashboards, real-time anomaly detection, dynamic pricing, and more
The next sample shows creating a Chatbot with Amazon Lex. In the diagram above, explained briefly,
- Amazon Lex is used to build a conversational interface for any application using voice and text. Amazon Lex provides the advanced deep learning functionalities of automatic speech recognition (ASR) for converting speech to text, and natural language understanding (NLU) to recognize the intent of the text, to enable you to build applications with highly engaging user experiences and lifelike conversational interactions. This enables you to build sophisticated, natural language, conversational bots (“chatbots”)
- To build an Amazon Lex bot, you will need to identify a set of actions – known as ‘intents’ — that you want your bot to fulfill. A bot can have multiple intents. For example, a ‘BookTickets’ bot can have intents to make reservations, cancel reservations and review reservations. An intent performs an action in response to natural language user input
- To create a bot, you will first define the actions performed by the bot. These actions are the intents that need to be fulfilled by the bot. For each intent, you will add sample utterances and slots. Utterances are phrases that invoke the intent. Slots are input data required to fulfill the intent. Lastly, you will provide the business logic necessary to execute the action. Amazon Lex integrates with AWS Lambda which you can use to easily trigger functions for execution of your back-end business logic for data retrieval and updates
The last sample shows using Amazon CloudWatch events to respond to state changes in your AWS resources.
In the above diagram, explained briefly:
- Amazon CloudWatch Events help you to respond to state changes in your AWS resources. When your resources change state, they automatically send events into an event stream. You can create rules that match selected events in the stream and route them to your AWS Lambda function to take action
- However, a Lambda function can be created to direct AWS Lambda to execute it on a regular schedule. You can specify a fixed rate (for example, execute a Lambda function every hour or 15 minutes), or you can specify a Cron expression
In an earlier slide, I mentioned that serverless Lambda functions are executed as autonomous and isolated functions that run in stateless compute services called CONTAINERS. Let’s look at containers & the value they provide in more depth.
AWS Lambda functions execute in a container (also known as a “sandbox”) that isolates them from other functions and provides resources, such as memory, specified in the function’s configuration.
So what’s the difference between Virtual Machines & Containers?
- Virtual Machines and Containers differ in several ways, but the primary difference is that Containers provide a way to virtualize an OS so that multiple workloads can run on a single OS instance
- With VMs, the hardware is being virtualized to run multiple OS instances
So How is a Docker Container Different than a Hypervisor?
- Docker containers are executed with the Docker engine rather than the Hypervisor. Containers are therefore smaller than Virtual Machines and enable faster start up with better performance, less isolation and greater compatibility possible due to sharing of the host’s kernel.
Virtualization offers the ability to emulate hardware to run multiple operating systems (OS) on a single computer
So What’s the Difference Between Hypervisors & Containers?
- In terms of Hypervisor categories, “bare-metal” refers to a IHypervisor running directly on the hardware, as opposed to a “hosted” Hypervisor that runs within the OS
- When Hypervisors run at baremetal level, it controls the execution at the processor. From that perspective, OSes are the Apps running on top of Hypervisor
- So from docker perspective, Containers are the apps running on your OS
Similar to how a Virtual Machine virtualizes (meaning it removes the need to directly manage) server hardware, Containers virtualize the operating system of a server
In the beginning, the only option available as a LAUNCH TYPE was Amazon EC2. Soon, customers started containerizing applications within EC2 instances using Docker. Containers made it easy to build & scale cloud-native applications. The advantage of doing this was that the Docker IMAGES were PACKAGED APPLICATION CODE that’s portable, reproducible, & immutable!
Like any new application solution, once one problem is tackled, another one eventually is identified. This is how advancements in technology are born: a customer has a new request, & companies discover ways to fulfill that request. The next request was that customers needed an easier way to manage large clusters of instances & containers. The problem was solved by AWS creating Amazon ECS that provides cluster management as a hosted service. It’s a highly scalable, high-performance container orchestration service that supports DOCKER containers & allows you to easily run and scale containerized applications on AWS. Amazon ECS eliminates the need for you to install and operate your own container orchestration software, manage and scale a cluster of virtual machines, or schedule containers on those virtual machines.
However, Cluster Management is only half the equation. Using Amazon EC2 as the container launch type, you end up managing more than just containers. ECS is responsible for managing the lifecycle & placement of tasks. Tasks are 1-2 containers that work together. You can start or stop a task with ECS, & it stores your intent. But it doesn’t run or execute your containers; it only manages tasks. An EC2 Container instance is simply an EC2 instance that runs the ECS Container Agent. Usually, you run a cluster of EC2 container instances in an autoscaling group. But, you still have to patch & upgrade the OS & agents, monitor & secure the instances, & scale for optimal utilization.
If you have a fleet of EC2 instances, managing fleets is hard work. This includes having to patch & upgrade the OS, the container agents & more. You also have to scale the instance fleet for optimization, & that can be a lot of work depending on the size of your fleet.
When you use Amazon EC2 Instances to launch your containers, running 1 container is easy. But running many containers isn’t! This led to the launch of a new AWS Service to handle this.
Introducing AWS FARGATE! AWS Fargate is a compute engine to run containers without having to manage servers or clusters. This removes the need to choose server types, decide when to scale your clusters, or optimize cluster packing. AWS Fargate removes the need for you to interact with or think about servers or clusters. Fargate lets you focus on designing and building your applications instead of managing the infrastructure that runs them.
Container management tools can be broken down into three categories: compute, orchestration, and registry.
Orchestration Services manage when & where your containers run. AWS helps manage your containers & their deployments, so you don’t have to worry about the underlying infrastructure.
AWS Container Services that falls under the functionality of “Orchestration” include:
- Amazon Elastic Container Service (or ECS): ECS is a highly scalable, high-performance container orchestration service that supports Docker containers and allows you to easily run and scale containerized applications on AWS. Amazon ECS eliminates the need for you to install and operate your own container orchestration software, manage and scale a cluster of virtual machines, or schedule containers on those virtual machines. With simple API calls, you can launch and stop Docker-enabled applications, query the complete state of your application, and access many familiar features such as IAM roles, security groups, load balancers, Amazon CloudWatch Events, AWS CloudFormation templates, and AWS CloudTrail logs. Use cases for Amazon ECS include MICROSERVICES, BATCH PROCESSING, APPLICATION MIGRATION TO THE AWS CLOUD, & ML.
- Amazon Elastic Kubernetes Service (Amazon EKS): makes it easy to deploy, manage, and scale containerized applications using Kubernetes on AWS. Amazon EKS runs the Kubernetes management infrastructure for you across multiple AWS availability zones to eliminate a single point of failure. Kubernetes is open source software that allows you to deploy and manage containerized applications at scale. Kubernetes manages clusters of Amazon EC2 compute instances and runs containers on those instances with processes for deployment, maintenance, and scaling. Kubernetes works by managing a cluster of compute instances and scheduling containers to run on the cluster based on the available compute resources and the resource requirements of each container. Containers are run in logical groupings called pods and you can run and scale one or many containers together as a pod. Use cases for EKS include MICROSERVICES, HYBRID CONTAINER DEPLOYMENTS, BATCH PROCESSING, & APPLICATION MIGRATION.
Compute engines power your containers. AWS Container Services that falls under the functionality of “Compute” include:
- Amazon Elastic Compute Cloud (EC2): ECS runs containers on virtual machine infrastructure with full control over configuration & scaling
- AWS Fargate: Fargate is a serverless compute engine for Amazon ECS that allows you to run containers in production at any scale. Fargate allows you to run containers without having to manage servers or clusters. With AWS Fargate, you no longer have to provision, configure, and scale clusters of virtual machines to run containers. This removes the need to choose server types, decide when to scale your clusters, or optimize cluster packing. AWS Fargate removes the need for you to interact with or think about servers or clusters.
The AWS Container Service that falls under the functionality of “Registry” is:
- Amazon Elastic Container Registry, or ECR. ECR is a fully-managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images. Amazon ECR is integrated with Amazon Elastic Container Service (ECS), simplifying your development to production workflow. Amazon ECR eliminates the need to operate your own container repositories or worry about scaling the underlying infrastructure. Amazon ECR hosts your images in a highly available, secure and scalable architecture, allowing you to reliably deploy containers for your applications. Integration with AWS Identity and Access Management (IAM) provides resource-level control of each repository
The steps to run a managed container on AWS are the following:
- You first choose your orchestration tool, either ECS or EKS
- Then choose your launch type EC2 or Fargate
The EC2 launch type allows you to have server-level, more granular control over the infrastructure that runs your container applications. With EC2 launch type, you can use Amazon ECS to manage a cluster of servers and schedule placement of containers on the servers. Amazon ECS keeps track of all the CPU, memory and other resources in your cluster, and also finds the best server for a container to run on based on your specified resource requirements. You are responsible for provisioning, patching, and scaling clusters of servers, what type of server to use, which applications and how many containers to run in a cluster to optimize utilization, and when you should add or remove servers from a cluster. EC2 launch type provides a broader range of customization options, which might be required to support some specific applications or possible compliance and government requirements.
AWS Fargate is a compute engine that can be used as a launch type that allows you to run containers without having to manage servers or clusters. With AWS Fargate, you no longer have to provision, configure, and scale clusters of virtual machines to run containers. This removes the need to choose server types, decide when to scale your clusters, or optimize cluster packing. AWS Fargate removes the need for you to interact with or think about servers or clusters. Fargate lets you focus on designing and building your applications instead of managing the infrastructure that runs them. AWS Fargate uses an on-demand pricing model that charges per vCPU and per GB of memory reserved per second, with a 1-minute minimum.
To sum up AWS Fargate’s primary benefits, they are the following:
- There’s absolutely NO INFRASTRUCTURE TO MANAGE!
- Everything is managed at the container level
- Fargate launches containers quickly, & they scale easily
- And, there’s resource-based pricing. You only pay when the service is running
Docker is an operating system for containers. DOCKER USERS ON AVERAGE SHIP SOFTWARE 7X MORE FREQUENTLY THAN NON-DOCKER USERS! It’s an engine that enables any payload to be encapsulated as a lightweight, portable, self-sufficient container. Docker accelerates application delivery by standardizing environments and removing conflicts between language stacks and versions. Docker can be manipulated using standard operations & run consistently on virtually any hardware platform, making it easy to deploy, identify issues, & roll back for remediation. With Docker, you get a single object that can reliably run anywhere. Docker is widely adopted, so there’s a robust ecosystem of tools and off-the-shelf applications that are ready to use with Docker. AWS supports both Docker open-source and commercial solutions.
Running Docker on AWS provides developers and admins a highly reliable, low-cost way to build, ship, and run distributed applications at any scale. You can run Amazon ECS, Amazon EKS, AWS Fargate, Amazon ECR & AWS Batch in Docker containers (as well as ML & ML algorithms for Amazon SageMaker). You can use Docker containers as a core building block creating modern applications and platforms. Docker makes it easy to build and run distributed microservices architectures, deploy your code with standardized continuous integration and delivery pipelines, build highly-scalable data processing systems, and create fully-managed platforms for your developers. A Docker image is a read-only template that defines your container. The image contains the code that will run including any definitions for any libraries & dependencies your code needs. A Docker container is an instantiated (running) Docker image. AWS provides ECR, which is an image registry for storing & quickly retrieving Docker images.
Docker solves one of the main problem that system administrators and developers faced for years. They would ask the question, “It was working on dev and qa. Why isn’t it working in the production environment?” Well, the problem most of the times can be a version mismatch of some library or few packages not being installed etc. This is where docker steps in OR (**and, I suggest you sing the next few words to the tune of “Mighty
“***) Here comes Docker to Save the Day!
In the above example, there’s 4 separate environments using the same Docker container. Docker encourages you to split your applications into their individual components, & ECS is optimized for this pattern. Tasks allow you to define a set of containers that you’d like to be placed together (or, part of the same placement decision), their properties, & how they’re linked. TASKS are a unit of work in ECS that provides grouping of related containers, & run the container instances. Tasks include all the information that Amazon ECS needs to make the placement decision. To launch a single container, your Task Definition should only include one container definition.
Docker solves this problem by making an image of an entire application with all its dependencies, allowing portability to whatever to and ship it to whatever your required target environment / server is. So in short, if the app worked in your local system, it should work anywhere in the world (because you are shipping the entire thing).
Shown on the above slide is a schematic of an Amazon ECS cluster. Before you can run Docker containers on Amazon ECS, you must create a task definition. You can define multiple containers and data volumes in a task definition. Tasks reference the Container image from the Elastic Container Registry. The ECS Agent pulls the image and starts the container, which is in the form of a Task (aka Instance)
Amazon ECS allows you to run and maintain a specified number of instances of a task definition simultaneously in an Amazon ECS cluster. This is called a service, which are long-running collections of Tasks. If any of your tasks should fail or stop for any reason, the Amazon ECS service scheduler launches another instance of your task definition to replace it and maintain the desired count of tasks in the service depending on the scheduling strategy used. You can optionally run your service behind a load balancer. The load balancer distributes traffic across the tasks that are associated with the service.
At the bottom of the orange rectangle on the right, there’s a red rectangle surrounding the words that read “Key/Value Store”. This refers to ETCD, an open-source distributed key-value store whose job is to safely store critical data for DISTRIBUTED SYSTEMS. It’s primary datastore is used to store configuration data, state, and metadata. Containers usually run on a cluster of several machines, so Etcd makes it easy to store data across a cluster and watch for changes, allowing any node in a cluster to read and write data. Etcd’s watch functionality is used by Containers to monitor changes to either the actual or the desired state of its system. If they are different, the container makes changes to reconcile the two states.
How you architect your application on Amazon ECS depends on several factors, with the launch type you are using being a key differentiator. The image on the left represents Amazon ECS launched with an EC2 launch type. Let’s look at that architecture vs. ECS launched with the Fargate launch type that’s shown in the image on the right. The ECS launch type consists of varying amounts of EC2 instances. Both launch types show Scheduling & Orchestration, & a Cluster Manager & a Placement Engine that are both using ECS.
Orchestration provides the following: Configuration, Scheduling, Deployment, Scaling, Storage or Volume Mapping, Secret Management, High Availability, & Load Balancing Integration. The service scheduler is ideally suited for long running stateless services and applications. The service scheduler ensures that the scheduling strategy you specify is followed and reschedules tasks when a task fails (for example, if the underlying infrastructure fails for some reason). Cluster management systems schedule work and manage the state of each cluster resource. A common example of developers interacting with a cluster management system is when you run a MapReduce job via Apache Hadoop or Apache Spark. Both of these systems typically manage a coordinated cluster of machines working together to perform a large task. When a task that uses the EC2 launch type is launched, Amazon ECS must determine where to place the task based on the requirements specified in the task definition, such as CPU and memory. Similarly, when you scale down the task count, Amazon ECS must determine which tasks to terminate. You can apply task placement strategies and constraints to customize how Amazon ECS places and terminates tasks. Task placement strategies and constraints are not supported for tasks using the Fargate launch type. By default, Fargate tasks are spread across Availability Zones..
Moving on from this point in defining what you see in both these images will differ a lot.
The Amazon ECS container agent, shown near the bottom of the EC2 launch type image, allows container instances to connect to your cluster. The Amazon ECS container agent is only supported on Amazon EC2 instances.
Amazon ECS uses Docker images in task definitions to launch containers on Amazon EC2 instances in your clusters. A Docker Agent is the containerized version of the host Agent & are shown next to the ECS container Agents. Amazon ECS-optimized AMIs, or Amazon Machine Images, are pre-configured with all the recommended instance specification requirements.
Now let’s look at the image on the right, showing a diagram of using the AWS Fargate launch type. Amazon ECS (& EKS) supports Fargate technology thus customers can choose AWS Fargate to launch their containers without having to provision or manage EC2 instances. AWS Fargate is the easiest way to launch and run containers on AWS. Customers who require greater control of their EC2 instances to support compliance and governance requirements or broader customization options can choose to use ECS without Fargate to launch EC2 instances.
Fargate is like EC2 but instead of giving you a virtual machine, you get a container. Fargate is a compute engine that allows you to use containers as a fundamental compute primitive without having to manage the underlying instances. AWS Fargate removes the need to choose server types, decide when to scale your clusters, or optimize cluster packing. AWS Fargate removes the need for you to interact with or think about servers or clusters, whereas with the EC2 launch type all of these tasks & more must be handled by you manually & continually. With Fargate launch type, all you have to do is package your application in containers, specify the CPU and memory requirements, define networking and IAM policies, and launch the application.
By using Amazon ECS, you reduce your compute footprint by 70%. For that reason, let’s quickly review the difference between ECS & EKS for clarity.
Both container services have CONTAINER-LEVEL NETWORKING. They also both have DEEP INTEGRATION WITH AWS PLATFORM, but the feature similarities stop there.
Amazon ECS uses the Amazon ECS CLI makes it easy to set up your local environment & supports Docker Compose, an open-source tool for defining & running multi-container apps. Amazon EKS has a scalable and highly-available control plane that runs across multiple AWS availability zones. The Amazon EKS service automatically manages the availability and scalability of the Kubernetes API servers and the etcd persistence layer for each cluster. Amazon EKS runs the Kubernetes control plane across three Availability Zones in order to ensure high availability, and it automatically detects and replaces unhealthy masters. Amazon ECS allows you to define tasks through a declarative JSON template called a Task Definition. Within a Task Definition you can specify one or more containers that are required for your task, including the Docker repository and image, memory and CPU requirements, shared data volumes, and how the containers are linked to each other. Task Definition files also allow you to have version control over your application specification. Amazon EKS provisions and scales the Kubernetes control plane, including the API servers and backend persistence layer, across multiple AWS availability zones for high availability and fault tolerance. Amazon EKS automatically detects and replaces unhealthy control plane nodes and provides patching for the control plane. Amazon ECS includes multiple scheduling strategies that place containers across your clusters based on your resource needs (for example, CPU or RAM) and availability requirements. Using the available scheduling strategies, you can schedule batch jobs, long-running applications and services, and daemon processes. Amazon EKS performs managed, in-place cluster upgrades for both Kubernetes and Amazon EKS platform version. There are two types of updates that you can apply to your Amazon EKS cluster, Kubernetes version updates and Amazon EKS platform version updates. Amazon ECS is integrated with Elastic Load Balancing, allowi ng you to distribute traffic across your containers using Application Load Balancers or Network Load Balancers. You specify the task definition and the load balancer to use, and Amazon ECS automatically adds and removes containers from the load balancer. Amazon EKS is fully compatible with Kubernetes community tools and supports popular Kubernetes add-ons. These include CoreDNS to create a DNS service for your cluster and both the Kubernetes Dashboard web-based UI and the kubectl command line tool to access and manage your cluster on Amazon EKS. Amazon ECS is built on technology developed from many years of experience running highly scalable services. You can launch tens or tens of thousands of Docker containers in seconds using Amazon ECS with no additional complexity. Amazon EKS provides a scalable and highly-available control plane that runs across multiple AWS availability zones. The Amazon EKS service automatically manages the availability and scalability of the Kubernetes API servers and the etcd persistence layer for each cluster. Amazon EKS runs the Kubernetes control plane across three Availability Zones in order to ensure high availability, and it automatically detects and replaces unhealthy masters. Amazon ECS provides monitoring capabilities for your containers and clusters through Amazon CloudWatch. You can monitor average and aggregate CPU and memory utilization of running tasks as grouped by task definition, service, or cluster. You can also set CloudWatch alarms to alert you when your containers or clusters need to scale up or down. Amazon EKS runs upstream Kubernetes and is certified Kubernetes conformant, so applications managed by Amazon EKS are fully compatible with applications managed by any standard Kubernetes environment. Amazon ECS includes an integrated service discovery that makes it easy for your containerized services to discover and connect with each other. Previously, to ensure that services were able to discover and connect with each other, you had to configure and run your own service discovery system or connect every service to a load balancer. Now, you can enable service discovery for your containerized services with a simple selection in the ECS console, AWS CLI, or using the ECS API. Amazon ECS creates and manages a registry of service names using the Route 53 Auto Naming API. Names are automatically mapped to a set of DNS records so you can refer to services by an alias, and have this alias automatically resolve to the service’s endpoint at runtime. You can specify health check conditions in a service’s task definition and Amazon ECS will ensure that only healthy service endpoints are returned by a service lookup.
Amazon ECR is a highly available and secure private container repository that makes it easy to store and manage your Docker container images, encrypting and compressing images at rest so they are fast to pull and secure. It’s a fully-managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images. Amazon ECR eliminates the need to operate your own container repositories or worry about scaling the underlying infrastructure. Amazon ECR hosts your images in a highly available and scalable architecture, allowing you to reliably deploy containers for your application. There’s deep Integration with AWS Identity and Access Management (IAM) to provide resource-level control of each repository, & integrates natively with other AWS Services.
In the next section of video #4, “Cloud & Data Metamorphosis, Part 4” I’ll cover the following:
- The Evolution of Data Analysis Platform Technologies
- The Benefits of Serverless Analytics
- And, an Introduction to AWS Glue & Amazon Athena
SECTION/VIDEO 3.4: ““Cloud & Data Metamorphosis, Video 3, Part 4“
This 4th video in the 4-part video series “Cloud & Data Metamorphosis, Video 3, Part 4” is the last video in the Video 3 series. Thus, by its very nature, by the end of this 3.4 video, you’ll have a complete foundation for the upcoming videos that are on specific AWS Services. Most of the upcoming videos will be on – but not limited to – AWS Glue, Amazon Athena, & the Amazon S3 Data Lake Storage Platform. But I’m keeping the last video in this series to give you a “wow” factor that hopefully will bring all topics full-circle, with everything summed up in a quite frankly unexpected manner. But as you look back after watching the last video, you’ll see that the end of the video is actually the beginning 🙂 That last video is entitled, “How to AI & ML Your Apps & Business Processes“.
There’s a ~7 minute video set to 3 songs that represent the journey I hope to take you on through my course & YouTube videos (whose graphics are squares, music stops abruptly & ok, I lingered a bit long on John Lennon’s “Imagine” 🙂 , & my grammar is atrocious!) that you can find on my YouTube channel entitled, “Learn to Use AWS AI, ML, & Analytics Fast!“. I’ll tempt you with that (I mean bad quality :-0 ) video now…it can be viewed here. Keep in mind that I didn’t “polish” that quickly-created video, but nevertheless, it’s relevant (& FUN!)
Below you’ll find the embedded YouTube Video 3.4:
SECTION/VIDEO 3.4: “Cloud & Data Metamorphosis, Video 3, Part 4” in Text & Screenshots
This 4th & last video of the video series entitled, “Cloud & Data Metamorphosis”, that augments my Pluralsight course, “Serverless Analytics on AWS”, highlighted with a blue rectangle & pointed to by a blue arrow. Under the blue arrow is the link to that Pluralsight course.
In Part 3 of “Cloud & Data Metamorphosis“, I covered the following:
- Serverless Architectures
- AWS Lambda
- AWS’ Serverless Application Model, or SAM
- All About AWS Containers
- AWS Fargate
- Amazon Elastic Container Registry (ECR)
In Part 4, I’ll cover:
- The Evolution of Data Analysis Platform Technologies
- Serverless Analytics
- How to Give Redbull to Your Data Transformations
- AWS Glue & Amazon Athena
- Clusterless Computing
- An Introduction to Amazon S3 Data Lake Architectures
Continuing how technologies have evolved, in this section I’ll cover the Evolution of Data Analysis Platform Technologies.
The timeline on this slide shows the evolution of data analysis platform technologies.
Around the year 1985, Data Warehouse appliances were the platform of choice. This consisted of multi-core CPUs and networking components with improved storage devices such as Network Attached Storage, or NAS, appliances
Around the year 2006, Hadoop clusters were the platform of choice. This consisted of a Hadoop master node and a network of many computers that provided a software framework for distributed storage and processing of big data using the MapReduce programming model. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. You can see logos for Amazon Elastic MapReduce (EMR) that represent Hadoop frameworks, such as Spark, Hbase, Presto, Hue, Hive, Pig, & Zeppelin.
Superman is actually an animated gif in the video. Everything Superman is flying towards were created by AWS.
Around the year 2009, Decoupled EMR clusters were the platform of choice. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), shown here using Amazon S3 for EMRFS (emr file storage), and a processing part which is a MapReduce programming model. This is the first occurrence of compute/memory decoupled from storage. Hadoop again splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.
Around the year 2012, Amazon Redshift Cloud Data Warehouse was introduced, which was a very transformative data analysis platform technology for oh so many awesome reasons! The diagram on the timeline underneath the year 2012 is difficult to read, so I’ll explain them to you. Directly under the written year 2012 is a set of 3 ”sample” client apps that talk to the Leader node & receive information back from the leader node via odbc or jdbc connections. Under the leader node are multiple compute nodes with multiple node slices that have 2-way communication with the leader node as well as a variety of data sources, shown by the 4 icons on the bottom of that image.
Today, the data analysis platform technology of choice is serverless & Clusterless computing via an Amazon S3 data lake, using AWS Glue for ETL & Amazon Athena for SQL, both having 2-way communication with the underlying S3 data store.
Since data is changing, so must the analytics used to glean insights from any type of data. Today data is captured & stored at PB, EB & even ZB scale, and analytics engines must be able to keep up.
Some Samples of the New Types of Analytics include:
- Machine Learning
- Big Data Processing
- Real-time Analytics
- Full Text Search
At this point, I’m excited to share with you the two innovative, cutting-edge serverless analytics services provided by AWS: Amazon Athena & AWS Glue! These services are not only cutting edge because they’re state-of-the-art technologies, but also because they’re serverless. Having a cloud-native Serverless architecture enables you to build modern applications with increased agility & lower cost of ownership. It enables you to shift most of your operational & infrastructure management responsibilities to AWS, so you can focus on developing great products that are highly-reliable and scalable. Joining the AWS Services of Glue & Athena is the Amazon S3 Data Lake Platform. S3 Data Lakes will be covered in the next video.
Data preparation is by far the most difficult & time-consuming tasks when mapping disparate data types for data analytics. 60% of time is spent on cleaning & organizing data. 19% of time is spent collecting datasets. The third most time consuming task is Mining data for patterns. The fourth most time consuming task is Redefining Algorithms. The fifth most time consuming task falls under the broad category of “Other”. The sixth most time consuming task is Building Training Sets for Machine Learning.
The moral of this story is there HAS TO BE A SOLUTION TO DECREASE THE TIME SPENT ON ALL THESE TASKS! Well, there is, & I can’t wait to share it with you! I’ll begin in the next few slides then elaborate more in the next video.
AWS Glue solves the business problems of heterogeneous data transformation and globally siloed data.
Let’s look at what AWS Glue “Is”:
- The AWS Glue Data Catalog provides 1 Location for All Globally Siloed Data – NO MATTER WHERE IN THE WORLD THE UNDERLYING DATA STORE IS!
- AWS Glue Crawlers crawls global data sources, populates the Glue Data Catalog with enough metadata & statistics to recreate the data set when needed for analytics, & keeps the Data Catalog in sync with all changes to data, located across the globe
- AWS Glue automatically identifies data formats & data types
- AWS Glue has built-in Error Handling
- AWS Glue Jobs perform the data transformation, which can be automated via a Glue Job Scheduler, either event-based or time-based
- AWS Glue ETL is one of the common data transformations AWS Glue does, but there are many other data transformations built-in
- AWS Glue has monitoring and alerting built in!
- And, AWS Glue ELIMINATES DARK DATA
Amazon Athena solves the business problems of heterogeneous data analytics & gives the ability to instantaneously query data without ETL.
Let’s look at what Amazon Athena “Is”:
- Amazon Athena is an interactive query service
- You query data directly from S3 using ANSI SQL
- You can analyze unstructured, semi-structured, & structured data
- Athena scales automatically
- Query execution is extremely fast via executing queries in parallel
- You can query encrypted data in S3 & write encrypted data back to another S3 bucket
- And, you only pay for the queries you run
I’ll now touch on the Data Architecture evolution regarding Amazon S3 Data Lakes.
Serverless Architectures remove most of the needs for traditional “always on” server components. The term “CLUSTERLESS” means architectures today don’t need 2 or more computers working at the same time, thus these services are both Serverless & Clusterless.
AWS Glue, Amazon Athena, & Amazon S3 are the 3 core services that make AWS Data Lake Architectures possible!!! These 3 AWS Services are pretty AMAZING!
Under Amazon Athena’s covers is both Presto & Apache Hive. Presto is an in-memory distributed SQL query engine used for DML (Data Manipulation Language)…like CREATE, SELECT, ALTER, & DELETE. It can query data where it is stored, without needing to move data into a separate analytics system. Query execution runs in parallel over a pure memory-based architecture, with most results returning in seconds. It can perform interactive data analysis against GB to PB of data. And, it’s ANSI-SQL compatible with extensions.
Hive is used to execute DDL statements… “Data Definition Language”, a subset of SQL statements that change the structure of the database schema in some way, typically by creating, deleting, or modifying schema objects such as databases tables, and views in Amazon Athena. It can work with complex data types & work with a plethora of data formats. It’s used by Amazon Athena to partition data. Hive also supports MSCK REPAIR TABLE (or, ALTER TABLE RECOVER PARTITIONS), to recover partitions and data associated with partitions.
AWS Glue builds on Apache Spark to offer ETL-specific functionality. Apache Spark is a high-performance, in-memory data processing framework that can perform large-scale data processing. You can configure your AWS Glue jobs and development endpoints to use the Data Catalog as an external Apache Hive metastore because the Data Catalog is Hive-metastore compatible. You can then directly run Apache Spark SQL queries against the tables stored in the Data Catalog. By using SparkSQL you can use existing Hive metastores, SerDes, & UDFs.
Either way, serverless architectures provide all the benefits of cloud computing but with considerably less time-to-create, maintain, monitor, & more & at an amazing cost-savings!
I’m going to end this video with some awesome quotes…
Hopefully you remember that we’re in the midst of the 4th Industrial Revolution from Video 2 in this series? On the above slide is a quote from Miguel Milano, President, EMEA, Salesforce. It reads, “The technology driving the 4th Industrial Revolution gives companies an ever-greater influence on how we live, work, & function as a society. As a result, customers expect business leaders to go beyond shareholders & have a longer-term, broader strategic approach“. In other words, what worked yesterday will not work today, at least not for long. Businesses & ourselves HAVE to keep up with the rapid pace of technological change. “It’s the end of the world as we know it!”
The quote above is from Jeff Bezos, Founder & CEO of Amazon.com. It reads, “In today’s era of volatility, there’s no other way but to re:invent. The only sustainable advantage you can have over others is agility, that’s it. Because, nothing else is sustainable, everything else you create, someone else will replicate.”
It’s interesting to watch how blazingly fast everything we do in life is recorded through the very technologies that also help us. People have always stood on the shoulder’s of giants to bootstrap their careers, but this time it’s different. If you don’t learn AI today, it’s a sobering fact that you’ll fall behind the pack. So, keep up with my course & these videos, stand on my “dwarf” (not giant! I’m REALLY SHORT!) shoulders, & be tenacious to thrive in the 4th Industrial Revolution. I know you can do it!
Keeping the last quote in mind, read the next quote above. It’s a quote from Brendan Witcher, Principal Analyst at Forrester. It reads, “You’re not behind your competitors; you’re behind your customers – beyond their expectations“. There’s 2 concepts I’d like you to ponder here. First, although you need to keep up & ideally surpass your competition, what your customers want, & KNOWING WHAT THAT IS, is always the way to approach a business or a job, & AI can tell you that in real-time. Secondly, as your competitors offer state-of-the-art AI solutions, they will come to expect that from anyone they choose to do business with.
The last quote I’ll leave you with is a bit of an extension of the last quote. This one is from Theodore Levitt, Former Harvard Business School Marketing Professor. It reads, “People don’t buy a quarter-inch drill. They want a quarter-inch hole“. Acquiring the talent to decipher customer’s requests into what they really mean but perhaps can’t articulate is a valuable characteristic indeed!
This is the end of multi-video series 3 in the parent multi-video series “AI, ML, & Advanced Analytics on AWS”. In the next video in this series, video #4, I’ll go into depth on just how cool the Amazon S3 Data Lake Storage Platform is & why . I’ll also describe how AWS Glue & Amazon Athena fit into that platform. I think you’ll be amazed at how these technologies together, & other AWS Services, provide a complete portfolio of data exploration, reporting, analytics, machine learning, AI and visualization tools to use on all of your data.
By the way, every top-level video in this series will end with this slide with this image on it & the url at the top. The URL leads you to my book site, where you can download a 99-pg chapter on how to create a predictive analytics workflow on AWS using Amazon SageMaker, Amazon DynamoDB, AWS Lambda, & some other really awesome AWS technologies. The reason the chapter is 99 pages long is because I leave no black boxes, so that people who are advanced at analytics can get something out of it as well as a complete novice. I walk readers through step-by-step in creating the workflow & describe why each service is used & what its doing. Note however, that it’s not fully edited, but there’s a lot of content as I walk you through building the architecture with a lot of screenshots, so you can confirm you’re following the instructions correctly.
I’ll get Video 4 up asap! Until then, #gottaluvAWS!