AI / ML & Advanced Analytics on AWS Series

I find it interesting that essentially the quote from Charles Darwin stated over 100 years ago is being quoted today by Amazon Founder & CEO Jeff Bezos…but I know why!

Today more than ever businesses & IT professionals have to evolve, because everything is changing so fast. You & I need to evolve to newer technologies at a more rapid pace than ever before.

This is the point I’d like to make in this (rather long, but broken into sections) post. 

I created a video tutorial that was 6.5 hours long that was reduced to 2.5 hours on AWS Glue & Amazon Athena for Pluralsight with the title “ Serverless Analytics on AWS“.

Since I had the extra content, I created “augmented videos” that compliment the course modules. The next 6 sections are the first 6 videos (here, made into a blog post) in the series I entitled “AI/ML, & Advanced Analytics on AWS“. NOTE: nothing is overlapped with the course, so it’s best to use a combination of course content & YouTube content, which is explained in the “Explainer Video 1”.

For each of the sections, I embed the YouTube video associated with the sections & then screenshot the slides for those of you who like to read (smile!)

To help you navigate these first 6 videos, I’ve created a document that describes what topics are in each video along with their respective timeframes. In addition, this document begins with a list of the DEMOS that the Pluralsight course has. Below are a couple of screenshots showing how I break down the content in the document:

SECTION/VIDEO 1: “Explainer Video

This section introduces you to the course & suggest the best way to interweave the YouTube videos/blog post with the course. (The meat of the content begins in Section 2.)

Pluralsight TOC

The Pluralsight course’s Table of Contents

Below you’ll find the embedded YouTube Explainer Video 1:

As of today, 11/7/2019, I have YouTube Videos 1, 2, 3.1, 3.2, 3.3, & 3.4 done: as soon as I can, I’ll finish the rest of the YouTube videos. Video 4 will be on Amazon S3 Data Lakes. The other videos will be on AWS Glue, Amazon Athena, Amazon QuickSight with a DEMO, an entire new DEMO on automating AWS Glue Crawlers & Amazon Glue Transformations, & knowing me there will be more ( 🙂 ) but I don’t know the order yet other than Video 4. I’ll get these done asap for you!

Here are some screenshots from the explainer video to give you an idea of how I explain the best way to watch the course content with the YouTube content:

SECTION/VIDEO 1: Explainer Video in Text & Screenshots

First Slide of Explainer Video

First Slide Explainer Video

Analytics today stretches beyond Business Intelligence to include real-time streaming & analytics augmented with Artificial Intelligence. AWS Glue, Amazon Athena, & Amazon S3 Data Lakes have transformed how businesses perform cutting-edge analytics in the day & age of AI & ML. Thus, the sooner you begin to learn how to use these services, the less of a learning curve you’ll have as AI & ML progress.

Viewing Order Intertwined

Viewing Order Intertwined

Let’s look at a couple different scenarios that you’ll see in each of the videos in this series.

TOC of Pluralsight Course

TOC of Pluralsight Course

The screenshot on this slide shows the TOC of the Pluralsight course. Nothing in the YouTube videos is in the Pluralsight course & vice-versa.

The Optimal Time to Watch the Explainer Video to Augment the Pluralsight Course

The Optimal Time to Watch the Explainer Video to Augment the Pluralsight Course

Each video will contain a slide or two that will point out in a screenshot what you should watch before that video (if relevant) & what you should watch after that video (if relevant) for the particular YT video you’re looking at.

Let’s say a YouTube video fits perfectly in between 2 Pluralsight modules. I’d show a screenshot like the one above showing the TOC of the course showing the 2 modules surrounded by a colored rectangle & suggest you watch whatever the name is of the module you should watch first & whatever the name of the module you should watch after. This pattern of when to watch the YouTube videos interweaved with the course modules will be included in every YouTube video, so you’re getting the maximum understanding & value from both sources.

Sample: Suggested Structure for Viewing Order

Sample: Suggested Structure for Viewing Order

This is another scenario on the suggested structure for viewing order of this YT video series. Let’s say you’re watching the Pluralsight course module entitled, “The Power of Amazon Athena”, surrounded by an orange rectangle. This module consists of individual sections, shown by the larger orange rectangle. Let’s say you just finished watching the section entitled “Databases & Tables” in the course, shown by the yellow rectangle & you’re about to watch the first demo in the section. Demos will always having a pink rectangle around it in all screenshots of the course.

Let’s say I suggest that you watch the section on Databases & Tables in the YT video series after the section on Databases & Tables in the course, because there’s more content in the YT video, & THEN watch the Demo shown on this slide. Another example would be that I suggest that after you watch the last section in the course entitled “Monitoring & Tracing Ephemeral Data,” before you start the next course section, entitled “How to AI & ML Your Apps & Business Processes” that you watch video Demo #2 on YT, because that entire demo was cut from the course. You get the picture!

The Entire Table of Contents of the Pluralsight Course

The Entire Table of Contents of the Pluralsight Course

In the next few slides I’ll show you the TOC of the Pluralsight course, so you’ll know what to look forward to learning.; HOWEVER, note that you’ll get more than what you see in the course, and some completely separate full modules that aren’t in the course, as well! Note also, that all of the demos & the course code download link to GitHub are only in the course, & the content slides in the course aren’t in the YT series.

The First 3 Modules of the Pluralsight Course

The First 3 Modules of the Pluralsight Course

The slide above shows you the 1st 3 modules of the course. Notice that each module of every has a play button, surrounded by a blue circle, to the left of each module name that you click to watch that section. In addition, each module has a chevron (a “carrot” character) on the right, surrounded by a pink circle, that expands the top-level module name to show the individual sections & their titles within each module.

The first module only has 1 section in the module entitled “Course Overview,” that serves as a summary of what you’ll learn & how you’ll benefit from watching the course. The next module only has 1 section as well entitled “Download & Install Course Pre-requisites.” You’ll need to watch the section entitled “Pre-requisites” so you can download the course code assets from my GitHub account, as well as making sure you have a 3rd-party SQL client installed locally on your machine.

The next module shown in this screenshot is entitled “The State of Analytics in the AWS Cloud” that overviews the AWS Services we’ll be using in the course, that includes the AWS S3 Data Lake Platform, AWS Glue, Amazon Athena, and the analytical relational database in the cloud named Amazon Aurora. We’ll be creating an Aurora database in the cloud, then via our 3rd-party SQL client give the database a schema & data, then query it & do some transformations. Later in the course, we’ll use AWS Glue Data Crawlers to crawl this Aurora database, extract the schema & statistics, & populate the AWS Glue Data Catalog with that data. Subsequently, we’ll perform the same queries and transformations on the Aurora table in the Glue Data Catalog just like we did with the 3rd-party SQL client, but without having to connect to the database remotely, showing how much faster, easier, & ultimately much more efficiently & effectively doing queries using AWS Glue & Amazon Athena is vs. having to connect remotely (among other things!)

Modules 4 & 5 of Pluralsight Course

Modules 4 & 5 of Pluralsight Course

This screenshot shows the next 2 modules in the course. At the top is the module named “Infrastructure & Data Setup via Amazon CloudFormation” that has the first DEMO in the course. This demo will build out the entire relational data part of the course. This entire module isn’t replicated in the YT video series because this module is absolutely centered on the DEMO, which I can’t replicate due to my contract with Pluralsight. You should watch this module in the course because we’ll be building upon this initial AWS framework throughout the rest of the course & video series.

The next module shown in this screenshot is entitled “The Power of AWS Glue.” This module explains important core concepts & features of AWS Glue. The information in the course module is important & not duplicated in the YT video series, so you should watch the course module. However, I suggest you watch video in the YT series after that because it contains so many more integral concepts.

AWS Glue Demos in the Pluralsight Course

AWS Glue Demos in the Pluralsight Course

The screenshot above shows the next module expanded named “Creating AWS Glue Resources & Populating the AWS Glue Data Catalog.” This module has 4 DEMOS that you’ll only be able to watch in the course. The demos are:

  • Configuring an Amazon S3 Bucket & Uploading the Python Transformation Script
  • Creating the AWS Glue Infrastructure Architecture
  • Running the First Crawler to Populate the Glue Data Catalog & Run the Glue ETL Job to Transform the Data
  • Creating a New AWS Crawler to Crawl the Parquet-formatted Data

After watching this module in the course & doing all the demos, I suggest watching the YT video associated with this module because it gives you more insight into what you’re doing in the demos as well as give you a pretty good understanding of AWS Glue’s primary & necessary topics for a full understanding. In addition, the accompanying AWS Glue YT video will have a demo that automates a Glue Crawler & a Glue ETL Job for transforming data; this will build a completely automated Glue workflow. There’s no automation demos in the course, so you’ll need to watch that YT video, once published. Automation can save you a lot of time indeed!

As of today, 11/7/2019, I have YouTube Videos 1, 2, 3.1, 3.2, 3.3, & 3.4 done: as soon as I can, I’ll finish the rest of the YouTube videos.

The Last 2 Modules of the Pluralsight Course

The Last 2 Modules of the Pluralsight Course

The above screenshot shows the last 2 modules of the Pluralsight course. At the top the module is entitled “The Power of Amazon Athena”, & it has 1 DEMO on working with Amazon Athena’s Databases & Tables.

I suggest after watching the course module on Athena, that you watch the YT video that corresponds to the online course module. It’ll have MUCH MORE information in it that’s very beneficial to know, including but not limited to:

  • Elaborating more on Databases & Tables in Amazon Athena
  • An entire section on using Workgroups with Athena that isn’t in the course
  • Elaboration on using Partitions in Athena
  • An entire section on The Top 10 Performance Tuning Tips for Athena
  • An extended section on Monitoring & Tracing Ephemeral Data
  • A section on using Amazon QuickSight, including a DEMO

The last module for both the course & the YT video series is entitled “How to AI & ML Your Apps & Business Processes”. You should watch the course module, then afterwards watch the accompanying YT video because the YT video has a lot of additional information that totally ties together everything you’ve learned in both the Pluralsight course & the video series!!!

Unique Modules Not in the Pluralsight Course

Unique Modules Not in the Pluralsight Course

There will be additional content in this YT video series that isn’t in the Pluralsight course. I’ll point out those videos when I create them

Personally, I LOVE quotes. I’ve added quotes in the YT series that amazingly fit perfectly into each YT video.

Lots of Failures First

Quote by Sergey Brin

The above quote is from Sergey Brin, Co-founder of Google. It reads “The only way you are going to have success is to have lots of failures first”. I’m aware that what I’m going to be teaching you through these videos isn’t going to be easy: it’s a steep learning curve. But don’t give up! Having many failures & even small wins will bring you closer to conquering that mountain!

Change Before You Have To

Quote from Jack Welch, Former Chairman & CEO of General Electric

The above quote is from Jack Welch, Former Chairman & CEO of General Electric. It reads “Change before you have to”.

Jack Welch’s quote has a direct correlation to the rapid pace of technology today. Those that start when the technologies are beginning to advance rapidly will have an easier time keeping up as newer technologies built upon what’s new now are released.

Last Slide of Explainer Video 1

The Last Slide of Explainer Video 1, Introducing Video 2

SECTION/VIDEO 2: Advanced Data & Emerging Technologies

This section augments the course that will explain how advanced analytics has impressively & massively progressed. We are now able to ask many new types of analytical questions that can be solved, giving deeper insights than ever before. This section also describes what technologies are emerging today – at an unprecedented rate – that are changing our lives completely.

Below you’ll find the YouTube Video 2 of the course:

Here are some screenshots from “Advanced Data & Emerging Technologies” to give you an idea of what’s covered in this video 2:

SECTION/VIDEO 2: Advanced Data & Emerging Technologies in Text & Screenshots

Advanced Data & Emerging Technoloy, Video 2

Advanced Data & Emerging Technoloy, Video 2

In this Video #2, I’ll explain how advanced analytics has impressively progressed, the types of questions that can be solved with analytics today, & what technologies are emerging at an unprecedented rate! I’ll also begin to explain why YOU need to know the compelling content in this video series, which will be expanded upon in subsequent videos when appropriate.

The Optimal Time to Watch Video 2

The Optimal Time to Watch Video 2 to Interweave the Pluralsight Course

Shown above, this 2nd video in the series fits nicely AFTER “Download & Install Course Pre-requisites” & BEFORE “The State of Analytics in the AWS Cloud“.

What's In This FOR ME???

What’s In This FOR ME???

I find that the reason some really awesome technologies aren’t as widely adopted as they should be (other than for very large enterprises), is because courses usually just say “how” to do something. If you understand “why” & “where” these technologies fit into your business, what business problems they solve, & then learn “how” to implement them, you’re more likely to take the time to advance your skills & use the very best technologies.

Thus, that’s my approach for this series, both in blog format & video format.

In a Perfect World 1...

In a Perfect World 1…

First, let me set the stage for this series with some “What if you could…” type questions, kind of like “In a perfect world” type questions.

(The first of 2 slides on this topic is shown above & the second slide on this topic is shown below)

What if you could…

  • Know in real-time what your customers want, the prices they’re willing to pay, & the triggers that make them buy more in real-time?
  • Have 1 unified view of your data no matter where in the world it’s stored?
  • Query your global data without the need to move into another location in order to perform any type of analysis you want to?
  • Automate any changes made to underlying data stores, keeping the 1 unified view in sync at all times?
In a Perfect World 2...

In a Perfect World 2…

What if you could…

  • Have a single, centralized, secure and durable platform combining storage, data governance, analytics, AI and ML that allows ingestion of structured, semi-structured, and structured data, transforming these raw data sets in a “just-in-time” manner? Not ETL, extract, transform, & load, but ELT, extract, load THEN transform when you need the data performing ELT on the fly?
  • Drastically reduce the amount of time spent on mapping disparate data, which is the most time-consuming step in analytics?
  • Turbo-charge your existing apps by adding AI into their workflows? And build new apps & systems using this best-practice design pattern?
  • Future-proof your data, have multiple permutations of the same underlying data source without affecting the source, and use a plethora of analytics and AI/ML services on that data simultaneously?

What is stated on the last 2 slides would be transformative in how your business operates. It would lead to new opportunities, give you a tremendous competitive advantage, the ability to satisfy your customers & additionally accrue new customers, wouldn’t it?

Now, imagine having all these insights on steroids!!!

Gartner's Intelligent Digital Mesh

Gartner’s Intelligent Digital Mesh

The content on the above slide is Gartner’s 2019 Top Strategic Technology Trends. This annual report is called “The Intelligent Digital Mesh“. Gartner defines a strategic technology trend as one with substantial disruptive potential that is beginning to break out of an emerging state into broader impact and use, or which are rapidly growing trends with a high degree of volatility reaching tipping points over the next five years.

In other words, “be prepared” today; don’t wait for these technologies to be 100% or even 40% mainstream: then you’re too late!

The definition of the 3 categories are the following:

  • The Intelligent category provides insights on the very best technologies, which today are under the heading AI
  • The Digital category provides insights on the technologies that are moving us into an immersive world
  • The Mesh category provides insights on what technologies are cutting-edge that intertwine the digital & physical world

In the Intelligent category, we have the AI strategic trends of:

  • Autonomous Things: Autonomous Things exist across 5 types: ROBOTICS, VEHICLES, DRONES, APPLIANCES, & AGENTS
  • Augmented Analytics is the results of the vast amount of data that needs to be analyzed today. It’s easy to miss key insights from hypotheses. Augmented analytics automates algorithms to explore more hypotheses through data science & machine learning platforms. This trend has transformed how businesses generate analytical insights
  • AI-driven Development highlights the tools, technologies & best practices for embedding AI into apps & using AI to create AI-powered tools

In the Digital category, we have the Digital strategic trends of:

  • Digital Twins: A Digital Twin is a digital representation that mirrors a real-life object, process or system. They improve enterprise decision making by providing information on maintenance & reliability, insight into how a product could perform more effectively, data about new products with increased efficiency
  • Empowered Edges: This is a topology where information processing, content collection, & delivery are placed closer to the sources of the information, with the idea that keeping traffic local will reduce latency. Currently, much of the focus of this technology is a result of the need for IoT systems to deliver disconnected or distributed capabilities into the embedded IoT world
  • Immersive Experiences: Gartner predicts through 2028, conversational platforms will change how users interact with the world. Technologies like Augmented Reality (AR), Mixed Reality (MR) & Virtual Reality (VR) change how users perceive the world. These technologies increase productivity, with the next generation of VR able to sense shapes & track a user’s position, while MR will enable people to view & interact with their world, & augmented reality will just blow your mind it’s oh-so-cool! (SHAMELESS PLUG: The Augmented World Expo is the biggest conference and expo for people involved in Augmented Reality, Virtual Reality and Wearable Technology (Wikipedia. This is a shameless plug because I know the founders & have attended this mind-blowing conference for years!)

In the Mesh category, we have the strategic trends of:

  • Blockchain: Blockchain is a type of distributed ledger, an expanding chronologically ordered list of cryptographically signed, irrevocable transactional records shared by all participants in a network. Blockchain allows companies to trace a transaction & work with untrusted parties without the need for a centralized party such as banks. This greatly reduces business friction & has applications that began in finance, but have expanded to government, healthcare, manufacturing, supply chain & others. Blockchain could potentially lower costs, reduce transaction settlement times & improve cash flow
  • Smart Spaces Smart Spaces are evolving along 5 key dimensions: OPENNESS, CONNECTEDNESS, COORDINATION, INTELLIGENCE & SCOPE. Essentially, smart spaces are developing as individual technologies emerge from silos to work together to create a collaborative & interaction environment. The most extensive example of smart spaces is smart cities, where areas that combine business, residential & industrial communities are being designed using intelligent urban ecosystem frameworks, with all sectors linking to social & community collaboration

Spanning all 3 categories are:

  • Digital Ethics & Privacy: This represents how consumers have a growing awareness of the value of their personal information, & they are increasingly concerned with how it’s being used by public & private entities. Enterprises that don’t pay attention are at risk of consumer backlash
  • Quantum Computing: This is a type of nonclassical computing that’s based on (Now, for all of you who aren’t familiar with this, put your seatbelts on because this is absolutely phenomenal!!!) the quantum state of subatomic particles that represent information as elements denoted as quantum bits or “qubits.” Quantum computers are an exponentially scalable & highly parallel computing model.  A way to imagine the difference between traditional & quantum computers is to imagine a giant library of books. While a classic computer would read every book in a library in a linear fashion, a quantum computer would read all the books simultaneously. Quantum computers are able to theoretically work on millions of computations at once. Real-world applications range from personalized medicine to optimization of pattern recognition.

This course covers most of these strategic IT trends for 2019. You can read more about each category by visiting the entire report at the URL on the bottom left of the slide: it’s not only a fascinating read, but also a reality check on what you should be focusing on today!

The 4th Industrial Revolution

The 4th Industrial Revolution

Today we’re experiencing what has been called The 4th Industrial Revolution. A broad definition of the term Industrial Revolution is “unprecedented technological & economic development that have changed the world”. The timeline on the above slide indicates when each Industrial Revolution occurred & what new inventions defined such a large-scale change. The emergence of the 4th Industrial Revolution is attributed to primarily technological advances built atop the technologies that were paramount in Industry 3.0.

As the pace of technological change quickens, we need to be sure that employees & ourselves are keeping up with the right skills to thrive in the Fourth Industrial Revolution.

Eric Schmidt Quote

Eric Schmidt Quote

The quote on the slide above should confirm that having 1 location for all your global data is essential with the vast amount of data today, in order to take advantage of all your data for analytics. The quote is from Eric Schmidt (LOVE his last name!), Former Executive Chairman of Google. It reads, “There were five exabytes of information created between the dawn of civilization through 2013, but that much information is now created every two days”. This quote was from around 2013, & boy have things accelerated beyond anyone’s imagination at the time!

Last Slide of Video 2

Last Slide of Video 2

The next video in the YouTube series is a 4-part sub-series entitled “Cloud & Data Metamorphosis“.

SECTION/VIDEO 3.1: “Cloud & Data Metamorphosis, Part 1

This first video in “Cloud & Data Metamorphosis” will cover many topics, including:

  • Big Data Evolution
  • Cloud Services Evolution
  • Amazon’s Purpose-built Databases
  • Polyglot Persistence
  • Traditional vs. Cloud Application Characteristics
  • Operational vs. Analytical Database Characteristics
Optimal Time to Watch This Video 3 with the Pluralsight Course

Optimal Time to Watch This Video 3 with the Pluralsight Course

If you watched the “explainer video” video #1 in this series, you’ll know that this YouTube course series complements my Pluralsight course entitled “Serverless Analytics on AWS”.

All 4 video segments of this 3rd video in the series, “Cloud & Data Metamorphosis” ideally should be watched AFTER Module 2 “Download & Install Course Pre-requisites” & BEFORE Module 3 “The State of Analytics in the AWS Cloud”.

Below you’ll find the embedded YouTube Video 3.1:

SECTION/VIDEO 3.1: Cloud & Data Metamorphosis Video in Text & Screenshots

Cloud & Data Metamorphosis, Video 3, Part 1

Cloud & Data Metamorphosis, Video 3, Part 1

Now that you’ve learned what was taught in the first video (Video 2, since Video 1 is an “explainer video”) in this series, “AI, ML, and Advanced Analytics on AWS,” in this video, entitled “Cloud and Data Metamorphosis” in this third video of the series, I’ll explain the technological changes that have evolved to the point of the 4th Industrial Revolution so you understand the significance of what you’ll learn in this video series!

Cloud & Big Data Evolution

Cloud & Big Data Evolution

Understanding different technology evolutions are important to know because through a linear timeline, you can see how technologies advanced WITH hardware & software advances or that technology solution had to evolve to work with advanced & emerging technologies. First, let’s look at Big Data Evolution. A big data challenge is HOW to build big data applications. A few years ago the answer was Spark no matter what the question was; today, the answer is AI, no matter what the question is.

Big Data Evolution

Big Data Evolution

Initially, batch processing was the only way to analyze data. Batch processing has the following characteristics: the data’s scope was limited to querying or processing over all or most of the data in the dataset. Thus, data size was in the form of large batches with data performance had latencies in minutes to hours, & analyses are complex. Think of OLTP, online transaction processing.

Next in big data evolution came stream processing. With the advent of broadband internet, cloud computing, & the IoT, data’s increased volume & velocity necessitated the need to process the continuous transfer of data rolling in over a small time period at a steady, high-speed rate. Detecting insights in data streams is instantaneous. This enabled companies to make data-driven decisions in real-time. Queries are usually done on just the most recent data in the form of individual records or micro-batches consisting of a few records whose performance has latencies in the order of seconds or milliseconds, & analyses are simple response functions, aggregates, & rolling metrics.

Today, the prevalence of AI processing gives rise to cognitive computing. Cognitive computing describes technology platforms that simulate human thought processes in a computerized model. Using self-learning algorithms that use data mining, pattern recognition, & Natural Language Understanding (NLU), the computer can mimic the way a human brain works.

The good news is that any batch or stream processing you already have in place can be “AI’d”.

Cloud Services Evolution

Cloud Services Evolution

Now let’s look at Cloud Services Evolution. Initially, running Virtual Machines in the cloud were the only option available. Shared computing resources of hardware & software provided a cloud computing environment that lowers the quantity of assets needed & the associated maintenance costs.

Next in cloud services evolution comes Managed Services. This term refers to outsourcing daily IT management for cloud-based services and technical support to automate and enhance your business operations.

Now we have Serverless computing. It’s the native architecture of the cloud that enables you to shift more of your operational responsibilities to AWS, increasing your agility and time for innovation. Serverless allows you to build and run applications and services without thinking about servers. It eliminates infrastructure management tasks such as server or cluster provisioning, patching, operating system maintenance, and capacity provisioning. Application scaling is automated, & it provides built-in availability & fault tolerance. You only pay for consistent throughput or execution vs. by server unit.

Database Evolution

Database Evolution

When you develop cloud analytical or AI apps, choosing the right database that’s right for your data & the analytics you’ll perform on that data is the first & foremost decision to make. This section builds upon cloud & big data evolution, adding Database & data storage evolution.

AWS' Purpose-built Databases

AWS’ Purpose-built Databases

AWS has a number of purpose-built databases. They’re all scalable, fully-managed or serverless, & are enterprise class.

Relational databases supports ACID transactions. Data types have a pre-defined schema & relationships between the tables. Examples of where relational databases are still a valid choice include traditional applications, ERP, CRM, & e-commerce. With AWS, the offerings include MySQL & PostgreSQL databases, MariaDB, Oracle, & SQL Server. For analytics, the relational database of choice is Amazon Aurora because it’s cloud native.

Key-value databases are optimized to store & retrieve key-value pairs in large volumes & in milliseconds, without the performance overhead & scale-limitations of relational databases. Examples of where key-value databases are used include Internet-scale applications, real-time bidding, shopping carts, & customer preferences. With AWS the key-value database is the almighty DynamoDB!

Document databases are designed to store semi-structured data as documents & are intuitive for developers because the data is typically represented as a readable document. Examples of where document databases are used include Content management, personalization, & mobile applications. With AWS, this offering is Amazon DocumentDB with MongoDB-compatibility.

Graph databases are used for applications that need to enable millions of users to query & navigate relationships between highly-connected, graph datasets with millisecond latency. Examples of where graph databases are used include Fraud detection, social networking, & recommendation engines. With AWS, this offering is Amazon Neptune.

In-memory databases are used by applications that require real-time access to data. By storing data directly in memory, these databases provide microsecond latency where millisecond latency isn’t enough. Examples of where in-memory databases are used include Caching, gaming leaderboards, & real-time analytics. With AWS, this is either Amazon ElastiCache for Redis or Amazon ElastiCache for Memcached.

Time Series databases are used to efficiently collect, synthesize, & derive insights from enormous amounts of data that changes over time (aka time-series data). Examples of where time-series databases are used include IoT applications, DevOps, & Industrial telemetry. With AWS, this offering is Amazon Timestream.

Ledger databases are used when you need a centralized, trusted authority to maintain a scalable, complete & cryptographically verifiable record of transactions. QLDBs were first used in the financial industry, but since has expanded to Manufacturing, Supply Chain, Healthcare,& Government. Examples of where Ledger databases are used include Systems of record, supply chain, registrations, & banking transactions. With AWS, this offering is Amazon Quantum Ledger Database.

Polyglot Persistence

Polyglot Persistence

It’s only natural that with the variety of data types today, the old one-size-fits-all when it comes to choosing what type of data storage to use in your application just doesn’t work anymore. The plethora of purpose-built AWS databases provides the ability to use POLYGLOT PERSISTENCE in your applications. This means using multiple types of databases together to fit the need of any application, organization, or spanning multiple organizations that fit the app characteristics vs. forcing the data into a database that will perform poorly because it wasn’t built to provide the many types of data within apps today. There’s loose coupling of services that communicate through queues.

In the sample application diagram on the slide above, 5 different databases are used in 1 application. The type of database you choose in different parts of one application should be based on application characteristics.

Traditional vs. Cloud App Characteristics

Traditional vs. Cloud App Characteristics

The table on the slide above explains the difference between traditional vs cloud-based applications. This is important to take note of because application characteristics help define what data storage you’ll use. Rather than the old way of fitting data into relational databases no matter what the data structure was, today you have the flexibility to choose the databases to fit the structure & function that works most efficiently with your various application data needs.

Operational vs. Analytical Databases

Operational vs. Analytical Databases

Before I cover how to optimize raw data in S3, I want to again emphasize how much better your apps will perform by using the right data store for all the data in your app. You need to consider whether your database be used for operational workloads or for analytical workloads. The table on the left side of the slide above lists the major characteristics of operational vs. analytical databases.

On the right side of the above slide, the table at the top lists the general characteristics of operational workloads, along with the primary dimensions to consider. The table on the lower half of the slide above lists the general characteristics of analytical workloads, along with the primary dimensions to consider for analytical workloads.

Primary Topics of Video 3, Part 2

Primary Topics of Video 3, Part 2

The next video is “Cloud & Data Metamorphosis, Part 2“. The primary topics that’ll be covered include:

  • Amazing Data Factoids that impact analytical systems
  • Data Architecture Evolution
  • Modern Data Analytics Pipelines

SECTION/VIDEO 3.2: “Cloud & Data Metamorphosis, Part 2

This video is a continuation of the video entitled “Cloud & Data Metamorphosis, Part 1” which is the 3rd video, a multi-video series, in the larger video series “AI, ML, & Advanced Analytics on AWS”.

Cloud & Data Metamorphosis, Part 2

Cloud & Data Metamorphosis, Part 2

Below you’ll find the embedded YouTube Video 3.2:

SECTION/VIDEO 3.2: “Cloud & Data Metamorphosis, Part 2” in Text & Screenshots

Topics Covered in Video 3, Part 1, & Topics Covered in Video 3, Part 2

Topics Covered in Video 3, Part 1, & Topics Covered in Video 3, Part 2

Let’s quickly overview the topics in “Cloud & Data Metamorphosis, Part 1” & quickly introduce you to the topics that’ll be covered in this “Cloud & Data Metamorphosis, Part 2“.

In Part 1 of Cloud & Data Metamorphosis, I covered the following:

  • Big Data Evolution
  • Cloud Evolution
  • Database Evolution & the Many Types of AWS’ Purpose-built Databases
  • Polyglot Persistence
  • The Differences Between Traditional & Cloud Application Characteristics
  • How to Determine if Your Database is Operational or Analytical

In Part 2, I’ll cover:

  • Amazing Data Factoids that impact analytical systems
  • Analytical Platform Evolution
  • Dark Data
  • The Problems with Data Silos
  • Data Architecture Evolution
  • Modern, Distributed Data Pipelines
Amazing Data Factoids

Amazing Data Factoids

In the next few slides, I’ll overview Amazing Data Factoids that impact analytical systems.

There Are More Ways to Analyze Data Than Ever Before

There Are More Ways to Analyze Data Than Ever Before

There are more ways to analyze data than ever before:

  • 11 years ago Hadoop was the data king of analysis
  • 8 years ago Elasticsearch was the data king of analysis
  • 5 years ago Presto was the data king of analysis
  • < 4 years ago Spark was the data king of analysis
  • TODAY, AI IS THE DATA KING OF ANALYSIS
Amazing Data Factoids

Amazing Data Factoids That Impact Analytical Systems

I’m sure you know that there is so much more data than most people think. Today, however, we also have more ways to analyze data than ever before. With the democratization of data, there are more people working with data than ever before. Job titles that never used data before must use it now to analyze performance in many ways, for reports, and more.

Garbage In, Garbage Out

Garbage In, Garbage Out

The old adage “garbage in, garbage out” carries more weight today than ever before. Organizations gather huge volumes of data which, they believe, will help improve their products and services. For example, a company may collect data on how users use its products, internal statistics about software development processes, and website visits. However, a large portion of the collected data are never even analyzed. According to the International Data Corporation, or, IDC, a provider of market intelligence on technology, 90% of the unstructured data are never analyzed. Such data is known as dark data.

Dark data is a subset of big data but it constitutes the biggest portion of the total volume of big data collected by organizations in a year. Dark data are the information assets organizations collect, process, & store during regular business activities, but generally fail to be used for other purposes (ie, analytics, business relationships, & direct monetizing).

The graph shown on the above slide illustrates that as the amount of Big Data grows, the amount of dark data increases. But that does not lessen its importance in the context of business value. There are two ways to view the importance of dark data. One view is that unanalyzed data contains undiscovered, important insights and represents an opportunity lost, whether today or in the future. The other view is that unanalyzed data, if not handled well, can result in a lot of problems such as legal and security problems, like with the recent GDPR laws and more.

Just storing data isn’t enough. DATA NEEDS TO BE DISCOVERABLE, & I’ll be showing you how to do that in this video series. With the varied data types & new sources are added to an analytical platform, it’s important that the platform is able to keep up! This means the system has to be flexible enough to adapt to changing data types and volumes. In the remainder of this video & throughout the next videos in the Video series 3, I’ll explain the technologies that not only makes data discoverable, available to numerous analytical & AI services, cleans and enriches your data, BUT ALSO FUTURE-PROOFS YOUR DATA!!!

A Chilling Quote by Atul Butte

A Chilling Quote by Atul Butte

The quote on the above slide is from Atul Butte, MD Brown University, Ph.D. in Health Sciences & Technology from MIT & Harvard Medical School, & Founder of Butte Labs, Solving Problems in Genomic Medicine. It reads “Hiding within those mounds of data is knowledge that could change the life of a patient, or change the world”. Oh, how so very true! (I love data!)

The Importance of Having Complete Data for Analysis

The Importance of Having Complete Data for Analysis

Analytical systems are only as good as the data used to power it. Believe it or not, most people have “dirty data”, meaning not only incorrect data, but missing data, duplicate data & incomplete data augmentation”. A robust analytics platform is critical for an organization’s success. Dirty data can derail analytical projects completely. In addition, most organizations struggle with having complete data for analysis because of the difficulty, the complexity, the laborious & time consuming task of bringing the data they need for complete & accurate analysis from many, disparate data silos scattered across the globe in differing formats & proprietary systems. These organizations have been forced to use different systems at different stages of growth because there really wasn’t a solution, much less an elegant solution, to have all the scattered data silos in 1 place (until now, which will be covered in this series). The data silo effect is amplified by vendor lock-in and the immense work required to transform legacy data systems to work with current data systems.

Data Architecture Evolution

Data Architecture Evolution

Just like there’s big data & cloud evolution, there’s also data architecture evolution. Let’s dive in!

The Data Silo Problem

The Data Silo Problem

Finding Value in data is a total journey, & is often painstakingly undertaken with data silos using a plethora of tools. Data silos are a chronic, deep-rooted problem companies face on a daily basis.

Some of the problems data silos create are the following:

  • The Contents of Data Silos will Likely Differ Slightly
  • It’s Difficult to Determine Which One is the Most Accurate or Up to Date
  • Data Silos Cause Wasted Resources and Inhibited Productivity
  • Different Tools and Interfaces Used by Everyone with Differing Job Titles Isn’t in a Consistent Manner, Having Unpredictable & Sometimes Disastrous Results
  • Many Situations Occur Where Separate Teams Could Use Another Team’s Data to Solve Problems More Efficiently if They Had Access to It
  • It’s Difficult to Move Data Across Silos
  • With Multiple People and Teams Working on the Same Data, You’re Forced to Keep Multiple Copies of Data
  • Getting the Right Data From Data Silos is Extremely Complex, Often Abandoned, Doing analysis on a Subset of the Data, Resulting in Incorrect & Incomplete Analytical Insights
  • Having Consistent Data Transformation & Data Governance Throughout All Silos Borders on Impossible, Depending on What Type of Storage Each Silo Has
  • Users Struggle to Find the Data They Need, Because Finding Data Stored in a Siloed Architecture is Akin to Looking for a Needle in a Haystack

These factors & more slows innovation & evolution drastically. It’s expensive, not only to move the data, but to pay for many different silo services, in all probability adds up to paying for storage you don’t need today. Many companies don’t have all these silos mapped out, & if they do, paying for employees to keep up legacy systems is a waste of their time, their talents, and working on more important tasks like innovating, experimentation, and building great products.

There has to be a better way to deal with siloed data. Well, there is, & I can’t wait to explain to you what that is, starting in the remainder of this video & extending to the next video!

Common Challenges Faced by Data Teams

Common Challenges Faced by Data Teams

The diagram above shows some of the challenges data teams face today:

  • With the exponential growth of data from many & varied data sources, the older systems weren’t designed to handle the volume, variety, velocity & the veracity – or accuracy – of data
  • Data is now ubiquitous. Almost every employee has to work with data, but oftentimes in different ways on the same data, increasing the complexity of the data while increasing the chance of data errors
  • And, with the explosion of the many ways to access the data, most people have their “tool of choice” & moreover certain job roles necessitated working with particular access mechanisms
From Monoliths to Microservices to Serverless

From Monoliths to Microservices to Serverless

The reason we create applications is to deliver business value delivered by creating business logic & operating it so it can provide a service to others. The time between creating business logic and providing service to users with that logic is the time to value. The cost of providing that value is the cost of creation plus the cost of delivery. As technology has progressed over the last decade, we’ve seen an evolution from monolithic applications to microservices and are now seeing the rise of serverless event driven functions, led by AWS Lambda.

What factors have driven this evolution? Low latency messaging enabled the move from monoliths to microservices, & low latency provisioning enabled the move to Lambda.

Legacy Data Architectures Are Monolithic

Legacy Data Architectures Are Monolithic

Monolithic means composed all in one piece. A monolithic application describes a single-tiered software application in which different components combined into a single program from a single platform.

Sample of a Monolithic Architecture

Sample of a Monolithic Architecture

The diagram on the above slide is a sample e-commerce application. Despite having different components, modules, & services, the app is built & deployed as one Application for all platforms (i.e. desktop, mobile and tablet) using RDBMS as the data source.

The drawbacks of this architecture includes:

  • Apps can get so large & complex that it’s challenging to make changes fast & properly
  • The size of the app can slow down startup time
  • With each update no matter how small it is, you have to redeploy the entire app
  • It’s very challenging to scale
  • A bug in any module can bring down the entire app because everything for the app is connected together
  • It’s very difficult to adopt new & advanced technologies, & if you try to do that, you have to pretty much rebuild the app, which is costly, takes a lot of time, & a ton of effort
Breaking Down a Monolith into Microservices

Breaking Down a Monolith into Microservices

The conceptual image on the above slide provides a representation of breaking a monolithic application into microservices. In monolithic architectures, all processes are tightly coupled and run as a single service. With a microservices architecture, an application is built as independent components that run each application process as a service. These services communicate via a well-defined interface using lightweight APIs. Services are built for business capabilities and each service performs a single function. Because they are independently run, each service can be updated, deployed, and scaled to meet demand for specific functions of an application.

Characteristics of Microservices include the following:

  • Microservices are autonomous. Each component service can be developed, deployed, operated, and scaled without affecting the functioning of other services
  • Microservices are specialized. Each service is designed for a set of capabilities and focuses on solving a specific problem
  • Microservices have great agility. Microservices foster an organization of small, independent teams that take ownership of their services. Teams act within a small and well understood context and are empowered to work more independently and more quickly. This shortens development cycle times. You benefit significantly from the aggregate throughput of the organization
  • Microservices have flexible scaling. Microservices allow each service to be independently scaled to meet demand for the application feature it supports. This enables teams to “right-size” infrastructure needs, accurately measure the cost of a feature, and maintain availability if a service experiences a spike in demand
  • It’s easy to deploy Microservices. Microservices enable continuous integration and continuous delivery, making it easy to try out new ideas and to roll back if something doesn’t work.
  • Microservices allow technical freedom. Microservices architectures don’t follow a “one size fits all” approach. Teams have the freedom to choose the best tool to solve their specific problems
  • With microservices, the code is reusable.  A service written for a certain function can be used as a building block for another feature. This allows an application to bootstrap off itself, as developers can create new capabilities without writing code from scratch
  • Microservices are resilient. Service independence increases an application’s resistance to failure. Applications handle total service failure by degrading functionality and not crashing the entire application
Sample of a Microservices Architecture

Sample of a Microservices Architecture

On the above slide is the e-commerce diagram (redrawn from the previous slide) to represent a modular, Microservices architecture consisting of several components/modules. Each module supports a specific business goal and uses a simple, well-defined interface to communicate with other sets of services. Instead of sharing a single database as in Monolithic application, each microservice has its own database. Having a database per service is essential if you want to benefit from microservices, because it ensures loose coupling. Moreover, a service can use a type of database that is best suited to its needs.

The benefits of a microservice architecture includes:

  • Continuous delivery & deployment of large, complex apps
  • Services are smaller & faster to test
  • Services can be deployed independently
  • It enables you to organize the development effort around multiple teams. Each team is responsible for one or more single service
  • Each microservice is relatively small
  • They’re easier for developers to understand
  • The IDE is faster, making developers more productive
  • They have faster startup time
  • They have improved fault isolation; if one service has a bug, it doesn’t affect the other services
  • They eliminate any long-term commitment to a technology stack. When developing a new service you can pick a new technology stack. Similarly, when making major changes to an existing service you can rewrite it using a new technology stack
An Example of Modern Data Architecture Diagram

An Example of Modern Data Architecture Diagram

The architectural diagram on the slide above is an example of modern data architecture. You’ll notice an abundance of data sources in the far-left column surrounded by the blue rectangle. These represent a good sample of the wide variety of data sources.

Moving to the next column surrounded by a red rectangle, the services shown represent a good sample of the many different ways to ingest data into AWS. Which Service you use depends on several factors, such as if the data is streaming in or not, how the data is migrated in, if you have a hybrid cloud environment, and more.

The next section is the storage section. Notice there are 2 separate storage sections; 1 on the bottom surrounded by a blue rectangle, which represents data that’s streaming in, shown by the blue arrow. The top section is surrounded by the green rectangle. That represents slow-moving data, or batch data.

In the streaming section, Amazon Kinesis Data Streams is streaming in data, shown in both the ingestion section & the streaming storage section surrounded by red rectangles & pointed to by red arrows. The Kinesis Event Capture in the streaming storage section captures the streaming data as it flows in.

In this particular architecture, the Kinesis Event Capture passes the ingested data stream to Amazon EMR to perform some initial stream processing shown surrounded by a blue rectangle & pointed to by another blue arrow. When the initial processing is complete, the streamed-in data is put into an Amazon S3 raw zone bucket, surrounded by an orange rectangle & pointed to by an orange arrow. If the data wasn’t streaming data & was slow-moving data, then rather than Amazon Kinesis Data Streams, the data would be put into the Amazon S3 raw zone bucket by Amazon Kinesis Data Firehose, although this diagram doesn’t show that.

Whenever any further processing needs to be performed, the data in the S3 raw zone is acted upon by whatever the processing service is, shown in this diagram as Amazon EMR’s Mlib performing ETL, surrounded by the green rectangle & pointed to by a green arrow. Amazon EMR’s Mlib is one of Amazon’s built-in Apache Spark machine learning tools. The results of EMR’s ML predictive analytics is placed into an Amazon S3 processed zone bucket surrounded by an orange rectangle & pointed to by an orange arrow. Amazon S3 processed zone buckets is where the data is staged in an S3 Data Lake where it awaits any downstream requests to use that curated data.

Surrounded by a pink rectangle & pointed to by a pink arrow is a sample of AWS data stores that serve the data to consumers, illustrating which type of AWS service the curated data is used by, showing what job roles, shown by the yellow rectangle & arrow, that most people who typically work with the AWS services in the pink rectangle.

On the top right, you’ll see some AWS Services that are commonly used in a modern cloud data architecture, surrounded by a purple rectangle & pointed to by a purple arrow.

Notice how all data when ingested, whether if it comes streaming in or comes in moving slowly, eventually ends up in an Amazon S3 Raw Zone bucket &, after some initial processing to prepare the raw data for downstream consumption, ends up in an Amazon S3 Processed Zone bucket. (on this diagram it’s called “Staged Data”. I’ll elaborate on those 2 S3 storage locations in the next video on Amazon S3 Data Lakes.

Schematic of the Flow of Modern Data Analytics Pipelines

Schematic of the Flow of Modern Data Analytics Pipelines

Let’s now look at how modern data analytics pipelines are built today.

Modern analytical data pipelines decouple Storage from Compute & Processing. In this manner, when a newer technology is introduced, it’s much easier to swap out a storage or compute service for another in the same category. Modern data pipelines have many iterations of processing, analyzing, & storage. Pipelines can go off in multiple directions depending on what downstream applications will be doing with the processed data.

Example of Batch & Real-time Processing

Example of Batch & Real-time Processing

What you’re looking at on the above slide is a simplified architectural diagram to emphasize that both batch & real-time processing can occur at the same time (& even have dozens & dozens of threads being simultaneously processed within each). The point is to see that multiple strings of analytics can be performed simultaneously on the same dataset, shown here as a batch workflow & a real-time streaming workflow, when you use 3 key AWS Services – Amazon S3, AWS Glue, & Amazon Athena – that I’ll be introducing shortly.

The really cool thing about this is that using these technologies I’ll soon be describing, there’s no limit on how many concurrent users on the same underlying datasets without affecting the underlying data source, no matter where in the world it’s stored!

The really cool thing about this is that using these technologies I’ll soon be describing, there’s no limit on how many concurrent users on the same underlying datasets without affecting the underlying data source, no matter where in the world it’s stored!

Primary Topics of Video 3.3

Primary Topics of Video 3.3

In the next video, “Cloud & Data Metamorphosis, Video 3.3” I’ll cover the following:

  • Cloud Services Evolution in Regard to Serverless Architectures
  • Serverless Architectures
  • AWS Lambda
  • EVERYTHING You Need to Know About Containers

SECTION/VIDEO 3.3: “Cloud & Data Netamorphosis, Video 3, Part 3

Part 3 of Video 3 continues from where Video 3, Part 2 left off.

Cloud & Data Metamorphosis, Video 3, Part 3

Cloud & Data Metamorphosis, Video 3, Part 3

Cloud & Data Metamorphosis” is a multi-video series, in the larger video series “AI, ML, & Advanced Analytics on AWS” that augments my Pluralsight course shown on this slide called “Serverless Analytics on AWS”, surrounded by a blue rectangle & pointed to by a blue arrow.

Below you’ll find the embedded YouTube video 3.3:

SECTION/VIDEO 3.3: “Cloud & Data Metamorphosis, Video 3, Part 3” in Text & Screenshots

What I Covered in Video 3.2 & What I'll Be 'Covering in Video 3.3

What I Covered in Video 3.2 & What I’ll Be ‘Covering in Video 3.3

In Part 2 of Cloud & Data Metamorphosis, I covered the following:

  • Amazing Data Factoids that impact analytical systems
  • Analytical Platform Evolution
  • Dark Data
  • The Problems with Data Silos
  • Data Architecture Evolution
  • Modern, Distributed Data Pipelines

In Part 2, I’ll cover:

  • Serverless Architectures
  • AWS Lambda
  • AWS’ Serverless Application Model, or SAM
  • All About AWS’ Containers
  • AWS Fargate
  • Amazon Elastic Container Registry (ECR)
Cloud Services Evolution in Regard to Serverless Architectures

Cloud Services Evolution in Regard to Serverless Architectures

Let’s now look at The Evolution of Cloud Services in regard to Serverless Architectures.

An Overview of AWS Serverless Architectures

An Overview of AWS Serverless Architectures

It wasn’t that long ago that all companies had to buy servers, guessing how many they’d need for peak usage. What normally happened was they took the side of being prepared for the worst & ended up overprovisioning the amount of servers, costing a ton of money up front.

In the image on the right, with AWS Serverless, gone are the days of “racking & stacking”, called undifferentiated heavy lifting in AWS terminology. It has a “pay as you go” pricing model, dramatically reducing costs & thus a perfect way to experiment & innovate cheaply.
It has the proper data pipeline architecture of decoupling storage from compute & analyze. Load balancing, Autoscaling, Failure Recovery, Security Isolation, OS Management, & Utilization Management are handled for you.

Serverless means:

  • Greater Agility
  • Less Overhead
  • Better Focus
  • Increased Scalability
  • More Flexability
  • Faster Time-to-Market
AWS Serverless

AWS Serverless

Serverless computing came into being back in 2014 at the AWS re:Invent conference, with Amazon Web Services’ announcement of AWS Lambda. . AWS Lambda is an event-driven, serverless computing platform provided by AWS. Serverless is a computing service that runs code in response to events that automatically manages the computing resources required by that code. Serverless computing is an extension of micro-services. The serverless architecture is divided into specific core components. To compare Microservices & Serverless components, microservices group the similar functionalities into one service while serverless computing defines functionalities into finer grained components.

The code you run on AWS Lambda is called a Lambda function.” After you create your Lambda function, it’s always ready to run as soon as it is triggered. Lambda functions are “stateless,” with no affinity to the underlying infrastructure, so that Lambda can rapidly launch as many copies of the function as needed to scale to the rate of incoming events. Billing is metered in increments of 100 milliseconds, making it cost-effective and easy to scale automatically from a few requests per day to thousands per second.

Since AWS Lambda’s release, there’s been an astonishing growth of over 300 percent year over year. Serverless analytics can be done via Amazon Athena making it easy to analyze big data directly in S3 using standard SQL. Developers create custom code, then the code is executed as autonomous and isolated functions that run in stateless compute services called CONTAINERS.

AWS Lambda Use Cases

AWS Lambda Use Cases

Let’s back up a bit here & discuss common lambda use cases.

Common AWS Lambda Use Cases include:

  • Static websites, complex Web Applications
  • Backend Systems for apps, services, mobile, & IoT
  • Data processing for either Batch or Real-time computing using AWS Lambda & Amazon MapReduce
  • Powering chatbot logic
  • Powering Amazon Alexa voice-enabled apps & for implementing the Alexa Skills Kit
  • IT Automation for policy engines, extending AWS services, & infrastructure management
Sample of a Total Serverless App

Sample of a Total Serverless App

The architectural diagram on the slide above is 1 way a simple website using all SERVERLESS AWS Service. Each service is fully managed and doesn’t require you to provision or manage servers. The only thing you need to do to build this is to configure them together and upload your application code to AWS Lambda.

The workflow represented in the diagram above goes like this:

  • Highlighted by a #1 in a blue circle is the first step, where you configure an Amazon Simple Storage Service (S3) bucket to host static resources for your web application such as HTML, CSS, JavaScript, images, & other files
  • After uploading your assets to the S3 bucket, you need to ensure that your bucket has public access settings. To do that, in the S3 permissions tab for the bucket. This tab is located within the bucket’s Properties tab & is where you enable static website hosting
    This will make your objects available at the AWS Region-specific website endpoint of the bucket. Your end users will access your site using the public website URL exposed by Amazon S3. You don’t need to run any web servers or use other services in order to make your site available.
  • When users visit your website they will first register a new user account. This is done by Amazon Cognito, highlighted by a #2 in a blue circle. After users submit their registration, Cognito will send a confirmation email with a verification code to the email address of the new visitor to your site. This user will return to your site and enter their email address and the verification code they received. After users have a confirmed account, they’ll be able to sign in.
  • When users sign in, they enter their username (or email) and password which triggers a JavaScript function that communicates with Amazon Cognito & authenticates using the Secure Remote Password protocol (SRP), and receives back a set of JSON Web Tokens (JWT). The JWTs contain claims about the identity of the user and will be used later to authenticate against the RESTful API of the Amazon API Gateway. Cognito User Pools add sign-up & sign-in functionality to your application. A user pool is a user directory in Amazon Cognito. With a user pool, your users can sign into your web or mobile app through Amazon Cognito.
  • Next, you create a backend process for handling requests for your app using AWS Lambda & DynamoDB, highlighted by a #3 in a blue circle. The Lambda function runs its code in response to events like HTTP events. Each time a user makes a request to the static website, the function records the request in a DynamoDB table then responds to the front-end app with details about the data being dispatched. The Lambda function is invoked from the browser using Amazon API Gateway, highlighted by a #4 in a blue circle, which handles all the tasks involved in accepting and processing up to hundreds of thousands of concurrent API calls, including traffic management, authorization and access control, monitoring, and API version management.
  • The Amazon API Gateway acts as a “front door” for applications to access data, business logic, or functionality from your backend services, such as workloads running on Amazon Elastic Compute Cloud (Amazon EC2), code running on AWS Lambda, & any web application, or real-time communication applications. The API Gateway creates a RESTful API that exposes an HTTP endpoint. The API Gateway uses the JWT tokens returned by Cognito User Pools to authenticate API calls. You then connect the Lambda function to that API in order to create a fully functional backend for your web application.
  • Now, whenever a user makes a dynamic API call, every AWS Service is configured properly & will run without you to provision, scale, and manage any servers.

You can build them for nearly any type of application or backend service, and everything required to run and scale your application with high availability is handled for you. Pretty cool, eh?

AWS Serverless Application Model

AWS Serverless Application Model

AWS has an open-source framework for building serverless applications called the AWS Serverless Application Model (SAM). It provides shorthand syntax to express functions, APIs, databases, and event source mappings. With just a few lines per resource, you can define the application you want and model it using YAML. During deployment, SAM transforms and expands the SAM syntax into AWS CloudFormation syntax, enabling you to build serverless applications faster. If you watched any of the demos from the Pluralsight course (& I hope you did!), you’ll be familiar with how cool CloudFormation is!

Capabilities of an AWS Serverless Platform

Capabilities of an AWS Serverless Platform

Delivering a production serverless application that can run at scale demands a platform with a broad set of capabilities.

AWS supports enterprise-grade serverless applications in the following ways:

  • The Cloud Logic Layer: Power your business logic with AWS Lambda, which can act as the control plane and logic layer for all your interconnected infrastructure resources and web APIs. Define, orchestrate, and run production-grade containerized applications and microservices without needing to manage any infrastructure using AWS Fargate.
  • Responsive Data Sources: Choose from a broad set of data sources and providers that you can use to process data or trigger events in real-time. AWS Lambda integrates with other AWS services to invoke functions. A small sampling of the other services includes Amazon Kinesis, Amazon DynamoDB, AWS Cognito, various AI services, queues & messaging, and DevOps code services
  • Integrations Library: The AWS Serverless Application Repository is a managed repository for serverless applications. It enables teams, organizations, and individual developers to store and share reusable applications, and easily assemble and deploy serverless architectures in powerful new ways. Using the Serverless Application Repository, you don’t need to clone, build, package, or publish source code to AWS before deploying it. Instead, you can use pre-built applications from the Serverless Application Repository in your serverless architectures, helping you and your teams reduce duplicated work, ensure organizational best practices, and get to market faster. Samples of the types of apps you’ll find in the App Repository include use cases for web & mobile backends, chatbots, IoT, Alexa Skills, data processing, stream processing, and more. You can also find integrations with popular third-party services (e.g., Slack, Algorithmia, Twilio, Loggly, Splunk, Sumo Logic, Box, etc)
  • Developer Ecosystem: AWS provides tools and services that aid developers in the serverless application development process. AWS and its partner ecosystem offer tools for continuous integration and delivery, testing, deployments, monitoring and diagnostics, SDKs, frameworks, and integrated development environment (IDE) plugins
  • Application Modeling Framework: The AWS Serverless Application Model (SAM) is an open-source framework for building serverless applications. It provides shorthand syntax to express functions, APIs, databases, and event source mappings. With just a few lines of configuration, you can define the application you want and model it
  • Orchestration & State Management:  You coordinate and manage the state of each distributed component or microservice of your serverless application usingAWS Step Functions. Step Functions makes it easy to coordinate the components of distributed applications and microservices using visual workflows. Building applications from individual components that each perform a discrete function lets you scale and change applications quickly
  • Global Scale & Reach: Take your application and services global in minutes using our global reach. AWS Lambda is available in multiple AWS regions and in all AWS edge locations via Lambda@Edge. You can also run Lambda functions on local, connected devices with AWS Greengrass
  • Reliability & Performance: AWS provides highly available, scalable, low-cost services that deliver performance for enterprise scale. AWS Lambda reliably executes your business logic with built-in features such as dead letter queues and automatic retries. See our customer stories to learn how companies are using AWS to run their applications
  • Security & Access Control: Enforce compliance and secure your entire IT environment with logging, change tracking, access controls, and encryption. Securely control access to your AWS resources with AWS Identity and Access Management (IAM). Manage and authenticate end users of your serverless applications with Amazon Cognito. Use Amazon Virtual Private Cloud (VPC) to create private virtual networks which only you can access

With AWS Serverless Platform, big data workflows can focus on the analytics & not the infrastructure or undifferentiated heavy lifting (racking & stacking) & you only pay for what you use

Sample Serverless Architectures 1

Sample Serverless Architectures 1

I’ll show you 3 simplified use cases that use serverless architectures.

The first sample is a real-time streaming data pipeline. Explained briefly, in this architectural diagram:

  1. Data is published to an Amazon Kinesis Data Stream
  2. The AWS Lambda function is mapped to the data stream & polls the data stream for records at a base rate of once per second
  3. When new records are in the stream, it invokes the Lambda function synchronously with an event that contains stream records. Lambda reads records in batches and invokes your function to process records from the batch. The data collected is available in milliseconds to enable real-time analytics use cases such as real-time dashboards, real-time anomaly detection, dynamic pricing, and more
Sample Serverless Architecture 2

Sample Serverless Architecture 2

The next sample shows creating a Chatbot with Amazon Lex. In the diagram above, explained briefly,

  1. Amazon Lex is used to build a conversational interface for any application using voice and text. Amazon Lex provides the advanced deep learning functionalities of automatic speech recognition (ASR) for converting speech to text, and natural language understanding (NLU) to recognize the intent of the text, to enable you to build applications with highly engaging user experiences and lifelike conversational interactions. This enables you to build sophisticated, natural language, conversational bots (“chatbots”)
  2. To build an Amazon Lex bot, you will need to identify a set of actions – known as ‘intents’ — that you want your bot to fulfill. A bot can have multiple intents. For example, a ‘BookTickets’ bot can have intents to make reservations, cancel reservations and review reservations. An intent performs an action in response to natural language user input
  3. To create a bot, you will first define the actions performed by the bot. These actions are the intents that need to be fulfilled by the bot. For each intent, you will add sample utterances and slots. Utterances are phrases that invoke the intent. Slots are input data required to fulfill the intent. Lastly, you will provide the business logic necessary to execute the action. Amazon Lex integrates with AWS Lambda which you can use to easily trigger functions for execution of your back-end business logic for data retrieval and updates
Sample Serverless Application 3

Sample Serverless Architecture 3

The last sample shows using Amazon CloudWatch events to respond to state changes in your AWS resources.

In the above diagram, explained briefly:

  1. Amazon CloudWatch Events help you to respond to state changes in your AWS resources. When your resources change state, they automatically send events into an event stream. You can create rules that match selected events in the stream and route them to your AWS Lambda function to take action
  2. However, a Lambda function can be created to direct AWS Lambda to execute it on a regular schedule. You can specify a fixed rate (for example, execute a Lambda function every hour or 15 minutes), or you can specify a Cron expression
Containers for Serverless Architectures

Containers for Serverless Architectures

In an earlier slide, I mentioned that serverless Lambda functions are executed as autonomous and isolated functions that run in stateless compute services called CONTAINERS. Let’s look at containers & the value they provide in more depth.

AWS Containers

AWS Containers

AWS Lambda functions execute in a container (also known as a “sandbox”) that isolates them from other functions and provides resources, such as memory, specified in the function’s configuration.

The Difference Between VMs & Containers, & Hypervisors vs. Containers

The Difference Between VMs & Containers, & Hypervisors vs. Containers

So what’s the difference between Virtual Machines & Containers?

  • Virtual Machines and Containers differ in several ways, but the primary difference is that Containers provide a way to virtualize an OS so that multiple workloads can run on a single OS instance
  • With VMs, the hardware is being virtualized to run multiple OS instances

So How is a Docker Container Different than a Hypervisor?

  • Docker containers are executed with the Docker engine rather than the Hypervisor. Containers are therefore smaller than Virtual Machines and enable faster start up with better performance, less isolation and greater compatibility possible due to sharing of the host’s kernel.
    Virtualization offers the ability to emulate hardware to run multiple operating systems (OS) on a single computer

So What’s the Difference Between Hypervisors & Containers?

  • In terms of Hypervisor categories, “bare-metal” refers to a IHypervisor running directly on the hardware, as opposed to a “hosted” Hypervisor that runs within the OS
  • When Hypervisors run at baremetal level, it controls the execution at the processor. From that perspective, OSes are the Apps running on top of Hypervisor
  • So from docker perspective, Containers are the apps running on your OS
Virtualization, Explained

Virtualization, Explained

Similar to how a Virtual Machine virtualizes (meaning it removes the need to directly manage) server hardware, Containers virtualize the operating system of a server

Containers, In the Beginning

Containers, In the Beginning

In the beginning, the only option available as a LAUNCH TYPE was Amazon EC2. Soon, customers started containerizing applications within EC2 instances using Docker. Containers made it easy to build & scale cloud-native applications. The advantage of doing this was that the Docker IMAGES were PACKAGED APPLICATION CODE that’s portable, reproducible, & immutable!

Like any new application solution, once one problem is tackled, another one eventually is identified. This is how advancements in technology are born: a customer has a new request, & companies discover ways to fulfill that request. The next request was that customers needed an easier way to manage large clusters of instances & containers. The problem was solved by AWS creating Amazon ECS that provides cluster management as a hosted service. It’s a highly scalable, high-performance container orchestration service that supports DOCKER containers & allows you to easily run and scale containerized applications on AWS. Amazon ECS eliminates the need for you to install and operate your own container orchestration software, manage and scale a cluster of virtual machines, or schedule containers on those virtual machines.

Cluster Management is Only Half of the Equation

Cluster Management is Only Half of the Equation

However, Cluster Management is only half the equation. Using Amazon EC2 as the container launch type, you end up managing more than just containers. ECS is responsible for managing the lifecycle & placement of tasks. Tasks are 1-2 containers that work together. You can start or stop a task with ECS, & it stores your intent. But it doesn’t run or execute your containers; it only manages tasks. An EC2 Container instance is simply an EC2 instance that runs the ECS Container Agent. Usually, you run a cluster of EC2 container instances in an autoscaling group. But, you still have to patch & upgrade the OS & agents, monitor & secure the instances, & scale for optimal utilization.

Containers Launched via EC2 Instances

Containers Launched via EC2 Instances

If you have a fleet of EC2 instances, managing fleets is hard work. This includes having to patch & upgrade the OS, the container agents & more. You also have to scale the instance fleet for optimization, & that can be a lot of work depending on the size of your fleet.

Running Many Containers is Hard Work!

Running Many Containers is Hard Work!

When you use Amazon EC2 Instances to launch your containers, running 1 container is easy. But running many containers isn’t! This led to the launch of a new AWS Service to handle this.

Introducing AWS Fargate!

Introducing AWS Fargate!

Introducing AWS FARGATE! AWS Fargate is a compute engine to run containers without having to manage servers or clusters. This removes the need to choose server types, decide when to scale your clusters, or optimize cluster packing. AWS Fargate removes the need for you to interact with or think about servers or clusters. Fargate lets you focus on designing and building your applications instead of managing the infrastructure that runs them.

AWS' Container Services

AWS’ Container Services

Container management tools can be broken down into three categories: compute, orchestration, and registry.

Orchestration Services manage when & where your containers run. AWS helps manage your containers & their deployments, so you don’t have to worry about the underlying infrastructure.

AWS Container Services that falls under the functionality of “Orchestration” include:

  • Amazon Elastic Container Service (or ECS): ECS is a highly scalable, high-performance container orchestration service that supports Docker containers and allows you to easily run and scale containerized applications on AWS. Amazon ECS eliminates the need for you to install and operate your own container orchestration software, manage and scale a cluster of virtual machines, or schedule containers on those virtual machines. With simple API calls, you can launch and stop Docker-enabled applications, query the complete state of your application, and access many familiar features such as IAM roles, security groups, load balancers, Amazon CloudWatch Events, AWS CloudFormation templates, and AWS CloudTrail logs. Use cases for Amazon ECS include MICROSERVICES, BATCH PROCESSING, APPLICATION MIGRATION TO THE AWS CLOUD, & ML.
  • Amazon Elastic Kubernetes Service (Amazon EKS): makes it easy to deploy, manage, and scale containerized applications using Kubernetes on AWS. Amazon EKS runs the Kubernetes management infrastructure for you across multiple AWS availability zones to eliminate a single point of failure. Kubernetes is open source software that allows you to deploy and manage containerized applications at scale. Kubernetes manages clusters of Amazon EC2 compute instances and runs containers on those instances with processes for deployment, maintenance, and scaling.  Kubernetes works by managing a cluster of compute instances and scheduling containers to run on the cluster based on the available compute resources and the resource requirements of each container. Containers are run in logical groupings called pods and you can run and scale one or many containers together as a pod. Use cases for EKS include MICROSERVICES, HYBRID CONTAINER DEPLOYMENTS, BATCH PROCESSING, & APPLICATION MIGRATION.

Compute engines power your containers. AWS Container Services that falls under the functionality of “Compute” include:

  • Amazon Elastic Compute Cloud (EC2): ECS runs containers on virtual machine infrastructure with full control over configuration & scaling
  • AWS Fargate: Fargate is a serverless compute engine for Amazon ECS that allows you to run containers in production at any scale. Fargate allows you to run containers without having to manage servers or clusters. With AWS Fargate, you no longer have to provision, configure, and scale clusters of virtual machines to run containers. This removes the need to choose server types, decide when to scale your clusters, or optimize cluster packing. AWS Fargate removes the need for you to interact with or think about servers or clusters.

The AWS Container Service that falls under the functionality of “Registry” is:

  • Amazon Elastic Container Registry, or ECR. ECR is a fully-managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images. Amazon ECR is integrated with Amazon Elastic Container Service (ECS), simplifying your development to production workflow. Amazon ECR eliminates the need to operate your own container repositories or worry about scaling the underlying infrastructure. Amazon ECR hosts your images in a highly available, secure and scalable architecture, allowing you to reliably deploy containers for your applications. Integration with AWS Identity and Access Management (IAM) provides resource-level control of each repository
The Steps to Run a Managed Container on AWS

The Steps to Run a Managed Container on AWS

The steps to run a managed container on AWS are the following:

  1. You first choose your orchestration tool, either ECS or EKS
  2. Then choose your launch type EC2 or Fargate

The EC2 launch type allows you to have server-level, more granular control over the infrastructure that runs your container applications. With EC2 launch type, you can use Amazon ECS to manage a cluster of servers and schedule placement of containers on the servers. Amazon ECS keeps track of all the CPU, memory and other resources in your cluster, and also finds the best server for a container to run on based on your specified resource requirements. You are responsible for provisioning, patching, and scaling clusters of servers, what type of server to use, which applications and how many containers to run in a cluster to optimize utilization, and when you should add or remove servers from a cluster. EC2 launch type provides a broader range of customization options, which might be required to support some specific applications or possible compliance and government requirements.

AWS Fargate is a compute engine that can be used as a launch type that allows you to run containers without having to manage servers or clusters. With AWS Fargate, you no longer have to provision, configure, and scale clusters of virtual machines to run containers. This removes the need to choose server types, decide when to scale your clusters, or optimize cluster packing. AWS Fargate removes the need for you to interact with or think about servers or clusters. Fargate lets you focus on designing and building your applications instead of managing the infrastructure that runs them. AWS Fargate uses an on-demand pricing model that charges per vCPU and per GB of memory reserved per second, with a 1-minute minimum.

AWS Fargate: Containers on Demand

AWS Fargate: Containers on Demand

To sum up AWS Fargate’s primary benefits, they are the following:

  • There’s absolutely NO INFRASTRUCTURE TO MANAGE!
  • Everything is managed at the container level
  • Fargate launches containers quickly, & they scale easily
  • And, there’s resource-based pricing. You only pay when the service is running
About Docker Containers

About Docker Containers

Docker is an operating system for containers. DOCKER USERS ON AVERAGE SHIP SOFTWARE 7X MORE FREQUENTLY THAN NON-DOCKER USERS! It’s an engine that enables any payload to be encapsulated as a lightweight, portable, self-sufficient container. Docker accelerates application delivery by standardizing environments and removing conflicts between language stacks and versions. Docker can be manipulated using standard operations & run consistently on virtually any hardware platform, making it easy to deploy, identify issues, & roll back for remediation. With Docker, you get a single object that can reliably run anywhere. Docker is widely adopted, so there’s a robust ecosystem of tools and off-the-shelf applications that are ready to use with Docker. AWS supports both Docker open-source and commercial solutions.

Running Docker on AWS provides developers and admins a highly reliable, low-cost way to build, ship, and run distributed applications at any scale. You can run Amazon ECS, Amazon EKS, AWS Fargate, Amazon ECR & AWS Batch in Docker containers (as well as ML & ML algorithms for Amazon SageMaker). You can use Docker containers as a core building block creating modern applications and platforms. Docker makes it easy to build and run distributed microservices architectures, deploy your code with standardized continuous integration and delivery pipelines, build highly-scalable data processing systems, and create fully-managed platforms for your developers. A Docker image is a read-only template that defines your container. The image contains the code that will run including any definitions for any libraries & dependencies your code needs. A Docker container is an instantiated (running) Docker image. AWS provides ECR, which is an image registry for storing & quickly retrieving Docker images.

Here Comes Docker to Save the Day!

Here Comes Docker to Save the Day!

Docker solves one of the main problem that system administrators and developers faced for years. They would ask the question, “It was working on dev and qa. Why isn’t it working in the production environment?Well, the problem most of the times can be a version mismatch of some library or few packages not being installed etc. This is where docker steps in OR (**and, I suggest you sing the next few words to the tune of “Mighty

The Evolution of Data Analysis Platform Technologies

The Evolution of Data Analysis Platform Technologies

“***) Here comes Docker to Save the Day!

Example: 4 Environments Using the Same Container

Example: 4 Environments Using the Same Container

In the above example, there’s 4 separate environments using the same Docker container. Docker encourages you to split your applications into their individual components, & ECS is optimized for this pattern. Tasks allow you to define a set of containers that you’d like to be placed together (or, part of the same placement decision), their properties, & how they’re linked. TASKS are a unit of work in ECS that provides grouping of related containers, & run the container instances. Tasks include all the information that Amazon ECS needs to make the placement decision. To launch a single container, your Task Definition should only include one container definition.

Docker Solves the Quagmire That Plagued IT Professionals for Years

Docker Solves the Quagmire That Plagued IT Professionals for Years

Docker solves this problem by making an image of an entire application with all its dependencies, allowing portability to whatever to and ship it to whatever your required target environment / server is. So in short, if the app worked in your local system, it should work anywhere in the world (because you are shipping the entire thing).

A Schematic of an Amazon ECS Cluster

A Schematic of an Amazon ECS Cluster

Shown on the above slide is a schematic of an Amazon ECS cluster. Before you can run Docker containers on Amazon ECS, you must create a task definition. You can define multiple containers and data volumes in a task definition. Tasks reference the Container image from the Elastic Container Registry. The ECS Agent pulls the image and starts the container, which is in the form of a Task (aka Instance)
Amazon ECS allows you to run and maintain a specified number of instances of a task definition simultaneously in an Amazon ECS cluster. This is called a service, which are long-running collections of Tasks. If any of your tasks should fail or stop for any reason, the Amazon ECS service scheduler launches another instance of your task definition to replace it and maintain the desired count of tasks in the service depending on the scheduling strategy used. You can optionally run your service behind a load balancer. The load balancer distributes traffic across the tasks that are associated with the service.

At the bottom of the orange rectangle on the right, there’s a red rectangle surrounding the words that read “Key/Value Store”. This refers to ETCD, an open-source distributed key-value store whose job is to safely store critical data for DISTRIBUTED SYSTEMS. It’s primary datastore is used to store configuration data, state, and metadata. Containers usually run on a cluster of several machines, so Etcd makes it easy to store data across a cluster and watch for changes, allowing any node in a cluster to read and write data. Etcd’s watch functionality is used by Containers to monitor changes to either the actual or the desired state of its system. If they are different, the container makes changes to reconcile the two states.

A Visual Diagram of How AWS Fargate Works vs. Amazon EC2 as Launch Types for Containers

A Visual Diagram of How AWS Fargate Works vs. Amazon EC2 as Launch Types for Containers

How you architect your application on Amazon ECS depends on several factors, with the launch type you are using being a key differentiator. The image on the left represents Amazon ECS launched with an EC2 launch type. Let’s look at that architecture vs. ECS launched with the Fargate launch type that’s shown in the image on the right. The ECS launch type consists of varying amounts of EC2 instances. Both launch types show Scheduling & Orchestration, & a Cluster Manager & a Placement Engine that are both using ECS.

Orchestration provides the following: Configuration, Scheduling, Deployment, Scaling, Storage or Volume Mapping, Secret Management, High Availability, & Load Balancing Integration. The service scheduler is ideally suited for long running stateless services and applications. The service scheduler ensures that the scheduling strategy you specify is followed and reschedules tasks when a task fails (for example, if the underlying infrastructure fails for some reason). Cluster management systems schedule work and manage the state of each cluster resource. A common example of developers interacting with a cluster management system is when you run a MapReduce job via Apache Hadoop or Apache Spark. Both of these systems typically manage a coordinated cluster of machines working together to perform a large task. When a task that uses the EC2 launch type is launched, Amazon ECS must determine where to place the task based on the requirements specified in the task definition, such as CPU and memory. Similarly, when you scale down the task count, Amazon ECS must determine which tasks to terminate. You can apply task placement strategies and constraints to customize how Amazon ECS places and terminates tasks. Task placement strategies and constraints are not supported for tasks using the Fargate launch type. By default, Fargate tasks are spread across Availability Zones..

Moving on from this point in defining what you see in both these images will differ a lot.

The Amazon ECS container agent, shown near the bottom of the EC2 launch type image, allows container instances to connect to your cluster. The Amazon ECS container agent is only supported on Amazon EC2 instances.

Amazon ECS uses Docker images in task definitions to launch containers on Amazon EC2 instances in your clusters. A Docker Agent is the containerized version of the host Agent & are shown next to the ECS container Agents. Amazon ECS-optimized AMIs, or Amazon Machine Images, are pre-configured with all the recommended instance specification requirements.

Now let’s look at the image on the right, showing a diagram of using the AWS Fargate launch type. Amazon ECS (& EKS) supports Fargate technology thus customers can choose AWS Fargate to launch their containers without having to provision or manage EC2 instances. AWS Fargate is the easiest way to launch and run containers on AWS. Customers who require greater control of their EC2 instances to support compliance and governance requirements or broader customization options can choose to use ECS without Fargate to launch EC2 instances.

Fargate is like EC2 but instead of giving you a virtual machine, you get a container. Fargate is a compute engine that allows you to use containers as a fundamental compute primitive without having to manage the underlying instances. AWS Fargate removes the need to choose server types, decide when to scale your clusters, or optimize cluster packing. AWS Fargate removes the need for you to interact with or think about servers or clusters, whereas with the EC2 launch type all of these tasks & more must be handled by you manually & continually. With Fargate launch type, all you have to do is package your application in containers, specify the CPU and memory requirements, define networking and IAM policies, and launch the application.

Comparing ECS to EKS

Comparing ECS to EKS

By using Amazon ECS, you reduce your compute footprint by 70%. For that reason, let’s quickly review the difference between ECS & EKS for clarity.

Both container services have CONTAINER-LEVEL NETWORKING. They also both have DEEP INTEGRATION WITH AWS PLATFORM, but the feature similarities stop there.

Amazon ECS uses the Amazon ECS CLI makes it easy to set up your local environment & supports Docker Compose, an open-source tool for defining & running multi-container apps. Amazon EKS has a scalable and highly-available control plane that runs across multiple AWS availability zones. The Amazon EKS service automatically manages the availability and scalability of the Kubernetes API servers and the etcd persistence layer for each cluster. Amazon EKS runs the Kubernetes control plane across three Availability Zones in order to ensure high availability, and it automatically detects and replaces unhealthy masters. Amazon ECS allows you to define tasks through a declarative JSON template called a Task Definition. Within a Task Definition you can specify one or more containers that are required for your task, including the Docker repository and image, memory and CPU requirements, shared data volumes, and how the containers are linked to each other. Task Definition files also allow you to have version control over your application specification. Amazon EKS provisions and scales the Kubernetes control plane, including the API servers and backend persistence layer, across multiple AWS availability zones for high availability and fault tolerance. Amazon EKS automatically detects and replaces unhealthy control plane nodes and provides patching for the control plane. Amazon ECS includes multiple scheduling strategies that place containers across your clusters based on your resource needs (for example, CPU or RAM) and availability requirements. Using the available scheduling strategies, you can schedule batch jobs, long-running applications and services, and daemon processes. Amazon EKS performs managed, in-place cluster upgrades for both Kubernetes and Amazon EKS platform version. There are two types of updates that you can apply to your Amazon EKS cluster, Kubernetes version updates and Amazon EKS platform version updates. Amazon ECS is integrated with Elastic Load Balancing, allowi ng you to distribute traffic across your containers using Application Load Balancers or Network Load Balancers. You specify the task definition and the load balancer to use, and Amazon ECS automatically adds and removes containers from the load balancer. Amazon EKS is fully compatible with Kubernetes community tools and supports popular Kubernetes add-ons. These include CoreDNS to create a DNS service for your cluster and both the Kubernetes Dashboard web-based UI and the kubectl command line tool to access and manage your cluster on Amazon EKS. Amazon ECS is built on technology developed from many years of experience running highly scalable services. You can launch tens or tens of thousands of Docker containers in seconds using Amazon ECS with no additional complexity. Amazon EKS provides a scalable and highly-available control plane that runs across multiple AWS availability zones. The Amazon EKS service automatically manages the availability and scalability of the Kubernetes API servers and the etcd persistence layer for each cluster. Amazon EKS runs the Kubernetes control plane across three Availability Zones in order to ensure high availability, and it automatically detects and replaces unhealthy masters. Amazon ECS provides monitoring capabilities for your containers and clusters through Amazon CloudWatch. You can monitor average and aggregate CPU and memory utilization of running tasks as grouped by task definition, service, or cluster. You can also set CloudWatch alarms to alert you when your containers or clusters need to scale up or down. Amazon EKS runs upstream Kubernetes and is certified Kubernetes conformant, so applications managed by Amazon EKS are fully compatible with applications managed by any standard Kubernetes environment. Amazon ECS includes an integrated service discovery that makes it easy for your containerized services to discover and connect with each other. Previously, to ensure that services were able to discover and connect with each other, you had to configure and run your own service discovery system or connect every service to a load balancer. Now, you can enable service discovery for your containerized services with a simple selection in the ECS console, AWS CLI, or using the ECS API. Amazon ECS creates and manages a registry of service names using the Route 53 Auto Naming API. Names are automatically mapped to a set of DNS records so you can refer to services by an alias, and have this alias automatically resolve to the service’s endpoint at runtime. You can specify health check conditions in a service’s task definition and Amazon ECS will ensure that only healthy service endpoints are returned by a service lookup.

Amazon Elastic Container Registry

Amazon Elastic Container Registry

Amazon ECR is a highly available and secure private container repository that makes it easy to store and manage your Docker container images, encrypting and compressing images at rest so they are fast to pull and secure. It’s a fully-managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images. Amazon ECR eliminates the need to operate your own container repositories or worry about scaling the underlying infrastructure. Amazon ECR hosts your images in a highly available and scalable architecture, allowing you to reliably deploy containers for your application. There’s deep Integration with AWS Identity and Access Management (IAM) to provide resource-level control of each repository, & integrates natively with other AWS Services.

Introduction to the Next Video 3.4

Introduction to the Next Video 3.4

In the next section of video #4, “Cloud & Data Metamorphosis, Part 4” I’ll cover the following:

  • The Evolution of Data Analysis Platform Technologies
  • The Benefits of Serverless Analytics
  • And, an Introduction to AWS Glue & Amazon Athena

SECTION/VIDEO 3.4: “Cloud & Data Metamorphosis, Video 3, Part 4

This 4th video in the 4-part video series “Cloud & Data Metamorphosis, Video 3, Part 4” is the last video in the Video 3 series. Thus, by its very nature, by the end of this 3.4 video, you’ll have a complete foundation for the upcoming videos that are on specific AWS Services. Most of the upcoming videos will be on – but not limited to – AWS Glue, Amazon Athena, & the Amazon S3 Data Lake Storage Platform. But I’m keeping the last video in this series to give you a “wow” factor that hopefully will bring all topics full-circle, with everything summed up in a quite frankly unexpected manner. But as you look back after watching the last video, you’ll see that the end of the video is actually the beginning 🙂 That last video is entitled, “How to AI & ML Your Apps & Business Processes“.

There’s a ~7 minute video set to 3 songs that represent the journey I hope to take you on through my course & YouTube videos (whose graphics are squares, music stops abruptly & ok, I lingered a bit long on John Lennon’s “Imagine” 🙂 , & my grammar is atrocious!) that you can find on my YouTube channel entitled, “Learn to Use AWS AI, ML, & Analytics Fast!“. I’ll tempt you with that (I mean bad quality :-0 ) video now…it can be viewed here. Keep in mind that I didn’t “polish” that quickly-created video, but nevertheless, it’s relevant (& FUN!)

Below you’ll find the embedded YouTube Video 3.4:

SECTION/VIDEO 3.4: “Cloud & Data Metamorphosis, Video 3, Part 4” in Text & Screenshots

Title Slide for Cloud & Data Metamorphosis, Video 3, Part 4

Title Slide for Cloud & Data Metamorphosis, Video 3, Part 4

This 4th & last video of the video series entitled, “Cloud & Data Metamorphosis”, that augments my Pluralsight course, “Serverless Analytics on AWS”, highlighted with a blue rectangle & pointed to by a blue arrow. Under the blue arrow is the link to that Pluralsight course.

Overviewing the Topics in Part 3.3 & an Introduction to the Topics We'll Cover in Video 3.4

Overview of the Topics in Part 3.3 & an Introduction to the Topics We’ll Cover in Video 3.4

In Part 3 of “Cloud & Data Metamorphosis“, I covered the following:

  • Serverless Architectures
  • AWS Lambda
  • AWS’ Serverless Application Model, or SAM
  • All About AWS Containers
  • AWS Fargate
  • Amazon Elastic Container Registry (ECR)

In Part 4, I’ll cover:

  • The Evolution of Data Analysis Platform Technologies
  • Serverless Analytics
  • How to Give Redbull to Your Data Transformations
  • AWS Glue & Amazon Athena
  • Clusterless Computing
  • An Introduction to Amazon S3 Data Lake Architectures
The Evolution of Data Analysis Platform Technologies

The Evolution of Data Analysis Platform Technologies

Continuing how technologies have evolved, in this section I’ll cover the Evolution of Data Analysis Platform Technologies.

The Evolution of Data Analysis Platform Technologies

The Evolution of Data Analysis Platform Technologies

The timeline on this slide shows the evolution of data analysis platform technologies.

Around the year 1985, Data Warehouse appliances were the platform of choice. This consisted of multi-core CPUs and networking components with improved storage devices such as Network Attached Storage, or NAS, appliances

Around the year 2006, Hadoop clusters were the platform of choice. This consisted of a Hadoop master node and a network of many computers that provided a software framework for distributed storage and processing of big data using the MapReduce programming model. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. You can see logos for Amazon Elastic MapReduce (EMR) that represent Hadoop frameworks, such as Spark, Hbase, Presto, Hue, Hive, Pig, & Zeppelin.

The Last 3 Data Analysis Platform Technologies are All AWS Services!

The Last 3 Data Analysis Platform Technologies are All AWS Services!

Superman is actually an animated gif in the video. Everything Superman is flying towards were created by AWS.

The Last 3 Data Analysis Platform Technologies

The Last 3 Data Analysis Platform Technologies

Around the year 2009, Decoupled EMR clusters were the platform of choice. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), shown here using Amazon S3 for EMRFS (emr file storage), and a processing part which is a MapReduce programming model. This is the first occurrence of compute/memory decoupled from storage. Hadoop again splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality,[6] where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

Around the year 2012, Amazon Redshift Cloud Data Warehouse was introduced, which was a very transformative data analysis platform technology for oh so many awesome reasons! The diagram on the timeline underneath the year 2012 is difficult to read, so I’ll explain them to you. Directly under the written year 2012 is a set of 3 ”sample” client apps that talk to the Leader node & receive information back from the leader node via odbc or jdbc connections. Under the leader node are multiple compute nodes with multiple node slices that have 2-way communication with the leader node as well as a variety of data sources, shown by the 4 icons on the bottom of that image.

Today, the data analysis platform technology of choice is serverless & Clusterless computing via an Amazon S3 data lake, using AWS Glue for ETL & Amazon Athena for SQL, both having 2-way communication with the underlying S3 data store.

The Evolution of Analytics

The Evolution of Analytics

Since data is changing, so must the analytics used to glean insights from any type of data. Today data is captured & stored at PB, EB & even ZB scale, and analytics engines must be able to keep up.

Some Samples of the New Types of Analytics include:

  • Machine Learning
  • Big Data Processing
  • Real-time Analytics
  • Full Text Search
Introductory Slide to Introduce AWS Glue & Amazon Athena

Introductory Slide to Introduce AWS Glue & Amazon Athena

At this point, I’m excited to share with you the two innovative, cutting-edge serverless analytics services provided by AWS: Amazon Athena & AWS Glue! These services are not only cutting edge because they’re state-of-the-art technologies, but also because they’re serverless. Having a cloud-native Serverless architecture enables you to build modern applications with increased agility & lower cost of ownership. It enables you to shift most of your operational & infrastructure management responsibilities to AWS, so you can focus on developing great products that are highly-reliable and scalable. Joining the AWS Services of Glue & Athena is the Amazon S3 Data Lake Platform. S3 Data Lakes will be covered in the next video.

Data Preparation Takes 60% of Data Transformation's Time to Complete

Data Preparation Takes 60% of Data Transformation’s Time to Complete

Data preparation is by far the most difficult & time-consuming tasks when mapping disparate data types for data analytics. 60% of time is spent on cleaning & organizing data. 19% of time is spent collecting datasets. The third most time consuming task is Mining data for patterns. The fourth most time consuming task is Redefining Algorithms. The fifth most time consuming task falls under the broad category of “Other”. The sixth most time consuming task is Building Training Sets for Machine Learning.

The moral of this story is there HAS TO BE A SOLUTION TO DECREASE THE TIME SPENT ON ALL THESE TASKS! Well, there is, & I can’t wait to share it with you! I’ll begin in the next few slides then elaborate more in the next video.

Cutting-edge Data Architecture with Serverless AWS Glue

Cutting-edge Data Architecture with Serverless AWS Glue

AWS Glue solves the business problems of heterogeneous data transformation and globally siloed data.

Let’s look at what AWS Glue “Is”:

  • The AWS Glue Data Catalog provides 1 Location for All Globally Siloed Data – NO MATTER WHERE IN THE WORLD THE UNDERLYING DATA STORE IS!
  • AWS Glue Crawlers crawls global data sources, populates the Glue Data Catalog with enough metadata & statistics to recreate the data set when needed for analytics, & keeps the Data Catalog in sync with all changes to data, located across the globe
  • AWS Glue automatically identifies data formats & data types
  • AWS Glue has built-in Error Handling
  • AWS Glue Jobs perform the data transformation, which can be automated via a Glue Job Scheduler, either event-based or time-based
  • AWS Glue ETL is one of the common data transformations AWS Glue does, but there are many other data transformations built-in
  • AWS Glue has monitoring and alerting built in!
  • And, AWS Glue ELIMINATES DARK DATA
Cutting-edge Data Analytics with Serverless Amazon Athena

Cutting-edge Data Analytics with Serverless Amazon Athena

Amazon Athena solves the business problems of heterogeneous data analytics & gives the ability to instantaneously query data without ETL.

Let’s look at what Amazon Athena “Is”:

  • Amazon Athena is an interactive query service
  • You query data directly from S3 using ANSI SQL
  • You can analyze unstructured, semi-structured, & structured data
  • Athena scales automatically
  • Query execution is extremely fast via executing queries in parallel
  • You can query encrypted data in S3 & write encrypted data back to another S3 bucket
  • And, you only pay for the queries you run
Amazon S3 Data Lake Architecture Introduction

Amazon S3 Data Lake Architecture Introduction

I’ll now touch on the Data Architecture evolution regarding Amazon S3 Data Lakes.

Serverless Architectures remove most of the needs for traditional “always on” server components. The term “CLUSTERLESS” means architectures today don’t need 2 or more computers working at the same time, thus these services are both Serverless & Clusterless.

AWS Glue, Amazon Athena, & Amazon S3 are the 3 core services that make AWS Data Lake Architectures possible!!! These 3 AWS Services are pretty AMAZING!

Under Amazon Athena’s covers is both Presto & Apache Hive. Presto is an in-memory distributed SQL query engine used for DML (Data Manipulation Language)…like CREATE, SELECT, ALTER, & DELETE. It can query data where it is stored, without needing to move data into a separate analytics system. Query execution runs in parallel over a pure memory-based architecture, with most results returning in seconds. It can perform interactive data analysis against GB to PB of data. And, it’s ANSI-SQL compatible with extensions.

Hive is used to execute DDL statements… “Data Definition Language”, a subset of SQL statements that change the structure of the database schema in some way, typically by creating, deleting, or modifying schema objects such as databases tables, and views in Amazon Athena. It can work with complex data types & work with a plethora of data formats. It’s used by Amazon Athena to partition data. Hive also supports MSCK REPAIR TABLE (or, ALTER TABLE RECOVER PARTITIONS), to recover partitions and data associated with partitions.

AWS Glue builds on Apache Spark to offer ETL-specific functionality. Apache Spark is a high-performance, in-memory data processing framework that can perform large-scale data processing. You can configure your AWS Glue jobs and development endpoints to use the Data Catalog as an external Apache Hive metastore because the Data Catalog is Hive-metastore compatible. You can then directly run Apache Spark SQL queries against the tables stored in the Data Catalog. By using SparkSQL you can use existing Hive metastores, SerDes, & UDFs.

Either way, serverless architectures provide all the benefits of cloud computing but with considerably less time-to-create, maintain, monitor, & more & at an amazing cost-savings!

I’m going to end this video with some awesome quotes…

Technology Today in the 4th Industrial Revolution Quote

Technology Today in the 4th Industrial Revolution Quote

Hopefully you remember that we’re in the midst of the 4th Industrial Revolution from Video 2 in this series? On the above slide is a quote from Miguel Milano, President, EMEA, Salesforce. It reads, “The technology driving the 4th Industrial Revolution gives companies an ever-greater influence on how we live, work, & function as a society. As a result, customers expect business leaders to go beyond shareholders & have a longer-term, broader strategic approach“. In other words, what worked yesterday will not work today, at least not for long. Businesses & ourselves HAVE to keep up with the rapid pace of technological change. “It’s the end of the world as we know it!

In Today's Era of Volatility, There's No Other Way But to re:Invent Quote

In Today’s Era of Volatility, There’s No Other Way But to re:Invent Quote

The quote above is from Jeff Bezos, Founder & CEO of Amazon.com. It reads, “In today’s era of volatility, there’s no other way but to re:invent. The only sustainable advantage you can have over others is agility, that’s it. Because, nothing else is sustainable, everything else you create, someone else will replicate.”

It’s interesting to watch how blazingly fast everything we do in life is recorded through the very technologies that also help us. People have always stood on the shoulder’s of giants to bootstrap their careers, but this time it’s different. If you don’t learn AI today, it’s a sobering fact that you’ll fall behind the pack. So, keep up with my course & these videos, stand on my “dwarf” (not giant! I’m REALLY SHORT!) shoulders, & be tenacious to thrive in the 4th Industrial Revolution. I know you can do it!

You're Behind Your Customers Quote

You’re Behind Your Customers Quote

Keeping the last quote in mind, read the next quote above. It’s a quote from Brendan Witcher, Principal Analyst at Forrester. It reads, “You’re not behind your competitors; you’re behind your customers – beyond their expectations“. There’s 2 concepts I’d like you to ponder here. First, although you need to keep up & ideally surpass your competition, what your customers want, & KNOWING WHAT THAT IS, is always the way to approach a business or a job, & AI can tell you that in real-time. Secondly, as your competitors offer state-of-the-art AI solutions, they will come to expect that from anyone they choose to do business with.

Quarter Inch Hole Quote

Listen to Your Customers to Figure Out What They REALLY Want

The last quote I’ll leave you with is a bit of an extension of the last quote. This one is from Theodore Levitt, Former Harvard Business School Marketing Professor. It reads, “People don’t buy a quarter-inch drill. They want a quarter-inch hole“. Acquiring the talent to decipher customer’s requests into what they really mean but perhaps can’t articulate is a valuable characteristic indeed!

Video 3.4 End Slide

Video 3.4 End Slide

This is the end of multi-video series 3 in the parent multi-video series “AI, ML, & Advanced Analytics on AWS”. In the next video in this series, video #4, I’ll go into depth on just how cool the Amazon S3 Data Lake Storage Platform is & why . I’ll also describe how AWS Glue & Amazon Athena fit into that platform. I think you’ll be amazed at how these technologies together, & other AWS Services, provide a complete portfolio of data exploration, reporting, analytics, machine learning, AI and visualization tools to use on all of your data.

By the way, every top-level video in this series will end with this slide with this image on it & the url at the top. The URL leads you to my book site, where you can download a 99-pg chapter on how to create a predictive analytics workflow on AWS using Amazon SageMaker, Amazon DynamoDB, AWS Lambda, & some other really awesome AWS technologies. The reason the chapter is 99 pages long is because I leave no black boxes, so that people who are advanced at analytics can get something out of it as well as a complete novice. I walk readers through step-by-step in creating the workflow & describe why each service is used & what its doing. Note however, that it’s not fully edited, but there’s a lot of content as I walk you through building the architecture with a lot of screenshots, so you can confirm you’re following the instructions correctly.

I’ll get Video 4 up asap! Until then, #gottaluvAWS!

Posted in Uncategorized | Tagged | Leave a comment

6 Hour Video Tutorial on AI, ML, AWS Glue, Amazon Athena, and S3 Data Lakes

How Would You Like 6 Hours of Video Tutorials that Will Comprehensively Teach You How to Use the Best, State-of-the-Art AWS Services using Best Practices to Keep Up with the Rapid Pace of Technological Changes?

Tomorrow (8-19-2019) is the scheduled publish date for a video tutorial I created for an online tech training company. That tutorial is about 2.5 hours long, but I created 6 hours of that video tutorial that covers very important information other than how to use the services. Only the 2.5 hours of training will be in that course, which is awesome!

So, I’m creating both a blog post here on the eliminated content in video-format from my YouTube account. I’ll embed the first video with augmented content below.

 

I find that the reason some really awesome technologies aren’t as widely adopted as they should be (other than for very large enterprises), is because courses usually just say “how” to do something.

What’s in this FOR ME???

If you understand “why” & “where” these technologies fit into your business, what business problems they solve, & then learn “how” to implement them, you’re more likely to take the time to advance your skills & use the very best technologies.

Thus, that’s my approach for this series, both in blog format & video format.

First, let me set the stage for this series with some “What if you could…” questions, kind of like “In a perfect world-type” questions.

WHAT IF YOU COULD:

What If You Could…

  1. What if you could know in real-time what your customers want, the prices they’re willing to pay, & the triggers that make them buy more in real-time?
  2. What if you could have 1 unified view of your data no matter where in the world it’s stored?
  3. What if you could query your global data without the need to move into another location in order to perform any type of analysis you want to?
  4. What if you could automate any changes made to underlying data stores, keeping the 1 unified view in sync at all times?

What if you also could….

  1. What if you could have a single, centralized, secure and durable platform combining storage, data governance, analytics, AI and ML that allows ingestion of structured, semi-structured, and structured data, transforming these raw data sets in a “just-in-time” manner? Not ETL, extract, transform, & load, but ELT, extract, load THEN transform when you need the data performing ELT on the fly?
  2. Mapping disparate data, or, data preparation, is the most time consuming step in analytics. What if you could drastically reduce the amount of time spent on this task?
  3. What if you could “turbo-charge” your existing apps by adding AI into their workflows? And build new apps & systems using this best-practice design pattern?
  4. What if you could future-proof your data, have multiple permutations of the same underlying data source without affecting the source, and use a plethora of analytics and AI/ML services on that data simultaneously?

What is stated on the two previous images would be transformative in how your business operates. It would lead to new opportunities, give you a tremendous competitive advantage, the ability to satisfy your customers & additionally accrue new customers, wouldn’t it? Now, imagine having all these insights on steroids!

Gartner’s 2019 Top Strategic Technology Trends are annually published in a report entitled, “The Intelligent Digital Mesh.

The Intelligent Digital Mesh for 2019

Gartner defines a strategic technology trend as one with substantial disruptive potential that is beginning to break out of an emerging state into broader impact and use, or which are rapidly growing trends with a high degree of volatility reaching tipping points over the next five years. In other words, “be prepared” today; don’t wait for these technologies to be 100% or even 40% mainstream: then you’re too late.

The Intelligent category provides insights on the very best technologies, which today are under the heading AI.

The Digital category provides insights on the technologies that are moving us into an immersive world.

The Mesh category provides insights on what technologies are cutting-edge that intertwine the digital & physical worlds.

In the Intelligent category, we have the AI strategic trends of:

  • Autonomous Things exist across 5 types: ROBOTICS, VEHICLES, DRONES, APPLIANCES, & AGENTS
  • Augmented Analytics is the results of the vast amount of data that needs to be analyzed today. It’s easy to miss key insights from hypotheses. Augmented analytics automates algorithms to explore more hypotheses through data science & machine learning platforms. This trend has transformed how businesses generate analytics insights.
  • AI-driven Development highlight the tools, technologies & best practices for embedding AI into apps & using AI to create AI-powered tools.

In the Digital category, we have the Digital strategic trends of:

  • A Digital Twin is a digital representation that mirrors a real-life object, process or system. They improve enterprise decision making by providing information on maintenance & reliability, insight into how a product could perform more effectively, data about new products with increased efficiency.
  • Empowered Edges is a topology where information processing, content collection, & delivery are placed closer to the sources of the information, with the idea that keeping traffic local will reduce latency. Currently, much of the focus of this technology is a result of the need for IoT systems to deliver disconnected or distributed capabilities into the embedded IoT world.
  • Immersive Experiences: Gartner predicts through 2028, conversational platforms will change how users interact with the world. Technologies like augmented reality (AR), mixed reality (MR) & virtual reality (VR) change how users perceive the world. These technologies increase productivity, with the next generation of VR able to sense shapes & track a user’s position, while MR will enable people to view & interact with their world

In the Mesh category we have the strategic trends of:

  • Blockchain is a type of distributed ledger, an expanding chronologically ordered list of cryptographically signed, irrevocable transactional records shared by all participants in a network. Blockchain allows companies to trace a transaction & work with untrusted parties without the need for a centralized party such as banks. This greatly reduces business friction & has applications that began in finance, but have expanded to government, healthcare, manufacturing, supply chains & others. Blockchain could potentially lower costs, reduce transaction settlement times & improve cash flow.
  • Smart Spaces are evolving along 5 key dimensions: Openness, connectedness, coordination, intelligence & scope. Essentially, smart spaces are developing as individual technologies emerge from silos to work together to create a collaborative & interaction environment. The most extensive example of smart spaces is smart cities, where areas that combine business, residential & industrial communities are being designed using intelligent urban ecosystem frameworks, with all sectors linking to social & community collaboration.

Spanning all 3 categories are:

  • Digital Ethics & Privacy represents how consumers have a growing awareness of the value of their personal information, & they are increasingly concerned with how it’s being used by public & private entities. Enterprises that don’t pay attention are at risk of consumer backlash.
  • Quantum Computing is a type of nonclassical computing that’s based on the quantum state of subatomic particles that represent information as elements denoted as quantum bits or “qubits.” Quantum computers are an exponentially scalable & highly parallel computing model.  A way to imagine the difference between traditional & quantum computers is to imagine a giant library of books. While a classic computer would read every book in a library in a linear fashion, a quantum computer would read all the books simultaneously. Quantum computers are able to theoretically work on millions of computations at once. Real-world applications range from personalized medicine to optimization of pattern recognition.

This course covers most of these strategic IT trends for 2019. You can read more about each category by visiting the entire report at the URL on the screen http://bit.ly/Gartner_IDM_2019: it’s not only a fascinating read, but also a reality check on what you should be focusing on today.

The 4th Industrial Revolution

Today we’re experiencing what has been called the 4th Industrial Revolution. A broad definition of the term Industrial Revolution is “unprecedented technological & economic development that have changed the world.”

The timeline on the above slide indicates when each Industrial Revolution occurred & what new inventions defined such a large-scale change.

The emergence of the 4th Industrial Revolution is attributed to primarily technological advances built atop the technologies that were paramount in Industry 3.0.

As the pace of technological change quickens, we need to be sure that employees & ourselves are keeping up with the right skills to thrive in the Fourth Industrial Revolution.

Quote by Eric Schmidt, Former Executive Chairman, Google

In the image above is a quote that should confirm that having 1 location for all your global data is essential with so much data in order to take advantage of all your data for analytics (via the AWS Glue Data Catalog.)

The last image in this post explains the next video/blog post in this series.

The next section will cover…

See you there!

#gottaluvAWS!

Posted in Uncategorized | Tagged , , , , , , , , | Leave a comment

We’re Living in a Metrics-Driven World (copied from read.acloud.guru blog)

*NOTE: To read this post on it’s original acloud.guru’s blog – which you should be reading daily anyway if you love the cloud & AWS – click here*

Analytics-Driven Organizations know how to turn data sources and solutions into insights that offer a competitive advantage

In 2017 & beyond, data will become more intelligent, more usable, and more relevant than ever

“The goal is to turn data into information, and information
into insight” — Carly Fiorina

We live in a data-driven world. For Fortune 500 companies, the value of data is clear and compelling. They invest millions of dollars annually in information systems that improve their performance and outcomes. Independent businesses have the same need to be data-driven; however there’s a persistent entrepreneurial resistance to becoming truly metrics-driven.

Founders are often tempted to postpone building necessary metrics in favor of spending time and resources on building products. While that might work in the short-term, it will very quickly come back to haunt them.

Very few companies have successfully achieved exponential growth, raised capital, or negotiated strong exits without first having a solid analytics model that has been iterated upon for many months or years.

The Analytics-Driven Organization

The importance of big data in the business world can’t be overstated. We know that there’s a enormous amount of valuable data in the world, but few companies are using it to maximum effect. Analytics drive business by showing how your customers think, what they want, and how the market views your brand.

In the age of Digital Transformation, almost everything can be measured. In the coming year this will be a cornerstone of how businesses operate. Every important decision can and should be supported by the application of data and analytics.

To keep competitive today, you need to “think ahead” and answer questions in real-time to provide alerts to mitigate negative impacts on your business, and you need to predictive analytics to forecast what’s going to happen before it ever does so you are prepared at any given point in time.

In 2017, data will become more intelligent, more usable, and more relevant than ever. Cloud technologies, primarily Amazon Web Services (AWS), has opened the doors to affordable, smart data solutions that make it possible for non-technical users to explore (through visualization tools) the power of predictive analytics.

There’s also an increasing democratization of artificial intelligence (AI), which is driving more sophisticated consumer insights and decision-making. Forward-thinking organizations will approach predictive analytics with the future and extensibility in mind.

Analyzing extensive data sets require noteworthy compute capacity that can fluctuate in size based on the data inputs and type of analytics. This characteristic of scaling workloads is perfectly suited to AWS and their Marketplace pay-as-you-go cloud model — where applications can scale up and down (and in and out) based on demand.

In 2017, entrepreneurs will learn how to embrace the power of cloud analytics.

The ubiquity of cloud is nothing new for anybody who stays up-to-date with Business Intelligence trends. The cloud will continue its reign as more and more companies move towards it as a result of the proliferation of cloud-based tools available on the market. Most of the elements — data sources, data models, processing applications, computing power, analytic models and data storage — are located in the cloud.

Very few companies have achieved success without first having a solid analytics model

No matter the role, no matter the sector, data is transforming it. Some companies have restructured themselves, their internal processes, their data systems & their cultures to embrace the opportunities provided by data.

At their core, the best data-driven companies operationalize data. Data informs the actions of each employee every morning & evening. These businesses use the morning’s purchasing data to inform which merchandise sits on the shelves in the afternoon, for example.

The Analytics-Driven Organization has also developed functional data supply chains that send insight to the people who need it. This supply chain comprises all the people, software, & processes related to data as it’s generated, stored, and accessed.

These businesses also have created a data dictionary — a common language of metrics used by the company to create a universal language used throughout all departments of the company.

As companies become analytics-driven, they aren’t just enjoying incremental improvements. The benefits enabled by analytical data processing becomes the heart of the business — enabling new applications and business processes, using a variety of data sources and analytical solutions — giving insight into their data never dreamed of and giving them a great competitive advantage.

The Types of Analytics and Their Use Cases

Descriptive Analytics

Uses business intelligence and data mining to ask “What has happened?”

Descriptive Analytics mines data to provide trending information on past or current events that can give businesses the context they need for future actions. Descriptive Analytics are characterized by the use of KPIs. It drills down into data to uncover details such as the frequency of events, the cost of operations and the root cause of failures.

Most traditional business intelligence reporting falls into this realm, but complex and sophisticated analytic techniques also fall into this realm when their purpose is to describe or characterize past events and states.

Summary statistics, clustering techniques, and association rules used in market basket analysis are all examples of Descriptive Analytics.

Diagnostic Analytics

Examines data or content to answer the question “Why did it happen?”

Diagnostic Analytics is characterized by techniques such as drill-down, data discovery, data mining and correlations. You can think of it as the casual inference and the comparative effect of different variables on a particular outcome. While Descriptive Analytics might be concerned with describing how large or significant a particular outcome is, it’s more focused on determining what factors and events contributed to the outcomes.

As more and more cases are included in a particular analysis and more factors or dimensions are included, it may be impossible to determine precise, limited statements regarding sequences and outcomes. Contradictory cases, data sparseness, missing factors (“unknown unknowns”), and data sampling and preparation techniques all contribute to uncertainty and the need to qualify conclusions in Diagnostic Analytics as occurring in a “probability space”.

Training algorithms for classification and regression techniques can be seen as falling into this space since they combine the analysis of past events and states with probability distributions. Other examples of Diagnostic Analytics include attribute importance, principle component analysis, sensitivity analysis and conjoint analysis.

Discovery Analytics

Approaches the data in an iterative process of “explore, discover, verify and operationalize.” It doesn’t begin with a pre-definition but rather with a goal.

This method uncovers new insights and then builds and operationalizes new analytic models that provide value back to the business. The key to delivering the most value through Discovery Analytics is to enable as many users as possible across the organization to participate in it to harness the collective intelligence.

Discovery Analytics searches for patterns or specific items in a data set. It uses applications such as geographical maps, pivot tables and heat maps to make the process of finding patterns or specific items rapid and intuitive.

Examples of Discovery Analytics include using advanced analytical geospatial mapping to find location intelligence or frequency analysis to find concentrations of insurance claims to detect fraud.

Predictive Analytics

Asks “What could happen?”

Predictive Analytics is used to make predictions about unknown future events. It uses many techniques from data mining, machine learning and artificial intelligence. This type of analytics is all about understanding predictions based on quantitative analysis on data sets.

It’s in the realm of “predictive modeling” and statistical evaluation of those models. It helps businesses anticipate likely scenarios so they can plan ahead, rather than reacting to what already happened.

Examples of Predictive Analytics includes classification models, regression models, Monte Carlo analysis, random forest models and Bayesian analysis.

Prescriptive Analytics

Uses optimization and simulation to ask “What should we do?”

Prescriptive Analytics explores a set of possible actions and suggests actions based on Descriptive and Predictive Analyses of complex data. It’s all about automating future actions or decisions which are defined programmatically through an analytical process. The emphasis is on defined future responses or actions and rules that specify what actions to take.

While simple threshold based “if then” statements are included in Prescriptive Analytics, highly sophisticated algorithms such as neural nets are also typically in the realm of Prescriptive Analytics because they’re focused on making a specific prediction.

Examples include recommendation engines, next best offer analysis, cueing analysis with automated assignment systems and most operations research optimization analyses.

Sentiment Analysis

The process of determining whether a piece of writing is positive, negative, or neutral.

Sentiment Analysis is also known as opinion mining — deriving the opinion or attitude of a speaker. Social media tweets, comments, & posts typically feed sentiment analysis. This is a sub-category of general Text AnalyticsA common use case of sentiment analysis is to discover how people feel about a particular topic.

Geospatial Analytics

There is a growing realization that by adding geographic location to business data and mapping it, organizations can dramatically enhance their insights into tabular data.

Geospatial Analytics, or Location Analytics, provide a whole new context that is simply not possible with tables and charts. This context can almost immediately help users discover new understandings and more effectively communicate and collaborate using maps as a common language.

When you can visualize millions of points on a map, use cases include route planning, geographic customer targeting, disease spread and more.

Interesting Types of Geospatial Analysis (courtesy of AWS Marketplace video of MapLarge Software)

The Culture of Digital Transformation

Change is going to happen whether you pursue it or not — you only need to look at how the role of cloud computing in 2016 has evolved to understand. Modern enterprises succeed when they adapt to industry and marketplace shifts and incorporate new technology into company culture and regular operations.

Digital transformation isn’t only about technology, it’s about bringing together the power of technology with a culture that embraces the change that it can lead for the organization.

Proactive innovation is one of the best ways to stay competitive in an evolving marketplace. New technology needs to be assessed, tested, analyzed, and judged more quickly than ever. Businesses can no longer afford to waste time and resources implementing new tools that offer no real value. This means a “Fail fast, to succeed faster,” mentality.

Some projects will work straight away, others will have significant learning curves. The faster your organization can go from idea to implementation the more it can embrace opportunities to transform and even disrupt markets and internal business models. We’ve already talked about adaptability, but that plays a major role here as well.

If a company has an adaptive culture where new tech can be easily integrated — or is at least encouraged — that enterprise is set up for long-term success.

Bring together the power of technology with a culture that embraces the change

#gottaluvAWS! #gottaluvAWSMarketplace!

Posted in 2017: The Year of Intelligence, Amazon Web Services, Amazon Web Services Analytic Services, AWS Analytics, Operationalizing Data Analytics | Tagged , , , , , | Leave a comment

Bringing Predictive Data Analytics to the People with PredicSis.ai

Don’t Be Left Behind in the Days Where Predictive Analytics is Mandatory
for Long-Term Business Success

BY FRANK for DATALEADER · JULY 24, 2017

Over the last decade, the term “big data” grew to prominence. The rush was on to create technologies to capture and store vast quantities of data. The focus of many enterprises, both large and small, was on data capture and storage. Now, the rush is on to monetize and exploit these sizable data stores. Companies want to make the right decisions, for the right customers at the precise time to maximize value and minimize risks. In short, businesses need to predict the future by anticipating behaviors and identifying trends, at the individual customer level and at scale across their entire customer base.The company that can achieve all of this before their competition will win the in the marketplace. Companies that don’t capitalize on their data resources by converting them to insights and actions will lose market share, fall behind, and, ultimately, fail.

Data: The New Oil

In the early 20th century there were hardly any automobiles. Accordingly, there were no gas stations or car mechanics. Over time, gas stations evolved to have convenience stores attached to them, auto repair shops thrived, and the insurance industry found a solid new foundation for an entirely new line of business.

Look around you and you will see an entire civilization transformed by oil with all its benefits and detriments. Now, imagine what the world will look like one hundred years from now when the data revolution has played out all its effects and unintended consequences.

Pushing the “data is the new oil” analogy further, consider that raw data exists in a natural, unprocessed state, very often deep underground. A considerable amount of labor goes into taking it from that primordial state into something that can be used to fuel a car or heat a home. The data must be extracted, shaped and processed in a process analogous to what oil refineries do. Finally, the output of the refinery gets sold as a product to consumers. In other words, as more oil does not make for better gasoline, more data doesn’t necessarily make your business data-centric.

Turning Raw Data into Insight

Over the last few years, more and more organizations have discovered that data can be turned into any number of Artificial Intelligence (AI), Machine Learning (ML), or other “cognitive” services. Some of these new services may blossom into new revenue streams and will more than likely disrupt entire industries as the normal way of business will be up ended in favor of automation and accelerated decision making.

Collecting raw data for the sake of collecting raw data, argues Hal Varian, Google’s chief economist, exhibits “decreasing returns to scale.” In other words, each additional piece of data is somewhat less valuable and at some point, collecting more does not add anything. What matters more, he says, is the quality of the algorithms that process the data and the talent a firm has brought on to develop these algorithms. Success for Google is in the “recipe” not the “ingredients.”

As for new world of data, the product could be a service that rates the likelihood of whether or not a transaction is fraudulent and the consumers of the service are internal auditing department. In this way, data will enable new markets and even economic ecosystems as a previously undervalued resource develops into new streams of income and creates entirely new offshoot industries.

Drawbacks of Conventional Analytics

With the future of their businesses at stake, one would think that every single enterprise would be eagerly scouring their data sets and feeding them to any number of algorithms in order to extract any deeper understanding of their customers’ activities and identify trends as they unfold. However, this is not the case. Why?

The answer comes down to cost, both in terms of finding the people with the skills to perform this type of work and in the computation infrastructure often required to run existing algorithms, and risk, as it is often said the value is quite difficult to foresee and the complexity difficult to handle along the analytics initiatives.

This is not mere risk aversion or fear of the unknown: there are hurdles everywhere, indeed. Data will need to be shaped and cleaned; the team is not skilled enough, and hiring on consultants is expensive. Not to mention the infrastructure investments required to store the data and compute the model. The payoff is hard to evaluate and the ROI is even harder to envision.

In a few words: getting tangible results from analytics is not straightforward. At least, not for everyone.

What if there was a way to take the cost of recruiting experienced data scientists and data engineers, remove the expense associated with beefing up IT infrastructure, and make advanced data analytics more approachable to the average knowledge worker?

Well, there is.

Enter PredicSis.ai

PredicSis.ai changes the game by making advanced analytics more accessible and affordable. No longer are advanced analytics limited to large organizations with massive budgets devoted towards hiring, training, and maintaining a data science team. Now, anyone with or without data science and machine learning skills can leverage the power with a few clicks of the mouse.

Simple to Use

The real power of PredicSis.ai lies in its ability to place data analytics into an easy to use self-service SaaS model. PredicSis.ai is now available on the AWS Marketplace. Just activate an account on the marketplace and pay as you go. No software installs or commitments.

Automatic, Swift & Agile Integrated Predictive Analytics

PredicSis.ai automates much of the work normally associated with machine learning. Using autoML algorithms, PredicSis.ai surfaces and evaluates new data features and only displays the ones with meaningful impact on predictive outcomes. In other words, the software automatically filters out the fields, or features, that lack correlation to the predicted outcome. From the meaningful features it discovers, PredicSis.ai then creates a predictive model for future input. As the workflow and the display are made straightforward and intuitive, users can focus on rapidly iterating data models, exploring the data and, finally, delivering added value to the business.

Blazingly Fast Heterogeneous or Homogeneous Data Integration

Getting data into PredicSis.ai is fast and easy. Simply drag and drop ASCII or UTF-8 encoded CSV files: once the primary dataset file is uploaded, users can upload any number of additional peripheral data tables. It’s then up to PredicSis.ai, supervised by the user and its business knowledge, to detect, surface and display meaningful features and insights from those multiple datasets.

Accessible Advanced Data Analytics

Making advanced analytics accessible opens up new worlds of possibilities. With the freedom of self-service analytics, all of different scenarios are possible. Marketing and sales departments can determine the customers most likely to leave for a competitor – before they leave. They can pre-emptively identify which accounts are high priority calls for the sales team. Outbound calls from the sales center can be optimized to increase conversion and sales performance. Marketing and sales teams can be self-sufficient with their model creation, exploration, and experimentation.

However, the advantages go beyond marketing and sales, business analysts can leverage their deep domain knowledge and apply predictive analytics to pre-emptively address challenges that the business faces and take corrective action ahead of time. They can also use the built-in sharing functionality to share insights with their colleagues and management.

Speaking of management, company leadership can make dashboards into foresight, predicting the course of the business. Using the same tools, senior management can take proactive steps to steer the company around dangers, risks, and even find new opportunities that they may have otherwise missed.

Finally, even seasoned data scientists can leverage the flexibility and power of PredicSis.ai to explore models with ease and speed. Data science teams can explore more datasets in less time, allowing for them to explore more options, create more effective models, and add more value to the business.

In short, PredicSis.ai allows non-data scientists to reap the rewards of data science and makes data scientists more efficient.

It is pointless and painful not to use PredicSis.ai!

Getting Started

The truly best way to see how PredicSis.ai works is to use it and see for yourself just how approachable it makes data analytics.

Now that PredicSis.ai is available on the AWS Marketplace, it couldn’t be easier to use. No software to download, nothing to install, no new hardware to provision: it’s just a service that runs in the AWS cloud.

The first step to using PredicSis.ai is to browse over to the AWS Marketplace and search for PredicSis.ai. It will appear as one of the options in the autocomplete dropdown.

Then choose from one of the following options. For the purposes of this blog post, choose the PredicSis.ai (single-user) option.

On the next page, choose the AWS Region and EC2 Instance Type you wish to use. Then click on the “Launch with 1-click” button.

On the following page, click on the EC2 Console link to browse to the EC2 console.

On the EC2 Console page, retrieve the URL where PredicSis.ai was deployed. It will be the domain name next to Public DNS and end with “compute.amazonaws.com”.

Copy the URL and then browse to it.

You should see the PredicSis.ai software home page with no projects.

Data Modeling 101

Now you’re about to create your first data model and discover what lies hidden inside the data: the PredicSis.ai way. You may be asking yourself, exactly what is a data model? A data model is a way to store and represent data and its relation to other data. For instance, customers have various attributes about them stored in the data model and customer actions are also logged in the data model.

By connecting a customer’s attribute, such as age, to their behavior, such as buying certain products, one can infer that customers of a similar age are more likely to make the same purchases. Many relationships are well known, parents with young children are more likely to buy diapers than those whose children are in college.

What’s great about machine learning and data analytics is the ability to identify patterns quickly and even see patterns that exist that humans may not be able to readily identify. How does that work? Well, let’s create a project and see how easy PredicSis.ai makes this process.

Creating Your First Project

Click on the Create button located on the upper right-hand corner of the screen. The following dialog box asks for a name for the new project. You may enter anything, but, for this example, I entered “first project.”

Click Create and then the “first project” project now appears in the workspace area.

Click on the first project button to bring up the screen below:

Now drag and drop files to add data to the project. As mentioned previously, data files must be in CSV format. Once the appropriate files are uploaded, specify the one containing the outcome.

Once completed, click the Get Insights button in the upper right hand corner of the screen and PredicSis.ai will get to work analyzing the data.

After a few moments, the results are in and we can now explore the results.

First, click on the magnifying glass icon to view the model. Click around and explore the features of the data. Note how the visualizations change with each field.

To get back to the previous screen, click the drop down list that starts with “first project” and then choose Models List.

Now click on the chart icon to assess the quality of the model

The following page explains how well the predicted outcomes the model created matched up with the test data.

This screen contains a lot of information. However, if you look at the Performance number, the model scored a “0.6941,” meaning it was correct around 69.41% of the time.

Certainly, there is room for improvement and PredicSis.ai provides ways for you to adjust and improve the model.

Go back to the models list page and this time, click on the lightning bolt icon to improve the model.

This page allows you to manually adjust the features that go into creating the predictive model. Remove the average_basket feature by clicking the checkbox. Now, add the region_code feature by clicking on the check box. Additionally, change the value in the dropdown list in the Type column to Categorical. Your feature set should look like the following.

Click the Apply Changes button in the upper right hand corner to apply the changes to the model and see if the performance has improved.

Upon quick inspection, you can see that the small changes improved the model modestly. Up to 74.11% now. Go back and experiment which fields and options improve the model and which ones have the opposite effect.

Once you’ve made improvements to the model, it’s time to share it.

On the Models List page, click on the magnifying glass icon to view the model. Check all boxes next to the fields you wish to include in the report. Now click on the Get PDF button in the upper right hand corner of the screen. PredicSis.ai generates a report in a PDF file that you can share.

If you wanted to share the performance metrics of the model, you can do that as well. Go back to the Models List page and this time click on the chart icon to see the Assess Model screen once more. On the upper right hand corner of the screen, there is once again a Get PDF button. Click on it to generate and download a report as PDF file.

Conclusion

PredicSis.ai allows for easy creation and exploration of predictive data models in just a few clicks of the mouse. It gives business users many of the same analytical tools that have previously only been in the hands of data scientists. With wider use and deployment of data analytics, businesses can more easily spot trends, detect fraud, better serve their existing customers, and find new ones.

Data analytics is already changing the game of business and now its power is in your hands. The recommendation is simple: use PredicSis.ai before your competition does.

Posted in Artificial Intelligence, Data Modeling, Easy AWS AI, PredicSis.ai, Predictive Analytics | Tagged , , , , , , | Leave a comment

AWS Data Analytics Services Leveraging AWS Marketplace in Detail

Unlock Hidden Insights within Massive Data Sources

Summary

Analyzing extensive data sets require noteworthy compute capacity that can fluctuate in size based on the data inputs and type of analytics. This characteristic of scaling workloads is perfectly suited to AWS and the AWS Marketplace’s pay-as-you-go cloud model, where applications can scale up and down based on demand. Being able to analyze data quickly to derive valuable insights can be done within minutes rather than months, and you only pay for what you use.

Introduction

As an ever-increasing and ubiquitous proliferation of data is emitted from increasingly new and previously unforeseen sources, traditional in-house IT solutions are unable to keep up with the pace. Heavily investing in data centers and servers by “best guess” is a waste of time and money, and a never-ending job.

Traditional data warehouses required very highly skilled employees that addressed a fixed set of questions. The need for speed and agility today in analyzing data differently and efficiently requires complex architectures that are available and ready for use with the click of a button on AWS – eliminating the need to concern yourself with the underlying mechanisms and configurations that you’d have to do on premises.

The AWS Marketplace streamlines the procurement of software solutions provided from popular software vendors by providing AMIs that are pre-integrated with the AWS cloud, further expediting and assisting you with supporting big data analytical software services. The AWS Marketplace has over 290 big data solutions to-date.

This eBook will cover big data and big data analytics as a whole in depth: what it is, where and how it comes from, and what kinds of information you can find when analyzing all of this data. It will then discuss the facts about why AWS and the solutions provided by top software vendors in AWS Marketplace provide the best big data analytics services and offerings. Then there will be a walk-through of the AWS Services that are used in big data analytics with augmented solutions from AWS Marketplace. In conclusion, you will see how AWS is the unequivocal leader when implementing big data analytic solutions.

Conventions Used in this eBook

In order to provide cohesiveness to the longer sections of this eBook, the use of tables are used. The header of the table will list the name of the topic, and each subtopic is listed below it. An example is shown below:

TABLE 1: EXAMPLE OF A TABLE WITH TOPIC IN THE HEADER WITH SUBTOPICS BELOW

Big Data Analytics Challenges

Data is not only getting bigger (in “Volume”) and in ever-increasing different formats (the “Variety”) faster (the “Velocity”), but the need to derive “Value” through analytics to provide actionable insights for businesses is indeed a differentiating factor between successful businesses that can mitigate risk and respond to customer actions in near real-time vs. other businesses that will fall behind in the day and age of data deluge. Using Amazon Web Services cloud architectures and software solutions available from popular software venders on AWS Marketplace, big data analytics solutions change from extremely complicated to set up and manage to a couple of clicks to deployment.

In addition the metaphorical “V’s” mentioned above to describe big data, there is one more: “Veracity” – being sure your data is clean prior to any analytics performed whatsoever. Garbage in, garbage out. There’s no time to waste making improper, misinformed decisions based on dirty data. This is paramount. Using solutions in the AWS Marketplace make this crucial and difficult step easy.

Big data has also evolved. It used to be that batch processing for reports was sufficient (and the only solution available). To keep competitive today, you need to “think ahead” and answer questions in real-time to provide alerts to mitigate negative impacts on your business, and you need to predictive analytics to forecast what’s going to happen before it ever does so you are prepared at any given point in time.

Overview of AWS and AWS Marketplace Big Data Analytics Advantages

TABLE 2: OVERVIEW OF AWS AND AWS MARKETPLACE BIG DATA ANALYTICS ADVANTAGES

AWS Big Data Analytics Advantages Overview

Analyzing large data sets requires significant compute capacity that can vary in size based upon the amount of input data and the type of analysis. Thus it’s apparent that these big data workloads is ideally suited to a pay-as-you-go cloud environment.

Many companies that have successfully taken advantage of AWS big data analytics processing aren’t just enjoying incremental improvements. The benefits enabled by big data processing becomes the heart of the business – enabling new applications and business processes, using a variety of data sources and analytical solutions – giving insight into their data never dreamed of and giving them a great competitive advantage.

Ongoing developments in AWS cloud computing are rapidly moving the promise of deriving business value from big data in real-time into a reality. With billions of devices globally already streaming data, forward-thinking companies have begun to leverage AWS to reap huge benefits from this data storm.

AWS has the broadest platform for big data in the market today, with deep and rapidly expanding functionality across big data stores, data warehousing, distributed analytics, real-time streaming, machine learning, and business intelligence. Gartner2 confirms AWS has the most diverse customer base and the broadest range of use cases, including enterprise mission-critical applications. For the sixth consecutive year, Gartner2 also confirms AWS is the overwhelming market share leader, with over 10 time more cloud compute capacity in use than the aggregate total of the other 14 providers in their Magic Quadrant!

2 Gartner

AWS has a tiered competency-badged network of partners that provide application development expertise, managed services and professional services such as data migration. This ecosystem, along with AWS’s training and certification programs, makes it easy to adopt and operate AWS in a best-practice fashion.

The AWS cloud provides governance capabilities enabling continuous monitoring of configuration changes to your IT resources as well as giving you the ability to leverage multiple native AWS security and encryption features for a higher level of data protection and compliance – security at every level up to the most stringent government compliance no matter what your industry.

Listed Below are Some of the Specific AWS Big Data Analytics Advantages:

  • The vast majority of big data use cases deployed in the cloud today run on AWS, with unique customer references for big data analytics, of which 67 are enterprise, household names
  • Over 50 AWS Services and hundreds of features to support virtually any big data application and workload
  • AWS releases new services and features weekly, enabling you to keeping the technologies you use aligned with the most current, state-of-the-art big data analytics capabilities and functionalities
  • AWS delivers an extensive range of tools for fast and secure data movement to and from the AWS cloud
  • Computational power that’s second to none2; each optimized with varying combinations of CPU, memory, storage and networking capacity to meet the need of any big data use case
  • AWS makes fast, scalable, gigabyte-to-petabyte scale analytics affordable to anyone via their broad range of storage, compute and analytical options, guaranteed!
  • AWS provides capabilities across all of your locations, your networks, software and business processes meeting the strictest security requirements that are continually audited for the broadest range of security certifications
  • AWS removes limits to the types of database and storage technologies you can use by providing managed database services that offer enterprise performance at open source cost. This results in applications running on many different data technologies, using the right technology for each workload
  • Virtually unlimited capacity for massive datasets
  • AWS provides data encryption at rest an in-transit for all services with the ability for you to directly analyze the encrypted data
  • AWS provides a scalable architecture that supports growth in users, traffic or data without a drop in performance, both vertically and horizontally, and allows for distributed processing
  • Faster time-to-market of products and services, enabling rapid and informed decision-making while shrinking product and service development time
  • Lower cost of ownership and reduced management overhead costs, freeing up your business for more strategic and business-focused tasks
  • In addition to the huge cost savings of simply moving from on-premises to the cloud, AWS provides suggestions on how to further decrease cost savings. Providing the most cost-efficient cloud solutions is a frugality rule at AWS
  • Numerous ways to achieve and optimize a globally-available, unlimited on-demand capacity of resources so you can grow as fast as you can
  • Fault tolerance across multiple servers in Availability Zones and across geographically distant Regions
  • An extremely agile application development environment: go from concept to full production deployment in 24 hours
  • Security at AWS is the highest priority. As an AWS customer, you benefit from a data center and network architecture built to meet the requirements of the most security-sensitive customers
  • AWS provides many suggestions on how to remove a single point of failure

2 Gartner

AWS Marketplace Big Data Analytics Advantages Overview

AWS provides an extensive set of managed services that help you build, secure, and scale big data analytics applications quickly and easily. Whether your applications require real-time streaming, a data warehouse solution, or batch data processing, AWS provides the infrastructure and tools to perform virtually any type of big data project.

When you combine the managed AWS services with software solutions available from popular software vendors on AWS Marketplace, you can get the precise business intelligence and big data analytical solutions you want that augment and enhances your project beyond what the services themselves provide. You get to data-driven results faster by decreasing the time it takes to plan, forecast, and make software provisioning decisions. This greatly improves the way you build business analytics solutions and run your business.

Gartner2 confirms that because AWS has a multi-year competitive advantage over all its competitors, it’s been able to attract over a thousand technology partners and independent software vendors from popular vendor that have licensed and packaged their software to run on AWS, have integrated their software with AWS capabilities, or to deliver add-on services all through the AWS Marketplace. The AWS Marketplace is the largest “app store” in the world, regardless of being strictly a B2B app store!

2 Gartner

FIGURE 1: THE AWS MARKETPLACE

Since AWS resources can be instantiated in seconds, you can treat these as “disposable” resources – not hardware or software you’ve spent months deciding which to choose and spending a significant up-front expenditure without knowing if it will solve your problems. The “Services not Servers” mantra of AWS provides many ways to increase developer productivity, operational efficiency and the ability to “try on” various solutions available on AWS Marketplace to find the perfect fit for your business needs without commitment to long-term contracts.

Listed Below are Some of the Specific AWS Marketplace Big Data Analytics Advantages:

  • Get to data-driven results faster by decreasing the time it takes to plan, forecast, and make decisions by performing big data analytics and visualizations on AWS data services and other third-party data sources via software solutions available from popular software vendors on AWS Marketplace – the largest ecosystem of popular software vendors and integrators of any provider2 – giving your organization the agility to experiment and innovate with the click of a button
  • The AWS Marketplace maintains the largest partner ecosystem of any provider. It has over 290+ big data software solutions available from popular software vendors that are pre-integrated with the AWS cloud
  • Deploy business intelligence and advanced analytics pre-configured software solutions in minutes
  • On-demand infrastructure through software solutions on AWS Marketplace allows iterative, experimental deployment and usage to take advantage of advanced analytics and emerging technologies within minutes, paying only for what you consume, by the hour or by the month
  • Many AWS Marketplace solutions offer free trials, so you can “try on” multiple big data analytical solutions to solve the same business problem to see which is the best fit for your specific scenario

2 Gartner

Example Solutions Achieved Through Augmenting AWS Services with Software Solutions Available on AWS Marketplace

Using software solutions available from popular software vendors on AWS Marketplace, you can customize and tailor your big data analytic project to precisely fit your business scenario. Below are just a fraction of example solutions you can achieve when using AWS Marketplace’s software solutions with the AWS big data services.

You can:

  • Launch pre-configured and pre-tested experimentation platforms for big data analysis
  • Query your data where it sits (in-datasource analysis) without moving or storing your data on an intermediate server while directly accessing the most powerful functions of the underlying database
  • Perform “ELT” (extract, load, and transform) vs. “ETL” (extract, transform, and load) your data into Amazon’s Redshift data warehouse so the data is in its original form, giving you the ability to perform multiple data warehouse transforms on the same data
  • Have long-term connectivity among many different databases
  • Ensure your data is clean and complete prior to analysis
  • Visualize millions of data points on a map
  • Develop route planning and geographic customer targeting
  • Embed visualizations in applications or stand-alone applications
  • Visualize billions of rows in seconds
  • Graph data and drill into areas of concern
  • Have built-in data science
  • Export information into any format
  • Deploy machine-learning algorithms for data mining and predictive analytics
  • Meet the needs of specialized data connector requirements
  • Create real-time geospatial visualization and interactive analytics
  • Have both OLAP and OLTP analytical processing
  • Map disparate data sources (cloud, social, Google Analytics, mobile, on-prem, big data or relational data) using high-performance massively parallel processing (MPP) with easy-to-use wizards
  • Fine-tune the type of analytical result (location, prescriptive, statistical, text, predictive, behavior, machine learning models and so on)
  • Customize the visualizations in countless views with different levels of interactivity
  • Integrate with existing SAP products
  • Deploy a new data warehouse or extend your existing one

AWS Marketplace-Specific Site for Data Analytics Solutions
There’s a plethora of options on AWS Marketplace to run big data analytics software solutions available from popular vendors that are already pre-configured on an Amazon Machine Image (AMI) that solve a variety of very specific needs, some of which were mentioned above.

You can visit the AWS Marketplace Big Data Analytics-specific site by clicking the bottom left icon on the AWS Marketplace site or by clicking here to view the premier AWS Marketplace solution providers for transforming and moving your data, processing and analyzing your data, and reporting and visualizing your data.

I’d like to point out if you click the “Learn More” link at the bottom of each type of solution (for example, below I’m showing the section “Business Intelligence and Data Visualization”), you’re taken to an awesome section that works like a “Channel Guide” for webcasts to teach you how to work with some of the solutions presented by software vendor representatives!

The first screenshot below shows where to find the “Learn More” link, and the second screenshot below is of the “Channel Guide” for webcasts by representatives for some of AWS Marketplace software vendors:

FIGURE 2: THE “LEARN MORE” LINK UNDER EACH ANALYTIC SOLUTION TYPE
ON THE BIG DATA ANALYTICS-SPECIFIC SITE

Click the “Learn More” link highlighted above in the red rectangle, and for whichever type of Analytics solution you click on, you’re taken to the Webcast Channels:

FIGURE 3: THE “CHANNEL GUIDE”-TYPE WEBCAST INTERFACE TO HELP YOU UNDERSTAND
EACH PARTICULAR TOPIC ON THE BIG DATA ANALYTICS-SPECIFIC SITE

Overview of AWS Cloud Architecture
The AWS cloud is based on the general design principles of the “Well-Architected Framework” that increases the likelihood of business success. It is based on the following four pillars:

  1. Security: The ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies
    • AWS’s built-in security features can be enhanced with software solutions available from popular software vendors on AWS Marketplace. To browse the various solutions, click here
  2. Reliability: The ability of a system to recover from infrastructure or service failures, dynamically acquiring computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues
    • AWS’s built-in fault tolerance and infrastructure disruption features can be enhanced with software solutions available from popular software vendors on AWS Marketplace. To browse solutions for fault tolerance, click here, and for infrastructure/network solutions click here
  3. Performance Efficiency: The ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve
    • AWS’s built-in performance features can be enhanced with software solutions available from popular software vendors on AWS Marketplace. To browse the various solutions, click here
  4. Cost Optimization: The ability to avoid or eliminate unneeded cost or suboptimal resources
    • AWS’s built-in cost alerting features can be enhanced with software solutions available from popular software vendors on AWS Marketplace. To browse the various solutions, click here

If you’d like to know more about AWS’ “Well-Architected Framework”, from which the above is referenced, click here.
Some of the Types of Big Data Analytical Insights and Example Use Cases

TABLE 3: SOME EXAMPLES OF THE TYPES OF BIG DATA ANALYTICAL INSIGHTS WITH USE CASES

Introduction
Big Data is such a buzz-word that it’s prudent to ensure we wrap our heads around what it means.

Big data means a massive volume of both structured and unstructured data that’s so large it’s difficult to process using traditional database and software techniques.

In Enterprise scenarios, the volume of data is just too big or it moves too fast, or it exceeds processing capabilities available on-premisis. But this data, when captured, formatted, manipulated, and stored, pulls powerful insights – some never imagined – through analytics.

Below you’ll find a description of some of the types of big data analytical insights and common use cases for each:

Descriptive: Descriptive Analytics uses business intelligence and data mining to ask “What has happened?” Descriptive Analytics mines data to provide trending information on past or current events that can give businesses the context they need for future actions. Descriptive Analytics are characterized by the use of KPIs. It drills down into data to uncover details such as the frequency of events, the cost of operations and the root cause of failures. Most traditional business intelligence reporting falls into this realm, but complex and sophisticated analytic techniques also fall into this realm when their purpose is to describe or characterize past events and states. Summary statistics, clustering techniques, and association rules used in market basket analysis are all examples of Descriptive Analytics.

Diagnostic: Diagnostic Analytics examines data or content to answer the question “Why did it happen?” It’s characterized by techniques such as drill-down, data discovery, data mining and correlations. You can think of it as the casual inference and the comparative effect of different variables on a particular outcome. While Descriptive Analytics might be concerned with describing how large or significant a particular outcome is, it’s more focused on determining what factors and events contributed to the outcomes. As more and more cases are included in a particular analysis and more factors or dimensions are included, it may be impossible to determine precise, limited statements regarding sequences and outcomes. Contradictory cases, data sparseness, missing factors (“unknown unknowns”), and data sampling and preparation techniques all contribute to uncertainty and the need to qualify conclusions in Diagnostic Analytics as occurring in a “probability space”. Training algorithms for classification and regression techniques can be seen as falling into this space since they combine the analysis of past events and states with probability distributions. Other examples of Diagnostic Analytics include attribute importance, principle component analysis, sensitivity analysis and conjoint analysis.

Discovery: Discovery Analytics doesn’t begin with a pre-definition but rather with a goal. It approaches the data in an iterative process of “explore, discover, verify and operationalize.” This method uncovers new insights and then builds and operationalizes new analytic models that provide value back to the business. The key to delivering the most value through Discovery Analytics is to enable as many users as possible across the organization to participate in it to harness the collective intelligence. Discovery Analytics searches for patterns or specific items in a data set. It uses applications such as geographical maps, pivot tables and heat maps to make the process of finding patterns or specific items rapid and intuitive. Examples of Discovery Analytics include using advanced analytical geospatial mapping to find location intelligence or frequency analysis to find concentrations of insurance claims to detect fraud.

Predictive: Predictive Analytics asks “What could happen?” It’s used to make predictions about unknown future events. It uses many techniques from data mining, machine learning and artificial intelligence. This type of analytics is all about understanding predictions based on quantitative analysis on data sets. It’s in the realm of “predictive modeling” and statistical evaluation of those models. Examples of Predictive Analytics includes classification models, regression models, Monte Carlo analysis, random forest models and Bayesian analysis. It helps businesses anticipate likely scenarios so they can plan ahead, rather than reacting to what already happened.

Prescriptive: Prescriptive Analytics uses optimization and simulation to ask “What should we do?” It explores a set of possible actions and suggests actions based on Descriptive and Predictive Analyses of complex data. It’s all about automating future actions or decisions which are defined programmatically through an analytical process. The emphasis is on defined future responses or actions and rules that specify what actions to take. While simple threshold based “if then” statements are included in Prescriptive Analytics, highly sophisticated algorithms such as neural nets are also typically in the realm of Prescriptive Analytics because they’re focused on making a specific prediction. Examples include recommendation engines, next best offer analysis, cueing analysis with automated assignment systems and most operations research optimization analyses.

How Big is “Big Data”?
The amount of digital information a typical business has to deal with doubles every two years. It has been predicted that the data we create and copy annually (“the digital universe”) will reach 44 zettabytes – or 44 trillion gigabytes – by the year 20203. With AWS and the analytical solutions provided by popular software vendors provided by the AWS Marketplace, there is a wealth of yet-to-be-discovered insights that could provide a myriad of understanding in countless types of research.
3 EMC Digital Universe with Research & Analysis by IDC

FIGURE 4: HOW BIG IS BIG DATA?

Examples of Big Data Producers
This section is included to give you an example of some of the types of “things” that produce massive amounts of data that can be analyzed and repurposed.

TABLE 4: EXAMPLES OF BIG DATA PRODUCERS

Machine and Sensor Data
Machine and sensor data come from many sources, and sources continue to proliferate. Some examples are energy meters, telecommunications, road/air/sea pattern analysis, satellites, meteorological sensors and other natural phenomena monitoring, scientific and technical services, manufacturing, medical devices and the Internet of Things (IoT) such as smart homes, appliances and cities. Analyses of this type of data can reveal many trends and predictive analysis can be performed to take action to prevent unwanted scenarios or be alerted when something goes awry.

Image and Video Data
It would take more than 5 million years to watch the amount of video that will cross global IP Networks each month in 20204. Some examples of image and video data are video surveillance, immersive video, virtual (and augmented) reality, internet gaming, smartphone images and video, photo and video sharing sites (YouTube, Instagram, Pinterest, etc.) and streaming video content (such as Netflix). Topological, contextual, hidden statistical patterns and historical analyses are examples of some of the types of analytics that can be done on image and video data5.

4 For a detailed report on Visual Networking, read Cisco’s Visual Networking Index: Forecast and Methodology, 2015-2020
5 Wired.com

Social Data
There are approximately 2 billion internet users using social networks in 20165, producing enormous amounts of data not only through posts and tweets, but comments, likes, and so forth. Some examples include Facebook and Facebook Messenger, Twitter, LinkedIn, Vine, WhatsApp, Facebook, Skype, and so forth. This type of data is useful for text and sentiment analysis.

6 Statista Statistics Portal

Internet Data
The current forecast projects global IP traffic to nearly triple from 2015 to 2020, growing to 194 exabytes/month7. Examples of internet data include data stored on websites, blogs, and news sources, online banking and financial transactions, package and asset tracking, transportation data, telemedicine, first responder connectivity, and even chips for pets! Internet data can be analyzed for security breaches, bank fraud, traffic analysis, geographic distribution of DNS clients, discovering origins of cybercrime8.

7 Cisco: The Zettabyte Era – Trends and Analysis 2016

8 CAIDA: Center for Applied Internet Data Analysis

Log Data
Log files are records of events that occur in a system, in software, or in communications between users of software. There are many types of logging systems. Some examples are event logs, server logs, RFID logs, Active Directory logs, security logs, mail logs, network logs and transaction logs. Log data analysis include analysis for performance, solve software bugs, testing of new features, audit trails for unauthorized or malicious access, etc.
Third-Party Data
Third-party data is any information collected by an entity that does not have a direct relationship with the user the data is being collected on. Often this data is generated on a variety of platforms and then aggregated together for analysis. Examples include geospatial data, mapping and demographic data, content delivery networks, CRM and other business software systems. Third-party data can be analyzed for trends in traffic, spread of disease, user behavior, and more.

AWS Cloud Computing Models and Deployment Models
Cloud computing provides developers and IT departments the ability to focus on what matters most and avoiding undifferentiated work like procurement, maintenance, and capacity planning. There are several different models and deployment strategies that help meet specific needs of different users. Each type of cloud service and deployment method provides different levels of control, flexibility and management. By understanding the differences between “Infrastructure as a Service” (IaaS), “Platform as a Service” (PaaS), and “Software as a Service” (SaaS), in addition to the different deployment strategies available, can help you decide what set of services is right for your business needs.

Before an analytical cloud project starts, it’s important to choose the right cloud computing and deployment architectures determined. Many factors come into play that affect the location of the data to be analyzed, where the analytics processing will be performed, and to abide by legal and regulatory requirements of different countries. Once you’ve determined the best cloud computing and deployment model, you can utilize Amazon CloudFront to speed up the distribution of your application. Amazon CloudFront delivers your content through a worldwide network of edge locations, so that when a user requests content served from CloudFront, they’re routed to the edge location that provides the lowest latency, so content is delivered with the best possible performance. For more details about Amazon CloudFront, click here.

TABLE 5: AWS CLOUD COMPUTING MODELS

AWS Cloud Computing Models
There are three main models for cloud computing on AWS. Each model represents different parts of the cloud computing stack.

  1. IaaS contains the basic building blocks for cloud IT. This model typically provides access to networking features, “computers” (virtual or on dedicated hardware), and data storage space. This model gives you the highest level of flexibility and management control over your IT resources and is most similar to most existing IT resources on-premisis. IaaS is usually the first model used when moving to the cloud
  2. PaaS removes the need to manage the underlying infrastructure (usually hardware and operating systems) which allows you to focus on the deployment and management of your applications. This increases efficiency because you don’t have to worry about resource procurement, capacity planning, software maintenance, patching, or any of the other undifferentiated “heavy lifting” involved in running your applications
  3. SaaS provides you with a completed product that’s run and managed by the service provider. In most cases, people referring to SaaS are referring to end-user applications. With SaaS you don’t have to think about how the service is maintained or how the underlying infrastructure is managed; you only need to think about how you’ll use the software.

You’ll find popular open source and commercial software available on AWS Marketplace available as SaaS (in addition to individual Amazon Machine Images (AMIs) or as a cluster of AMIs deployed through an AWS CloudFormation template).

AWS Cloud Computing Deployment Models
There are three AWS cloud computing deployment models: Public, Hybrid, and Private.

*NOTE: These describe where the IT resources reside, and are separate from the many ways to get your data and applications onto AWS.

TABLE 6: AWS CLOUD DEPLOYMENT MODELS

AWS Public Cloud Model (Cloud Native)
The AWS public cloud is where most companies and individuals start. It’s the easiest, fastest way to begin to use on-demand delivery of IT resources and applications via the Internet with a low-cost, pay-as-you-go pricing model via AWS services and solutions available on AWS Marketplace.

The public cloud is an ideal place to quickly use big data analytics solutions on the AWS Marketplace to experiment, innovate and try new and different analytical solutions. Spin up solutions as you need them, turn them off when you’re done and only pay for what you’ve used.

The public cloud provides a simple way to access servers, storage, databases and a huge set of application services. AWS owns and maintains the network-connected hardware required for these services, while you provision and use what you need via the AWS console. Using the public cloud gives you the benefits of cloud computing such as the following:

  • Rather than investing in data centers and servers before you know what you’re going to use, you can only pay when you consume computing resources, and only for how much you use
  • Achieve lower variable cost because hundreds of thousands of customers are aggregated in the cloud
  • Eliminate guessing on infrastructure capacity needs. Access as much or as little as you need, and scale up or down as required within minutes
  • Increase speed and agility, since new IT resources are only a click away
  • Focus on projects that differentiate your business since AWS does all the heavy lifting of racking, stacking and powering servers
  • Easily deploy your application in multiple regions around the world, going global in minutes

The public network contains elements that may be sourced from the internet, data sources and users, and the edge services needed to access the AWS cloud or enterprise network. The flow from the external internet may come through normal edge services including DNS Servers (Amazon Route 53, for example), Content Delivery Networks (Amazon CloudFront, for example), firewalls (Amazon EC2 VPCs, for example) and load balancers (Amazon EC2 Load Balancers, for example) before entering the data integration or data streaming entry points to the data analytics solution.

AWS Hybrid Cloud Model
For companies who have significant on-premises and / or data center investments, migrating to the cloud can take years. Therefore, it’s very common to see Enterprises use a “Hybrid Cloud Architecture”, where critical data and processing remains in the data center and other resources are deployed in public cloud environments. Processing resources can be further optimized with a hybrid topology that enables cloud analytics engines to work with on-premises data. This leverages enhanced cloud software deployment and update cycles while keeping data inside the firewall.

Another benefit of a hybrid environment is the ability to develop applications on dedicated resource pools which eliminate the need to compromise on configuration details like processors, GPUs, memory, networking and even software licensing constraints. The resulting solution can be subsequently deployed to an Infrastructure as a Service (IaaS) cloud service that offers compute capacity matching the dedicated hardware environment that otherwise would be hosted on-premisEs. This feature is rapidly becoming a big differentiator for cloud applications that need to hit the ground running with the right configuration to meet real-world demands.

AWS Private/Enterprise Cloud Model

The main reason to choose a Private Cloud Environment is for network isolation. Your EC2 instances are created in a virtual private cloud (VPC) to provide a logically isolated section on the AWS cloud.

Within that VPC, you have complete control over the virtual networking environment, including your own IP range selection, subnet creation, and configuration of route tables and network gateways. You can also create a Hardware Virtual Private Network (VPN). You can implement fine-grained access roles and groups, and stages of isolation for users. Enterprise governance and private encryption resources are available in a private cloud model. For more information on Enterprise cloud computing with AWS, click here. There are solutions on AWS Marketplace that allow you to perform big data analytics on the cloud while keeping your data on-premises.

Overview: AWS Identity and Access Management and Other AWS Built-In Security Features

TABLE 7: AWS BUILT-IN SECURITY SERVICES

AWS Security Overview
Before delving into any of AWS’s services used for big data advanced analytics, security must be minimally addressed. For any business, cloud security is the number one concern. AWS has industry-leading capabilities across facilities, networks, software and business processes meeting the strictest requirements for any vertical. Security is a core functional requirement that protects mission-critical information from accidental or deliberate theft, leakage, integrity, compromise and deletion.

AWS customers benefit from a data center and network architecture built to satisfy the requirements of their most security-sensitive customers. AWS used redundant layered controls, continuous validation and testing, and a substantial amount of automation to ensure that the underlying infrastructure is monitored and protected 24×7. These controls are replicated in every new data center and service.

Under the AWS “Shared Responsibility Model”, AWS is responsible for the security of the underlying cloud infrastructure and you are responsible for securing workloads you deploy in AWS, giving you the flexibility and agility to implement the most applicable security controls for your business functions in the AWS environment.

There are certain security features, such as individual user accounts and credentials, SSL/TLS for data transmissions and user activity logging that you should configure no matter which AWS service you use.

Identity and Access Management – User Accounts
AWS provides a variety of tools and features to keep your AWS account and resources safe from unauthorized use. This includes credentials for access control, HTTPS endpoints for encrypted data transmission, the creation of separate Identity and Access Management (IAM) user accounts, user activity logging for security monitoring, and Trusted Advisor security checks.

Only the business owner should have “root access” to your AWS account. The screenshot below is what you see when you’re logging in with your “root credentials”:

FIGURE 5: LOGIN PAGE USING YOUR ROOT CREDENTIALS

The screenshot below is the login page you see when you log in with your Identity and Access Management (IAM) account credentials (Note the additional “Account” textbox and the link at the bottom stating “Sign-in using root account credentials”):

FIGURE 6: LOGIN PAGE WHEN YOU’RE LOGGING IN WITH IAM CREDENTIALS

In order to prevent access to your AWS root account, create IAM mechanisms for creating and managing individual users (an individual, system, or application that interacts with your AWS resources). With IAM, you define policies that control which AWS services your users can access and what they can do with them. You give very fine-grained control – giving only the minimum permissions needed to do their job.

The Table Below Gives You an Overview of AWS User Security Measures:

TABLE 8: AWS BUILT-IN USER SECURITY MEASURES

To read more about IAM security best practices, click here.

AWS Network, Data and API Security
The AWS network has been architected to permit you to select the level of security and resiliency appropriate for your workload. To enable you to build geographically dispersed, fault-tolerant web architectures with cloud resources via a world-class network infrastructure that’s continually monitored and managed.

Most enterprises take advantage of Amazon Virtual Private Cloud (VPC) that enables you to launch AWS resources into a virtual network you define that resembles your own data center network but with the benefits of using the scalable infrastructure of AWS. For more information, click here.

Below are Some of AWS Network Security Measures:
• Firewall and other boundary devices that employ rule sets, access control lists (ACLs) and configurations
• Secure access points with comprehensive monitoring
• Transmission protection via HTTPS using SSL
• Continually monitoring systems at all levels
• Account audits every 90 days
• Security logs
• Individual service-specific security
• Virtual Private Gateways / Internet Gateways
• Amazon Route 53 Security (DNS)
• CloudFront Security
• Direct Connect Security for Hybrid Cloud Architectures
• Multiple Data Security Options
• Encryption and Data Encryption at rest
• Event Notifications
Amazon Cognito Federated Identity Authentication

For more information on AWS Network, Data, and API Security, look here.

AWS Trusted Advisor
AWS Trusted Advisor scours your infrastructure and provides continual best practice recommendations free of charge in four categories: Cost Optimization, Performance, Security and Fault Tolerance. Within the Trusted Advisor console, details are given and there are direct links to the exact resource that requires attention. However, if you have a Business or Enterprise Support Plan, you have access to numerous other best practice recommendations. See the image below to grok the way AWS Trusted Advisor works:

FIGURE 7: AWS TRUSTED ADVISOR OVERVIEW DIAGRAM

To read more about AWS Trusted Advisor click here.

AWS Marketplace Software Solutions to Augment AWS’s Built-In Security Features

AWS’s infrastructure monitoring and security features can be enhanced and customized to meet the needs of any business by augmenting AWS built-in features with a plethora of options available on AWS Marketplace to create a secure cloud nirvana.

Some of the solutions to enhance security can be found here.

Using AWS Services with Solutions Available on AWS Marketplace for Big Data Analytics
This section will describe how to implement, augment, or customize some of the most commonly used AWS managed services in big data analytics with solutions available on AWS Marketplace.

Below you’ll find the AWS Management Console (the view below is once you’ve logged in), from where you access AWS’s managed services:

FIGURE 8: THE AWS MANAGEMENT CONSOLE WITH THE MANAGED SERVICES

Amazon EC2: Self-Managed Big Data Analytics Solutions on AWS Marketplace

TABLE 9: AMAZON EC2 SELF-MANAGED ANALYTICS

Amazon Elastic Cloud Compute (EC2) Overview
Amazon EC2 provides an ideal platform for operating your own self-managed big data analytics applications on AWS infrastructure. Almost any software you can install on Linux or Windows virtualized environments can be run on Amazon EC2 with a pay-as-you-go pricing model with a solution available on AWS Marketplace. Amazon EC2 uses the implemented architecture to distribute computing power across parallel servers to execute the algorithms in the most efficient manner.

Amazon EC2 provides scalable computing capacity through highly configurable instance types launched as an Amazon Machine Image (AMI). You can use Amazon EC2 to launch as many or as few virtual servers as you need, configure security and networking and manage storage. For a quick test run or 1-time big data analytics project, you can use instance store volumes for temporary data that’s deleted when you stop or terminate your instance, or use Amazon Elastic Block Store (EBS) for persistent storage.

EC2 also provides virtual networks you can create that are logically isolated from the rest of the AWS cloud that you can optionally connect to your own network, known as virtual private clouds (VPCs). You could use a VPC to run analytics solutions with data in your data center, or use one of the solutions on AWS Marketplace that facilitates a hybrid deployment model like Attunity CloudBeam (which has many other big data analytics features).

Examples of Amazon EC2 Self-Managed Analytics Solutions on AWS Marketplace
Some examples of self-managed big data analytics that run on Amazon EC2 include the following:
• A Splunk Enterprise Platform, the leading software platform for real-time Operational Intelligence. Splunk software and cloud services enable organizations to search, monitor, analyze and visualize machine-generated big data coming from websites, applications, servers, networks, sensors and mobile devices. A Splunk Analytics for Hadoop, within AWS, solution is available on AWS Marketplace also. It’s called Hunk and it enables interactive exploration, analysis, and data visualization for data stored in Amazon EMR and Amazon S3
• A Tableau Server Data Visualization Instance, for users to interact with pre-built data visualizations created using Tableau Desktop. Tableau server allows for ad-hoc querying and data discovery, supports high-volume data visualization and historical analysis, and enables the creation of reports and dashboards
• A SAP HANA One Instance, a single-tenant SAP HANA database instance that has SAP HANA’s in-memory platform, to do transactional processing, operational reporting, online analytical processing, predictive and text analysis
• A Geospatial AMI such as MapLarge, that brings high-performance, real-time geospatial visualization and interactive analytics. MapLarge’s visualization results are useful for plotting addresses on a map to determine demographics, analyzing law enforcement and intelligence data, delivering insight to public health information, and visualizing distances such as roads and pipelines
• An Advanced Analytics Zementis ADAPA Decision Engine Instance, which is a platform and scoring engine to produce Data Science predictive models that integrate with other predictive models like R, Python, KNIME, SAS, SPSS, SAP, FICO and more. Zementis ADAPA Decision Engine can score data in real-time using web services or in batch mode from local files or data in Amazon S3 buckets. It provides predictive analytics through many predictive algorithms, sensor data processing (IoT), behavior analysis, and machine learning models
• A Matillion Data Integration Instance, an ELT service natively built for Amazon Redshift, that uses Amazon Redshift’s processing for data transformations to utilize it’s blazing speed and scalability. Matillion gives the ability to orchestrate and/or transform data upon ingestion or simply load the data so it can be transformed multiple times as your business requires

Below is an awesome brochure on how using solutions available on “AWS Marketplace Re-Invents the Way You Choose, Test, and Deploy Analytics Software”:

FIGURE 9: AWS MARKETPLACE BROCHURE: RE-INVENTING THE WAY YOU CHOOSE, TEST, AND DEPLOY ANALYTICS SOFTWARE

Amazon EC2 Instance Types
Amazon EC2 provides a wide selection of instance types optimized to fit different use cases. Instance types comprise varying combinations of CPU, memory, storage, and networking capacity that as a whole is measured by developers as “vCPU” or “Virtual CPU” (vs. the legacy way of describing EC2 compute power of “ECU” (Elastic Compute Unit) which you’ll still see at times today. Each instance type includes one or more instance sizes, allowing you to scale your resources to the requirements of your target analytical workload. To read more about the differences between Amazon EC2-Classic and Amazon EC2-VPC, read this.

Performance is based on the Amazon EC2 instance type you choose. There are many instance types that you can read about here, but below the four main EC2 types that power big data analytics are described:

  • Compute Optimized: Compute-optimized instances, such as C4 instances, feature the highest performing processors and the lowest price/compute performance in EC2. With support for clustering C4 instances, they’re ideal for batch processing, distributed analytics, high performance science and engineering applications, ad serving, MMO gaming, and video encoding
  • Memory Optimized: Memory optimized instances have the lowest cost per GB of RAM among Amazon EC2 instance types. These instances are ideal for high performance databases, distributed memory caches, in-memory analytics, genome assembly and analysis, and other large enterprise applications
  • GPU Optimized: GPU instances are ideal to power graphics-intensive applications such as 3D streaming, machine learning, and video encoding. Each instance features high-performance NVIDEA GPUs with an on-board hardware video encoder designed to support up to eight real-time HD video streams (720p@30fps) or up to four real-time full HD video streams (1080p@30fps)
  • Dense Storage: Featuring up to 48 TB of HDD-based local storage, dense storage instances deliver high throughput, and offer the lowest price per disk throughput performance on EC2. This instance type is ideal for Massively Parallel Processing (MPP), Hadoop, distributed file systems, network file systems, and big data processing applications

Amazon S3: A Data Store for Computation and Large-Scale Analytics

TABLE 10: AMAZON S3 COMPUTATION & ANALYTICS DATA STORE

Amazon Simple Storage Service (S3) Overview
Amazon S3 is storage for the internet. It’s a simple storage service that offers software developers a highly-scalable, reliable, and low-cost data storage infrastructure. It provides a simple web service interface that can be used to store and retrieve any amount of data, at any time, from within Amazon EC2 or anywhere on the web. You can read, write and delete objects containing from 1 byte to 5 TB of data each. The number of objects you can store in an S3 “bucket” is virtually unlimited. It’s highly secure, supports encryption at rest, and provides multiple mechanisms to provide fine-grained control of access to Amazon S3 resources. Amazon S3’s perd, it allows concurrent read or write access by many separate clients or application threads. No storage provisioning is necessary.

Amazon S3 is very commonly used as a data store for computation and large-scale analytics, such as financial transactions, clickstream analytics, and media transcoding. Because of the horizontal scalability of Amazon S3, you can access your data from multiple computing nodes concurrently without being constrained by a single connection.

Amazon S3 is the common data repository for pre-and-post processing with Amazon EMR.

FIGURE 10: USING AMAZON S3 FOR STORAGE PRE-AND-POST AMAZON EMR ANALYSIS

Amazon S3 is well-suited for extremely spiky bandwidth demands, making it the perfect storage for Amazon EMR batch analysis. Because Amazon S3 is inexpensive, highly durable, stores objects redundantly on multiple devices across multiple i, and provides a highly durable storage infrastructure with its version capability protecting critical data from inadvertent deletion, data is often kept on S3 for long periods of time post-processing with Amazon EMR for subsequent new queries on the same data. If you store your data on Amazon S3, you can access that data from as many Amazon EMR clusters as you need.

FIGURE 11: ACCESSING DATA IN AMAZON S3 FROM MULTIPLE AMAZON EMR CLUSTERS

Amazon S3 is the common data repository for Amazon Redshift before loading the data into the Amazon Redshift Data Warehouse. You use the “COPY” command to load data from Amazon S3:

FIGURE 12: AMAZON S3 COPY COMMAND

In addition, all data written to any node in an Amazon Redshift cluster is continually backed up to Amazon S3.

FIGURE 13: AMAZON REDSHIFT CLUSTER DATA BACKS UP AUTOMATICALLY TO AMAZON S3

Examples of Some of Amazon S3’s Benefits in Large-Scale Analytics:
• S3 storage provides the highest level of data durability and availability in the AWS platform
• Error correction is built-in, and there are no single points of failure. It’s designed to sustain concurrent loss of data in two facilities, making it very well-suited to serve as the primary data storage for mission-critical data
• Amazon S3 is designed for 99.999999999% (11 nines) durability per object and 99.99% availability over a one-year period
• Highly scalable, with practically unlimited storage
• Access to Amazon S3 from Amazon EC2 in the same region is lightning fast; server-side latencies are insignificant relative to Internet latencies
• Although Amazon S3 can be accessed using multiple threads, multiple applications and multiple clients concurrently, total Amazon S3 aggregate throughput will scale to rates that far exceed what any single server can generate or consume
• To speed relevant data, many developers pair Amazon S3 with a database, such as Amazon DynamoDB or Amazon RDS where Amazon S3 stores the actual information and the database serves as the repository for associated metadata. Metadata in the database can be easily indexed and queried, making it efficient to locate an object’s reference via a database query, and this result can then be used to pinpoint and retrieve the object itself from Amazon S3
• You can nest folders in Amazon S3 “buckets” and give fine-grained access control to each

Amazon Redshift: A Massively Parallel Processing (MPP) Petabyte-Scale Enterprise Data Warehouse

TABLE 11: AMAZON REDSHIFT DATA WAREHOUSE

Amazon Redshift Overview
Amazon Redshift service is a fast and powerful, fully-managed, petabyte-scale data warehouse that makes it easy and cost-effective to efficiently analyze all your data by seamlessly integrating with existing business intelligence, reporting, and analytics tools. It’s optimized for datasets ranging from a few hundred gigabytes to a petabyte or more. You can start small for a very low cost per hour with no commitments and scale to petabytes for up to one tenth the cost less than traditional solutions. And, when you need to scale, you simply add more nodes to your cluster and redistributes your data for maximum performance, with no downtime.

FIGURE 13: ADDING MORE NODES TO AN AMAZON REDSHIFT CLUSTER

Amazon Redshift is a SQL data warehouse solution and uses standard ODBC and JDBC connections. Your data warehouse can be up and running in minutes, enabling you to use your data to acquire new insights for your business and customers continually.

Traditional data warehouses required significant expenditures, time and resources to buy, build, and maintain, and they don’t scale. Therefore, as your requirements grew, you’d have to invest in more hardware and resources. You also had the expenditure of hiring many DBAs to ensure your queries were working right and that there was no data loss. Amazon Redshift manages the work needed to set up, operate, and scale a data warehouse, from provisioning the infrastructure capacity to automating ongoing administrative tasks such as backups, patches, and upgrades.

Amazon Redshift’s Features Enabling Large-Scale Analytics
Amazon Redshift uses columnar storage and a massively parallel processing (MPP) architecture to parallelize and distribute queries across multiple nodes to consistently deliver high performance at any volume of data.

FIGURE 14: AMAZON REDSHIFT’S COLUMNAR STORAGE ARCHITECTURE GIVES THE ABILITY TO ONLY READ THE DATA YOU NEED

It automatically and continuously monitors your cluster and copies your data into Amazon S3 so you can restore your data warehouse with a few clicks. Amazon Redshift stores three copies of your data for reliability. Amazon Redshift utilizes data compression, and zone maps to reduce the amount of I/O needed to perform queries. Security is built in, and you can encrypt data at rest and in transit using hardware-accelerated AES-256 and SSL, and if you want to use Amazon VPC with your Amazon Redshift cluster, that’s also built in. All API calls, connection attempts, queries and changes to the cluster are logged and auditable.

An Amazon Redshift data warehouse is a collection of computing resources called “nodes” that are organized into a group called a “cluster”. Each cluster runs an Amazon Redshift engine and contains one or more databases. Each cluster has a leader node and one or more compute nodes. The “leader node” receives queries from client applications, parses the queries and develops query execution plans. The leader node then coordinates the parallel execution of these plans with the compute nodes, aggregates the intermediate results from these nodes then finally returns the results back to the client applications. “Compute nodes” execute the query execution plans and transmit data among themselves to serve these queries. The intermediate results are sent back to the leader node for aggregation before being sent back to the client applications.

FIGURE 15: AMAZON REDSHIFT DATA WAREHOUSE SYSTEM ARCHITECTURE

Data typically flows into a data warehouse from many different sources and in many different formats including structured, semi-structured, and unstructured data. This data is processed, transformed, and ingested at a regular cadence. You can use AWS Data Pipeline to extract, transform, and load data into Amazon Redshift. AWS Data Pipeline provides fault tolerance, scheduling, resource management and an easy-to-extend API for your ETL. It can reliably process and move data between different AWS compute and storage services as well as on-premise data sources.

You can also use AWS Database Migration Service to stream data to Amazon Redshift from any of the supported sources including Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle, SAP ASE and SQL Server, enabling consolidation for easy analysis of data in Amazon Redshift.

Amazon Redshift is integrated with other AWS services and has built-in commands to load data in parallel to each node from Amazon S3, Amazon DynamoDB or your Amazon EC2 and on-premise servers using SSH. Amazon Kinesis and Amazon Lambda integrate with Amazon Redshift as a target. You can also load streaming data into Amazon Redshift using Amazon Kinesis Firehose.

Amazon Redshift Analytics Examples
• Analyze Global Sales Data for Multiple Products
• Store Historical Stock Trade Data
• Analyze Ad Impressions and Clicks
• Aggregate Gaming Data
• Analyze Social Trends
• Measure Clinical Quality, Operation Efficiency, and Financial Performance in the Healthcare Space

AWS Marketplace Solutions for Amazon Redshift
Data can be loaded into Amazon Redshift from a multitude of solutions available from popular software vendors on AWS Marketplace to assist in Data Integration, Analytics, and Reporting and Visualization. Many of the solutions include much more than the broad topic titles.

For Data Integration and more, Matillion ETL for Redshift is a fast, modern, easy-to-use and powerful ETL/ELT tool that makes it simple and productive to load and transform data on Amazon Redshift. 100 x faster than traditional ETL technology, up and running in under 5 minutes. With a few clicks you can load data into directly into Redshift, fast, from Amazon S3; Amazon RDS; relational, columnar, cloud and NoSQL databases; FTP/HTTP; REST, SOAP, & JSON APIs; Amazon EMR; and directly from enterprise and cloud-based systems including Google Analytics, Google Adwords, Facebook, Twitter and more.

FIGURE 16: MATILLION PROCESSES MILLIONS OF ROWS IN SECONDS WITH REAL-TIME FEEDBACK

Matillion ETL for Redshift transforms data at eye-popping speed in a productivity-orientated, streamlined, browser-based graphical job development environment. Expect 50% reduction in ETL development and maintenance effort and months off your project as a result of the streamlined UI, perfect integration to AWS & Redshift and the sheer speed.

FIGURE 17: MATILLION JOINS, TRANSFORMS, FILTERS & MANIPULATES BIG DATA AT BLISTERING SPEED IN A MODERN, BEAUTIFUL, BROWSER-BASED ENVIRONMENT

Matillion ETL for Redshift delivers a rich orchestration environment where you can orchestrate and schedule data load and transform; control flow; integrate with other systems and AWS services via Amazon SQS, Amazon SNS and Python; iterate; manage variables; create and drop tables; vacuum and analyze tables; soft code ETL/ELTs from configuration tables; control transactions and commit/roll-back; setup alerting; and develop data quality, error-handling and conditional logic.

For Advanced Analytics, TIBCO Spotfire Analytics Platform is a complete analytics solution that helps you quickly uncover insights for better decision-making. Explore, visualize, and create dashboards for Amazon Redshift, RDS, Microsoft Excel, SQL Server, Oracle, and more in-minutes. Easily scale from a small team to the entire organization with Spotfire for AWS. Includes 1 Spotfire Analyst user (via Microsoft Remote Desktop), unlimited Consumer and Business Author (web) users, plus Spotfire Server, Web Player, Automation Services and Statistics Services. Go from data to dashboard in under a minute. No other solution makes it as easy to get started or deliver analytics expertise. TIBCO Spotfire® Recommendations suggests the best visualizations based on years of best-practices. Broadest Data Connectivity – Access and combine all of your data in a single analysis to get a holistic view of your business. Cloud or on-premise, small or big data. Best-in-class analytics for any data source, incl. Amazon Redshift and RDS.

FIGURE 18: TIBCO SPOTFIRE ANALYTICS PLATFORM CONNECTS TO CLOUD OR ON-PREMIS DATA SOURCES

Comprehensive Analytics – A full spectrum of analytics capabilities to empower novice to advanced users, including: interactive visualizations, data mashup, predictive and prescriptive analytics, location analytics, and more.

FIGURE 19: TIBCO SPOTFIRE GIVES COMPREHENSIVE ANALYTICS & VISUALIZATIONS

For Data Analysis and Visualization, Tableau Server Tableau Server for AWS is browser and mobile-based visual analytics anyone can use. Publish interactive dashboards with Tableau Desktop and share them throughout your organization.

FIGURE 20: TABLEAU SERVER BROWSER & MOBILE-BASED VISUAL ANALYTICS

FIGURE 21: TABLEAU SERVER’S PUBLISHED INTERACTIVE DASHBOARDS SHARED THROUGHOUT YOUR ORGANIZATION

Embedded or as a stand-alone application, you can empower your business to find answers in minutes, not months. By deploying on the AWS Marketplace you can stand-up a perfectly sized instance for your Tableau Server with just a few clicks. Tableau helps tens of thousands of people see and understand their data by making it simple for the everyday data worker to perform ad-hoc visual analytics and data discovery as well as the ability to seamlessly build beautiful dashboards and reports. Tableau is designed to make connecting live to data of all types a simple process that does not require any coding or scripting. From cloud sources like Amazon Redshift, to on-premise Hadoop clusters, to local spreadsheets, Tableau gives everyone the power to quickly start visually exploring data of any size to find new insights.

For Data Warehouse Databases, SAP HANA One is a production-ready, upgradable to latest HANA SPS version (by Addon), single-tenant configured SAP HANA database instance. Perform real-time analysis, develop and deploy real-time applications with the SAP HANA One. Natively built using in-memory technology and now deployed on AWS, SAP HANA One accelerates transactional processing, operational reporting, OLAP, and, predictive and text analysis while bypassing traditional data latency & maintenance issues created through pre-materializing views and pre-caching query results. Unlike other database management systems, the SAP HANA One on AWS streamlines both transactional (OLTP) and analytical (OLAP) processing by working with single data copy in the in-memory columnar data store. By consolidating OLAP and OLTP workloads into a single in-memory RDBMS, you benefit from a dramatically lower TCO, in addition to mind-blowing speed. Build new, or deploy existing, on-demand applications on top of this instance for productive use. Developers can take advantage of this offering through standard based open connectivity protocols ODBC, JDBC, ODBO, ODATA and MDX allowing ease of integration with existing tools and technologies. Transform decision processing by streamlining transactions, analytics, planning, predictive and text analytics on a single in-memory platform. HANA One instances are now more secure with SSH root login disabled. Customers can now login to the instance using the new ‘ec2-user’ user.

Amazon EMR: A Managed Hadoop Distributed Computing Framework

TABLE 12: AMAZON EMR – A MANAGED HADOOP DISTRIBUTED COMPUTING FRAMEWORK

Amazon Elastic MapReduce (EMR) Overview

With Amazon EMR you can analyze and process vast amounts of data by distributing the computational work across a resizable cluster of virtual servers using Apache Hadoop, an open-source framework. Open-source projects that run on top of the Hadoop architecture can also be run on Amazon EMR, such as Hive, Pig, Spark, etc.

Hadoop uses a distributed processing architecture called MapReduce in which a task is mapped to a set of servers for processing. The results of the computation performed by those servers is then reduced down to a single output set. One node, designated as the “master node”, controls the distribution of tasks.

FIGURE 22: AMAZON EMR HADOOP CLUSTER WITH THE MASTER ODE DIRECTING A GROUP OF SLAVE NODES TO PROCESS THE DATA

Amazon EMR has made enhancements to Hadoop and the other open-source applications to work seamlessly with AWS. For example, Hadoop clusters running on Amazon EMR use EC2 instances as virtual Linux servers for the master and slave nodes, Amazon S3 for bulk storage of input and output data, and Amazon CloudWatch to monitor cluster performance and raise alarms. You can also move data into and out of Amazon DynamoDB using Amazon EMR and Hive. All of this is orchestrated by Amazon EMR control software that launches and manages the Hadoop cluster. This process is called an Amazon EMR cluster.

FIGURE 23: AMAZON EMR INTERACTING WITH OTHER AWS SERVICES

Amazon EMR’s Features Enabling Large-Scale Analytics

Hadoop provides the framework to run big data processing and analytics and Amazon EMR does all the heavy lifting involved with provisioning, managing, and maintaining the infrastructure and software of a Hadoop cluster. You can easily provision a fully managed Hadoop framework in minutes. You can scale your Hadoop cluster dynamically and pay only for what you use, from one to thousands of compute instances. Amazon EMR has native support for Amazon EC2 Spot and Reserved Instances, you can also save 50-80% on the cost of the underlying instances. You can launch a persistent cluster that stays up indefinitely or a temporary cluster that terminates after the analysis is complete.

Amazon EMR securely and reliably handles big data use cases, including log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics.

Amazon EMR monitors your cluster, retrying failed tasks and automatically replacing poorly performing instances. It automatically configures Amazon EC2 firewall settings that control access to instances, and you can launch clusters in an Amazon VPC. For objects stored in Amazon S3, you can use Amazon S3 server-side encryption or Amazon S3 client-side encryption with EMRFS, with AWS Key Management Service or customer-managed keys. You can customize every cluster.

Apache Spark, an engine in the Apache Hadoop ecosystem for fast and efficient processing of large datasets. By using in-memory, fault-tolerant resilient distributed datasets (RDDs) and directed, acyclic graphs (DAGs) to define data transformations, Spark has shown significant performance increases for certain workloads when compared to Hadoop MapReduce. Spark provides additional speed for other power tools such as Spark SQL, and Spark can be run on top of YARN (the resource manager for Hadoop 2). AWS has revised the bootstrap action to install Spark 1.x on AWS Hadoop 2.x AMIs and run it on top of YARN. The bootstrap action also installs and configures Spark SQL (SQL driven data warehouse), Spark Streaming (streaming applications), MLlib (machine learning), and GraphX (graph systems).

The S3 location for the Spark installation bootstrap action is:

Amazon EMR Analytics Examples

  • Log processing and analytics
  • Large extract, transform, and load (ETL) data movement
  • Risk modeling and threat analytics
  • Ad targeting and click stream analytics
  • Genomics
  • Predictive analytics
  • Ad hoc data mining and analytics

You can view an architectural diagram of Web Log Analysis on AWS here, and another on Advertisement Serving here

AWS Marketplace Solutions for Amazon EMR and the Hadoop Ecosystem

Data can be loaded into EMR from a multitude of solutions available from popular software vendors on AWS Marketplace to assist in Data Integration, Analytics, and Reporting and Visualization. Many of the solutions include much more than the broad topic titles.

For Data Integration, Attunity CloudBeam for S3, EMR and other Hadoop distributions simplifies, automates, and accelerates the loading and replication of data from a variety of structured and unstructured sources to create a data lake for Hadoop consumption on Amazon S3, including replication across Amazon Regions.

FIGURE 24: ATTUNITY CLOUDBEAM ACCELERATES THE LOADING & REPLICATION OF DATA FROM A VARIETY OF SOURCES

Attunity CloudBeam simplifies and streamlines ingesting enterprise data for use in Big Data Analytics by EMR or other Hadoop distributions from Cloudera, Hortonworks or MapR as well as for pre-processing before moving data into Redshift, S3, or RDS. Attunity CloudBeam is designed to handle files of any size, transferring content over any given network connection, thereby achieving best-in-class acceleration and guaranteed delivery.

FIGURE 25: ATTUNITY CLOUDBEAM STREAMLINES INGESTING ENTERPRISE DATA FOR USE IN BIG DATA ANALYTICS BY AMAZON EMR

Attunity Cloudbeam’s automation provides intuitive administration, scheduling, replication of deltas only, security and monitoring.

FIGURE 26: ATTUNITY CLOUDBEAM’S AUTOMATED SCHEDULING

FIGURE 27: ATTUNITY CLOUDBEAM’S AUTOMATED REPLICATION

For Advanced Analytics, Infosys Information Platform (IIP) leverages the power of open source to address big data adoption challenges such as inadequate accessibility of easy-to-use development tools; fragmented approach to building data pipelines; and lack of an enterprise-ready version of open source big data analytics platform that can support all forms of data: structured, semi-structured, and unstructured.

FIGURE 28: INFOSYS INFORMATION PLATFORM SUPPORTS ALL FORMS OF DATA

It’s a one stop solution from Data Engineering to Data Science requirements enabling Ingestion to Visualization. It’s One-Click Launch, High Performance, Scalable, Enterprise-grade security.

FIGURE 29: INFOSYS INFORMATION PLATFORM SOLUTION FROM DATA ENGINEERING TO DATA SCIENCE

Actionable insights in real-time.

FIGURE 30: INFOSYS INFORMATION PLATFORM GIVES ACTIONABLE INSIGHTS IN REAL- TIME

For Data Analysis and Visualization, TIBCO Jaspersoft Reporting and Analytics for AWS is a commercial open source reporting and analytics server built for AWS that can run standalone or be embedded in your application. It is priced very aggressively with a low hourly rate that has no data or user limits and no additional fees. A multi-tenant version is available as a separate Marketplace listing. Free Online Support is available for registration upon launching the instance. Professional Support is available separately from TIBCO sales.

FIGURE 31: TIBCO JASPERSOFT SUPPORT

Jaspersoft’s business intelligence suite allows you to easily create beautiful, interactive reports, dashboards and data visualizations. Designed to quickly connect to your Amazon RDS, Redshift and EMR data sources, you can be analyzing your data and building reports in under 10 minutes.

FIGURE 32: TIBCO JASPERSOFT ANALYZING YOUR DATA & BUILDING REPORTS

TIBCO Jaspersoft’s software empowers millions of people every day to make better decisions faster by bringing them timely, actionable data inside their apps and business processes. Thanks to a community hundreds of thousands strong, TIBCO Jaspersoft’s software has been downloaded millions of times and is used to create the intelligence inside hundreds of thousands of apps and business processes. Full BI Server for Cents/Hour: no user or data limits and no additional fees. Suite includes ad hoc query and reporting, dashboards, data analysis, data visualization and data virtualization. 10 Minutes to Your AWS Data: purpose-built for AWS, our reporting and analytics server allows you to quickly and easily connect to Amazon RDS, Redshift and EMR. In under 10 minutes you can be reporting on and analyzing your data. BI for Your Business or App: built to modern web standards with a HTML5 UI and JavaScript and REST APIs, our flexible BI suite can be used to analyze your business or deliver stunning interactive reports and dashboards inside your app.

FIGURE 33: TIBCO JASPERSOFT INTERACTIVE ANALYTICS DASHBOARD

Amazon Elasticsearch Service: Real-time Data Analysis and Visualization

TABLE 13: AMAZON ELASTICSEARCH SERVICE – REAL-TIME DATA ANALYSIS & VISUALIZATION

Amazon Elasticsearch Service (ES) Overview

Organizations are collecting an ever-increasing amount of data from numerous sources such as log systems, click streams, and connected devices. Launched in 2009, Elasticsearch —an open-source analytics and search engine— has emerged as a popular tool for real-time analytics and visualization of data. Some of the most common use cases include risk assessment, error detection, and sentiment analysis. However, as data volumes and applications grow, managing the open source version of Elasticsearch clusters can consume significant IT resources while adding little or no differentiated value to the organization. Amazon Elasticsearch Service (Amazon ES) is a managed service that makes it easy to deploy, operate, and scale Elasticsearch clusters in the AWS Cloud. Amazon ES offers the benefits of a managed service, including cluster provisioning, easy configuration, and replication for high availability, scaling options, data durability, security, and node monitoring.

Amazon ES is integrated with and tightly coupled with Logstash and a Kibana instance is automatically configured for you. When you deploy Elasticsearch, you deploy an “ELK Stack”.

Amazon ES Service Features Enabling Large-Scale Analytics

Logstash is an open source data pipeline that helps process logs and other event data and has built-in support for Kibana. Kibana is an open source analytics and visualization platform that helps you get a better understanding of your data in Amazon ES Service. You can set up your Amazon ES Service domain as the backend store for all logs coming through your Logstash implementation to easily ingest structured and unstructured data from a variety of sources. It allows you to explore your data at a speed and at a scale never before possible. It’s used for full text-search, structured search, analytics, and all three in combination.

Amazon ES Service gives you direct access to the open-source Elasticsearch APIs to load, query and analyze data and manage indices. There is integration for streaming data from Amazon S3, Amazon Kinesis Streams, and DynamoDB Streams. The integrations use a Lambda function as an event handler in the cloud that responds to new data by processing it and streaming the data to your Amazon ES Service domain.

Click here to read how to get started with Amazon ES Service and Kibana on Amazon EMR.

Amazon ES Service Examples

  • Real-time application monitoring
  • Analyze activity logs
  • Analyze Amazon CloudWatch logs
  • Analyze product usage data coming from various services and systems
  • Analyze social media sentiments and CRM data, and find trends for brands and products
  • Analyze data stream updates from other AWS services, such as Amazon Kinesis Streams and DynamoDB
  • Monitor usage for mobile applications
  • e-Commerce filtering and navigation
  • Streaming data analytics
  • Social media sentiment analysis
  • Text search
  • Risk assessment
  • Error detection

AWS Case Study: MLB Advanced Media Using Amazon ES Service as Part of a New Data Collection and Analysis Tool

MLB Advanced Media (MLBAM) wanted a new way to capture and analyze every play using data-collection and analysis tools. It needed a platform that could quickly ingest data from ballparks across North America, provide enough compute power for real-time analytics, produce results in seconds, and then be shut down during the off season.  It turned to AWS to power its revolutionary Player Tracking System, which is transforming the sport by revealing new, richly detailed information about the nuances and athleticism of the game—information that’s generating new levels of excitement among fans, broadcasters, and teams.

FIGURE 34: THE RELEASE OF AMAZON ES WAS ANNOUNCED AT AWS RE:INVENT 2015. THIS IS THE SLIDE SHOWN IN REGARD TO THE AWS MAJOR LEAGUE BASEBALL USE CASE (view YouTube video here)

You can read the story here.

AWS Marketplace Solutions for Amazon ES, Logstash, and Kibana (ELK)

There are quite a few software solutions available from popular software vendors on AWS Marketplace that help implement Amazon Elasticsearch Service. You can browse them here. Some of them provide the entire ELK Stack.

ELK Stack (PV) built by Stratalux: ELK stack is the leading open-source centralized log management solution for companies who want the benefits of a centralized logging solution without the enterprise software price. ELK stack provides a centralized and searchable repository for all your infrastructure logs providing a unique and holistic insight to your infrastructure. The ELK stack built by Stratalux AMI has been configured with all the basic components that together make a complete working solution. Included in this AMI are the Logstash server, Kibana web interface, ElasticSearch storage and Redis data structure server. Simply install and point your Logstash agents to this AMI and begin searching through your logs and create custom dashboards. With over five years of experience, Stratalux is the leading cloud-based managed services company for ELK stack on AWS. Sandbox environment to try out different functions. An image of this product is shown below:

FIGURE 35: STRATALUX ELK STACK (PV) OFFERING IN AWS MARKETPLACE

Amazon Machine Learning: Highly Scalable Predictive Analytics

TABLE 14: AMAZON MACHINE LEARNING – CREATE ML MODELS WITHOUT LEARNING COMPLEX ML ALGORITHMS

Amazon Machine Learning (ML) Overview
You see machine learning in action every day. Websites make suggestions on products you’re likely to buy based on past purchases, you get an alert from your bank if they suspect a fraudulent transaction and you get emails from stores when items you typically buy are on sale.

With Amazon Machine Learning, anyone can create ML models via Amazon ML’s learning and visualization tools and wizards without having to learn complex machine learning algorithms and technology. Amazon ML can create models based on data stored in Amazon S3, Amazon Redshift, or Amazon RDS. There is no set-up cost and you pay as you go, so you can start small and scale as your application grows, and you don’t have to manage any infrastructure required for the large amount of data used in machine learning.

Amazon ML Features Enabling Large-Scale Analytics
Built-in wizards guide you through the steps of interactively exploring your data to train the ML model by finding patterns in existing data, and use these patterns to make predictions from new data as it becomes available. You’re guided through the process of measuring the quality of your models and evaluating the accuracy of predictions, fine-tuning the predictions to align with business goals. You don’t have to implement custom prediction generation code.

FIGURE 36: FINE-TUNING AMAZON MACHINE LEARNING INTERPRETATIONS

Amazon ML can generate billions of predictions daily, and serve those predictions in low-latency batches or in real-time at high throughput.

List of Amazon ML Analytics Examples
Amazon ML can perform document classification to help you process unstructured text and take actions based on content from forms, emails and product reviews, for example. You can process free-form feedback from your customers, including email messages, comments or phone conversation transcripts, and recommend actions based on their concerns. One example would be using Amazon ML to analyze social media traffic to discover customers who have a product support issue, and connect them with the right customer care specialists.

Other examples you can perform with Amazon ML include the following:
• Predict customer churn
• Fraud detection
• Content personalization
• Propensity Modeling for marketing campaigns
• Readmission Prediction through patient risk stratification
• Predict if a website comment is spam
• Forecast product demand
• Personalize content
• Predict user activity

AWS Marketplace Solutions for Amazon ML
There are many solutions from leading software vendors available on AWS Marketplace, some of which are highlighted below.

BigML PredictServer is a dedicated machine image that you can deploy in your own AWS account to create blazingly fast predictions from your BigML models and ensembles.
PredictServer is ideal for real-time scoring and/or for very large batch predictions (millions and upwards). Dedicated in-memory prediction server guarantees fast and consistent prediction rates. And the built-in dashboard makes it easy to track performance. Models and Ensembles are cached directly from bigml.io and predictions can be created with API calls similar to that of the BigML.io API and/or through BigML’s command line tool bigmler. You can deploy BigML PredictServer in a region closer to your application servers to reduce latency, or even in a VPC.

FIGURE 37: YOU HAVE AN APPLICATION USING BIGML TO MAKE PREDICTIONS, SENDING WHAT MODELS & ENSEMBLES TO DOWNLOAD TO YOUR BIGML PREDICT SERVER

FIGURE 38: THEN, CHANGE YOUR CONNECTION TO USE PREDICT SERVER FOR PREDICTIONS

BigML also supports text analytics.

FIGURE 39: BIGML TEXT ANALYTICS

Zementis ADAPA Decision Engine is a predictive analytics decision engine based on the PMML (Predictive Model Markup Language) standard.

FIGURE 40: ZEMENTIS ADAPA DECISION ENGINE IS BASED ON THE INDUSTRY STANDARD PMML MARKUP LANGUAGE

With ADAPA, deploy one or many predictive models from data mining tools like R, Python, KNIME, SAS, SPSS, SAP, FICO and many others. Score your data in real-time using Web-Services, or use ADAPA in batch mode for Big Data scoring directly from your local file system or an Amazon S3 bucket. As a central solution for today’s data-rich environments, ADAPA delivers precise insights into customer behavior and sensor information. Predictive Analytics Using Vendor-neutral Standards: ADAPA uses the Predictive Model Markup Language (PMML) industry standard to import and deploy predictive algorithms and machine learning models. ADAPA can understand any version of PMML and is compatible most data mining tools, open source and commercial. Model Deployment Made Easy: ADAPA allows for one or many predictive models to be deployed at the same time. It executes many algorithms, from simple regression models to the most complex machine learning ensembles, e.g. Random Forest and boosted models. Scoring at the Speed of Business: ADAPA is able to instantly transform your scores into business decisions. The use of PMML-based rules allows for different score ranges to be paired with specific business decisions. Applications range from fraud detection and risk scoring to marketing campaign optimization and sensor data processing in the Internet of Things (IoT).

FIGURE 41: ZEMENTIS ADAPA DECISION ENGINE SUPPORTS MANY TYPES OF ANALYSES

AWS Storage and Database Options for Use in Big Data Analytics with Use Cases

TABLE 15: AWS STORAGE OPTIONS FOR BIG DATA ANALYTICS

AWS Big Data Analytics Storage and Database Options Overview
AWS has a broad set of engines for storing data throughout a big data analytics lifecycle. Each has a unique combination of performance, durability, availability, scalability, elasticity and interfaces.

Most big data analytics infrastructures and application architectures employ multiple storage technologies in concert, each of which has been selected to satisfy the needs of a particular subclass of data storage, or for the storage of data at a particular point in its lifecycle. These combinations form a hierarchy of data storage tiers. The image below gives one example of using a combination of data storage and database usage during a big data analytics workflow:

FIGURE 42: WITH AWS YOU CAN BUILD AN ENTIRE ANALYTICS APPLICATION TO POWER YOUR BUSINESS

Amazon S3 for Storage for Big Data Advanced Analytics
Please refer to the section above on the benefits of Amazon S3 for storage in big data analytics here.

Amazon DynamoDB Database for Big Data Advanced Analytics
Amazon DynamoDB is a fast, flexible and fully managed NoSQL database service for all applications – mobile, web, gaming, ad tech, IoT and more – that need a consistent, single-digit millisecond latency at any scale. It supports both document and key-value store models.

DynamoDB supports storing, querying, and updating documents. It supports three data types –
number, string, and binary – in both scalar and multi-valued sets. Using the AWS SDK you can write applications that store JSON documents directly into Amazon DynamoDB tables. This capability reduces the amount of code to write insert, update and retrieve JSON documents and perform powerful database operations like nested JSON queries using only a few lines of code.

FIGURE 43: AMAZON DYNAMODB STORING JSON DOCUMENTS

Other document stores Amazon DynamoDB supports are XML and HTML. Tables don’t have to have a fixed schema, so each data item can have a different number of attributes. The primary key can either be a single-attribute hash key or a composite hash-key range.

In addition to querying the primary key, you can query non-primary key attributes using Global
Secondary Indexes and Local Secondary Indexes. DynamoDB provides both eventually-consistent reads by default, strongly-consistent reads (optional), and implicit item-level transactions for item put, update, delete, conditional operations, and increment / decrement.

There is no limit to the amount of data you can store in an Amazon DynamoDB table. The service automatically allocates more storage as you store more data. DynamoDB streams captures all data activity that happens on your table and allows the ability to set up regional replication from one geographic region to another for more availability.

DynamoDB integrates with AWS Lambda to provide triggers for alerts when things change in your DynamoDB instance. It also integrates with Amazon Elasticsearch using the Amazon DynamoDB Logstash plugin to search Amazon DynamoDB content for things like messages, locations, tags and keywords. It integrates with Amazon EMR, so Amazon EMR can analyze data sets stored in DynamoDB, yet keeping the the original data set intact. It integrates with Amazon Redshift to perform complex data analysis queries, including joins with other tables in the Amazon Redshift cluster. DynamoDB integrates with the AWS Data Pipeline to automate data movement and transformation into and out of Amazon DynamoDB. It also integrates with Amazon S3 for analytics, AWS Import/Export, backup and archive.

Common use cases for DynamoDB include:
• Gaming
• Mobile applications
• Digital Ad serving
• Live voting
• Sensor networks
• IoT
• Log ingestion
• Access control for e-Commerce shopping carts or other web-based content
• Web session management

Amazon DynamoDB can be the storage backend to Titan, enabling you to store Titan graphs of any size in fully-managed DynamoDB tables that stores and traverses both small and large graphs up to hundreds of billions of vertices and edges distributed across a multi-machine cluster.

Amazon Redshift Data Warehouse for Storage in Big Data Advanced Analytics
Please refer to the section above on the benefits of Amazon Redshift for a data warehousing solution in big data analytics here.

Amazon Aurora Database for Big Data Advanced Analytics
Amazon Aurora is a MySQL-compatible relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. Amazon Aurora delivers five times the throughput vs. standard MySQL open source databases. This performance is achieved by tightly integrating the database engine with an SSD-backed virtualized storage layer that’s fault-tolerant and self-healing. Disk failures are repaired in the background without loss of database availability. It automates most administrative tasks and enables point-in-time recovery of your instances. Amazon Aurora can help cut down your database costs by 90% or more while improving reliability and providing high availability. It tolerates failures and fixes them automatically. Durable backups are continual and automatic to Amazon S3 and has six copies of your data replicated across three Availability Zones.

Amazon Aurora is compatible with MySQL 5.6 so that any existing MySQL applications and tools can run on Aurora without modification. It’s managed by Amazon RDS, which takes care of complicated administration tasks like provisioning, patching, and monitoring.

Historical data analysis is the most common type of big data analytics implemented on Aurora. The benefits of using Aurora vs. other relational databases is its scalability; therefore, with terabytes of real-time data processing daily and scales to millions of transactions per minute. If you need more transactions per minute, you can add replicas, up to 15 of them. It will automatically scale up as needed up to 64 TB.

FIGURE 44: ARCHITECTURAL DIAGRAM OF THE RELATIONSHIP BETWEEN AMAZON AURORA CLUSTER VOLUMES & THE PRIMARY & REPLICAS IN AN AURORA DB CLUSTER

Common use cases for Amazon Aurora include:
• Data warehouse analytics
• Website responsiveness
• Content Management
• IoT data analysis
• Transaction processing
• Great for any enterprise application that uses a relational database
• SaaS applications that need flexibility in instance and storage scaling
• Web and mobile applications

Amazon EC2 Instance Store Volumes for Big Data Advanced Analytics
Amazon EC2 provides flexible, cost-effective, and easy-to-use data storage for your instances. Each option has a unique combination of performance and durability. These options can be used independently or in combination to suit your requirements. The storage option that best fits running advanced analytics is called “Amazon EC2 Instance Store Volumes”.

Amazon EC2 Instance Store Volumes (also called ephemeral drives) provide temporary block-level storage for many EC2 instance types. The storage-optimized Amazon EC2 instance family provides special-purpose instance storage targeted to specific use cases. HI1 instances provide very fast solid-state drive (SSD)-backed instance storage capable of supporting over 120,000 random read IOPS, and are optimized for very high random I/O performance and low cost per IOPS.

Example applications well-suited to use HI1 storage-optimized EC2 Instance Store Volumes include data warehouses, Hadoop storage nodes, seismic analysis, cluster file systems, etc. Note, however, that the data on instance store volumes is lost if the Amazon EC2 instance is stopped, re-started, terminates, or fails.

AWS Marketplace Solutions to Augment AWS Storage and Database Services for Big Data Advanced Analytics

TABLE 16: AWS MARKETPLACE SOLUTIONS TO AUGMENT AWS STORAGE & DATABASE SOLUTIONS FOR BIG DATA ADVANCED ANALYTICS

Overview of AWS Marketplace Solutions to Augment AWS Big Data Analytics Storage and Database Services
There’s an abundance of AWS Marketplace software solutions from top vendors that augment these AWS built-in solutions for storage and databases used in big data analytics.

AWS Marketplace Solutions to Augment Amazon S3
AWS Marketplace solutions from top software vendors that augment the functionality or interact with Amazon S3 to a complete end-to-end out-of-the-box solutions are many. I’ll mention a couple of them below, but you can browse for yourself here.

Attunity CloudBeam for Amazon S3, EMR, and Hadoop was described in an earlier section. Click here to review that section again.

Matillion ETL for Redshift was also described in an earlier section. Matillion first loads data into Amazon S3 prior to Redshift ingestion. Click here to return to review that section again.

Informatica Cloud for Amazon S3 provides native, high-volume connectivity to S3. It is designed and optimized for data-integration between cloud and on-prem data sources to S3 as object store. It handles special characters within data-set, uni-code characters, escape characters and multiple formats of delimited files. It also supports multi-part upload and download to/from S3. It allows you to develop and run data integration tasks (mappings), task flows and unlimited scheduling restricted to use only for S3. Informatica Cloud integration is a visual, metadata-driven solution, enabling self-documenting code, extensive reuse in development and automation in deployment. Solution is limited to 1TB of S3 storage. The solution includes: Cloud Designer: Cloud Based service that enables visual design, development and deployment of data integration mappings (static data flows) A simple 6-step wizard to support the needs of citizen integrators. Informatica Cloud Data Synchronization Service Secure Agent: A light-weight binary that runs in AWS EC2 environment to access the Informatica cloud services located in Informatica’s hosted environment. Secure Agent is installed on your AMI. One instance of Informatica Cloud service in Informatica’s hosted environment. One connector for Amazon S3 as target One connector as source Run on SUSE Linux Enterprise Server 11.

AWS Marketplace Solutions to Augment Amazon DynamoDB
The AWS Marketplace has independent software vendors that augment Amazon DynamoDB, in addition to solutions that offer complete graphing solutions other than Titan graph. Below you’ll find some selected solutions:

Informatica Cloud for Amazon DynamoDB provides native, high-volume connectivity to DynamoDB. It is designed to catalog data from data sources such as SQL, NoSQL and Social into a single DynamoDB store and take advantage of high throughput and scale of DynamoDB. It saves cost by temporarily increasing Write capacity or Read capacity as needed. It allows you to develop and run data integration tasks (mappings), task flows and unlimited scheduling restricted to use only for DynamoDB. Informatica Cloud integration is a visual, metadata-driven solution, enabling self-documenting code, extensive reuse in development and automation in deployment. Solution is limited to 1TB of DynamoDB storage. The solution includes: Cloud Designer: Cloud Based service that enables visual design, development and deployment of data integration mappings (static data flows) A simple 6-step wizard to support the needs of citizen integrators. Informatica Cloud Data Synchronization Service Secure Agent: A light-weight binary that runs in AWS EC2 environment to access the Informatica cloud services located in Informatica’s hosted environment. Secure Agent is installed on your AMI. One instance of Informatica Cloud service in Informatica’s hosted environment One connector for Amazon DynamoDB as target One connector as source Run on SUSE Linux Enterprise Server 11.

Mentioned earlier, Amazon DynamoDB integrates with Titan graphs. I’ll mention one of the solutions available from top software vendors here that can offer graphing solution other than Titan below, but you can browse them yourself here.

MicroStrategy Analytics Platform is a powerful Mobile and Business Intelligence solution that enables leading organizations to quickly analyze vast amounts of data and distribute actionable business insight throughout the enterprise. MicroStrategy enables users to conduct ad hoc analysis and share their insights anywhere, anytime with reports, documents, and dashboards delivered via Web or mobile devices. Anyone can create dashboards with stunning visualizations, explore dynamic reports to investigate performance, graph data instantly, drill into areas of concern, and export information into any format.

FIGURE 44: MICROSTRATEGY’S INTERACTIVE REPORTS WITH FILTERS

Users benefit from powerful, sophisticated statistical analysis that yields new critical business insights. Uniquely positioned at the nexus of analytics, security, and mobility, MicroStrategy delivers superior analytics and mobile applications secured with advanced authentication, enhanced user administration, and user authentication tracking. Our software is built for AWS and is certified with numerous AWS services such as Amazon Redshift, Amazon RDS’s and MicroStrategy is an Advanced Technology Partner.

FIGURE 45: MICROSTRATEGY’S DATA DISCOVERY TECHNOLOGY HELPS YOU COMBINE INFORMATION FROM DIFFERENT SYSTEMS WITHOUT COMPLICATED SCRIPTS, DATA MODELS, OR HELP FROM IT

AWS Marketplace Solutions to Augment Amazon Redshift
The AWS Marketplace has independent software vendors that augment or work in tandem with Amazon Redshift. Some will be highlighted below, but you can browse the many solutions available in the AWS Marketplace that work with Amazon Redshift here.

Matillion ETL for Redshift was also described in an earlier section. Click here to return to review that section again.

Attunity CloudBeam for Amazon Redshift enables organizations to simplify, automate, and accelerate data loading and near real-time incremental changes from on-premises sources (Oracle, Microsoft SQL Server, and MySQL) to Amazon Redshift.

FIGURE 46: ATTUNITY CLOUDBEAM FOR AMAZON REDSHIFT’S REPLICATION DIAGRAM

Attunity CloudBeam allows your team to avoid the heavy lifting of manually extracting data, transferring via API/script, chopping, staging, and importing. A Click-to-Load solution, Attunity CloudBeam is easy to setup and allows organizations to start validating or realizing the benefits of Amazon Redshift in just minutes. Zero-footprint
technology: Reduces impact on IT operations with log-based capture and delivery of transaction data that does not require the Attunity software to be installed on each source and target database. Performance: Accelerated, secured, and guaranteed delivery of data.

AWS Marketplace Solutions to Augment Amazon Aurora
The AWS Marketplace has services from top software vendors that augment or work in tandem with Amazon Aurora. You can view them all here, but below you’ll find two Informatica solutions.

Informatica Cloud for Amazon Aurora (Windows) provides native, high-volume connectivity to Aurora and is optimized for Oracle to Aurora migration. It allows you to develop and run data integration tasks (mappings), data synchronization and replication tasks, task flows and unlimited scheduling restricted to use only for Aurora. It supports single inserts and batched statement, as well as more advanced capabilities such as create tables on-the-fly, custom queries, look-ups, joiners, filters, expressions and sorters. Informatica Cloud integration is a visual, metadata-driven solution, enabling self-documenting code, extensive reuse in development and automation in deployment. Solution is limited to 1TB of Aurora storage. The solution includes: Cloud Designer: Cloud Based service that enables visual design, development and deployment of data integration mappings (static data flows) A simple 6-step wizard to support the needs of citizen integrators. Informatica Cloud Data Synchronization Service Secure Agent: A light-weight binary that runs in AWS EC2 environment to access the Informatica cloud services located in Informatica’s hosted environment. Secure Agent is installed on your AMI. One instance of Informatica Cloud service in Informatica’s hosted environment. One connector for Amazon Aurora (MySQL) as target One connector as source Run on Windows Server 2012 R2.

Informatica Cloud for Amazon Aurora (Linux) has the same features as the Windows version above except for the connector. In this version, one connector for Amazon Aurora (MySQL) as target. One connector as source Run on SUSE Linux Enterprise Server 11.

AWS Marketplace Solutions to Augment Amazon EMR
The AWS Marketplace has independent software vendors to augment your big data analytics solutions with Amazon EMR.

Syncsort DMX-h, Amazon EMR Edition is designed for Hadoop and now deployed on Amazon EMR, Syncsort DMX-h helps organizations propel their Big Data initiatives, getting productive and delivering results with Hadoop and Amazon EMR in almost no time.

FIGURE 47: SYNCSORT DMX-H, AMAZON EMR EDITION HELPS PROPEL BIG DATA INITIATIVES BY LESSENING THE LEARNING CURVE

Syncsort DMX-h is the only Hadoop ETL application available for EMR. Syncsort DMX-h delivers: 1) Blazingly fast, easy to use Hadoop ETL in the Cloud. 2) A graphical user interface for developing & maintaining MapReduce ETL jobs.

FIGURE 47: SYNCSORT’S GRAPHICAL USER INTERFACE

3) A library of Use Case Accelerators to fast-track development. 4) Unbounded scalability at a disruptively low price. With Syncsort Ironcluster (10 Nodes) you can test, pilot and perform Proof of Concepts for free on up to ten Hadoop EMR nodes. 30 days of free phone and email support are also available.

FIGURE 48: SYNCSORT IS AVAILABLE IN AWS MARKETPLACE

MapR Enterprise Edition Plus Spark includes 24/7 support for the MapR Enterprise Edition plus the Apache Spark stack. IMPORTANT: Use MapR Standard Cluster with VPC Support delivery method to launch your cluster. This edition provides a standards-based enterprise-class distributed file system, complete with high availability and disaster recovery features.

FIGURE 49: MAPR ENTERPRISE EDITION PLUS SPARK’S FEATURES OF HIGH AVAILABILITY, PERFORMANCE, & DISASTER RECOVERY

Also included is a broad range of technologies like data processing with Spark, machine learning with MLlib, SQL with Spark SQL, graph processing with GraphX, and YARN for resource management.

FIGURE 50: MAPR ENTERPRISE EDITION SUPPORTS A BROAD RANGE OF HADOOP FRAMEWORKS

With the browser-based management console, MapR Control System, you can monitor and manage your Hadoop cluster easily and efficiently.

Other Notable Marketplace Solutions to Augment AWS Built-In Storage and Databases for Big Data Analytics
The AWS Marketplace has services from top software vendors that augment or work in tandem with AWS Big Data Storage and Database Services. Some notable choices are listed below.

Looker Analytics Platform for AWS allows anyone in your business to quickly analyze and find insights in your Redshift and RDS datasets. By connecting directly to your AWS instance, Looker opens up access to high-resolution data for detailed exploration and collaborative discovery, building the foundation for a truly data-driven organization.

FIGURE 51: LOOKER – ANALYTICS EVOLVED…HADOOP & AMAZON REDSHIFT

To help you get started quickly, the Looker for AWS license includes implementation services from our team of expert analysts. And throughout your entire subscription, you’ll receive 100% unlimited support from a live analyst using our in-app chat functionality. Purpose-built to leverage the next generation of analytic databases, like Amazon Redshift, and to live in the cloud, Looker takes an entirely new approach to business intelligence. Unlike traditional BI tools, Looker doesn’t move and store your data; instead, it optimizes data discovery within the database itself. Using Looker’s modern data explanation language, called LookML, data analysts create rich experiences so that end users can self-serve their own data discovery. Key to LookML is its reusability: Measures and dimensions are created in only one place and then consistently (and automatically) reused in all relevant views of that same data concept, creating a single source of the truth across your organization. Powerful data discovery, including contextual filtering, pivoting, sequencing, and cohort tiering, so your entire organization can ask questions, share views, and collaborate, all from within the browser, on any device. Live connection to the database using the LookML data modeling language and browser-based agile development IDE, so data analysts can call any Redshift function, such as sortkeys, distkeys, and HyperLogLog, for advanced performance and insights. Wide set of visualizations, including scatter, table, bar, and line charts, a streamlined approach to dashboarding, and the ability to embed visualizations in any web application.

Mapping by MapLarge gives 5 User License to the MapLarge Mapping Engine, a high performance geospatial visualization platform that dynamically renders data for interactive analysis and collaboration. For more information visit http://maplarge.com. Scales to millions of records and beyond. Intuitive User Interface. Robust APIs allow complete customization.

FIGURE 52: EXAMPLES OF SOE OF THE GEOSPATIAL VISUALIZATIONS THAT CAN BE PRODUCED BY MAPLARGE

The Teradata Database Developer (Single Node) with SSD local storage is the same full-feature data warehouse software that powers analytics at many of the world’s greatest companies. Teradata Database Developer includes Teradata Columnar and rights to use: Teradata Parallel Transporter (TPT), including the Load, Update, and Export operators; Teradata Studio; and Teradata Tools and Utilities (TTU). These tools are included with the Teradata Database AMI or available as a free download. In addition to the Teradata Database, your subscription includes rights to use the following products, which are listed in the AWS Marketplace: Teradata Data Stream Utility; Teradata REST Services; Teradata Server Management; and Teradata Viewpoint (Single Teradata System). With Teradata Database, customers get quick time to value and low total cost of ownership. Applications are portable across cloud and on-premises platforms and there is no re-training required. Teradata 15.10 is the newest release of the Teradata Database, bringing industry leading features for data fabric enhanced support, fast JSON performance, and the world’s most advanced hybrid row/column database. Query processing performance is also accelerated by the world’s most advanced hybrid row/column table storage capability. Enhanced hybrid row/column storage enables high performance for selective queries on tables with many columns while also allowing pinpoint access to single rows by operational queries.

HPE Vertica Analytics Platform is Enterprise-Class Analytics that fits your budget. Until now, enterprise-class Big Data analytics in the cloud was just not available. Current cloud analytics offerings lack critical enterprise features — fine-tuning capabilities, integrated BI/reporting, data ingestion and more. With HPE Vertica Analytics Platform for Amazon Web Services, you can tap into all of the core enterprise capabilities and more. HPE Vertica Analytics Platform for Amazon Web Services offers you the flexibility to start small and grow as your business grows, and you get analytics functionality that no other cloud analytics provider can offer. The HPE Vertica Analytics Platform also runs on-premise on industry-standard hardware and in the cloud. Get started immediately with your analytics initiative via the cloud or the deployment model that makes sense for your business without any compromises or limits. Optimized data ingestion for high performance. Fast query optimization for quick insight, with comprehensive SQL & extensions for true openness. Enhanced data storage for cost efficiency, and ease of administration for true reliability.

Zoomdata for AWS is the Fastest Visual Analytics for Big Data and includes smart connectors for Redshift, S3, Kinesis, Apache Spark, Cloudera, Hortonworks, MapR, Elastic, real time, SQL and NoSQL sources.

FIGURE 53: ZOOMDATA LEVERAGES SMART CONNECTORS TO CONNECT TO MANY TYPES OF DATA SOURCES, INCLUDING ON-PREMESIS

Sign up for a free trial today, and you’ll be visualizing billions of rows of data in seconds! Free support is available for users who register at http://go.zoomdata.com/awstrial. Using patented Data Sharpening and micro-query technologies, Zoomdata visualizes Big Data in seconds, even across billions of rows of data. Zoomdata is designed for the business user — not just data analysts — via an intuitive user interface that can be used for interactive dashboards or embedded in a custom application.

FIGURE 54: AN EXAMPLE OF ZOOMDATA’S VISUALIZATION DASHBOARDS

Built for Big Data: By taking the query to the data, Zoomdata leverages the power of modern databases to visualize billions of data points in seconds. Includes Redshift, S3, Cloudera, Solr, and Hortonworks connectors.

FIGURE 55: ZOOMDATA HAS CONNECTORS TO MANY DIFFERENT DATA SOURCES

TIBCO Clarity is the data cleaning and standardization component of the TIBCO Software System. It serves as a single solution for business users to handle massive messy data across various applications and systems, such as TIBCO Jaspersoft, Spotfire, Marketo and Salesforce. The quality of data impacts your decision-making. So data coming from external sources such as SaaS applications or partners needs to be validated before used in systems. TIBCO Clarity makes it easy for business users to profile, standardize, and transform data so that trends can be identified and smart decisions can be made quickly.

FIGURE 56: TIBCO CLARITY PROVIDES THE UBER-IMPORTANT STEP OF DATA CLEANSING, AUGMENTING, & ENHANCEMENT

TIBCO Clarity provides an easy-to-use Web environment, and since it’s a cloud-based subscription service, it only requires an investment relative to the usage of the service. De-duplication: TIBCO Clarity discovers duplicate records in a dataset by using configurable fuzzy match algorithms. Seamless Integration: You can collect your raw data from disparate sources in variety of data formats. Such as files, databases, spreadsheets, both cloud and on-premise. Data Discovery and Profiling: TIBCO Clarity detects data patterns and data types for auto-metadata generation enabling profiling of row and column data for completeness, uniqueness, and variation.

AWS Services for Data Collection

TABLE 14: AWS DATA COLLECTION OPTIONS

AWS Data Collection Overview
Before you can do any big data analytics using Amazon Services, you have to get the data loaded to an AWS storage location. This is a crucial step and can prohibit a company’s first move into a cloud-based environment. It can seem complex and time consuming, and taking a lot of time and resources, and concerns about how to recode and convert you data to another format can seem daunting. However, AWS has many different services to help you move your data onto AWS, whether you are loading from numerous external resources, integration with your premises infrastructure, or migrating data from an existing data center.

AWS Data Migration Service (DMS)
With just a few clicks, the AWS Data Migration Service starts while your original database stays live. AWS DMS handles all the complexity. You even have the ability to replicate back to your original database, or replicate to other databases in different regions or Availability Zones. Heterogeneous migration is taken care of by the AWS Schema Conversion tool. Migration assessment and code conversion is taken care of for you. The source database and code are converted into a format compatible with the target database, and any code that can’t be converted is marked for manual conversion. And costs start at $3.00 per TB.

AWS Import / Export Snowball
AWS Import /Export Snowball is an easy, secure and affordable solution for even the biggest (petabytes) data transfer jobs via a highly secure hardware appliance. You don’t need to purchase any hardware: with just a few clicks in the AWS Management Console you create a job and a Snowball appliance will be shipped to you, or up to 50 of them if you need them. When it arrives, you attach the appliance to your network, download and run the Snowball client to establish a connection, then use the client to select the file directories you want to transfer. Snowball will encrypt and transfer the files at an extremely high speed. When the transfer is complete, you ship the appliance back using the free shipping label supplied. Snowball uses multiple layers of security to protect your data including tamper-resistant enclosures, 256-bit encryption, and an industry-standard Trusted Platform Module (TPM) designed to ensure both security and full chain-of-custody of your data. Snowball unloads your data into Amazon S3 and from there you can access any AWS Service that you need.

Amazon S3 Transfer Acceleration
Amazon S3 Transfer Acceleration can be used when your upload speeds to Amazon S3 are sub-optimal, which can occur for a few reasons. Amazon S3 Transfer Acceleration enables fast, easy and secure transfers of files over long distances between your client and an S3 bucket. It takes advantage of Amazon CloudFront’s globally distributed edge locations. As the data arrives at an edge location, data is routed to Amazon S3 over an optimized network path.

AWS Direct Connect
AWS Direct Connect makes it easy to establish a dedicated, private network connection between AWS and your premisis, data center, co-location environment, etc. This increases bandwidth and throughput while reducing network costs and provides a more consistent network experience than Internet-based connections. This dedicated connection can be partitioned into multiple virtual interfaces so you can use the same connection to access public resources and private resources, maintaining separation between the environments.

AWS Storage Gateway
The AWS Storage gateway is a service that connects an premises software appliance with cloud-based storage to provide seamless and secure integration between an organization’s on-premises IT environment and AWS’s storage infrastructure. The service allows you to securely store data in the AWS cloud for scalable and cost-effective storage for data backups without buying more storage or managing the infrastructure, and pay only for what you use. It supports industry-standard storage protocols that work with your existing applications. It provides low-latency performance by maintaining frequently accessed data on-premises while securely storing all your data encrypted in Amazon S3 or Amazon Glacier. The AWS Storage Gateway Appliance is software that sits in your data center between your applications and your storage infrastructure.

FIGURE 57: AWS STORAGE GATEWAY SOFTWARE APPLIANCE DIAGRAM

Without making any changes to your applications, it backs up your data with SSL encryption, and you can pull your old data back when needed.

Amazon Kinesis Streams
Amazon Kinesis Streams capture large amounts of data (TB/hr) in real-time from data producers and streaming it into custom applications for data processing and analysis. Streaming data is replicated by Kinesis across three availability zones to ensure reliability.

Amazon Kinesis Streams is capable of scaling from a single MB up to TB/hr of streaming data, but in contrast to Firehose, you have to manually provision the capacity. Amazon provides a “shard calculator” (below image) when creating a Kinesis Stream to correctly provision the appropriate number of shards for your stream to handle the volume of data you’re going to process. Once created, it’s possible to scale up or down the number of shards to meet demand.

FIGURE 58: AWS KINESIS STREAMS SHARD CALCULATOR

AWS Marketplace Solutions to Assist and Augment Data Collection
The AWS Marketplace has services from top software vendors that augment or work in tandem with AWS Services for data ingestion. Many have capabilities beyond data ingestion, including the ability to perform ETL/ELT, data cleansing and much more.

Matillion ETL for Redshift was also described in an earlier section. Click here to return to review that section again.

Informatica CloudBeam for Amazon S3, EMR, Hadoop was also described in an earlier section. Click here to return to review that section again.

CloudBerry Backup Desktop Edition is Simple and fast backup to Amazon S3 cloud. CloudBerry Backup is a secure online backup solution that helps organizations to store backup copies of their data in online storage. It is a powerful Backup and Restore program designed to leverage Amazon S3 technology to make your disaster recovery plan simple, reliable, and affordable. Keep your backups in remote location. Access your backups anywhere where you have internet connection. Strong data encryption protects your data from unauthorized access.

CloudBasic RDS Deploy for DevOps DLM/Jenkins (SQL Server) enables you to Move RDS databases around from development and staging environments without access to RDS file system. Integrate RDS Deploy into your DevOps tools, such as Jenkins and GO, to further automate DLM and achieve true one-click deployments. Sample DevOps Scenario involving Jenkins and RDS: The job is to deploy 3 SQL Server databases from RDS to a standard SQL Server. Merge data from two of them into the third. Send the third database back to RDS as the new production system. The job will also flush any open sessions to the website, take it offline and put it back online when everything is finished. This is all done in a PowerShell script which is executed from Jenkins so that it can be performed by employees with minimal security access (only have access to that job) and all history is recorded. No file system access to the RDS instance is required. Traditional tools require access to the SQL Server file system and cannot be used with AWS RDS. Easy integration into DevOps tools such as Jenkins and GO. REST API allows for RDS DB Deployments to be initiated from PowerShell etc.

AWS Services for Data Orchestration and Analytic Workflows

TABLE 15: AWS DATA ORCHESTRATION & ANALYTIC WORKFLOW SERVICES

AWS Services for Data Orchestration and Analytic Workflow Overview
Big data advanced analytics solutions very often require automated arrangement, coordination, and management of data once it’s on the AWS cloud, moving data when finished from one AWS Service to another for subsequent processing or storage, and vice versa. Automating workflows can ensure that necessary activities take place when required to drive the analytic processes.

Amazon Simple Workflow Service (SWF)
Amazon SWF allows you to build distributed applications in any programming language with components that are accessible from anywhere. It reduces infrastructure and administration overhead because you don’t need to run orchestration infrastructure. SWF provides durable, distributed-state management that enables resilient, truly distributed applications. Think of SWF as a fully-managed state tracker and task coordinator in the cloud.

Amazon SWF’s key concepts include the following:
o Workflows are collections of actions
o Domains are collections of related workflows
o Actions are tasks or workflow steps
o Activity workers implement actions
o Deciders implement a workflow’s execution logic
o Maintains distributed application state
o Tracks workflow executions and logs their progress
o Controls which tasks each of your application hosts will be assigned to execute
o Supports the execution of Lambda functions as “workers”

Amazon Data Pipeline
AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premise data sources, at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon Elastic MapReduce (EMR).
AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available. You don’t have to worry about ensuring resource availability, managing inter-task dependencies, retrying transient failures or timeouts in individual tasks, or creating a failure notification system. AWS Data Pipeline also allows you to move and process data that was previously locked up in on-premise data silos.

Using AWS Data Pipeline, you define a pipeline composed of the “data sources” that contain your data, the “activities” or business logic such as EMR jobs or SQL queries, and the “schedule” on which your business logic executes. For example, you could define a job that, every hour, runs an Amazon Elastic MapReduce (Amazon EMR)–based analysis on that hour’s Amazon Simple Storage Service (Amazon S3) log data, loads the results into a relational database for future lookup, and then automatically sends you a daily summary email.

AWS Data Pipeline handles:
o Your jobs’ scheduling, execution, and retry logic
o Tracking the dependencies between your business logic, data sources, and previous processing steps to ensure that your logic does not run until all of its dependencies are met
o Sending any necessary failure notifications
o Creating and managing any temporary compute resources your jobs may require

FIGURE 58: AMAZON DATA PIPELINE HIGH-LEVEL DIAGRAM

Amazon Kinesis Firehose
Amazon Kinesis Firehose is AWS’s data-ingestion product offering for Kinesis. It’s used to capture and load streaming data into other Amazon services such as Amazon S3 or Amazon Redshift. From there, you can load the streams into data processing and analysis tools like Amazon Elastic Map Reduce (EMR) or Amazon ElasticSearch Service. It’s also possible to load the same data into Amazon S3 and Amazon Redshift at the same time using Firehose.

Firehose can scale to gigabytes of streaming data per second, and allows for batching, encrypting and compression of data. It will automatically scale to meet demand. It’s possible to load data into Firehose including HTTPS, the Kinesis Producer Library, the Kinesis Client Library and the Kinesis Agent. Monitoring is available through Amazon CloudWatch.

Amazon CloudFront
Amazon CloudFront is a global Content Delivery Network (CDN) service that gives you the ability to distribute your application globally in minutes. In Amazon CloudFront, your content is organized into distributions. A distribution specifies the location or locations of the original version of your files. Store the original versions of your files on one or more origin servers. An origin server is the location of the definitive version of an object. Origin servers could be other Amazon Web Services – an Amazon S3 bucket, an Amazon EC2 instance, or an Elastic Load Balancer – or your own origin server. Create a distribution to register your origin servers with Amazon CloudFront through a simple API call or the AWS Management Console. When configuring more than one origin server, use URL pattern matches to specify which origin has what content. You can assign one of the origins as the default origin. Use your distribution’s domain name in your web pages, media player, or application. When end users request an object using this domain name, they are automatically routed to the nearest edge location for high performance delivery of your content. An edge location is where end users access services located at AWS. They are located in most of the major cities around the world and are specifically used by CloudFront (CDN) to distribute content to end user to reduce latency.

AWS Data Processing Types

TABLE X: AWS DATA PROCESSING TYPES

AWS Data Processing Types Overview
Analysis of data is a process of inspecting, cleaning, transforming and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making.
Big data can be processed and analyzed in two different ways on AWS: batch processing or stream processing.

When deciding which type of processing you need, consider the performance of Batch vs. Stream processing. Batch processing is latencies in minutes to hours, while Stream processing is latency in the order of seconds or milliseconds.

AWS Batch Processing
Batch processing is normally done on AWS using Amazon S3 for storage pre-and-post processing. Amazon EMR is then used to run managed analytic clusters on top of this data with Hadoop ecosystem tools like Spark, Presto, Hive, and Pig. Batch processing is often used to normalize the data then compute arbitrary queries over the varying sets of data. It computes results derived from all that data it encompasses and enables deep analysis of large data sets. Once you have the results, you can shut down your Amazon EMR or keep it running for further processing or querying.

You can find an architectural drawing of AWS Batch Processing here.

AWS Stream Processing
Streaming data is generated continuously by more than thousands of data sources, typically sending data simultaneously in small sizes. This event data needs to be processed sequentially and incrementally on a record-by-record basis over sliding time windows, and used for a wide variety of analytics. Information from such analysis incrementally updates metrics, reports, and summary statistics which gives companies visibility into many aspects of business and consumer activity as it “streams into AWS” and allows businesses to respond promptly to emerging situations. Amazon Kinesis Streams or Amazon Kinesis Firehose are used to capture and load data into a data store.

With AWS Stream processing, analytic processing and decision making to in-motion and transient data is done with minimal latency. Filtering and diverting in-motion data to a data warehouse like Amazon Redshift for example, where existing business intelligence tools are used to analyze the data for deeper background analysis and/or data augmentation.

Producers of streaming data include machine data, sensor-based monitoring devices, messaging systems, IoT and financial market feeds.

You can find an architectural drawing of AWS Time Series Processing, which is a type of Stream Processing, here.

Conclusion
In this day and age of needing analytics, with more data, more questions to answer, and tougher competition, your success depends on making the right decisions by relying on fast, secure, scalable, durable cloud data analytics, and AWS is the clear leader in this realm by leaps & bounds!

**Caveat: This document was created ~9 months ago (Today is 5/23/2017), so there might be more up-to-date information & there are certainly more AWS Data Analytics Services that should be included here. However, the information herein was complete & accurate when created. Thank you!

#gottaluvAWS! #gottaluvAWSMarketplace!

Posted in 1-Click to Deploy Software Solutions for Your Choosing Paid for by the Hour, Amazon Aurora, Amazon CloudFront, Amazon CloudWatch, Amazon DynamoDB, Amazon EC2, Amazon EC2 On-Demand Instances, Amazon Elastic MapReduce, Amazon Elasticsearch Service, Amazon EMR, Amazon IAM, Amazon Kinesis Family, Amazon Machine Learning, Amazon Redshift Data Warehouse, Amazon S3, Amazon S3 Transfer Acceleration, Amazon Web Services, Amazon Web Services Analytic Services, Attunity CloudBeam in AWS Marketplace, AWS Analytics, AWS Batch Processing, AWS BI, AWS Built-In Security Features, AWS Cloud & Data Security, AWS Cloud Architecture, AWS Cloud Computing Models, AWS Cloud Deployment Models, AWS CloudFront, AWS Data Collection, AWS Data Migration Service, AWS Data Orchestration, AWS Data Pipeline, AWS Direct Connect, AWS Kinesis Firehose, AWS Kinesis Streams, AWS Marketplace, AWS Marketplace FAQs, AWS Snowball, AWS Storage and Database Options Big Data Analytics, AWS Storage Gateway, AWS Stream Processing, AWS SWF, AWS Trusted Advisor, Benefits of Analytics, Big Data Analytics Challenges, Big Data Producers, BigML PredictServer in AWS Marketplace, Business Intelligence & Big Data, Cloud Computing, CloudBasic RDS Deploy for DevOps DLM/Jenkins on AWS Marketplace, CloudBerry Backup on AWS Marketplace, ELK Stack for AWS Elasticsearch in AWS Marketplace, EMR and Hadoop on AWS Marketplace, HPE Vertica on AWS Marketplace, Informatica Cloud on AWS Marketplace, Infosys Information Platform on AWS Marketplace, Looker Analytics Platform in AWS Marketplace, Mapping by MapLarge in AWS Marketplace, MapR Enterprise Edition in AWS Marketplace, Matillion ETL/ELT for Redshift, MicroStrategy Analytics Platform on AWS Marketplace, SAP HANA One on AWS Marketplace, Syncsort for Amazon EMR in AWS Marketplace, Tableau Server on AWS Marketplace, Teradata on AWS Marketplace, TIBCO Clarity on AWS Marketplace, TIBCO Jaspersoft Reporting & Analytics for AWS, TIBCO Spotfire Analytics Platform, Zementis ADAPA Decision Engine in AWS Marketplace, Zoomdata on AWS Marketplace | Leave a comment

DISCOVER, MIGRATE & DEPLOY PRE-CONFIGURED BIG DATA BI & ADVANCED ANALYTIC SOLUTIONS IN MINUTES – AND PAY ONLY FOR WHAT YOU USE BY THE HOUR (Chapter 3.7 in “All AWS Data Analytics Services”)

3.7  DISCOVER, MIGRATE & DEPLOY PRE-CONFIGURED BIG DATA BI & ADVANCED ANALYTIC SOLUTIONS IN MINUTES – AND PAY ONLY FOR WHAT YOU USE BY THE HOUR

Talking about AWS Marketplace is a passion of mine. I view it as AWS’ gift to the business world. No other cloud provider has the authority and influence to attract over a thousand technology partners and independent software vendors (ISVs) from popular vendors that have licensed and packaged their software to run on AWS, have integrated their software with AWS capabilities, or to deliver add-on services that benefit their customers as greatly as AWS does through the AWS Marketplace. The AWS Marketplace is the largest “app store” in the world, regardless of being strictly a B2B app store!

This translates to the best & most popular software vendors going to great lengths to alter their software to seamlessly integrate with other AWS Services and run seamlessly on the AWS cloud. Only AWS has the prominence and much deserved reputation being a total customer-centric company that’s necessary to attract such renown ISVs, and for each one to take the time required to be offered in AWS Marketplace.

The AWS Marketplace facilitates the discovery, purchasing and deployment of BI and Big Data solutions (and many more categories) on AWS, and migrate or get the business intelligence and data analytics solutions you want in minutes… and pay only for what you consume.
For those of you who haven’t heard about AWS Marketplace or dismiss it (for any number of pre-conceived ideas) as another “That’s how they get you!” thought, let me explain the facts, the benefits, and how to navigate AWS Marketplace. Please read on: AWS Marketplace has more than 100,000 active customers who use 300M compute hours/month deployed on Amazon EC2, with more than 3,000 listings from over 1,000 popular software vendors (not including the new SaaS launch that occurred in late November, 2016.)

Since AWS resources can be instantiated in seconds, you can treat these as “disposable” resources – not hardware or software you’ve spent months deciding which to choose and spending a significant up-front expenditure without knowing if it will solve your problems. The “Services not Servers” mantra of AWS provides many ways to increase developer productivity, operational efficiency and the ability to “try on” various solutions available on AWS Marketplace to find the perfect fit for your business needs without commitment to long-term contracts.

1-Click, on-demand infrastructure through software solutions on AWS Marketplace allows iterative, experimental deployment and usage to take advantage of advanced analytics and emerging technologies within minutes, paying only for what you consume, by the hour or by the month.

The vast majority of big data use cases deployed in the cloud today run on AWS, with unique customer references for big data analytics, of which 67 are household names. AWS has over 50 Services and hundreds of features to support virtually any big data application and workload. When you combine the managed AWS services with software solutions available on AWS Marketplace, you can get the precise business intelligence and big data analytical solutions you want that augments and enhances your project beyond what the services themselves provide. There are over 290 big data solutions in AWS Marketplace. Therefore, you get to data-driven results faster by decreasing the time it takes to plan, forecast, and make software provisioning decisions. This greatly improves the way you build business analytics solutions and run your business.

You can read the whitepaper on AWS Big Data Analytics Leveraging the AWS Marketplace, where I’m an author, by going to https://aws.amazon.com/mp/bi/ –> scroll to the bottom under “Additional Resources”, & click on “Download Solution Overview”. Below is a screenshot of the first page:

I'm an Contributor of this "Business Intelligence & Big Data on AWS, Leveraging ISV AWS Marketplace Solutions" Whitepaper

I’m an Contributor of this “Business Intelligence & Big Data on AWS, Leveraging ISV AWS Marketplace Solutions” Whitepaper

Below are just a fraction of example solutions you can achieve when using AWS Marketplace’s software solutions with AWS big data services:

You can:

  • Launch pre-configured and pre-tested experimentation platforms for big data analysis
  • Query your data where it sits (in-datasource analysis) without moving or storing your data on an intermediate server while directly accessing the most powerful functions of the underlying database
  • Perform “ELT” (extract, load, and transform) vs. “ETL” (extract, transform, and load) your data into Amazon’s Redshift data warehouse so the data is in its original form, giving you the ability to perform multiple data warehouse transforms on the same data
  • Have long-term connectivity among many different databases
  • Ensure your data is clean and complete prior to analysis
  • Visualize millions of data points on a map
  • Develop route planning and geographic customer targeting
  • Embed visualizations in applications or stand-alone applications
  • Visualize billions of rows in seconds
  • Graph data and drill into areas of concern
  • Have built-in data science
  • Export information into any format
  • Deploy machine-learning algorithms for data mining and predictive analytics
  • Meet the needs of specialized data connector requirements
  • Create real-time geospatial visualization and interactive analytics
  • Have both OLAP and OLTP analytical processing
  • Map disparate data sources (cloud, social, Google Analytics, mobile, on-prem, big data or relational data) using high-performance massively parallel processing (MPP) with easy-to-use wizards
  • Fine-tune the type of analytical result (location, prescriptive, statistical, text, predictive, behavior, machine learning models and so on)
  • Customize the visualizations in countless views with different levels of interactivity
  • Integrate with existing SAP products
  • Deploy a new data warehouse or extend your existing one

Amazon EC2 provides an ideal platform for operating your own self-managed big data analytics applications on AWS infrastructure. Almost any software you can install on Linux or Windows virtualized environments can be run on Amazon EC2 with a pay-as-you-go pricing model with a solution available on AWS Marketplace. Amazon EC2 uses the implemented architecture to distribute computing power across parallel servers to execute the algorithms in the most efficient manner.

Some examples of self-managed big data analytics that run on Amazon EC2 include the following:

  • A Splunk Enterprise Platform, the leading software platform for real-time Operational Intelligence. Splunk software and cloud services enable organizations to search, monitor, analyze and visualize machine-generated big data coming from websites, applications, servers, networks, sensors and mobile devices. A Splunk Analytics for Hadoop, within AWS, solution is available on AWS Marketplace also. It’s called Hunk and it enables interactive exploration, analysis, and data visualization for data stored in Amazon EMR and Amazon S3
  • A Tableau Server Data Visualization Instance, for users to interact with pre-built data visualizations created using Tableau Desktop. Tableau server allows for ad-hoc querying and data discovery, supports high-volume data visualization and historical analysis, and enables the creation of reports and dashboards
  • A SAP HANA One Instance, a single-tenant SAP HANA database instance that has SAP HANA’s in-memory platform, to do transactional processing, operational reporting, online analytical processing, predictive and text analysis
  • A Geospatial AMI such as MapLarge, that brings high-performance, real-time geospatial visualization and interactive analytics. MapLarge’s visualization results are useful for plotting addresses on a map to determine demographics, analyzing law enforcement and intelligence data, delivering insight to public health information, and visualizing distances such as roads and pipelines
  • An Advanced Analytics Zementis ADAPA Decision Engine Instance, which is a platform and scoring engine to produce Data Science predictive models that integrate with other predictive models like R, Python, KNIME, SAS, SPSS, SAP, FICO and more. Zementis ADAPA Decision Engine can score data in real-time using web services or in batch mode from local files or data in Amazon S3 buckets. It provides predictive analytics through many predictive algorithms, sensor data processing (IoT), behavior analysis, and machine learning models
  • A Matillion Data Integration Instance, an ELT service natively built for Amazon Redshift, that uses Amazon Redshift’s processing for data transformations to utilize it’s blazing speed and scalability. Matillion gives the ability to orchestrate and/or transform data upon ingestion or simply load the data so it can be transformed multiple times as your business requires

Below is an AWS Marketplace brochure explaining the benefits of using Marketplace solutions for big data analytics (that can also be found on the “BI & Big Data Landing Page” if you scroll to the bottom of the page & click on “Download PDF Poster“.

Benefits of AWS Marketplace in Analytical Solutions

Poster on the Benefits of AWS Marketplace in Analytical Solutions

The Main Categories on AWS Marketplace

AWS Marketplace has solutions for big data analytics, but listed below are all of the main sections, with links to each topics’ respective landing pages:

Breaking Down the Main AWS Marketplace Categories to Specific Functionalities:

Security Solutions:

Network Infrastructure Solutions:

Storage Solutions:

BI and Big Data Solutions:

Database Solutions:

Application Development Solutions:

Content Delivery Solutions:

Mobile Solutions:

Microsoft Solutions (note: the list of  “Third-Party Software Products” is a small fraction of the AWS Marketplace solutions that run on Microsoft Servers):

  • Microsoft Workloads:
    • Windows Server (many editions, type “Windows Server” into AWS Marketplace search bar)
    • Exchange Server
    • Microsoft Dynamics (many editions, type “Microsoft Dynamics” into AWS Marketplace search bar)
    • Microsoft SQL Server  (many editions, type “SQL Server” into AWS Marketplace search bar)
    • SharePoint (many editions, type “SharePoint” into AWS Marketplace search bar)
  • Third-Party Software Products:

Migration Solutions:

I hope you read through the entire post, and that you now realize how much time, frustration, configuration, and money you can save by using the preconfigured software solutions available at AWS Marketplace, only paying for what you use!

Why do it any other way?

Using Pre-Configured Software Solutions from AWS Marketplace with 1-Click Deployments & Paying by the Hour - Why Do It Any Other Way???

Using Pre-Configured Software Solutions from AWS Marketplace with 1-Click Deployments & Paying by the Hour – Why Do It Any Other Way???

Read the previous post here.

#gottaluvAWS! #gottaluvAWSMarketplace!

Posted in 1-Click to Deploy Software Solutions for Your Choosing Paid for by the Hour, Amazon EC2 On-Demand Instances, Amazon Web Services, Amazon Web Services Analytic Services, AWS Analytic Services, AWS Analytics, AWS BI, AWS Business Value, AWS Data Collection, AWS Marketplace, AWS Marketplace FAQs, AWS Marketplace Security Solutions, Cloud Computing, Faster Time to Data-Driven Results, Faster Time to ROI, How to Find AWS Marketplace Category Langing Pages, How to Find AWS Marketplace Preferred Vendors, List of Main Vendors by Category AWS Marketplace, Making Your IT Life Simpler with AWS Marketplace | 4 Comments

TRADITIONAL RELATIONAL DATABASE MANAGEMENT SYSTEMS (Chapter 3.6 in “All AWS Data Analytics Services”)

A Traditional Relational Database Schema Showing Tables and Relations

A Traditional Relational Database Schema Showing Tables, Relations, & Keys

3.6  TRADITIONAL RELATIONAL DATABASE MANAGEMENT SYSTEMS

A relational database management system (RDBMS) is a database management system (DBMS) that is based on the relational model.

In 1970, Edgar F. Codd, a British computer scientist with IBM, published “A Relational Model of Data for Large Shared Data Banks.” At the time, the renowned paper attracted little interest, and few understood how Codd’s groundbreaking work would define the basic rules for relational data storage for decades to come, which can be simplified as:

  1. Data must be stored and presented as relations, i.e., tables that have relationships with each other, e.g., primary/foreign keys.
  2. To manipulate the data stored in tables, a system should provide relational operators – code that enables the relationship to be tested between two entities. A good example is the WHERE clause of a SELECT statement, i.e., the SQL statement SELECT * FROM CUSTOMER_MASTER WHERE CUSTOMER_SURNAME = ’Smith’ will query the CUSTOMER_MASTER table and return all customers with a surname of Smith.

RDBMSs have been a common choice for the storage of information in databases used for financial records, manufacturing and logistical information, personnel data, and other applications using historical, transactional data since the 1980s.

However, relational databases have received unsuccessful challenge attempts by object database management systems in the 1980s and 1990s (which were introduced trying to address the so-called object-relational impedance mismatch between relational databases and object-oriented application programs) and also by XML database management systems in the 1990s.

Despite such attempts, RDBMSs keep most of the market share, but that share is declining because of the lack of the ability to scale, concurrency issues, and the high network bandwidth required for queries having to traverse many tables that have been architected to be highly normalized. Database normalization is the technique used in organizing the data in an RDBMS. It’s a systematic approach of decomposing tables to eliminate data redundancy and improve data integrity.

Two examples of traditional relational databases are Microsoft SQL Server & Oracle Databases.

In Chapter 22, the second section will compare traditional relational databases with Amazon RDS Aurora database, that is a new RDBMS built from the ground up for the cloud and just recently surpassed Amazon Redshift to be AWS’ fastest growing service.

Read the previous post here.

Read the next post here.

#gottaluvAWS! #gottaluvAWSMarketplace!

 

Posted in Amazon Aurora, Amazon Web Services, AWS BI, Microsoft SQL Server, Oracle Database, RDBMS, Traditional Relational Database Systems | 1 Comment