We’re Living in a Metrics-Driven World (copied from read.acloud.guru blog)

*NOTE: To read this post on it’s original acloud.guru’s blog – which you should be reading daily anyway if you love the cloud & AWS – click here*

Analytics-Driven Organizations know how to turn data sources and solutions into insights that offer a competitive advantage

In 2017 & beyond, data will become more intelligent, more usable, and more relevant than ever

“The goal is to turn data into information, and information
into insight” — Carly Fiorina

We live in a data-driven world. For Fortune 500 companies, the value of data is clear and compelling. They invest millions of dollars annually in information systems that improve their performance and outcomes. Independent businesses have the same need to be data-driven; however there’s a persistent entrepreneurial resistance to becoming truly metrics-driven.

Founders are often tempted to postpone building necessary metrics in favor of spending time and resources on building products. While that might work in the short-term, it will very quickly come back to haunt them.

Very few companies have successfully achieved exponential growth, raised capital, or negotiated strong exits without first having a solid analytics model that has been iterated upon for many months or years.

The Analytics-Driven Organization

The importance of big data in the business world can’t be overstated. We know that there’s a enormous amount of valuable data in the world, but few companies are using it to maximum effect. Analytics drive business by showing how your customers think, what they want, and how the market views your brand.

In the age of Digital Transformation, almost everything can be measured. In the coming year this will be a cornerstone of how businesses operate. Every important decision can and should be supported by the application of data and analytics.

To keep competitive today, you need to “think ahead” and answer questions in real-time to provide alerts to mitigate negative impacts on your business, and you need to predictive analytics to forecast what’s going to happen before it ever does so you are prepared at any given point in time.

In 2017, data will become more intelligent, more usable, and more relevant than ever. Cloud technologies, primarily Amazon Web Services (AWS), has opened the doors to affordable, smart data solutions that make it possible for non-technical users to explore (through visualization tools) the power of predictive analytics.

There’s also an increasing democratization of artificial intelligence (AI), which is driving more sophisticated consumer insights and decision-making. Forward-thinking organizations will approach predictive analytics with the future and extensibility in mind.

Analyzing extensive data sets require noteworthy compute capacity that can fluctuate in size based on the data inputs and type of analytics. This characteristic of scaling workloads is perfectly suited to AWS and their Marketplace pay-as-you-go cloud model — where applications can scale up and down (and in and out) based on demand.

In 2017, entrepreneurs will learn how to embrace the power of cloud analytics.

The ubiquity of cloud is nothing new for anybody who stays up-to-date with Business Intelligence trends. The cloud will continue its reign as more and more companies move towards it as a result of the proliferation of cloud-based tools available on the market. Most of the elements — data sources, data models, processing applications, computing power, analytic models and data storage — are located in the cloud.

Very few companies have achieved success without first having a solid analytics model

No matter the role, no matter the sector, data is transforming it. Some companies have restructured themselves, their internal processes, their data systems & their cultures to embrace the opportunities provided by data.

At their core, the best data-driven companies operationalize data. Data informs the actions of each employee every morning & evening. These businesses use the morning’s purchasing data to inform which merchandise sits on the shelves in the afternoon, for example.

The Analytics-Driven Organization has also developed functional data supply chains that send insight to the people who need it. This supply chain comprises all the people, software, & processes related to data as it’s generated, stored, and accessed.

These businesses also have created a data dictionary — a common language of metrics used by the company to create a universal language used throughout all departments of the company.

As companies become analytics-driven, they aren’t just enjoying incremental improvements. The benefits enabled by analytical data processing becomes the heart of the business — enabling new applications and business processes, using a variety of data sources and analytical solutions — giving insight into their data never dreamed of and giving them a great competitive advantage.

The Types of Analytics and Their Use Cases

Descriptive Analytics

Uses business intelligence and data mining to ask “What has happened?”

Descriptive Analytics mines data to provide trending information on past or current events that can give businesses the context they need for future actions. Descriptive Analytics are characterized by the use of KPIs. It drills down into data to uncover details such as the frequency of events, the cost of operations and the root cause of failures.

Most traditional business intelligence reporting falls into this realm, but complex and sophisticated analytic techniques also fall into this realm when their purpose is to describe or characterize past events and states.

Summary statistics, clustering techniques, and association rules used in market basket analysis are all examples of Descriptive Analytics.

Diagnostic Analytics

Examines data or content to answer the question “Why did it happen?”

Diagnostic Analytics is characterized by techniques such as drill-down, data discovery, data mining and correlations. You can think of it as the casual inference and the comparative effect of different variables on a particular outcome. While Descriptive Analytics might be concerned with describing how large or significant a particular outcome is, it’s more focused on determining what factors and events contributed to the outcomes.

As more and more cases are included in a particular analysis and more factors or dimensions are included, it may be impossible to determine precise, limited statements regarding sequences and outcomes. Contradictory cases, data sparseness, missing factors (“unknown unknowns”), and data sampling and preparation techniques all contribute to uncertainty and the need to qualify conclusions in Diagnostic Analytics as occurring in a “probability space”.

Training algorithms for classification and regression techniques can be seen as falling into this space since they combine the analysis of past events and states with probability distributions. Other examples of Diagnostic Analytics include attribute importance, principle component analysis, sensitivity analysis and conjoint analysis.

Discovery Analytics

Approaches the data in an iterative process of “explore, discover, verify and operationalize.” It doesn’t begin with a pre-definition but rather with a goal.

This method uncovers new insights and then builds and operationalizes new analytic models that provide value back to the business. The key to delivering the most value through Discovery Analytics is to enable as many users as possible across the organization to participate in it to harness the collective intelligence.

Discovery Analytics searches for patterns or specific items in a data set. It uses applications such as geographical maps, pivot tables and heat maps to make the process of finding patterns or specific items rapid and intuitive.

Examples of Discovery Analytics include using advanced analytical geospatial mapping to find location intelligence or frequency analysis to find concentrations of insurance claims to detect fraud.

Predictive Analytics

Asks “What could happen?”

Predictive Analytics is used to make predictions about unknown future events. It uses many techniques from data mining, machine learning and artificial intelligence. This type of analytics is all about understanding predictions based on quantitative analysis on data sets.

It’s in the realm of “predictive modeling” and statistical evaluation of those models. It helps businesses anticipate likely scenarios so they can plan ahead, rather than reacting to what already happened.

Examples of Predictive Analytics includes classification models, regression models, Monte Carlo analysis, random forest models and Bayesian analysis.

Prescriptive Analytics

Uses optimization and simulation to ask “What should we do?”

Prescriptive Analytics explores a set of possible actions and suggests actions based on Descriptive and Predictive Analyses of complex data. It’s all about automating future actions or decisions which are defined programmatically through an analytical process. The emphasis is on defined future responses or actions and rules that specify what actions to take.

While simple threshold based “if then” statements are included in Prescriptive Analytics, highly sophisticated algorithms such as neural nets are also typically in the realm of Prescriptive Analytics because they’re focused on making a specific prediction.

Examples include recommendation engines, next best offer analysis, cueing analysis with automated assignment systems and most operations research optimization analyses.

Sentiment Analysis

The process of determining whether a piece of writing is positive, negative, or neutral.

Sentiment Analysis is also known as opinion mining — deriving the opinion or attitude of a speaker. Social media tweets, comments, & posts typically feed sentiment analysis. This is a sub-category of general Text AnalyticsA common use case of sentiment analysis is to discover how people feel about a particular topic.

Geospatial Analytics

There is a growing realization that by adding geographic location to business data and mapping it, organizations can dramatically enhance their insights into tabular data.

Geospatial Analytics, or Location Analytics, provide a whole new context that is simply not possible with tables and charts. This context can almost immediately help users discover new understandings and more effectively communicate and collaborate using maps as a common language.

When you can visualize millions of points on a map, use cases include route planning, geographic customer targeting, disease spread and more.

Interesting Types of Geospatial Analysis (courtesy of AWS Marketplace video of MapLarge Software)

The Culture of Digital Transformation

Change is going to happen whether you pursue it or not — you only need to look at how the role of cloud computing in 2016 has evolved to understand. Modern enterprises succeed when they adapt to industry and marketplace shifts and incorporate new technology into company culture and regular operations.

Digital transformation isn’t only about technology, it’s about bringing together the power of technology with a culture that embraces the change that it can lead for the organization.

Proactive innovation is one of the best ways to stay competitive in an evolving marketplace. New technology needs to be assessed, tested, analyzed, and judged more quickly than ever. Businesses can no longer afford to waste time and resources implementing new tools that offer no real value. This means a “Fail fast, to succeed faster,” mentality.

Some projects will work straight away, others will have significant learning curves. The faster your organization can go from idea to implementation the more it can embrace opportunities to transform and even disrupt markets and internal business models. We’ve already talked about adaptability, but that plays a major role here as well.

If a company has an adaptive culture where new tech can be easily integrated — or is at least encouraged — that enterprise is set up for long-term success.

Bring together the power of technology with a culture that embraces the change

#gottaluvAWS! #gottaluvAWSMarketplace!

Posted in 2017: The Year of Intelligence, Amazon Web Services, Amazon Web Services Analytic Services, AWS Analytics, Operationalizing Data Analytics | Tagged , , , , , | Leave a comment

Bringing Predictive Data Analytics to the People with PredicSis.ai

Don’t Be Left Behind in the Days Where Predictive Analytics is Mandatory
for Long-Term Business Success

BY FRANK for DATALEADER · JULY 24, 2017

Over the last decade, the term “big data” grew to prominence. The rush was on to create technologies to capture and store vast quantities of data. The focus of many enterprises, both large and small, was on data capture and storage. Now, the rush is on to monetize and exploit these sizable data stores. Companies want to make the right decisions, for the right customers at the precise time to maximize value and minimize risks. In short, businesses need to predict the future by anticipating behaviors and identifying trends, at the individual customer level and at scale across their entire customer base.The company that can achieve all of this before their competition will win the in the marketplace. Companies that don’t capitalize on their data resources by converting them to insights and actions will lose market share, fall behind, and, ultimately, fail.

Data: The New Oil

In the early 20th century there were hardly any automobiles. Accordingly, there were no gas stations or car mechanics. Over time, gas stations evolved to have convenience stores attached to them, auto repair shops thrived, and the insurance industry found a solid new foundation for an entirely new line of business.

Look around you and you will see an entire civilization transformed by oil with all its benefits and detriments. Now, imagine what the world will look like one hundred years from now when the data revolution has played out all its effects and unintended consequences.

Pushing the “data is the new oil” analogy further, consider that raw data exists in a natural, unprocessed state, very often deep underground. A considerable amount of labor goes into taking it from that primordial state into something that can be used to fuel a car or heat a home. The data must be extracted, shaped and processed in a process analogous to what oil refineries do. Finally, the output of the refinery gets sold as a product to consumers. In other words, as more oil does not make for better gasoline, more data doesn’t necessarily make your business data-centric.

Turning Raw Data into Insight

Over the last few years, more and more organizations have discovered that data can be turned into any number of Artificial Intelligence (AI), Machine Learning (ML), or other “cognitive” services. Some of these new services may blossom into new revenue streams and will more than likely disrupt entire industries as the normal way of business will be up ended in favor of automation and accelerated decision making.

Collecting raw data for the sake of collecting raw data, argues Hal Varian, Google’s chief economist, exhibits “decreasing returns to scale.” In other words, each additional piece of data is somewhat less valuable and at some point, collecting more does not add anything. What matters more, he says, is the quality of the algorithms that process the data and the talent a firm has brought on to develop these algorithms. Success for Google is in the “recipe” not the “ingredients.”

As for new world of data, the product could be a service that rates the likelihood of whether or not a transaction is fraudulent and the consumers of the service are internal auditing department. In this way, data will enable new markets and even economic ecosystems as a previously undervalued resource develops into new streams of income and creates entirely new offshoot industries.

Drawbacks of Conventional Analytics

With the future of their businesses at stake, one would think that every single enterprise would be eagerly scouring their data sets and feeding them to any number of algorithms in order to extract any deeper understanding of their customers’ activities and identify trends as they unfold. However, this is not the case. Why?

The answer comes down to cost, both in terms of finding the people with the skills to perform this type of work and in the computation infrastructure often required to run existing algorithms, and risk, as it is often said the value is quite difficult to foresee and the complexity difficult to handle along the analytics initiatives.

This is not mere risk aversion or fear of the unknown: there are hurdles everywhere, indeed. Data will need to be shaped and cleaned; the team is not skilled enough, and hiring on consultants is expensive. Not to mention the infrastructure investments required to store the data and compute the model. The payoff is hard to evaluate and the ROI is even harder to envision.

In a few words: getting tangible results from analytics is not straightforward. At least, not for everyone.

What if there was a way to take the cost of recruiting experienced data scientists and data engineers, remove the expense associated with beefing up IT infrastructure, and make advanced data analytics more approachable to the average knowledge worker?

Well, there is.

Enter PredicSis.ai

PredicSis.ai changes the game by making advanced analytics more accessible and affordable. No longer are advanced analytics limited to large organizations with massive budgets devoted towards hiring, training, and maintaining a data science team. Now, anyone with or without data science and machine learning skills can leverage the power with a few clicks of the mouse.

Simple to Use

The real power of PredicSis.ai lies in its ability to place data analytics into an easy to use self-service SaaS model. PredicSis.ai is now available on the AWS Marketplace. Just activate an account on the marketplace and pay as you go. No software installs or commitments.

Automatic, Swift & Agile Integrated Predictive Analytics

PredicSis.ai automates much of the work normally associated with machine learning. Using autoML algorithms, PredicSis.ai surfaces and evaluates new data features and only displays the ones with meaningful impact on predictive outcomes. In other words, the software automatically filters out the fields, or features, that lack correlation to the predicted outcome. From the meaningful features it discovers, PredicSis.ai then creates a predictive model for future input. As the workflow and the display are made straightforward and intuitive, users can focus on rapidly iterating data models, exploring the data and, finally, delivering added value to the business.

Blazingly Fast Heterogeneous or Homogeneous Data Integration

Getting data into PredicSis.ai is fast and easy. Simply drag and drop ASCII or UTF-8 encoded CSV files: once the primary dataset file is uploaded, users can upload any number of additional peripheral data tables. It’s then up to PredicSis.ai, supervised by the user and its business knowledge, to detect, surface and display meaningful features and insights from those multiple datasets.

Accessible Advanced Data Analytics

Making advanced analytics accessible opens up new worlds of possibilities. With the freedom of self-service analytics, all of different scenarios are possible. Marketing and sales departments can determine the customers most likely to leave for a competitor – before they leave. They can pre-emptively identify which accounts are high priority calls for the sales team. Outbound calls from the sales center can be optimized to increase conversion and sales performance. Marketing and sales teams can be self-sufficient with their model creation, exploration, and experimentation.

However, the advantages go beyond marketing and sales, business analysts can leverage their deep domain knowledge and apply predictive analytics to pre-emptively address challenges that the business faces and take corrective action ahead of time. They can also use the built-in sharing functionality to share insights with their colleagues and management.

Speaking of management, company leadership can make dashboards into foresight, predicting the course of the business. Using the same tools, senior management can take proactive steps to steer the company around dangers, risks, and even find new opportunities that they may have otherwise missed.

Finally, even seasoned data scientists can leverage the flexibility and power of PredicSis.ai to explore models with ease and speed. Data science teams can explore more datasets in less time, allowing for them to explore more options, create more effective models, and add more value to the business.

In short, PredicSis.ai allows non-data scientists to reap the rewards of data science and makes data scientists more efficient.

It is pointless and painful not to use PredicSis.ai!

Getting Started

The truly best way to see how PredicSis.ai works is to use it and see for yourself just how approachable it makes data analytics.

Now that PredicSis.ai is available on the AWS Marketplace, it couldn’t be easier to use. No software to download, nothing to install, no new hardware to provision: it’s just a service that runs in the AWS cloud.

The first step to using PredicSis.ai is to browse over to the AWS Marketplace and search for PredicSis.ai. It will appear as one of the options in the autocomplete dropdown.

Then choose from one of the following options. For the purposes of this blog post, choose the PredicSis.ai (single-user) option.

On the next page, choose the AWS Region and EC2 Instance Type you wish to use. Then click on the “Launch with 1-click” button.

On the following page, click on the EC2 Console link to browse to the EC2 console.

On the EC2 Console page, retrieve the URL where PredicSis.ai was deployed. It will be the domain name next to Public DNS and end with “compute.amazonaws.com”.

Copy the URL and then browse to it.

You should see the PredicSis.ai software home page with no projects.

Data Modeling 101

Now you’re about to create your first data model and discover what lies hidden inside the data: the PredicSis.ai way. You may be asking yourself, exactly what is a data model? A data model is a way to store and represent data and its relation to other data. For instance, customers have various attributes about them stored in the data model and customer actions are also logged in the data model.

By connecting a customer’s attribute, such as age, to their behavior, such as buying certain products, one can infer that customers of a similar age are more likely to make the same purchases. Many relationships are well known, parents with young children are more likely to buy diapers than those whose children are in college.

What’s great about machine learning and data analytics is the ability to identify patterns quickly and even see patterns that exist that humans may not be able to readily identify. How does that work? Well, let’s create a project and see how easy PredicSis.ai makes this process.

Creating Your First Project

Click on the Create button located on the upper right-hand corner of the screen. The following dialog box asks for a name for the new project. You may enter anything, but, for this example, I entered “first project.”

Click Create and then the “first project” project now appears in the workspace area.

Click on the first project button to bring up the screen below:

Now drag and drop files to add data to the project. As mentioned previously, data files must be in CSV format. Once the appropriate files are uploaded, specify the one containing the outcome.

Once completed, click the Get Insights button in the upper right hand corner of the screen and PredicSis.ai will get to work analyzing the data.

After a few moments, the results are in and we can now explore the results.

First, click on the magnifying glass icon to view the model. Click around and explore the features of the data. Note how the visualizations change with each field.

To get back to the previous screen, click the drop down list that starts with “first project” and then choose Models List.

Now click on the chart icon to assess the quality of the model

The following page explains how well the predicted outcomes the model created matched up with the test data.

This screen contains a lot of information. However, if you look at the Performance number, the model scored a “0.6941,” meaning it was correct around 69.41% of the time.

Certainly, there is room for improvement and PredicSis.ai provides ways for you to adjust and improve the model.

Go back to the models list page and this time, click on the lightning bolt icon to improve the model.

This page allows you to manually adjust the features that go into creating the predictive model. Remove the average_basket feature by clicking the checkbox. Now, add the region_code feature by clicking on the check box. Additionally, change the value in the dropdown list in the Type column to Categorical. Your feature set should look like the following.

Click the Apply Changes button in the upper right hand corner to apply the changes to the model and see if the performance has improved.

Upon quick inspection, you can see that the small changes improved the model modestly. Up to 74.11% now. Go back and experiment which fields and options improve the model and which ones have the opposite effect.

Once you’ve made improvements to the model, it’s time to share it.

On the Models List page, click on the magnifying glass icon to view the model. Check all boxes next to the fields you wish to include in the report. Now click on the Get PDF button in the upper right hand corner of the screen. PredicSis.ai generates a report in a PDF file that you can share.

If you wanted to share the performance metrics of the model, you can do that as well. Go back to the Models List page and this time click on the chart icon to see the Assess Model screen once more. On the upper right hand corner of the screen, there is once again a Get PDF button. Click on it to generate and download a report as PDF file.

Conclusion

PredicSis.ai allows for easy creation and exploration of predictive data models in just a few clicks of the mouse. It gives business users many of the same analytical tools that have previously only been in the hands of data scientists. With wider use and deployment of data analytics, businesses can more easily spot trends, detect fraud, better serve their existing customers, and find new ones.

Data analytics is already changing the game of business and now its power is in your hands. The recommendation is simple: use PredicSis.ai before your competition does.

Posted in Artificial Intelligence, Data Modeling, Easy AWS AI, PredicSis.ai, Predictive Analytics | Tagged , , , , , , | Leave a comment

AWS Data Analytics Services Leveraging AWS Marketplace in Detail

Unlock Hidden Insights within Massive Data Sources

Summary

Analyzing extensive data sets require noteworthy compute capacity that can fluctuate in size based on the data inputs and type of analytics. This characteristic of scaling workloads is perfectly suited to AWS and the AWS Marketplace’s pay-as-you-go cloud model, where applications can scale up and down based on demand. Being able to analyze data quickly to derive valuable insights can be done within minutes rather than months, and you only pay for what you use.

Introduction

As an ever-increasing and ubiquitous proliferation of data is emitted from increasingly new and previously unforeseen sources, traditional in-house IT solutions are unable to keep up with the pace. Heavily investing in data centers and servers by “best guess” is a waste of time and money, and a never-ending job.

Traditional data warehouses required very highly skilled employees that addressed a fixed set of questions. The need for speed and agility today in analyzing data differently and efficiently requires complex architectures that are available and ready for use with the click of a button on AWS – eliminating the need to concern yourself with the underlying mechanisms and configurations that you’d have to do on premises.

The AWS Marketplace streamlines the procurement of software solutions provided from popular software vendors by providing AMIs that are pre-integrated with the AWS cloud, further expediting and assisting you with supporting big data analytical software services. The AWS Marketplace has over 290 big data solutions to-date.

This eBook will cover big data and big data analytics as a whole in depth: what it is, where and how it comes from, and what kinds of information you can find when analyzing all of this data. It will then discuss the facts about why AWS and the solutions provided by top software vendors in AWS Marketplace provide the best big data analytics services and offerings. Then there will be a walk-through of the AWS Services that are used in big data analytics with augmented solutions from AWS Marketplace. In conclusion, you will see how AWS is the unequivocal leader when implementing big data analytic solutions.

Conventions Used in this eBook

In order to provide cohesiveness to the longer sections of this eBook, the use of tables are used. The header of the table will list the name of the topic, and each subtopic is listed below it. An example is shown below:

TABLE 1: EXAMPLE OF A TABLE WITH TOPIC IN THE HEADER WITH SUBTOPICS BELOW

Big Data Analytics Challenges

Data is not only getting bigger (in “Volume”) and in ever-increasing different formats (the “Variety”) faster (the “Velocity”), but the need to derive “Value” through analytics to provide actionable insights for businesses is indeed a differentiating factor between successful businesses that can mitigate risk and respond to customer actions in near real-time vs. other businesses that will fall behind in the day and age of data deluge. Using Amazon Web Services cloud architectures and software solutions available from popular software venders on AWS Marketplace, big data analytics solutions change from extremely complicated to set up and manage to a couple of clicks to deployment.

In addition the metaphorical “V’s” mentioned above to describe big data, there is one more: “Veracity” – being sure your data is clean prior to any analytics performed whatsoever. Garbage in, garbage out. There’s no time to waste making improper, misinformed decisions based on dirty data. This is paramount. Using solutions in the AWS Marketplace make this crucial and difficult step easy.

Big data has also evolved. It used to be that batch processing for reports was sufficient (and the only solution available). To keep competitive today, you need to “think ahead” and answer questions in real-time to provide alerts to mitigate negative impacts on your business, and you need to predictive analytics to forecast what’s going to happen before it ever does so you are prepared at any given point in time.

Overview of AWS and AWS Marketplace Big Data Analytics Advantages

TABLE 2: OVERVIEW OF AWS AND AWS MARKETPLACE BIG DATA ANALYTICS ADVANTAGES

AWS Big Data Analytics Advantages Overview

Analyzing large data sets requires significant compute capacity that can vary in size based upon the amount of input data and the type of analysis. Thus it’s apparent that these big data workloads is ideally suited to a pay-as-you-go cloud environment.

Many companies that have successfully taken advantage of AWS big data analytics processing aren’t just enjoying incremental improvements. The benefits enabled by big data processing becomes the heart of the business – enabling new applications and business processes, using a variety of data sources and analytical solutions – giving insight into their data never dreamed of and giving them a great competitive advantage.

Ongoing developments in AWS cloud computing are rapidly moving the promise of deriving business value from big data in real-time into a reality. With billions of devices globally already streaming data, forward-thinking companies have begun to leverage AWS to reap huge benefits from this data storm.

AWS has the broadest platform for big data in the market today, with deep and rapidly expanding functionality across big data stores, data warehousing, distributed analytics, real-time streaming, machine learning, and business intelligence. Gartner2 confirms AWS has the most diverse customer base and the broadest range of use cases, including enterprise mission-critical applications. For the sixth consecutive year, Gartner2 also confirms AWS is the overwhelming market share leader, with over 10 time more cloud compute capacity in use than the aggregate total of the other 14 providers in their Magic Quadrant!

2 Gartner

AWS has a tiered competency-badged network of partners that provide application development expertise, managed services and professional services such as data migration. This ecosystem, along with AWS’s training and certification programs, makes it easy to adopt and operate AWS in a best-practice fashion.

The AWS cloud provides governance capabilities enabling continuous monitoring of configuration changes to your IT resources as well as giving you the ability to leverage multiple native AWS security and encryption features for a higher level of data protection and compliance – security at every level up to the most stringent government compliance no matter what your industry.

Listed Below are Some of the Specific AWS Big Data Analytics Advantages:

  • The vast majority of big data use cases deployed in the cloud today run on AWS, with unique customer references for big data analytics, of which 67 are enterprise, household names
  • Over 50 AWS Services and hundreds of features to support virtually any big data application and workload
  • AWS releases new services and features weekly, enabling you to keeping the technologies you use aligned with the most current, state-of-the-art big data analytics capabilities and functionalities
  • AWS delivers an extensive range of tools for fast and secure data movement to and from the AWS cloud
  • Computational power that’s second to none2; each optimized with varying combinations of CPU, memory, storage and networking capacity to meet the need of any big data use case
  • AWS makes fast, scalable, gigabyte-to-petabyte scale analytics affordable to anyone via their broad range of storage, compute and analytical options, guaranteed!
  • AWS provides capabilities across all of your locations, your networks, software and business processes meeting the strictest security requirements that are continually audited for the broadest range of security certifications
  • AWS removes limits to the types of database and storage technologies you can use by providing managed database services that offer enterprise performance at open source cost. This results in applications running on many different data technologies, using the right technology for each workload
  • Virtually unlimited capacity for massive datasets
  • AWS provides data encryption at rest an in-transit for all services with the ability for you to directly analyze the encrypted data
  • AWS provides a scalable architecture that supports growth in users, traffic or data without a drop in performance, both vertically and horizontally, and allows for distributed processing
  • Faster time-to-market of products and services, enabling rapid and informed decision-making while shrinking product and service development time
  • Lower cost of ownership and reduced management overhead costs, freeing up your business for more strategic and business-focused tasks
  • In addition to the huge cost savings of simply moving from on-premises to the cloud, AWS provides suggestions on how to further decrease cost savings. Providing the most cost-efficient cloud solutions is a frugality rule at AWS
  • Numerous ways to achieve and optimize a globally-available, unlimited on-demand capacity of resources so you can grow as fast as you can
  • Fault tolerance across multiple servers in Availability Zones and across geographically distant Regions
  • An extremely agile application development environment: go from concept to full production deployment in 24 hours
  • Security at AWS is the highest priority. As an AWS customer, you benefit from a data center and network architecture built to meet the requirements of the most security-sensitive customers
  • AWS provides many suggestions on how to remove a single point of failure

2 Gartner

AWS Marketplace Big Data Analytics Advantages Overview

AWS provides an extensive set of managed services that help you build, secure, and scale big data analytics applications quickly and easily. Whether your applications require real-time streaming, a data warehouse solution, or batch data processing, AWS provides the infrastructure and tools to perform virtually any type of big data project.

When you combine the managed AWS services with software solutions available from popular software vendors on AWS Marketplace, you can get the precise business intelligence and big data analytical solutions you want that augment and enhances your project beyond what the services themselves provide. You get to data-driven results faster by decreasing the time it takes to plan, forecast, and make software provisioning decisions. This greatly improves the way you build business analytics solutions and run your business.

Gartner2 confirms that because AWS has a multi-year competitive advantage over all its competitors, it’s been able to attract over a thousand technology partners and independent software vendors from popular vendor that have licensed and packaged their software to run on AWS, have integrated their software with AWS capabilities, or to deliver add-on services all through the AWS Marketplace. The AWS Marketplace is the largest “app store” in the world, regardless of being strictly a B2B app store!

2 Gartner

FIGURE 1: THE AWS MARKETPLACE

Since AWS resources can be instantiated in seconds, you can treat these as “disposable” resources – not hardware or software you’ve spent months deciding which to choose and spending a significant up-front expenditure without knowing if it will solve your problems. The “Services not Servers” mantra of AWS provides many ways to increase developer productivity, operational efficiency and the ability to “try on” various solutions available on AWS Marketplace to find the perfect fit for your business needs without commitment to long-term contracts.

Listed Below are Some of the Specific AWS Marketplace Big Data Analytics Advantages:

  • Get to data-driven results faster by decreasing the time it takes to plan, forecast, and make decisions by performing big data analytics and visualizations on AWS data services and other third-party data sources via software solutions available from popular software vendors on AWS Marketplace – the largest ecosystem of popular software vendors and integrators of any provider2 – giving your organization the agility to experiment and innovate with the click of a button
  • The AWS Marketplace maintains the largest partner ecosystem of any provider. It has over 290+ big data software solutions available from popular software vendors that are pre-integrated with the AWS cloud
  • Deploy business intelligence and advanced analytics pre-configured software solutions in minutes
  • On-demand infrastructure through software solutions on AWS Marketplace allows iterative, experimental deployment and usage to take advantage of advanced analytics and emerging technologies within minutes, paying only for what you consume, by the hour or by the month
  • Many AWS Marketplace solutions offer free trials, so you can “try on” multiple big data analytical solutions to solve the same business problem to see which is the best fit for your specific scenario

2 Gartner

Example Solutions Achieved Through Augmenting AWS Services with Software Solutions Available on AWS Marketplace

Using software solutions available from popular software vendors on AWS Marketplace, you can customize and tailor your big data analytic project to precisely fit your business scenario. Below are just a fraction of example solutions you can achieve when using AWS Marketplace’s software solutions with the AWS big data services.

You can:

  • Launch pre-configured and pre-tested experimentation platforms for big data analysis
  • Query your data where it sits (in-datasource analysis) without moving or storing your data on an intermediate server while directly accessing the most powerful functions of the underlying database
  • Perform “ELT” (extract, load, and transform) vs. “ETL” (extract, transform, and load) your data into Amazon’s Redshift data warehouse so the data is in its original form, giving you the ability to perform multiple data warehouse transforms on the same data
  • Have long-term connectivity among many different databases
  • Ensure your data is clean and complete prior to analysis
  • Visualize millions of data points on a map
  • Develop route planning and geographic customer targeting
  • Embed visualizations in applications or stand-alone applications
  • Visualize billions of rows in seconds
  • Graph data and drill into areas of concern
  • Have built-in data science
  • Export information into any format
  • Deploy machine-learning algorithms for data mining and predictive analytics
  • Meet the needs of specialized data connector requirements
  • Create real-time geospatial visualization and interactive analytics
  • Have both OLAP and OLTP analytical processing
  • Map disparate data sources (cloud, social, Google Analytics, mobile, on-prem, big data or relational data) using high-performance massively parallel processing (MPP) with easy-to-use wizards
  • Fine-tune the type of analytical result (location, prescriptive, statistical, text, predictive, behavior, machine learning models and so on)
  • Customize the visualizations in countless views with different levels of interactivity
  • Integrate with existing SAP products
  • Deploy a new data warehouse or extend your existing one

AWS Marketplace-Specific Site for Data Analytics Solutions
There’s a plethora of options on AWS Marketplace to run big data analytics software solutions available from popular vendors that are already pre-configured on an Amazon Machine Image (AMI) that solve a variety of very specific needs, some of which were mentioned above.

You can visit the AWS Marketplace Big Data Analytics-specific site by clicking the bottom left icon on the AWS Marketplace site or by clicking here to view the premier AWS Marketplace solution providers for transforming and moving your data, processing and analyzing your data, and reporting and visualizing your data.

I’d like to point out if you click the “Learn More” link at the bottom of each type of solution (for example, below I’m showing the section “Business Intelligence and Data Visualization”), you’re taken to an awesome section that works like a “Channel Guide” for webcasts to teach you how to work with some of the solutions presented by software vendor representatives!

The first screenshot below shows where to find the “Learn More” link, and the second screenshot below is of the “Channel Guide” for webcasts by representatives for some of AWS Marketplace software vendors:

FIGURE 2: THE “LEARN MORE” LINK UNDER EACH ANALYTIC SOLUTION TYPE
ON THE BIG DATA ANALYTICS-SPECIFIC SITE

Click the “Learn More” link highlighted above in the red rectangle, and for whichever type of Analytics solution you click on, you’re taken to the Webcast Channels:

FIGURE 3: THE “CHANNEL GUIDE”-TYPE WEBCAST INTERFACE TO HELP YOU UNDERSTAND
EACH PARTICULAR TOPIC ON THE BIG DATA ANALYTICS-SPECIFIC SITE

Overview of AWS Cloud Architecture
The AWS cloud is based on the general design principles of the “Well-Architected Framework” that increases the likelihood of business success. It is based on the following four pillars:

  1. Security: The ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies
    • AWS’s built-in security features can be enhanced with software solutions available from popular software vendors on AWS Marketplace. To browse the various solutions, click here
  2. Reliability: The ability of a system to recover from infrastructure or service failures, dynamically acquiring computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues
    • AWS’s built-in fault tolerance and infrastructure disruption features can be enhanced with software solutions available from popular software vendors on AWS Marketplace. To browse solutions for fault tolerance, click here, and for infrastructure/network solutions click here
  3. Performance Efficiency: The ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve
    • AWS’s built-in performance features can be enhanced with software solutions available from popular software vendors on AWS Marketplace. To browse the various solutions, click here
  4. Cost Optimization: The ability to avoid or eliminate unneeded cost or suboptimal resources
    • AWS’s built-in cost alerting features can be enhanced with software solutions available from popular software vendors on AWS Marketplace. To browse the various solutions, click here

If you’d like to know more about AWS’ “Well-Architected Framework”, from which the above is referenced, click here.
Some of the Types of Big Data Analytical Insights and Example Use Cases

TABLE 3: SOME EXAMPLES OF THE TYPES OF BIG DATA ANALYTICAL INSIGHTS WITH USE CASES

Introduction
Big Data is such a buzz-word that it’s prudent to ensure we wrap our heads around what it means.

Big data means a massive volume of both structured and unstructured data that’s so large it’s difficult to process using traditional database and software techniques.

In Enterprise scenarios, the volume of data is just too big or it moves too fast, or it exceeds processing capabilities available on-premisis. But this data, when captured, formatted, manipulated, and stored, pulls powerful insights – some never imagined – through analytics.

Below you’ll find a description of some of the types of big data analytical insights and common use cases for each:

Descriptive: Descriptive Analytics uses business intelligence and data mining to ask “What has happened?” Descriptive Analytics mines data to provide trending information on past or current events that can give businesses the context they need for future actions. Descriptive Analytics are characterized by the use of KPIs. It drills down into data to uncover details such as the frequency of events, the cost of operations and the root cause of failures. Most traditional business intelligence reporting falls into this realm, but complex and sophisticated analytic techniques also fall into this realm when their purpose is to describe or characterize past events and states. Summary statistics, clustering techniques, and association rules used in market basket analysis are all examples of Descriptive Analytics.

Diagnostic: Diagnostic Analytics examines data or content to answer the question “Why did it happen?” It’s characterized by techniques such as drill-down, data discovery, data mining and correlations. You can think of it as the casual inference and the comparative effect of different variables on a particular outcome. While Descriptive Analytics might be concerned with describing how large or significant a particular outcome is, it’s more focused on determining what factors and events contributed to the outcomes. As more and more cases are included in a particular analysis and more factors or dimensions are included, it may be impossible to determine precise, limited statements regarding sequences and outcomes. Contradictory cases, data sparseness, missing factors (“unknown unknowns”), and data sampling and preparation techniques all contribute to uncertainty and the need to qualify conclusions in Diagnostic Analytics as occurring in a “probability space”. Training algorithms for classification and regression techniques can be seen as falling into this space since they combine the analysis of past events and states with probability distributions. Other examples of Diagnostic Analytics include attribute importance, principle component analysis, sensitivity analysis and conjoint analysis.

Discovery: Discovery Analytics doesn’t begin with a pre-definition but rather with a goal. It approaches the data in an iterative process of “explore, discover, verify and operationalize.” This method uncovers new insights and then builds and operationalizes new analytic models that provide value back to the business. The key to delivering the most value through Discovery Analytics is to enable as many users as possible across the organization to participate in it to harness the collective intelligence. Discovery Analytics searches for patterns or specific items in a data set. It uses applications such as geographical maps, pivot tables and heat maps to make the process of finding patterns or specific items rapid and intuitive. Examples of Discovery Analytics include using advanced analytical geospatial mapping to find location intelligence or frequency analysis to find concentrations of insurance claims to detect fraud.

Predictive: Predictive Analytics asks “What could happen?” It’s used to make predictions about unknown future events. It uses many techniques from data mining, machine learning and artificial intelligence. This type of analytics is all about understanding predictions based on quantitative analysis on data sets. It’s in the realm of “predictive modeling” and statistical evaluation of those models. Examples of Predictive Analytics includes classification models, regression models, Monte Carlo analysis, random forest models and Bayesian analysis. It helps businesses anticipate likely scenarios so they can plan ahead, rather than reacting to what already happened.

Prescriptive: Prescriptive Analytics uses optimization and simulation to ask “What should we do?” It explores a set of possible actions and suggests actions based on Descriptive and Predictive Analyses of complex data. It’s all about automating future actions or decisions which are defined programmatically through an analytical process. The emphasis is on defined future responses or actions and rules that specify what actions to take. While simple threshold based “if then” statements are included in Prescriptive Analytics, highly sophisticated algorithms such as neural nets are also typically in the realm of Prescriptive Analytics because they’re focused on making a specific prediction. Examples include recommendation engines, next best offer analysis, cueing analysis with automated assignment systems and most operations research optimization analyses.

How Big is “Big Data”?
The amount of digital information a typical business has to deal with doubles every two years. It has been predicted that the data we create and copy annually (“the digital universe”) will reach 44 zettabytes – or 44 trillion gigabytes – by the year 20203. With AWS and the analytical solutions provided by popular software vendors provided by the AWS Marketplace, there is a wealth of yet-to-be-discovered insights that could provide a myriad of understanding in countless types of research.
3 EMC Digital Universe with Research & Analysis by IDC

FIGURE 4: HOW BIG IS BIG DATA?

Examples of Big Data Producers
This section is included to give you an example of some of the types of “things” that produce massive amounts of data that can be analyzed and repurposed.

TABLE 4: EXAMPLES OF BIG DATA PRODUCERS

Machine and Sensor Data
Machine and sensor data come from many sources, and sources continue to proliferate. Some examples are energy meters, telecommunications, road/air/sea pattern analysis, satellites, meteorological sensors and other natural phenomena monitoring, scientific and technical services, manufacturing, medical devices and the Internet of Things (IoT) such as smart homes, appliances and cities. Analyses of this type of data can reveal many trends and predictive analysis can be performed to take action to prevent unwanted scenarios or be alerted when something goes awry.

Image and Video Data
It would take more than 5 million years to watch the amount of video that will cross global IP Networks each month in 20204. Some examples of image and video data are video surveillance, immersive video, virtual (and augmented) reality, internet gaming, smartphone images and video, photo and video sharing sites (YouTube, Instagram, Pinterest, etc.) and streaming video content (such as Netflix). Topological, contextual, hidden statistical patterns and historical analyses are examples of some of the types of analytics that can be done on image and video data5.

4 For a detailed report on Visual Networking, read Cisco’s Visual Networking Index: Forecast and Methodology, 2015-2020
5 Wired.com

Social Data
There are approximately 2 billion internet users using social networks in 20165, producing enormous amounts of data not only through posts and tweets, but comments, likes, and so forth. Some examples include Facebook and Facebook Messenger, Twitter, LinkedIn, Vine, WhatsApp, Facebook, Skype, and so forth. This type of data is useful for text and sentiment analysis.

6 Statista Statistics Portal

Internet Data
The current forecast projects global IP traffic to nearly triple from 2015 to 2020, growing to 194 exabytes/month7. Examples of internet data include data stored on websites, blogs, and news sources, online banking and financial transactions, package and asset tracking, transportation data, telemedicine, first responder connectivity, and even chips for pets! Internet data can be analyzed for security breaches, bank fraud, traffic analysis, geographic distribution of DNS clients, discovering origins of cybercrime8.

7 Cisco: The Zettabyte Era – Trends and Analysis 2016

8 CAIDA: Center for Applied Internet Data Analysis

Log Data
Log files are records of events that occur in a system, in software, or in communications between users of software. There are many types of logging systems. Some examples are event logs, server logs, RFID logs, Active Directory logs, security logs, mail logs, network logs and transaction logs. Log data analysis include analysis for performance, solve software bugs, testing of new features, audit trails for unauthorized or malicious access, etc.
Third-Party Data
Third-party data is any information collected by an entity that does not have a direct relationship with the user the data is being collected on. Often this data is generated on a variety of platforms and then aggregated together for analysis. Examples include geospatial data, mapping and demographic data, content delivery networks, CRM and other business software systems. Third-party data can be analyzed for trends in traffic, spread of disease, user behavior, and more.

AWS Cloud Computing Models and Deployment Models
Cloud computing provides developers and IT departments the ability to focus on what matters most and avoiding undifferentiated work like procurement, maintenance, and capacity planning. There are several different models and deployment strategies that help meet specific needs of different users. Each type of cloud service and deployment method provides different levels of control, flexibility and management. By understanding the differences between “Infrastructure as a Service” (IaaS), “Platform as a Service” (PaaS), and “Software as a Service” (SaaS), in addition to the different deployment strategies available, can help you decide what set of services is right for your business needs.

Before an analytical cloud project starts, it’s important to choose the right cloud computing and deployment architectures determined. Many factors come into play that affect the location of the data to be analyzed, where the analytics processing will be performed, and to abide by legal and regulatory requirements of different countries. Once you’ve determined the best cloud computing and deployment model, you can utilize Amazon CloudFront to speed up the distribution of your application. Amazon CloudFront delivers your content through a worldwide network of edge locations, so that when a user requests content served from CloudFront, they’re routed to the edge location that provides the lowest latency, so content is delivered with the best possible performance. For more details about Amazon CloudFront, click here.

TABLE 5: AWS CLOUD COMPUTING MODELS

AWS Cloud Computing Models
There are three main models for cloud computing on AWS. Each model represents different parts of the cloud computing stack.

  1. IaaS contains the basic building blocks for cloud IT. This model typically provides access to networking features, “computers” (virtual or on dedicated hardware), and data storage space. This model gives you the highest level of flexibility and management control over your IT resources and is most similar to most existing IT resources on-premisis. IaaS is usually the first model used when moving to the cloud
  2. PaaS removes the need to manage the underlying infrastructure (usually hardware and operating systems) which allows you to focus on the deployment and management of your applications. This increases efficiency because you don’t have to worry about resource procurement, capacity planning, software maintenance, patching, or any of the other undifferentiated “heavy lifting” involved in running your applications
  3. SaaS provides you with a completed product that’s run and managed by the service provider. In most cases, people referring to SaaS are referring to end-user applications. With SaaS you don’t have to think about how the service is maintained or how the underlying infrastructure is managed; you only need to think about how you’ll use the software.

You’ll find popular open source and commercial software available on AWS Marketplace available as SaaS (in addition to individual Amazon Machine Images (AMIs) or as a cluster of AMIs deployed through an AWS CloudFormation template).

AWS Cloud Computing Deployment Models
There are three AWS cloud computing deployment models: Public, Hybrid, and Private.

*NOTE: These describe where the IT resources reside, and are separate from the many ways to get your data and applications onto AWS.

TABLE 6: AWS CLOUD DEPLOYMENT MODELS

AWS Public Cloud Model (Cloud Native)
The AWS public cloud is where most companies and individuals start. It’s the easiest, fastest way to begin to use on-demand delivery of IT resources and applications via the Internet with a low-cost, pay-as-you-go pricing model via AWS services and solutions available on AWS Marketplace.

The public cloud is an ideal place to quickly use big data analytics solutions on the AWS Marketplace to experiment, innovate and try new and different analytical solutions. Spin up solutions as you need them, turn them off when you’re done and only pay for what you’ve used.

The public cloud provides a simple way to access servers, storage, databases and a huge set of application services. AWS owns and maintains the network-connected hardware required for these services, while you provision and use what you need via the AWS console. Using the public cloud gives you the benefits of cloud computing such as the following:

  • Rather than investing in data centers and servers before you know what you’re going to use, you can only pay when you consume computing resources, and only for how much you use
  • Achieve lower variable cost because hundreds of thousands of customers are aggregated in the cloud
  • Eliminate guessing on infrastructure capacity needs. Access as much or as little as you need, and scale up or down as required within minutes
  • Increase speed and agility, since new IT resources are only a click away
  • Focus on projects that differentiate your business since AWS does all the heavy lifting of racking, stacking and powering servers
  • Easily deploy your application in multiple regions around the world, going global in minutes

The public network contains elements that may be sourced from the internet, data sources and users, and the edge services needed to access the AWS cloud or enterprise network. The flow from the external internet may come through normal edge services including DNS Servers (Amazon Route 53, for example), Content Delivery Networks (Amazon CloudFront, for example), firewalls (Amazon EC2 VPCs, for example) and load balancers (Amazon EC2 Load Balancers, for example) before entering the data integration or data streaming entry points to the data analytics solution.

AWS Hybrid Cloud Model
For companies who have significant on-premises and / or data center investments, migrating to the cloud can take years. Therefore, it’s very common to see Enterprises use a “Hybrid Cloud Architecture”, where critical data and processing remains in the data center and other resources are deployed in public cloud environments. Processing resources can be further optimized with a hybrid topology that enables cloud analytics engines to work with on-premises data. This leverages enhanced cloud software deployment and update cycles while keeping data inside the firewall.

Another benefit of a hybrid environment is the ability to develop applications on dedicated resource pools which eliminate the need to compromise on configuration details like processors, GPUs, memory, networking and even software licensing constraints. The resulting solution can be subsequently deployed to an Infrastructure as a Service (IaaS) cloud service that offers compute capacity matching the dedicated hardware environment that otherwise would be hosted on-premisEs. This feature is rapidly becoming a big differentiator for cloud applications that need to hit the ground running with the right configuration to meet real-world demands.

AWS Private/Enterprise Cloud Model

The main reason to choose a Private Cloud Environment is for network isolation. Your EC2 instances are created in a virtual private cloud (VPC) to provide a logically isolated section on the AWS cloud.

Within that VPC, you have complete control over the virtual networking environment, including your own IP range selection, subnet creation, and configuration of route tables and network gateways. You can also create a Hardware Virtual Private Network (VPN). You can implement fine-grained access roles and groups, and stages of isolation for users. Enterprise governance and private encryption resources are available in a private cloud model. For more information on Enterprise cloud computing with AWS, click here. There are solutions on AWS Marketplace that allow you to perform big data analytics on the cloud while keeping your data on-premises.

Overview: AWS Identity and Access Management and Other AWS Built-In Security Features

TABLE 7: AWS BUILT-IN SECURITY SERVICES

AWS Security Overview
Before delving into any of AWS’s services used for big data advanced analytics, security must be minimally addressed. For any business, cloud security is the number one concern. AWS has industry-leading capabilities across facilities, networks, software and business processes meeting the strictest requirements for any vertical. Security is a core functional requirement that protects mission-critical information from accidental or deliberate theft, leakage, integrity, compromise and deletion.

AWS customers benefit from a data center and network architecture built to satisfy the requirements of their most security-sensitive customers. AWS used redundant layered controls, continuous validation and testing, and a substantial amount of automation to ensure that the underlying infrastructure is monitored and protected 24×7. These controls are replicated in every new data center and service.

Under the AWS “Shared Responsibility Model”, AWS is responsible for the security of the underlying cloud infrastructure and you are responsible for securing workloads you deploy in AWS, giving you the flexibility and agility to implement the most applicable security controls for your business functions in the AWS environment.

There are certain security features, such as individual user accounts and credentials, SSL/TLS for data transmissions and user activity logging that you should configure no matter which AWS service you use.

Identity and Access Management – User Accounts
AWS provides a variety of tools and features to keep your AWS account and resources safe from unauthorized use. This includes credentials for access control, HTTPS endpoints for encrypted data transmission, the creation of separate Identity and Access Management (IAM) user accounts, user activity logging for security monitoring, and Trusted Advisor security checks.

Only the business owner should have “root access” to your AWS account. The screenshot below is what you see when you’re logging in with your “root credentials”:

FIGURE 5: LOGIN PAGE USING YOUR ROOT CREDENTIALS

The screenshot below is the login page you see when you log in with your Identity and Access Management (IAM) account credentials (Note the additional “Account” textbox and the link at the bottom stating “Sign-in using root account credentials”):

FIGURE 6: LOGIN PAGE WHEN YOU’RE LOGGING IN WITH IAM CREDENTIALS

In order to prevent access to your AWS root account, create IAM mechanisms for creating and managing individual users (an individual, system, or application that interacts with your AWS resources). With IAM, you define policies that control which AWS services your users can access and what they can do with them. You give very fine-grained control – giving only the minimum permissions needed to do their job.

The Table Below Gives You an Overview of AWS User Security Measures:

TABLE 8: AWS BUILT-IN USER SECURITY MEASURES

To read more about IAM security best practices, click here.

AWS Network, Data and API Security
The AWS network has been architected to permit you to select the level of security and resiliency appropriate for your workload. To enable you to build geographically dispersed, fault-tolerant web architectures with cloud resources via a world-class network infrastructure that’s continually monitored and managed.

Most enterprises take advantage of Amazon Virtual Private Cloud (VPC) that enables you to launch AWS resources into a virtual network you define that resembles your own data center network but with the benefits of using the scalable infrastructure of AWS. For more information, click here.

Below are Some of AWS Network Security Measures:
• Firewall and other boundary devices that employ rule sets, access control lists (ACLs) and configurations
• Secure access points with comprehensive monitoring
• Transmission protection via HTTPS using SSL
• Continually monitoring systems at all levels
• Account audits every 90 days
• Security logs
• Individual service-specific security
• Virtual Private Gateways / Internet Gateways
• Amazon Route 53 Security (DNS)
• CloudFront Security
• Direct Connect Security for Hybrid Cloud Architectures
• Multiple Data Security Options
• Encryption and Data Encryption at rest
• Event Notifications
Amazon Cognito Federated Identity Authentication

For more information on AWS Network, Data, and API Security, look here.

AWS Trusted Advisor
AWS Trusted Advisor scours your infrastructure and provides continual best practice recommendations free of charge in four categories: Cost Optimization, Performance, Security and Fault Tolerance. Within the Trusted Advisor console, details are given and there are direct links to the exact resource that requires attention. However, if you have a Business or Enterprise Support Plan, you have access to numerous other best practice recommendations. See the image below to grok the way AWS Trusted Advisor works:

FIGURE 7: AWS TRUSTED ADVISOR OVERVIEW DIAGRAM

To read more about AWS Trusted Advisor click here.

AWS Marketplace Software Solutions to Augment AWS’s Built-In Security Features

AWS’s infrastructure monitoring and security features can be enhanced and customized to meet the needs of any business by augmenting AWS built-in features with a plethora of options available on AWS Marketplace to create a secure cloud nirvana.

Some of the solutions to enhance security can be found here.

Using AWS Services with Solutions Available on AWS Marketplace for Big Data Analytics
This section will describe how to implement, augment, or customize some of the most commonly used AWS managed services in big data analytics with solutions available on AWS Marketplace.

Below you’ll find the AWS Management Console (the view below is once you’ve logged in), from where you access AWS’s managed services:

FIGURE 8: THE AWS MANAGEMENT CONSOLE WITH THE MANAGED SERVICES

Amazon EC2: Self-Managed Big Data Analytics Solutions on AWS Marketplace

TABLE 9: AMAZON EC2 SELF-MANAGED ANALYTICS

Amazon Elastic Cloud Compute (EC2) Overview
Amazon EC2 provides an ideal platform for operating your own self-managed big data analytics applications on AWS infrastructure. Almost any software you can install on Linux or Windows virtualized environments can be run on Amazon EC2 with a pay-as-you-go pricing model with a solution available on AWS Marketplace. Amazon EC2 uses the implemented architecture to distribute computing power across parallel servers to execute the algorithms in the most efficient manner.

Amazon EC2 provides scalable computing capacity through highly configurable instance types launched as an Amazon Machine Image (AMI). You can use Amazon EC2 to launch as many or as few virtual servers as you need, configure security and networking and manage storage. For a quick test run or 1-time big data analytics project, you can use instance store volumes for temporary data that’s deleted when you stop or terminate your instance, or use Amazon Elastic Block Store (EBS) for persistent storage.

EC2 also provides virtual networks you can create that are logically isolated from the rest of the AWS cloud that you can optionally connect to your own network, known as virtual private clouds (VPCs). You could use a VPC to run analytics solutions with data in your data center, or use one of the solutions on AWS Marketplace that facilitates a hybrid deployment model like Attunity CloudBeam (which has many other big data analytics features).

Examples of Amazon EC2 Self-Managed Analytics Solutions on AWS Marketplace
Some examples of self-managed big data analytics that run on Amazon EC2 include the following:
• A Splunk Enterprise Platform, the leading software platform for real-time Operational Intelligence. Splunk software and cloud services enable organizations to search, monitor, analyze and visualize machine-generated big data coming from websites, applications, servers, networks, sensors and mobile devices. A Splunk Analytics for Hadoop, within AWS, solution is available on AWS Marketplace also. It’s called Hunk and it enables interactive exploration, analysis, and data visualization for data stored in Amazon EMR and Amazon S3
• A Tableau Server Data Visualization Instance, for users to interact with pre-built data visualizations created using Tableau Desktop. Tableau server allows for ad-hoc querying and data discovery, supports high-volume data visualization and historical analysis, and enables the creation of reports and dashboards
• A SAP HANA One Instance, a single-tenant SAP HANA database instance that has SAP HANA’s in-memory platform, to do transactional processing, operational reporting, online analytical processing, predictive and text analysis
• A Geospatial AMI such as MapLarge, that brings high-performance, real-time geospatial visualization and interactive analytics. MapLarge’s visualization results are useful for plotting addresses on a map to determine demographics, analyzing law enforcement and intelligence data, delivering insight to public health information, and visualizing distances such as roads and pipelines
• An Advanced Analytics Zementis ADAPA Decision Engine Instance, which is a platform and scoring engine to produce Data Science predictive models that integrate with other predictive models like R, Python, KNIME, SAS, SPSS, SAP, FICO and more. Zementis ADAPA Decision Engine can score data in real-time using web services or in batch mode from local files or data in Amazon S3 buckets. It provides predictive analytics through many predictive algorithms, sensor data processing (IoT), behavior analysis, and machine learning models
• A Matillion Data Integration Instance, an ELT service natively built for Amazon Redshift, that uses Amazon Redshift’s processing for data transformations to utilize it’s blazing speed and scalability. Matillion gives the ability to orchestrate and/or transform data upon ingestion or simply load the data so it can be transformed multiple times as your business requires

Below is an awesome brochure on how using solutions available on “AWS Marketplace Re-Invents the Way You Choose, Test, and Deploy Analytics Software”:

FIGURE 9: AWS MARKETPLACE BROCHURE: RE-INVENTING THE WAY YOU CHOOSE, TEST, AND DEPLOY ANALYTICS SOFTWARE

Amazon EC2 Instance Types
Amazon EC2 provides a wide selection of instance types optimized to fit different use cases. Instance types comprise varying combinations of CPU, memory, storage, and networking capacity that as a whole is measured by developers as “vCPU” or “Virtual CPU” (vs. the legacy way of describing EC2 compute power of “ECU” (Elastic Compute Unit) which you’ll still see at times today. Each instance type includes one or more instance sizes, allowing you to scale your resources to the requirements of your target analytical workload. To read more about the differences between Amazon EC2-Classic and Amazon EC2-VPC, read this.

Performance is based on the Amazon EC2 instance type you choose. There are many instance types that you can read about here, but below the four main EC2 types that power big data analytics are described:

  • Compute Optimized: Compute-optimized instances, such as C4 instances, feature the highest performing processors and the lowest price/compute performance in EC2. With support for clustering C4 instances, they’re ideal for batch processing, distributed analytics, high performance science and engineering applications, ad serving, MMO gaming, and video encoding
  • Memory Optimized: Memory optimized instances have the lowest cost per GB of RAM among Amazon EC2 instance types. These instances are ideal for high performance databases, distributed memory caches, in-memory analytics, genome assembly and analysis, and other large enterprise applications
  • GPU Optimized: GPU instances are ideal to power graphics-intensive applications such as 3D streaming, machine learning, and video encoding. Each instance features high-performance NVIDEA GPUs with an on-board hardware video encoder designed to support up to eight real-time HD video streams (720p@30fps) or up to four real-time full HD video streams (1080p@30fps)
  • Dense Storage: Featuring up to 48 TB of HDD-based local storage, dense storage instances deliver high throughput, and offer the lowest price per disk throughput performance on EC2. This instance type is ideal for Massively Parallel Processing (MPP), Hadoop, distributed file systems, network file systems, and big data processing applications

Amazon S3: A Data Store for Computation and Large-Scale Analytics

TABLE 10: AMAZON S3 COMPUTATION & ANALYTICS DATA STORE

Amazon Simple Storage Service (S3) Overview
Amazon S3 is storage for the internet. It’s a simple storage service that offers software developers a highly-scalable, reliable, and low-cost data storage infrastructure. It provides a simple web service interface that can be used to store and retrieve any amount of data, at any time, from within Amazon EC2 or anywhere on the web. You can read, write and delete objects containing from 1 byte to 5 TB of data each. The number of objects you can store in an S3 “bucket” is virtually unlimited. It’s highly secure, supports encryption at rest, and provides multiple mechanisms to provide fine-grained control of access to Amazon S3 resources. Amazon S3’s perd, it allows concurrent read or write access by many separate clients or application threads. No storage provisioning is necessary.

Amazon S3 is very commonly used as a data store for computation and large-scale analytics, such as financial transactions, clickstream analytics, and media transcoding. Because of the horizontal scalability of Amazon S3, you can access your data from multiple computing nodes concurrently without being constrained by a single connection.

Amazon S3 is the common data repository for pre-and-post processing with Amazon EMR.

FIGURE 10: USING AMAZON S3 FOR STORAGE PRE-AND-POST AMAZON EMR ANALYSIS

Amazon S3 is well-suited for extremely spiky bandwidth demands, making it the perfect storage for Amazon EMR batch analysis. Because Amazon S3 is inexpensive, highly durable, stores objects redundantly on multiple devices across multiple i, and provides a highly durable storage infrastructure with its version capability protecting critical data from inadvertent deletion, data is often kept on S3 for long periods of time post-processing with Amazon EMR for subsequent new queries on the same data. If you store your data on Amazon S3, you can access that data from as many Amazon EMR clusters as you need.

FIGURE 11: ACCESSING DATA IN AMAZON S3 FROM MULTIPLE AMAZON EMR CLUSTERS

Amazon S3 is the common data repository for Amazon Redshift before loading the data into the Amazon Redshift Data Warehouse. You use the “COPY” command to load data from Amazon S3:

FIGURE 12: AMAZON S3 COPY COMMAND

In addition, all data written to any node in an Amazon Redshift cluster is continually backed up to Amazon S3.

FIGURE 13: AMAZON REDSHIFT CLUSTER DATA BACKS UP AUTOMATICALLY TO AMAZON S3

Examples of Some of Amazon S3’s Benefits in Large-Scale Analytics:
• S3 storage provides the highest level of data durability and availability in the AWS platform
• Error correction is built-in, and there are no single points of failure. It’s designed to sustain concurrent loss of data in two facilities, making it very well-suited to serve as the primary data storage for mission-critical data
• Amazon S3 is designed for 99.999999999% (11 nines) durability per object and 99.99% availability over a one-year period
• Highly scalable, with practically unlimited storage
• Access to Amazon S3 from Amazon EC2 in the same region is lightning fast; server-side latencies are insignificant relative to Internet latencies
• Although Amazon S3 can be accessed using multiple threads, multiple applications and multiple clients concurrently, total Amazon S3 aggregate throughput will scale to rates that far exceed what any single server can generate or consume
• To speed relevant data, many developers pair Amazon S3 with a database, such as Amazon DynamoDB or Amazon RDS where Amazon S3 stores the actual information and the database serves as the repository for associated metadata. Metadata in the database can be easily indexed and queried, making it efficient to locate an object’s reference via a database query, and this result can then be used to pinpoint and retrieve the object itself from Amazon S3
• You can nest folders in Amazon S3 “buckets” and give fine-grained access control to each

Amazon Redshift: A Massively Parallel Processing (MPP) Petabyte-Scale Enterprise Data Warehouse

TABLE 11: AMAZON REDSHIFT DATA WAREHOUSE

Amazon Redshift Overview
Amazon Redshift service is a fast and powerful, fully-managed, petabyte-scale data warehouse that makes it easy and cost-effective to efficiently analyze all your data by seamlessly integrating with existing business intelligence, reporting, and analytics tools. It’s optimized for datasets ranging from a few hundred gigabytes to a petabyte or more. You can start small for a very low cost per hour with no commitments and scale to petabytes for up to one tenth the cost less than traditional solutions. And, when you need to scale, you simply add more nodes to your cluster and redistributes your data for maximum performance, with no downtime.

FIGURE 13: ADDING MORE NODES TO AN AMAZON REDSHIFT CLUSTER

Amazon Redshift is a SQL data warehouse solution and uses standard ODBC and JDBC connections. Your data warehouse can be up and running in minutes, enabling you to use your data to acquire new insights for your business and customers continually.

Traditional data warehouses required significant expenditures, time and resources to buy, build, and maintain, and they don’t scale. Therefore, as your requirements grew, you’d have to invest in more hardware and resources. You also had the expenditure of hiring many DBAs to ensure your queries were working right and that there was no data loss. Amazon Redshift manages the work needed to set up, operate, and scale a data warehouse, from provisioning the infrastructure capacity to automating ongoing administrative tasks such as backups, patches, and upgrades.

Amazon Redshift’s Features Enabling Large-Scale Analytics
Amazon Redshift uses columnar storage and a massively parallel processing (MPP) architecture to parallelize and distribute queries across multiple nodes to consistently deliver high performance at any volume of data.

FIGURE 14: AMAZON REDSHIFT’S COLUMNAR STORAGE ARCHITECTURE GIVES THE ABILITY TO ONLY READ THE DATA YOU NEED

It automatically and continuously monitors your cluster and copies your data into Amazon S3 so you can restore your data warehouse with a few clicks. Amazon Redshift stores three copies of your data for reliability. Amazon Redshift utilizes data compression, and zone maps to reduce the amount of I/O needed to perform queries. Security is built in, and you can encrypt data at rest and in transit using hardware-accelerated AES-256 and SSL, and if you want to use Amazon VPC with your Amazon Redshift cluster, that’s also built in. All API calls, connection attempts, queries and changes to the cluster are logged and auditable.

An Amazon Redshift data warehouse is a collection of computing resources called “nodes” that are organized into a group called a “cluster”. Each cluster runs an Amazon Redshift engine and contains one or more databases. Each cluster has a leader node and one or more compute nodes. The “leader node” receives queries from client applications, parses the queries and develops query execution plans. The leader node then coordinates the parallel execution of these plans with the compute nodes, aggregates the intermediate results from these nodes then finally returns the results back to the client applications. “Compute nodes” execute the query execution plans and transmit data among themselves to serve these queries. The intermediate results are sent back to the leader node for aggregation before being sent back to the client applications.

FIGURE 15: AMAZON REDSHIFT DATA WAREHOUSE SYSTEM ARCHITECTURE

Data typically flows into a data warehouse from many different sources and in many different formats including structured, semi-structured, and unstructured data. This data is processed, transformed, and ingested at a regular cadence. You can use AWS Data Pipeline to extract, transform, and load data into Amazon Redshift. AWS Data Pipeline provides fault tolerance, scheduling, resource management and an easy-to-extend API for your ETL. It can reliably process and move data between different AWS compute and storage services as well as on-premise data sources.

You can also use AWS Database Migration Service to stream data to Amazon Redshift from any of the supported sources including Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle, SAP ASE and SQL Server, enabling consolidation for easy analysis of data in Amazon Redshift.

Amazon Redshift is integrated with other AWS services and has built-in commands to load data in parallel to each node from Amazon S3, Amazon DynamoDB or your Amazon EC2 and on-premise servers using SSH. Amazon Kinesis and Amazon Lambda integrate with Amazon Redshift as a target. You can also load streaming data into Amazon Redshift using Amazon Kinesis Firehose.

Amazon Redshift Analytics Examples
• Analyze Global Sales Data for Multiple Products
• Store Historical Stock Trade Data
• Analyze Ad Impressions and Clicks
• Aggregate Gaming Data
• Analyze Social Trends
• Measure Clinical Quality, Operation Efficiency, and Financial Performance in the Healthcare Space

AWS Marketplace Solutions for Amazon Redshift
Data can be loaded into Amazon Redshift from a multitude of solutions available from popular software vendors on AWS Marketplace to assist in Data Integration, Analytics, and Reporting and Visualization. Many of the solutions include much more than the broad topic titles.

For Data Integration and more, Matillion ETL for Redshift is a fast, modern, easy-to-use and powerful ETL/ELT tool that makes it simple and productive to load and transform data on Amazon Redshift. 100 x faster than traditional ETL technology, up and running in under 5 minutes. With a few clicks you can load data into directly into Redshift, fast, from Amazon S3; Amazon RDS; relational, columnar, cloud and NoSQL databases; FTP/HTTP; REST, SOAP, & JSON APIs; Amazon EMR; and directly from enterprise and cloud-based systems including Google Analytics, Google Adwords, Facebook, Twitter and more.

FIGURE 16: MATILLION PROCESSES MILLIONS OF ROWS IN SECONDS WITH REAL-TIME FEEDBACK

Matillion ETL for Redshift transforms data at eye-popping speed in a productivity-orientated, streamlined, browser-based graphical job development environment. Expect 50% reduction in ETL development and maintenance effort and months off your project as a result of the streamlined UI, perfect integration to AWS & Redshift and the sheer speed.

FIGURE 17: MATILLION JOINS, TRANSFORMS, FILTERS & MANIPULATES BIG DATA AT BLISTERING SPEED IN A MODERN, BEAUTIFUL, BROWSER-BASED ENVIRONMENT

Matillion ETL for Redshift delivers a rich orchestration environment where you can orchestrate and schedule data load and transform; control flow; integrate with other systems and AWS services via Amazon SQS, Amazon SNS and Python; iterate; manage variables; create and drop tables; vacuum and analyze tables; soft code ETL/ELTs from configuration tables; control transactions and commit/roll-back; setup alerting; and develop data quality, error-handling and conditional logic.

For Advanced Analytics, TIBCO Spotfire Analytics Platform is a complete analytics solution that helps you quickly uncover insights for better decision-making. Explore, visualize, and create dashboards for Amazon Redshift, RDS, Microsoft Excel, SQL Server, Oracle, and more in-minutes. Easily scale from a small team to the entire organization with Spotfire for AWS. Includes 1 Spotfire Analyst user (via Microsoft Remote Desktop), unlimited Consumer and Business Author (web) users, plus Spotfire Server, Web Player, Automation Services and Statistics Services. Go from data to dashboard in under a minute. No other solution makes it as easy to get started or deliver analytics expertise. TIBCO Spotfire® Recommendations suggests the best visualizations based on years of best-practices. Broadest Data Connectivity – Access and combine all of your data in a single analysis to get a holistic view of your business. Cloud or on-premise, small or big data. Best-in-class analytics for any data source, incl. Amazon Redshift and RDS.

FIGURE 18: TIBCO SPOTFIRE ANALYTICS PLATFORM CONNECTS TO CLOUD OR ON-PREMIS DATA SOURCES

Comprehensive Analytics – A full spectrum of analytics capabilities to empower novice to advanced users, including: interactive visualizations, data mashup, predictive and prescriptive analytics, location analytics, and more.

FIGURE 19: TIBCO SPOTFIRE GIVES COMPREHENSIVE ANALYTICS & VISUALIZATIONS

For Data Analysis and Visualization, Tableau Server Tableau Server for AWS is browser and mobile-based visual analytics anyone can use. Publish interactive dashboards with Tableau Desktop and share them throughout your organization.

FIGURE 20: TABLEAU SERVER BROWSER & MOBILE-BASED VISUAL ANALYTICS

FIGURE 21: TABLEAU SERVER’S PUBLISHED INTERACTIVE DASHBOARDS SHARED THROUGHOUT YOUR ORGANIZATION

Embedded or as a stand-alone application, you can empower your business to find answers in minutes, not months. By deploying on the AWS Marketplace you can stand-up a perfectly sized instance for your Tableau Server with just a few clicks. Tableau helps tens of thousands of people see and understand their data by making it simple for the everyday data worker to perform ad-hoc visual analytics and data discovery as well as the ability to seamlessly build beautiful dashboards and reports. Tableau is designed to make connecting live to data of all types a simple process that does not require any coding or scripting. From cloud sources like Amazon Redshift, to on-premise Hadoop clusters, to local spreadsheets, Tableau gives everyone the power to quickly start visually exploring data of any size to find new insights.

For Data Warehouse Databases, SAP HANA One is a production-ready, upgradable to latest HANA SPS version (by Addon), single-tenant configured SAP HANA database instance. Perform real-time analysis, develop and deploy real-time applications with the SAP HANA One. Natively built using in-memory technology and now deployed on AWS, SAP HANA One accelerates transactional processing, operational reporting, OLAP, and, predictive and text analysis while bypassing traditional data latency & maintenance issues created through pre-materializing views and pre-caching query results. Unlike other database management systems, the SAP HANA One on AWS streamlines both transactional (OLTP) and analytical (OLAP) processing by working with single data copy in the in-memory columnar data store. By consolidating OLAP and OLTP workloads into a single in-memory RDBMS, you benefit from a dramatically lower TCO, in addition to mind-blowing speed. Build new, or deploy existing, on-demand applications on top of this instance for productive use. Developers can take advantage of this offering through standard based open connectivity protocols ODBC, JDBC, ODBO, ODATA and MDX allowing ease of integration with existing tools and technologies. Transform decision processing by streamlining transactions, analytics, planning, predictive and text analytics on a single in-memory platform. HANA One instances are now more secure with SSH root login disabled. Customers can now login to the instance using the new ‘ec2-user’ user.

Amazon EMR: A Managed Hadoop Distributed Computing Framework

TABLE 12: AMAZON EMR – A MANAGED HADOOP DISTRIBUTED COMPUTING FRAMEWORK

Amazon Elastic MapReduce (EMR) Overview

With Amazon EMR you can analyze and process vast amounts of data by distributing the computational work across a resizable cluster of virtual servers using Apache Hadoop, an open-source framework. Open-source projects that run on top of the Hadoop architecture can also be run on Amazon EMR, such as Hive, Pig, Spark, etc.

Hadoop uses a distributed processing architecture called MapReduce in which a task is mapped to a set of servers for processing. The results of the computation performed by those servers is then reduced down to a single output set. One node, designated as the “master node”, controls the distribution of tasks.

FIGURE 22: AMAZON EMR HADOOP CLUSTER WITH THE MASTER ODE DIRECTING A GROUP OF SLAVE NODES TO PROCESS THE DATA

Amazon EMR has made enhancements to Hadoop and the other open-source applications to work seamlessly with AWS. For example, Hadoop clusters running on Amazon EMR use EC2 instances as virtual Linux servers for the master and slave nodes, Amazon S3 for bulk storage of input and output data, and Amazon CloudWatch to monitor cluster performance and raise alarms. You can also move data into and out of Amazon DynamoDB using Amazon EMR and Hive. All of this is orchestrated by Amazon EMR control software that launches and manages the Hadoop cluster. This process is called an Amazon EMR cluster.

FIGURE 23: AMAZON EMR INTERACTING WITH OTHER AWS SERVICES

Amazon EMR’s Features Enabling Large-Scale Analytics

Hadoop provides the framework to run big data processing and analytics and Amazon EMR does all the heavy lifting involved with provisioning, managing, and maintaining the infrastructure and software of a Hadoop cluster. You can easily provision a fully managed Hadoop framework in minutes. You can scale your Hadoop cluster dynamically and pay only for what you use, from one to thousands of compute instances. Amazon EMR has native support for Amazon EC2 Spot and Reserved Instances, you can also save 50-80% on the cost of the underlying instances. You can launch a persistent cluster that stays up indefinitely or a temporary cluster that terminates after the analysis is complete.

Amazon EMR securely and reliably handles big data use cases, including log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics.

Amazon EMR monitors your cluster, retrying failed tasks and automatically replacing poorly performing instances. It automatically configures Amazon EC2 firewall settings that control access to instances, and you can launch clusters in an Amazon VPC. For objects stored in Amazon S3, you can use Amazon S3 server-side encryption or Amazon S3 client-side encryption with EMRFS, with AWS Key Management Service or customer-managed keys. You can customize every cluster.

Apache Spark, an engine in the Apache Hadoop ecosystem for fast and efficient processing of large datasets. By using in-memory, fault-tolerant resilient distributed datasets (RDDs) and directed, acyclic graphs (DAGs) to define data transformations, Spark has shown significant performance increases for certain workloads when compared to Hadoop MapReduce. Spark provides additional speed for other power tools such as Spark SQL, and Spark can be run on top of YARN (the resource manager for Hadoop 2). AWS has revised the bootstrap action to install Spark 1.x on AWS Hadoop 2.x AMIs and run it on top of YARN. The bootstrap action also installs and configures Spark SQL (SQL driven data warehouse), Spark Streaming (streaming applications), MLlib (machine learning), and GraphX (graph systems).

The S3 location for the Spark installation bootstrap action is:

Amazon EMR Analytics Examples

  • Log processing and analytics
  • Large extract, transform, and load (ETL) data movement
  • Risk modeling and threat analytics
  • Ad targeting and click stream analytics
  • Genomics
  • Predictive analytics
  • Ad hoc data mining and analytics

You can view an architectural diagram of Web Log Analysis on AWS here, and another on Advertisement Serving here

AWS Marketplace Solutions for Amazon EMR and the Hadoop Ecosystem

Data can be loaded into EMR from a multitude of solutions available from popular software vendors on AWS Marketplace to assist in Data Integration, Analytics, and Reporting and Visualization. Many of the solutions include much more than the broad topic titles.

For Data Integration, Attunity CloudBeam for S3, EMR and other Hadoop distributions simplifies, automates, and accelerates the loading and replication of data from a variety of structured and unstructured sources to create a data lake for Hadoop consumption on Amazon S3, including replication across Amazon Regions.

FIGURE 24: ATTUNITY CLOUDBEAM ACCELERATES THE LOADING & REPLICATION OF DATA FROM A VARIETY OF SOURCES

Attunity CloudBeam simplifies and streamlines ingesting enterprise data for use in Big Data Analytics by EMR or other Hadoop distributions from Cloudera, Hortonworks or MapR as well as for pre-processing before moving data into Redshift, S3, or RDS. Attunity CloudBeam is designed to handle files of any size, transferring content over any given network connection, thereby achieving best-in-class acceleration and guaranteed delivery.

FIGURE 25: ATTUNITY CLOUDBEAM STREAMLINES INGESTING ENTERPRISE DATA FOR USE IN BIG DATA ANALYTICS BY AMAZON EMR

Attunity Cloudbeam’s automation provides intuitive administration, scheduling, replication of deltas only, security and monitoring.

FIGURE 26: ATTUNITY CLOUDBEAM’S AUTOMATED SCHEDULING

FIGURE 27: ATTUNITY CLOUDBEAM’S AUTOMATED REPLICATION

For Advanced Analytics, Infosys Information Platform (IIP) leverages the power of open source to address big data adoption challenges such as inadequate accessibility of easy-to-use development tools; fragmented approach to building data pipelines; and lack of an enterprise-ready version of open source big data analytics platform that can support all forms of data: structured, semi-structured, and unstructured.

FIGURE 28: INFOSYS INFORMATION PLATFORM SUPPORTS ALL FORMS OF DATA

It’s a one stop solution from Data Engineering to Data Science requirements enabling Ingestion to Visualization. It’s One-Click Launch, High Performance, Scalable, Enterprise-grade security.

FIGURE 29: INFOSYS INFORMATION PLATFORM SOLUTION FROM DATA ENGINEERING TO DATA SCIENCE

Actionable insights in real-time.

FIGURE 30: INFOSYS INFORMATION PLATFORM GIVES ACTIONABLE INSIGHTS IN REAL- TIME

For Data Analysis and Visualization, TIBCO Jaspersoft Reporting and Analytics for AWS is a commercial open source reporting and analytics server built for AWS that can run standalone or be embedded in your application. It is priced very aggressively with a low hourly rate that has no data or user limits and no additional fees. A multi-tenant version is available as a separate Marketplace listing. Free Online Support is available for registration upon launching the instance. Professional Support is available separately from TIBCO sales.

FIGURE 31: TIBCO JASPERSOFT SUPPORT

Jaspersoft’s business intelligence suite allows you to easily create beautiful, interactive reports, dashboards and data visualizations. Designed to quickly connect to your Amazon RDS, Redshift and EMR data sources, you can be analyzing your data and building reports in under 10 minutes.

FIGURE 32: TIBCO JASPERSOFT ANALYZING YOUR DATA & BUILDING REPORTS

TIBCO Jaspersoft’s software empowers millions of people every day to make better decisions faster by bringing them timely, actionable data inside their apps and business processes. Thanks to a community hundreds of thousands strong, TIBCO Jaspersoft’s software has been downloaded millions of times and is used to create the intelligence inside hundreds of thousands of apps and business processes. Full BI Server for Cents/Hour: no user or data limits and no additional fees. Suite includes ad hoc query and reporting, dashboards, data analysis, data visualization and data virtualization. 10 Minutes to Your AWS Data: purpose-built for AWS, our reporting and analytics server allows you to quickly and easily connect to Amazon RDS, Redshift and EMR. In under 10 minutes you can be reporting on and analyzing your data. BI for Your Business or App: built to modern web standards with a HTML5 UI and JavaScript and REST APIs, our flexible BI suite can be used to analyze your business or deliver stunning interactive reports and dashboards inside your app.

FIGURE 33: TIBCO JASPERSOFT INTERACTIVE ANALYTICS DASHBOARD

Amazon Elasticsearch Service: Real-time Data Analysis and Visualization

TABLE 13: AMAZON ELASTICSEARCH SERVICE – REAL-TIME DATA ANALYSIS & VISUALIZATION

Amazon Elasticsearch Service (ES) Overview

Organizations are collecting an ever-increasing amount of data from numerous sources such as log systems, click streams, and connected devices. Launched in 2009, Elasticsearch —an open-source analytics and search engine— has emerged as a popular tool for real-time analytics and visualization of data. Some of the most common use cases include risk assessment, error detection, and sentiment analysis. However, as data volumes and applications grow, managing the open source version of Elasticsearch clusters can consume significant IT resources while adding little or no differentiated value to the organization. Amazon Elasticsearch Service (Amazon ES) is a managed service that makes it easy to deploy, operate, and scale Elasticsearch clusters in the AWS Cloud. Amazon ES offers the benefits of a managed service, including cluster provisioning, easy configuration, and replication for high availability, scaling options, data durability, security, and node monitoring.

Amazon ES is integrated with and tightly coupled with Logstash and a Kibana instance is automatically configured for you. When you deploy Elasticsearch, you deploy an “ELK Stack”.

Amazon ES Service Features Enabling Large-Scale Analytics

Logstash is an open source data pipeline that helps process logs and other event data and has built-in support for Kibana. Kibana is an open source analytics and visualization platform that helps you get a better understanding of your data in Amazon ES Service. You can set up your Amazon ES Service domain as the backend store for all logs coming through your Logstash implementation to easily ingest structured and unstructured data from a variety of sources. It allows you to explore your data at a speed and at a scale never before possible. It’s used for full text-search, structured search, analytics, and all three in combination.

Amazon ES Service gives you direct access to the open-source Elasticsearch APIs to load, query and analyze data and manage indices. There is integration for streaming data from Amazon S3, Amazon Kinesis Streams, and DynamoDB Streams. The integrations use a Lambda function as an event handler in the cloud that responds to new data by processing it and streaming the data to your Amazon ES Service domain.

Click here to read how to get started with Amazon ES Service and Kibana on Amazon EMR.

Amazon ES Service Examples

  • Real-time application monitoring
  • Analyze activity logs
  • Analyze Amazon CloudWatch logs
  • Analyze product usage data coming from various services and systems
  • Analyze social media sentiments and CRM data, and find trends for brands and products
  • Analyze data stream updates from other AWS services, such as Amazon Kinesis Streams and DynamoDB
  • Monitor usage for mobile applications
  • e-Commerce filtering and navigation
  • Streaming data analytics
  • Social media sentiment analysis
  • Text search
  • Risk assessment
  • Error detection

AWS Case Study: MLB Advanced Media Using Amazon ES Service as Part of a New Data Collection and Analysis Tool

MLB Advanced Media (MLBAM) wanted a new way to capture and analyze every play using data-collection and analysis tools. It needed a platform that could quickly ingest data from ballparks across North America, provide enough compute power for real-time analytics, produce results in seconds, and then be shut down during the off season.  It turned to AWS to power its revolutionary Player Tracking System, which is transforming the sport by revealing new, richly detailed information about the nuances and athleticism of the game—information that’s generating new levels of excitement among fans, broadcasters, and teams.

FIGURE 34: THE RELEASE OF AMAZON ES WAS ANNOUNCED AT AWS RE:INVENT 2015. THIS IS THE SLIDE SHOWN IN REGARD TO THE AWS MAJOR LEAGUE BASEBALL USE CASE (view YouTube video here)

You can read the story here.

AWS Marketplace Solutions for Amazon ES, Logstash, and Kibana (ELK)

There are quite a few software solutions available from popular software vendors on AWS Marketplace that help implement Amazon Elasticsearch Service. You can browse them here. Some of them provide the entire ELK Stack.

ELK Stack (PV) built by Stratalux: ELK stack is the leading open-source centralized log management solution for companies who want the benefits of a centralized logging solution without the enterprise software price. ELK stack provides a centralized and searchable repository for all your infrastructure logs providing a unique and holistic insight to your infrastructure. The ELK stack built by Stratalux AMI has been configured with all the basic components that together make a complete working solution. Included in this AMI are the Logstash server, Kibana web interface, ElasticSearch storage and Redis data structure server. Simply install and point your Logstash agents to this AMI and begin searching through your logs and create custom dashboards. With over five years of experience, Stratalux is the leading cloud-based managed services company for ELK stack on AWS. Sandbox environment to try out different functions. An image of this product is shown below:

FIGURE 35: STRATALUX ELK STACK (PV) OFFERING IN AWS MARKETPLACE

Amazon Machine Learning: Highly Scalable Predictive Analytics

TABLE 14: AMAZON MACHINE LEARNING – CREATE ML MODELS WITHOUT LEARNING COMPLEX ML ALGORITHMS

Amazon Machine Learning (ML) Overview
You see machine learning in action every day. Websites make suggestions on products you’re likely to buy based on past purchases, you get an alert from your bank if they suspect a fraudulent transaction and you get emails from stores when items you typically buy are on sale.

With Amazon Machine Learning, anyone can create ML models via Amazon ML’s learning and visualization tools and wizards without having to learn complex machine learning algorithms and technology. Amazon ML can create models based on data stored in Amazon S3, Amazon Redshift, or Amazon RDS. There is no set-up cost and you pay as you go, so you can start small and scale as your application grows, and you don’t have to manage any infrastructure required for the large amount of data used in machine learning.

Amazon ML Features Enabling Large-Scale Analytics
Built-in wizards guide you through the steps of interactively exploring your data to train the ML model by finding patterns in existing data, and use these patterns to make predictions from new data as it becomes available. You’re guided through the process of measuring the quality of your models and evaluating the accuracy of predictions, fine-tuning the predictions to align with business goals. You don’t have to implement custom prediction generation code.

FIGURE 36: FINE-TUNING AMAZON MACHINE LEARNING INTERPRETATIONS

Amazon ML can generate billions of predictions daily, and serve those predictions in low-latency batches or in real-time at high throughput.

List of Amazon ML Analytics Examples
Amazon ML can perform document classification to help you process unstructured text and take actions based on content from forms, emails and product reviews, for example. You can process free-form feedback from your customers, including email messages, comments or phone conversation transcripts, and recommend actions based on their concerns. One example would be using Amazon ML to analyze social media traffic to discover customers who have a product support issue, and connect them with the right customer care specialists.

Other examples you can perform with Amazon ML include the following:
• Predict customer churn
• Fraud detection
• Content personalization
• Propensity Modeling for marketing campaigns
• Readmission Prediction through patient risk stratification
• Predict if a website comment is spam
• Forecast product demand
• Personalize content
• Predict user activity

AWS Marketplace Solutions for Amazon ML
There are many solutions from leading software vendors available on AWS Marketplace, some of which are highlighted below.

BigML PredictServer is a dedicated machine image that you can deploy in your own AWS account to create blazingly fast predictions from your BigML models and ensembles.
PredictServer is ideal for real-time scoring and/or for very large batch predictions (millions and upwards). Dedicated in-memory prediction server guarantees fast and consistent prediction rates. And the built-in dashboard makes it easy to track performance. Models and Ensembles are cached directly from bigml.io and predictions can be created with API calls similar to that of the BigML.io API and/or through BigML’s command line tool bigmler. You can deploy BigML PredictServer in a region closer to your application servers to reduce latency, or even in a VPC.

FIGURE 37: YOU HAVE AN APPLICATION USING BIGML TO MAKE PREDICTIONS, SENDING WHAT MODELS & ENSEMBLES TO DOWNLOAD TO YOUR BIGML PREDICT SERVER

FIGURE 38: THEN, CHANGE YOUR CONNECTION TO USE PREDICT SERVER FOR PREDICTIONS

BigML also supports text analytics.

FIGURE 39: BIGML TEXT ANALYTICS

Zementis ADAPA Decision Engine is a predictive analytics decision engine based on the PMML (Predictive Model Markup Language) standard.

FIGURE 40: ZEMENTIS ADAPA DECISION ENGINE IS BASED ON THE INDUSTRY STANDARD PMML MARKUP LANGUAGE

With ADAPA, deploy one or many predictive models from data mining tools like R, Python, KNIME, SAS, SPSS, SAP, FICO and many others. Score your data in real-time using Web-Services, or use ADAPA in batch mode for Big Data scoring directly from your local file system or an Amazon S3 bucket. As a central solution for today’s data-rich environments, ADAPA delivers precise insights into customer behavior and sensor information. Predictive Analytics Using Vendor-neutral Standards: ADAPA uses the Predictive Model Markup Language (PMML) industry standard to import and deploy predictive algorithms and machine learning models. ADAPA can understand any version of PMML and is compatible most data mining tools, open source and commercial. Model Deployment Made Easy: ADAPA allows for one or many predictive models to be deployed at the same time. It executes many algorithms, from simple regression models to the most complex machine learning ensembles, e.g. Random Forest and boosted models. Scoring at the Speed of Business: ADAPA is able to instantly transform your scores into business decisions. The use of PMML-based rules allows for different score ranges to be paired with specific business decisions. Applications range from fraud detection and risk scoring to marketing campaign optimization and sensor data processing in the Internet of Things (IoT).

FIGURE 41: ZEMENTIS ADAPA DECISION ENGINE SUPPORTS MANY TYPES OF ANALYSES

AWS Storage and Database Options for Use in Big Data Analytics with Use Cases

TABLE 15: AWS STORAGE OPTIONS FOR BIG DATA ANALYTICS

AWS Big Data Analytics Storage and Database Options Overview
AWS has a broad set of engines for storing data throughout a big data analytics lifecycle. Each has a unique combination of performance, durability, availability, scalability, elasticity and interfaces.

Most big data analytics infrastructures and application architectures employ multiple storage technologies in concert, each of which has been selected to satisfy the needs of a particular subclass of data storage, or for the storage of data at a particular point in its lifecycle. These combinations form a hierarchy of data storage tiers. The image below gives one example of using a combination of data storage and database usage during a big data analytics workflow:

FIGURE 42: WITH AWS YOU CAN BUILD AN ENTIRE ANALYTICS APPLICATION TO POWER YOUR BUSINESS

Amazon S3 for Storage for Big Data Advanced Analytics
Please refer to the section above on the benefits of Amazon S3 for storage in big data analytics here.

Amazon DynamoDB Database for Big Data Advanced Analytics
Amazon DynamoDB is a fast, flexible and fully managed NoSQL database service for all applications – mobile, web, gaming, ad tech, IoT and more – that need a consistent, single-digit millisecond latency at any scale. It supports both document and key-value store models.

DynamoDB supports storing, querying, and updating documents. It supports three data types –
number, string, and binary – in both scalar and multi-valued sets. Using the AWS SDK you can write applications that store JSON documents directly into Amazon DynamoDB tables. This capability reduces the amount of code to write insert, update and retrieve JSON documents and perform powerful database operations like nested JSON queries using only a few lines of code.

FIGURE 43: AMAZON DYNAMODB STORING JSON DOCUMENTS

Other document stores Amazon DynamoDB supports are XML and HTML. Tables don’t have to have a fixed schema, so each data item can have a different number of attributes. The primary key can either be a single-attribute hash key or a composite hash-key range.

In addition to querying the primary key, you can query non-primary key attributes using Global
Secondary Indexes and Local Secondary Indexes. DynamoDB provides both eventually-consistent reads by default, strongly-consistent reads (optional), and implicit item-level transactions for item put, update, delete, conditional operations, and increment / decrement.

There is no limit to the amount of data you can store in an Amazon DynamoDB table. The service automatically allocates more storage as you store more data. DynamoDB streams captures all data activity that happens on your table and allows the ability to set up regional replication from one geographic region to another for more availability.

DynamoDB integrates with AWS Lambda to provide triggers for alerts when things change in your DynamoDB instance. It also integrates with Amazon Elasticsearch using the Amazon DynamoDB Logstash plugin to search Amazon DynamoDB content for things like messages, locations, tags and keywords. It integrates with Amazon EMR, so Amazon EMR can analyze data sets stored in DynamoDB, yet keeping the the original data set intact. It integrates with Amazon Redshift to perform complex data analysis queries, including joins with other tables in the Amazon Redshift cluster. DynamoDB integrates with the AWS Data Pipeline to automate data movement and transformation into and out of Amazon DynamoDB. It also integrates with Amazon S3 for analytics, AWS Import/Export, backup and archive.

Common use cases for DynamoDB include:
• Gaming
• Mobile applications
• Digital Ad serving
• Live voting
• Sensor networks
• IoT
• Log ingestion
• Access control for e-Commerce shopping carts or other web-based content
• Web session management

Amazon DynamoDB can be the storage backend to Titan, enabling you to store Titan graphs of any size in fully-managed DynamoDB tables that stores and traverses both small and large graphs up to hundreds of billions of vertices and edges distributed across a multi-machine cluster.

Amazon Redshift Data Warehouse for Storage in Big Data Advanced Analytics
Please refer to the section above on the benefits of Amazon Redshift for a data warehousing solution in big data analytics here.

Amazon Aurora Database for Big Data Advanced Analytics
Amazon Aurora is a MySQL-compatible relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. Amazon Aurora delivers five times the throughput vs. standard MySQL open source databases. This performance is achieved by tightly integrating the database engine with an SSD-backed virtualized storage layer that’s fault-tolerant and self-healing. Disk failures are repaired in the background without loss of database availability. It automates most administrative tasks and enables point-in-time recovery of your instances. Amazon Aurora can help cut down your database costs by 90% or more while improving reliability and providing high availability. It tolerates failures and fixes them automatically. Durable backups are continual and automatic to Amazon S3 and has six copies of your data replicated across three Availability Zones.

Amazon Aurora is compatible with MySQL 5.6 so that any existing MySQL applications and tools can run on Aurora without modification. It’s managed by Amazon RDS, which takes care of complicated administration tasks like provisioning, patching, and monitoring.

Historical data analysis is the most common type of big data analytics implemented on Aurora. The benefits of using Aurora vs. other relational databases is its scalability; therefore, with terabytes of real-time data processing daily and scales to millions of transactions per minute. If you need more transactions per minute, you can add replicas, up to 15 of them. It will automatically scale up as needed up to 64 TB.

FIGURE 44: ARCHITECTURAL DIAGRAM OF THE RELATIONSHIP BETWEEN AMAZON AURORA CLUSTER VOLUMES & THE PRIMARY & REPLICAS IN AN AURORA DB CLUSTER

Common use cases for Amazon Aurora include:
• Data warehouse analytics
• Website responsiveness
• Content Management
• IoT data analysis
• Transaction processing
• Great for any enterprise application that uses a relational database
• SaaS applications that need flexibility in instance and storage scaling
• Web and mobile applications

Amazon EC2 Instance Store Volumes for Big Data Advanced Analytics
Amazon EC2 provides flexible, cost-effective, and easy-to-use data storage for your instances. Each option has a unique combination of performance and durability. These options can be used independently or in combination to suit your requirements. The storage option that best fits running advanced analytics is called “Amazon EC2 Instance Store Volumes”.

Amazon EC2 Instance Store Volumes (also called ephemeral drives) provide temporary block-level storage for many EC2 instance types. The storage-optimized Amazon EC2 instance family provides special-purpose instance storage targeted to specific use cases. HI1 instances provide very fast solid-state drive (SSD)-backed instance storage capable of supporting over 120,000 random read IOPS, and are optimized for very high random I/O performance and low cost per IOPS.

Example applications well-suited to use HI1 storage-optimized EC2 Instance Store Volumes include data warehouses, Hadoop storage nodes, seismic analysis, cluster file systems, etc. Note, however, that the data on instance store volumes is lost if the Amazon EC2 instance is stopped, re-started, terminates, or fails.

AWS Marketplace Solutions to Augment AWS Storage and Database Services for Big Data Advanced Analytics

TABLE 16: AWS MARKETPLACE SOLUTIONS TO AUGMENT AWS STORAGE & DATABASE SOLUTIONS FOR BIG DATA ADVANCED ANALYTICS

Overview of AWS Marketplace Solutions to Augment AWS Big Data Analytics Storage and Database Services
There’s an abundance of AWS Marketplace software solutions from top vendors that augment these AWS built-in solutions for storage and databases used in big data analytics.

AWS Marketplace Solutions to Augment Amazon S3
AWS Marketplace solutions from top software vendors that augment the functionality or interact with Amazon S3 to a complete end-to-end out-of-the-box solutions are many. I’ll mention a couple of them below, but you can browse for yourself here.

Attunity CloudBeam for Amazon S3, EMR, and Hadoop was described in an earlier section. Click here to review that section again.

Matillion ETL for Redshift was also described in an earlier section. Matillion first loads data into Amazon S3 prior to Redshift ingestion. Click here to return to review that section again.

Informatica Cloud for Amazon S3 provides native, high-volume connectivity to S3. It is designed and optimized for data-integration between cloud and on-prem data sources to S3 as object store. It handles special characters within data-set, uni-code characters, escape characters and multiple formats of delimited files. It also supports multi-part upload and download to/from S3. It allows you to develop and run data integration tasks (mappings), task flows and unlimited scheduling restricted to use only for S3. Informatica Cloud integration is a visual, metadata-driven solution, enabling self-documenting code, extensive reuse in development and automation in deployment. Solution is limited to 1TB of S3 storage. The solution includes: Cloud Designer: Cloud Based service that enables visual design, development and deployment of data integration mappings (static data flows) A simple 6-step wizard to support the needs of citizen integrators. Informatica Cloud Data Synchronization Service Secure Agent: A light-weight binary that runs in AWS EC2 environment to access the Informatica cloud services located in Informatica’s hosted environment. Secure Agent is installed on your AMI. One instance of Informatica Cloud service in Informatica’s hosted environment. One connector for Amazon S3 as target One connector as source Run on SUSE Linux Enterprise Server 11.

AWS Marketplace Solutions to Augment Amazon DynamoDB
The AWS Marketplace has independent software vendors that augment Amazon DynamoDB, in addition to solutions that offer complete graphing solutions other than Titan graph. Below you’ll find some selected solutions:

Informatica Cloud for Amazon DynamoDB provides native, high-volume connectivity to DynamoDB. It is designed to catalog data from data sources such as SQL, NoSQL and Social into a single DynamoDB store and take advantage of high throughput and scale of DynamoDB. It saves cost by temporarily increasing Write capacity or Read capacity as needed. It allows you to develop and run data integration tasks (mappings), task flows and unlimited scheduling restricted to use only for DynamoDB. Informatica Cloud integration is a visual, metadata-driven solution, enabling self-documenting code, extensive reuse in development and automation in deployment. Solution is limited to 1TB of DynamoDB storage. The solution includes: Cloud Designer: Cloud Based service that enables visual design, development and deployment of data integration mappings (static data flows) A simple 6-step wizard to support the needs of citizen integrators. Informatica Cloud Data Synchronization Service Secure Agent: A light-weight binary that runs in AWS EC2 environment to access the Informatica cloud services located in Informatica’s hosted environment. Secure Agent is installed on your AMI. One instance of Informatica Cloud service in Informatica’s hosted environment One connector for Amazon DynamoDB as target One connector as source Run on SUSE Linux Enterprise Server 11.

Mentioned earlier, Amazon DynamoDB integrates with Titan graphs. I’ll mention one of the solutions available from top software vendors here that can offer graphing solution other than Titan below, but you can browse them yourself here.

MicroStrategy Analytics Platform is a powerful Mobile and Business Intelligence solution that enables leading organizations to quickly analyze vast amounts of data and distribute actionable business insight throughout the enterprise. MicroStrategy enables users to conduct ad hoc analysis and share their insights anywhere, anytime with reports, documents, and dashboards delivered via Web or mobile devices. Anyone can create dashboards with stunning visualizations, explore dynamic reports to investigate performance, graph data instantly, drill into areas of concern, and export information into any format.

FIGURE 44: MICROSTRATEGY’S INTERACTIVE REPORTS WITH FILTERS

Users benefit from powerful, sophisticated statistical analysis that yields new critical business insights. Uniquely positioned at the nexus of analytics, security, and mobility, MicroStrategy delivers superior analytics and mobile applications secured with advanced authentication, enhanced user administration, and user authentication tracking. Our software is built for AWS and is certified with numerous AWS services such as Amazon Redshift, Amazon RDS’s and MicroStrategy is an Advanced Technology Partner.

FIGURE 45: MICROSTRATEGY’S DATA DISCOVERY TECHNOLOGY HELPS YOU COMBINE INFORMATION FROM DIFFERENT SYSTEMS WITHOUT COMPLICATED SCRIPTS, DATA MODELS, OR HELP FROM IT

AWS Marketplace Solutions to Augment Amazon Redshift
The AWS Marketplace has independent software vendors that augment or work in tandem with Amazon Redshift. Some will be highlighted below, but you can browse the many solutions available in the AWS Marketplace that work with Amazon Redshift here.

Matillion ETL for Redshift was also described in an earlier section. Click here to return to review that section again.

Attunity CloudBeam for Amazon Redshift enables organizations to simplify, automate, and accelerate data loading and near real-time incremental changes from on-premises sources (Oracle, Microsoft SQL Server, and MySQL) to Amazon Redshift.

FIGURE 46: ATTUNITY CLOUDBEAM FOR AMAZON REDSHIFT’S REPLICATION DIAGRAM

Attunity CloudBeam allows your team to avoid the heavy lifting of manually extracting data, transferring via API/script, chopping, staging, and importing. A Click-to-Load solution, Attunity CloudBeam is easy to setup and allows organizations to start validating or realizing the benefits of Amazon Redshift in just minutes. Zero-footprint
technology: Reduces impact on IT operations with log-based capture and delivery of transaction data that does not require the Attunity software to be installed on each source and target database. Performance: Accelerated, secured, and guaranteed delivery of data.

AWS Marketplace Solutions to Augment Amazon Aurora
The AWS Marketplace has services from top software vendors that augment or work in tandem with Amazon Aurora. You can view them all here, but below you’ll find two Informatica solutions.

Informatica Cloud for Amazon Aurora (Windows) provides native, high-volume connectivity to Aurora and is optimized for Oracle to Aurora migration. It allows you to develop and run data integration tasks (mappings), data synchronization and replication tasks, task flows and unlimited scheduling restricted to use only for Aurora. It supports single inserts and batched statement, as well as more advanced capabilities such as create tables on-the-fly, custom queries, look-ups, joiners, filters, expressions and sorters. Informatica Cloud integration is a visual, metadata-driven solution, enabling self-documenting code, extensive reuse in development and automation in deployment. Solution is limited to 1TB of Aurora storage. The solution includes: Cloud Designer: Cloud Based service that enables visual design, development and deployment of data integration mappings (static data flows) A simple 6-step wizard to support the needs of citizen integrators. Informatica Cloud Data Synchronization Service Secure Agent: A light-weight binary that runs in AWS EC2 environment to access the Informatica cloud services located in Informatica’s hosted environment. Secure Agent is installed on your AMI. One instance of Informatica Cloud service in Informatica’s hosted environment. One connector for Amazon Aurora (MySQL) as target One connector as source Run on Windows Server 2012 R2.

Informatica Cloud for Amazon Aurora (Linux) has the same features as the Windows version above except for the connector. In this version, one connector for Amazon Aurora (MySQL) as target. One connector as source Run on SUSE Linux Enterprise Server 11.

AWS Marketplace Solutions to Augment Amazon EMR
The AWS Marketplace has independent software vendors to augment your big data analytics solutions with Amazon EMR.

Syncsort DMX-h, Amazon EMR Edition is designed for Hadoop and now deployed on Amazon EMR, Syncsort DMX-h helps organizations propel their Big Data initiatives, getting productive and delivering results with Hadoop and Amazon EMR in almost no time.

FIGURE 47: SYNCSORT DMX-H, AMAZON EMR EDITION HELPS PROPEL BIG DATA INITIATIVES BY LESSENING THE LEARNING CURVE

Syncsort DMX-h is the only Hadoop ETL application available for EMR. Syncsort DMX-h delivers: 1) Blazingly fast, easy to use Hadoop ETL in the Cloud. 2) A graphical user interface for developing & maintaining MapReduce ETL jobs.

FIGURE 47: SYNCSORT’S GRAPHICAL USER INTERFACE

3) A library of Use Case Accelerators to fast-track development. 4) Unbounded scalability at a disruptively low price. With Syncsort Ironcluster (10 Nodes) you can test, pilot and perform Proof of Concepts for free on up to ten Hadoop EMR nodes. 30 days of free phone and email support are also available.

FIGURE 48: SYNCSORT IS AVAILABLE IN AWS MARKETPLACE

MapR Enterprise Edition Plus Spark includes 24/7 support for the MapR Enterprise Edition plus the Apache Spark stack. IMPORTANT: Use MapR Standard Cluster with VPC Support delivery method to launch your cluster. This edition provides a standards-based enterprise-class distributed file system, complete with high availability and disaster recovery features.

FIGURE 49: MAPR ENTERPRISE EDITION PLUS SPARK’S FEATURES OF HIGH AVAILABILITY, PERFORMANCE, & DISASTER RECOVERY

Also included is a broad range of technologies like data processing with Spark, machine learning with MLlib, SQL with Spark SQL, graph processing with GraphX, and YARN for resource management.

FIGURE 50: MAPR ENTERPRISE EDITION SUPPORTS A BROAD RANGE OF HADOOP FRAMEWORKS

With the browser-based management console, MapR Control System, you can monitor and manage your Hadoop cluster easily and efficiently.

Other Notable Marketplace Solutions to Augment AWS Built-In Storage and Databases for Big Data Analytics
The AWS Marketplace has services from top software vendors that augment or work in tandem with AWS Big Data Storage and Database Services. Some notable choices are listed below.

Looker Analytics Platform for AWS allows anyone in your business to quickly analyze and find insights in your Redshift and RDS datasets. By connecting directly to your AWS instance, Looker opens up access to high-resolution data for detailed exploration and collaborative discovery, building the foundation for a truly data-driven organization.

FIGURE 51: LOOKER – ANALYTICS EVOLVED…HADOOP & AMAZON REDSHIFT

To help you get started quickly, the Looker for AWS license includes implementation services from our team of expert analysts. And throughout your entire subscription, you’ll receive 100% unlimited support from a live analyst using our in-app chat functionality. Purpose-built to leverage the next generation of analytic databases, like Amazon Redshift, and to live in the cloud, Looker takes an entirely new approach to business intelligence. Unlike traditional BI tools, Looker doesn’t move and store your data; instead, it optimizes data discovery within the database itself. Using Looker’s modern data explanation language, called LookML, data analysts create rich experiences so that end users can self-serve their own data discovery. Key to LookML is its reusability: Measures and dimensions are created in only one place and then consistently (and automatically) reused in all relevant views of that same data concept, creating a single source of the truth across your organization. Powerful data discovery, including contextual filtering, pivoting, sequencing, and cohort tiering, so your entire organization can ask questions, share views, and collaborate, all from within the browser, on any device. Live connection to the database using the LookML data modeling language and browser-based agile development IDE, so data analysts can call any Redshift function, such as sortkeys, distkeys, and HyperLogLog, for advanced performance and insights. Wide set of visualizations, including scatter, table, bar, and line charts, a streamlined approach to dashboarding, and the ability to embed visualizations in any web application.

Mapping by MapLarge gives 5 User License to the MapLarge Mapping Engine, a high performance geospatial visualization platform that dynamically renders data for interactive analysis and collaboration. For more information visit http://maplarge.com. Scales to millions of records and beyond. Intuitive User Interface. Robust APIs allow complete customization.

FIGURE 52: EXAMPLES OF SOE OF THE GEOSPATIAL VISUALIZATIONS THAT CAN BE PRODUCED BY MAPLARGE

The Teradata Database Developer (Single Node) with SSD local storage is the same full-feature data warehouse software that powers analytics at many of the world’s greatest companies. Teradata Database Developer includes Teradata Columnar and rights to use: Teradata Parallel Transporter (TPT), including the Load, Update, and Export operators; Teradata Studio; and Teradata Tools and Utilities (TTU). These tools are included with the Teradata Database AMI or available as a free download. In addition to the Teradata Database, your subscription includes rights to use the following products, which are listed in the AWS Marketplace: Teradata Data Stream Utility; Teradata REST Services; Teradata Server Management; and Teradata Viewpoint (Single Teradata System). With Teradata Database, customers get quick time to value and low total cost of ownership. Applications are portable across cloud and on-premises platforms and there is no re-training required. Teradata 15.10 is the newest release of the Teradata Database, bringing industry leading features for data fabric enhanced support, fast JSON performance, and the world’s most advanced hybrid row/column database. Query processing performance is also accelerated by the world’s most advanced hybrid row/column table storage capability. Enhanced hybrid row/column storage enables high performance for selective queries on tables with many columns while also allowing pinpoint access to single rows by operational queries.

HPE Vertica Analytics Platform is Enterprise-Class Analytics that fits your budget. Until now, enterprise-class Big Data analytics in the cloud was just not available. Current cloud analytics offerings lack critical enterprise features — fine-tuning capabilities, integrated BI/reporting, data ingestion and more. With HPE Vertica Analytics Platform for Amazon Web Services, you can tap into all of the core enterprise capabilities and more. HPE Vertica Analytics Platform for Amazon Web Services offers you the flexibility to start small and grow as your business grows, and you get analytics functionality that no other cloud analytics provider can offer. The HPE Vertica Analytics Platform also runs on-premise on industry-standard hardware and in the cloud. Get started immediately with your analytics initiative via the cloud or the deployment model that makes sense for your business without any compromises or limits. Optimized data ingestion for high performance. Fast query optimization for quick insight, with comprehensive SQL & extensions for true openness. Enhanced data storage for cost efficiency, and ease of administration for true reliability.

Zoomdata for AWS is the Fastest Visual Analytics for Big Data and includes smart connectors for Redshift, S3, Kinesis, Apache Spark, Cloudera, Hortonworks, MapR, Elastic, real time, SQL and NoSQL sources.

FIGURE 53: ZOOMDATA LEVERAGES SMART CONNECTORS TO CONNECT TO MANY TYPES OF DATA SOURCES, INCLUDING ON-PREMESIS

Sign up for a free trial today, and you’ll be visualizing billions of rows of data in seconds! Free support is available for users who register at http://go.zoomdata.com/awstrial. Using patented Data Sharpening and micro-query technologies, Zoomdata visualizes Big Data in seconds, even across billions of rows of data. Zoomdata is designed for the business user — not just data analysts — via an intuitive user interface that can be used for interactive dashboards or embedded in a custom application.

FIGURE 54: AN EXAMPLE OF ZOOMDATA’S VISUALIZATION DASHBOARDS

Built for Big Data: By taking the query to the data, Zoomdata leverages the power of modern databases to visualize billions of data points in seconds. Includes Redshift, S3, Cloudera, Solr, and Hortonworks connectors.

FIGURE 55: ZOOMDATA HAS CONNECTORS TO MANY DIFFERENT DATA SOURCES

TIBCO Clarity is the data cleaning and standardization component of the TIBCO Software System. It serves as a single solution for business users to handle massive messy data across various applications and systems, such as TIBCO Jaspersoft, Spotfire, Marketo and Salesforce. The quality of data impacts your decision-making. So data coming from external sources such as SaaS applications or partners needs to be validated before used in systems. TIBCO Clarity makes it easy for business users to profile, standardize, and transform data so that trends can be identified and smart decisions can be made quickly.

FIGURE 56: TIBCO CLARITY PROVIDES THE UBER-IMPORTANT STEP OF DATA CLEANSING, AUGMENTING, & ENHANCEMENT

TIBCO Clarity provides an easy-to-use Web environment, and since it’s a cloud-based subscription service, it only requires an investment relative to the usage of the service. De-duplication: TIBCO Clarity discovers duplicate records in a dataset by using configurable fuzzy match algorithms. Seamless Integration: You can collect your raw data from disparate sources in variety of data formats. Such as files, databases, spreadsheets, both cloud and on-premise. Data Discovery and Profiling: TIBCO Clarity detects data patterns and data types for auto-metadata generation enabling profiling of row and column data for completeness, uniqueness, and variation.

AWS Services for Data Collection

TABLE 14: AWS DATA COLLECTION OPTIONS

AWS Data Collection Overview
Before you can do any big data analytics using Amazon Services, you have to get the data loaded to an AWS storage location. This is a crucial step and can prohibit a company’s first move into a cloud-based environment. It can seem complex and time consuming, and taking a lot of time and resources, and concerns about how to recode and convert you data to another format can seem daunting. However, AWS has many different services to help you move your data onto AWS, whether you are loading from numerous external resources, integration with your premises infrastructure, or migrating data from an existing data center.

AWS Data Migration Service (DMS)
With just a few clicks, the AWS Data Migration Service starts while your original database stays live. AWS DMS handles all the complexity. You even have the ability to replicate back to your original database, or replicate to other databases in different regions or Availability Zones. Heterogeneous migration is taken care of by the AWS Schema Conversion tool. Migration assessment and code conversion is taken care of for you. The source database and code are converted into a format compatible with the target database, and any code that can’t be converted is marked for manual conversion. And costs start at $3.00 per TB.

AWS Import / Export Snowball
AWS Import /Export Snowball is an easy, secure and affordable solution for even the biggest (petabytes) data transfer jobs via a highly secure hardware appliance. You don’t need to purchase any hardware: with just a few clicks in the AWS Management Console you create a job and a Snowball appliance will be shipped to you, or up to 50 of them if you need them. When it arrives, you attach the appliance to your network, download and run the Snowball client to establish a connection, then use the client to select the file directories you want to transfer. Snowball will encrypt and transfer the files at an extremely high speed. When the transfer is complete, you ship the appliance back using the free shipping label supplied. Snowball uses multiple layers of security to protect your data including tamper-resistant enclosures, 256-bit encryption, and an industry-standard Trusted Platform Module (TPM) designed to ensure both security and full chain-of-custody of your data. Snowball unloads your data into Amazon S3 and from there you can access any AWS Service that you need.

Amazon S3 Transfer Acceleration
Amazon S3 Transfer Acceleration can be used when your upload speeds to Amazon S3 are sub-optimal, which can occur for a few reasons. Amazon S3 Transfer Acceleration enables fast, easy and secure transfers of files over long distances between your client and an S3 bucket. It takes advantage of Amazon CloudFront’s globally distributed edge locations. As the data arrives at an edge location, data is routed to Amazon S3 over an optimized network path.

AWS Direct Connect
AWS Direct Connect makes it easy to establish a dedicated, private network connection between AWS and your premisis, data center, co-location environment, etc. This increases bandwidth and throughput while reducing network costs and provides a more consistent network experience than Internet-based connections. This dedicated connection can be partitioned into multiple virtual interfaces so you can use the same connection to access public resources and private resources, maintaining separation between the environments.

AWS Storage Gateway
The AWS Storage gateway is a service that connects an premises software appliance with cloud-based storage to provide seamless and secure integration between an organization’s on-premises IT environment and AWS’s storage infrastructure. The service allows you to securely store data in the AWS cloud for scalable and cost-effective storage for data backups without buying more storage or managing the infrastructure, and pay only for what you use. It supports industry-standard storage protocols that work with your existing applications. It provides low-latency performance by maintaining frequently accessed data on-premises while securely storing all your data encrypted in Amazon S3 or Amazon Glacier. The AWS Storage Gateway Appliance is software that sits in your data center between your applications and your storage infrastructure.

FIGURE 57: AWS STORAGE GATEWAY SOFTWARE APPLIANCE DIAGRAM

Without making any changes to your applications, it backs up your data with SSL encryption, and you can pull your old data back when needed.

Amazon Kinesis Streams
Amazon Kinesis Streams capture large amounts of data (TB/hr) in real-time from data producers and streaming it into custom applications for data processing and analysis. Streaming data is replicated by Kinesis across three availability zones to ensure reliability.

Amazon Kinesis Streams is capable of scaling from a single MB up to TB/hr of streaming data, but in contrast to Firehose, you have to manually provision the capacity. Amazon provides a “shard calculator” (below image) when creating a Kinesis Stream to correctly provision the appropriate number of shards for your stream to handle the volume of data you’re going to process. Once created, it’s possible to scale up or down the number of shards to meet demand.

FIGURE 58: AWS KINESIS STREAMS SHARD CALCULATOR

AWS Marketplace Solutions to Assist and Augment Data Collection
The AWS Marketplace has services from top software vendors that augment or work in tandem with AWS Services for data ingestion. Many have capabilities beyond data ingestion, including the ability to perform ETL/ELT, data cleansing and much more.

Matillion ETL for Redshift was also described in an earlier section. Click here to return to review that section again.

Informatica CloudBeam for Amazon S3, EMR, Hadoop was also described in an earlier section. Click here to return to review that section again.

CloudBerry Backup Desktop Edition is Simple and fast backup to Amazon S3 cloud. CloudBerry Backup is a secure online backup solution that helps organizations to store backup copies of their data in online storage. It is a powerful Backup and Restore program designed to leverage Amazon S3 technology to make your disaster recovery plan simple, reliable, and affordable. Keep your backups in remote location. Access your backups anywhere where you have internet connection. Strong data encryption protects your data from unauthorized access.

CloudBasic RDS Deploy for DevOps DLM/Jenkins (SQL Server) enables you to Move RDS databases around from development and staging environments without access to RDS file system. Integrate RDS Deploy into your DevOps tools, such as Jenkins and GO, to further automate DLM and achieve true one-click deployments. Sample DevOps Scenario involving Jenkins and RDS: The job is to deploy 3 SQL Server databases from RDS to a standard SQL Server. Merge data from two of them into the third. Send the third database back to RDS as the new production system. The job will also flush any open sessions to the website, take it offline and put it back online when everything is finished. This is all done in a PowerShell script which is executed from Jenkins so that it can be performed by employees with minimal security access (only have access to that job) and all history is recorded. No file system access to the RDS instance is required. Traditional tools require access to the SQL Server file system and cannot be used with AWS RDS. Easy integration into DevOps tools such as Jenkins and GO. REST API allows for RDS DB Deployments to be initiated from PowerShell etc.

AWS Services for Data Orchestration and Analytic Workflows

TABLE 15: AWS DATA ORCHESTRATION & ANALYTIC WORKFLOW SERVICES

AWS Services for Data Orchestration and Analytic Workflow Overview
Big data advanced analytics solutions very often require automated arrangement, coordination, and management of data once it’s on the AWS cloud, moving data when finished from one AWS Service to another for subsequent processing or storage, and vice versa. Automating workflows can ensure that necessary activities take place when required to drive the analytic processes.

Amazon Simple Workflow Service (SWF)
Amazon SWF allows you to build distributed applications in any programming language with components that are accessible from anywhere. It reduces infrastructure and administration overhead because you don’t need to run orchestration infrastructure. SWF provides durable, distributed-state management that enables resilient, truly distributed applications. Think of SWF as a fully-managed state tracker and task coordinator in the cloud.

Amazon SWF’s key concepts include the following:
o Workflows are collections of actions
o Domains are collections of related workflows
o Actions are tasks or workflow steps
o Activity workers implement actions
o Deciders implement a workflow’s execution logic
o Maintains distributed application state
o Tracks workflow executions and logs their progress
o Controls which tasks each of your application hosts will be assigned to execute
o Supports the execution of Lambda functions as “workers”

Amazon Data Pipeline
AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premise data sources, at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon Elastic MapReduce (EMR).
AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available. You don’t have to worry about ensuring resource availability, managing inter-task dependencies, retrying transient failures or timeouts in individual tasks, or creating a failure notification system. AWS Data Pipeline also allows you to move and process data that was previously locked up in on-premise data silos.

Using AWS Data Pipeline, you define a pipeline composed of the “data sources” that contain your data, the “activities” or business logic such as EMR jobs or SQL queries, and the “schedule” on which your business logic executes. For example, you could define a job that, every hour, runs an Amazon Elastic MapReduce (Amazon EMR)–based analysis on that hour’s Amazon Simple Storage Service (Amazon S3) log data, loads the results into a relational database for future lookup, and then automatically sends you a daily summary email.

AWS Data Pipeline handles:
o Your jobs’ scheduling, execution, and retry logic
o Tracking the dependencies between your business logic, data sources, and previous processing steps to ensure that your logic does not run until all of its dependencies are met
o Sending any necessary failure notifications
o Creating and managing any temporary compute resources your jobs may require

FIGURE 58: AMAZON DATA PIPELINE HIGH-LEVEL DIAGRAM

Amazon Kinesis Firehose
Amazon Kinesis Firehose is AWS’s data-ingestion product offering for Kinesis. It’s used to capture and load streaming data into other Amazon services such as Amazon S3 or Amazon Redshift. From there, you can load the streams into data processing and analysis tools like Amazon Elastic Map Reduce (EMR) or Amazon ElasticSearch Service. It’s also possible to load the same data into Amazon S3 and Amazon Redshift at the same time using Firehose.

Firehose can scale to gigabytes of streaming data per second, and allows for batching, encrypting and compression of data. It will automatically scale to meet demand. It’s possible to load data into Firehose including HTTPS, the Kinesis Producer Library, the Kinesis Client Library and the Kinesis Agent. Monitoring is available through Amazon CloudWatch.

Amazon CloudFront
Amazon CloudFront is a global Content Delivery Network (CDN) service that gives you the ability to distribute your application globally in minutes. In Amazon CloudFront, your content is organized into distributions. A distribution specifies the location or locations of the original version of your files. Store the original versions of your files on one or more origin servers. An origin server is the location of the definitive version of an object. Origin servers could be other Amazon Web Services – an Amazon S3 bucket, an Amazon EC2 instance, or an Elastic Load Balancer – or your own origin server. Create a distribution to register your origin servers with Amazon CloudFront through a simple API call or the AWS Management Console. When configuring more than one origin server, use URL pattern matches to specify which origin has what content. You can assign one of the origins as the default origin. Use your distribution’s domain name in your web pages, media player, or application. When end users request an object using this domain name, they are automatically routed to the nearest edge location for high performance delivery of your content. An edge location is where end users access services located at AWS. They are located in most of the major cities around the world and are specifically used by CloudFront (CDN) to distribute content to end user to reduce latency.

AWS Data Processing Types

TABLE X: AWS DATA PROCESSING TYPES

AWS Data Processing Types Overview
Analysis of data is a process of inspecting, cleaning, transforming and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making.
Big data can be processed and analyzed in two different ways on AWS: batch processing or stream processing.

When deciding which type of processing you need, consider the performance of Batch vs. Stream processing. Batch processing is latencies in minutes to hours, while Stream processing is latency in the order of seconds or milliseconds.

AWS Batch Processing
Batch processing is normally done on AWS using Amazon S3 for storage pre-and-post processing. Amazon EMR is then used to run managed analytic clusters on top of this data with Hadoop ecosystem tools like Spark, Presto, Hive, and Pig. Batch processing is often used to normalize the data then compute arbitrary queries over the varying sets of data. It computes results derived from all that data it encompasses and enables deep analysis of large data sets. Once you have the results, you can shut down your Amazon EMR or keep it running for further processing or querying.

You can find an architectural drawing of AWS Batch Processing here.

AWS Stream Processing
Streaming data is generated continuously by more than thousands of data sources, typically sending data simultaneously in small sizes. This event data needs to be processed sequentially and incrementally on a record-by-record basis over sliding time windows, and used for a wide variety of analytics. Information from such analysis incrementally updates metrics, reports, and summary statistics which gives companies visibility into many aspects of business and consumer activity as it “streams into AWS” and allows businesses to respond promptly to emerging situations. Amazon Kinesis Streams or Amazon Kinesis Firehose are used to capture and load data into a data store.

With AWS Stream processing, analytic processing and decision making to in-motion and transient data is done with minimal latency. Filtering and diverting in-motion data to a data warehouse like Amazon Redshift for example, where existing business intelligence tools are used to analyze the data for deeper background analysis and/or data augmentation.

Producers of streaming data include machine data, sensor-based monitoring devices, messaging systems, IoT and financial market feeds.

You can find an architectural drawing of AWS Time Series Processing, which is a type of Stream Processing, here.

Conclusion
In this day and age of needing analytics, with more data, more questions to answer, and tougher competition, your success depends on making the right decisions by relying on fast, secure, scalable, durable cloud data analytics, and AWS is the clear leader in this realm by leaps & bounds!

**Caveat: This document was created ~9 months ago (Today is 5/23/2017), so there might be more up-to-date information & there are certainly more AWS Data Analytics Services that should be included here. However, the information herein was complete & accurate when created. Thank you!

#gottaluvAWS! #gottaluvAWSMarketplace!

Posted in 1-Click to Deploy Software Solutions for Your Choosing Paid for by the Hour, Amazon Aurora, Amazon CloudFront, Amazon CloudWatch, Amazon DynamoDB, Amazon EC2, Amazon EC2 On-Demand Instances, Amazon Elastic MapReduce, Amazon Elasticsearch Service, Amazon EMR, Amazon IAM, Amazon Kinesis Family, Amazon Machine Learning, Amazon Redshift Data Warehouse, Amazon S3, Amazon S3 Transfer Acceleration, Amazon Web Services, Amazon Web Services Analytic Services, Attunity CloudBeam in AWS Marketplace, AWS Analytics, AWS Batch Processing, AWS BI, AWS Built-In Security Features, AWS Cloud & Data Security, AWS Cloud Architecture, AWS Cloud Computing Models, AWS Cloud Deployment Models, AWS CloudFront, AWS Data Collection, AWS Data Migration Service, AWS Data Orchestration, AWS Data Pipeline, AWS Direct Connect, AWS Kinesis Firehose, AWS Kinesis Streams, AWS Marketplace, AWS Marketplace FAQs, AWS Snowball, AWS Storage and Database Options Big Data Analytics, AWS Storage Gateway, AWS Stream Processing, AWS SWF, AWS Trusted Advisor, Benefits of Analytics, Big Data Analytics Challenges, Big Data Producers, BigML PredictServer in AWS Marketplace, Business Intelligence & Big Data, Cloud Computing, CloudBasic RDS Deploy for DevOps DLM/Jenkins on AWS Marketplace, CloudBerry Backup on AWS Marketplace, ELK Stack for AWS Elasticsearch in AWS Marketplace, EMR and Hadoop on AWS Marketplace, HPE Vertica on AWS Marketplace, Informatica Cloud on AWS Marketplace, Infosys Information Platform on AWS Marketplace, Looker Analytics Platform in AWS Marketplace, Mapping by MapLarge in AWS Marketplace, MapR Enterprise Edition in AWS Marketplace, Matillion ETL/ELT for Redshift, MicroStrategy Analytics Platform on AWS Marketplace, SAP HANA One on AWS Marketplace, Syncsort for Amazon EMR in AWS Marketplace, Tableau Server on AWS Marketplace, Teradata on AWS Marketplace, TIBCO Clarity on AWS Marketplace, TIBCO Jaspersoft Reporting & Analytics for AWS, TIBCO Spotfire Analytics Platform, Zementis ADAPA Decision Engine in AWS Marketplace, Zoomdata on AWS Marketplace | Leave a comment

DISCOVER, MIGRATE & DEPLOY PRE-CONFIGURED BIG DATA BI & ADVANCED ANALYTIC SOLUTIONS IN MINUTES – AND PAY ONLY FOR WHAT YOU USE BY THE HOUR (Chapter 3.7 in “All AWS Data Analytics Services”)

3.7  DISCOVER, MIGRATE & DEPLOY PRE-CONFIGURED BIG DATA BI & ADVANCED ANALYTIC SOLUTIONS IN MINUTES – AND PAY ONLY FOR WHAT YOU USE BY THE HOUR

Talking about AWS Marketplace is a passion of mine. I view it as AWS’ gift to the business world. No other cloud provider has the authority and influence to attract over a thousand technology partners and independent software vendors (ISVs) from popular vendors that have licensed and packaged their software to run on AWS, have integrated their software with AWS capabilities, or to deliver add-on services that benefit their customers as greatly as AWS does through the AWS Marketplace. The AWS Marketplace is the largest “app store” in the world, regardless of being strictly a B2B app store!

This translates to the best & most popular software vendors going to great lengths to alter their software to seamlessly integrate with other AWS Services and run seamlessly on the AWS cloud. Only AWS has the prominence and much deserved reputation being a total customer-centric company that’s necessary to attract such renown ISVs, and for each one to take the time required to be offered in AWS Marketplace.

The AWS Marketplace facilitates the discovery, purchasing and deployment of BI and Big Data solutions (and many more categories) on AWS, and migrate or get the business intelligence and data analytics solutions you want in minutes… and pay only for what you consume.
For those of you who haven’t heard about AWS Marketplace or dismiss it (for any number of pre-conceived ideas) as another “That’s how they get you!” thought, let me explain the facts, the benefits, and how to navigate AWS Marketplace. Please read on: AWS Marketplace has more than 100,000 active customers who use 300M compute hours/month deployed on Amazon EC2, with more than 3,000 listings from over 1,000 popular software vendors (not including the new SaaS launch that occurred in late November, 2016.)

Since AWS resources can be instantiated in seconds, you can treat these as “disposable” resources – not hardware or software you’ve spent months deciding which to choose and spending a significant up-front expenditure without knowing if it will solve your problems. The “Services not Servers” mantra of AWS provides many ways to increase developer productivity, operational efficiency and the ability to “try on” various solutions available on AWS Marketplace to find the perfect fit for your business needs without commitment to long-term contracts.

1-Click, on-demand infrastructure through software solutions on AWS Marketplace allows iterative, experimental deployment and usage to take advantage of advanced analytics and emerging technologies within minutes, paying only for what you consume, by the hour or by the month.

The vast majority of big data use cases deployed in the cloud today run on AWS, with unique customer references for big data analytics, of which 67 are household names. AWS has over 50 Services and hundreds of features to support virtually any big data application and workload. When you combine the managed AWS services with software solutions available on AWS Marketplace, you can get the precise business intelligence and big data analytical solutions you want that augments and enhances your project beyond what the services themselves provide. There are over 290 big data solutions in AWS Marketplace. Therefore, you get to data-driven results faster by decreasing the time it takes to plan, forecast, and make software provisioning decisions. This greatly improves the way you build business analytics solutions and run your business.

You can read the whitepaper on AWS Big Data Analytics Leveraging the AWS Marketplace, where I’m an author, by going to https://aws.amazon.com/mp/bi/ –> scroll to the bottom under “Additional Resources”, & click on “Download Solution Overview”. Below is a screenshot of the first page:

I'm an Contributor of this "Business Intelligence & Big Data on AWS, Leveraging ISV AWS Marketplace Solutions" Whitepaper

I’m an Contributor of this “Business Intelligence & Big Data on AWS, Leveraging ISV AWS Marketplace Solutions” Whitepaper

Below are just a fraction of example solutions you can achieve when using AWS Marketplace’s software solutions with AWS big data services:

You can:

  • Launch pre-configured and pre-tested experimentation platforms for big data analysis
  • Query your data where it sits (in-datasource analysis) without moving or storing your data on an intermediate server while directly accessing the most powerful functions of the underlying database
  • Perform “ELT” (extract, load, and transform) vs. “ETL” (extract, transform, and load) your data into Amazon’s Redshift data warehouse so the data is in its original form, giving you the ability to perform multiple data warehouse transforms on the same data
  • Have long-term connectivity among many different databases
  • Ensure your data is clean and complete prior to analysis
  • Visualize millions of data points on a map
  • Develop route planning and geographic customer targeting
  • Embed visualizations in applications or stand-alone applications
  • Visualize billions of rows in seconds
  • Graph data and drill into areas of concern
  • Have built-in data science
  • Export information into any format
  • Deploy machine-learning algorithms for data mining and predictive analytics
  • Meet the needs of specialized data connector requirements
  • Create real-time geospatial visualization and interactive analytics
  • Have both OLAP and OLTP analytical processing
  • Map disparate data sources (cloud, social, Google Analytics, mobile, on-prem, big data or relational data) using high-performance massively parallel processing (MPP) with easy-to-use wizards
  • Fine-tune the type of analytical result (location, prescriptive, statistical, text, predictive, behavior, machine learning models and so on)
  • Customize the visualizations in countless views with different levels of interactivity
  • Integrate with existing SAP products
  • Deploy a new data warehouse or extend your existing one

Amazon EC2 provides an ideal platform for operating your own self-managed big data analytics applications on AWS infrastructure. Almost any software you can install on Linux or Windows virtualized environments can be run on Amazon EC2 with a pay-as-you-go pricing model with a solution available on AWS Marketplace. Amazon EC2 uses the implemented architecture to distribute computing power across parallel servers to execute the algorithms in the most efficient manner.

Some examples of self-managed big data analytics that run on Amazon EC2 include the following:

  • A Splunk Enterprise Platform, the leading software platform for real-time Operational Intelligence. Splunk software and cloud services enable organizations to search, monitor, analyze and visualize machine-generated big data coming from websites, applications, servers, networks, sensors and mobile devices. A Splunk Analytics for Hadoop, within AWS, solution is available on AWS Marketplace also. It’s called Hunk and it enables interactive exploration, analysis, and data visualization for data stored in Amazon EMR and Amazon S3
  • A Tableau Server Data Visualization Instance, for users to interact with pre-built data visualizations created using Tableau Desktop. Tableau server allows for ad-hoc querying and data discovery, supports high-volume data visualization and historical analysis, and enables the creation of reports and dashboards
  • A SAP HANA One Instance, a single-tenant SAP HANA database instance that has SAP HANA’s in-memory platform, to do transactional processing, operational reporting, online analytical processing, predictive and text analysis
  • A Geospatial AMI such as MapLarge, that brings high-performance, real-time geospatial visualization and interactive analytics. MapLarge’s visualization results are useful for plotting addresses on a map to determine demographics, analyzing law enforcement and intelligence data, delivering insight to public health information, and visualizing distances such as roads and pipelines
  • An Advanced Analytics Zementis ADAPA Decision Engine Instance, which is a platform and scoring engine to produce Data Science predictive models that integrate with other predictive models like R, Python, KNIME, SAS, SPSS, SAP, FICO and more. Zementis ADAPA Decision Engine can score data in real-time using web services or in batch mode from local files or data in Amazon S3 buckets. It provides predictive analytics through many predictive algorithms, sensor data processing (IoT), behavior analysis, and machine learning models
  • A Matillion Data Integration Instance, an ELT service natively built for Amazon Redshift, that uses Amazon Redshift’s processing for data transformations to utilize it’s blazing speed and scalability. Matillion gives the ability to orchestrate and/or transform data upon ingestion or simply load the data so it can be transformed multiple times as your business requires

Below is an AWS Marketplace brochure explaining the benefits of using Marketplace solutions for big data analytics (that can also be found on the “BI & Big Data Landing Page” if you scroll to the bottom of the page & click on “Download PDF Poster“.

Benefits of AWS Marketplace in Analytical Solutions

Poster on the Benefits of AWS Marketplace in Analytical Solutions

The Main Categories on AWS Marketplace

AWS Marketplace has solutions for big data analytics, but listed below are all of the main sections, with links to each topics’ respective landing pages:

Breaking Down the Main AWS Marketplace Categories to Specific Functionalities:

Security Solutions:

Network Infrastructure Solutions:

Storage Solutions:

BI and Big Data Solutions:

Database Solutions:

Application Development Solutions:

Content Delivery Solutions:

Mobile Solutions:

Microsoft Solutions (note: the list of  “Third-Party Software Products” is a small fraction of the AWS Marketplace solutions that run on Microsoft Servers):

  • Microsoft Workloads:
    • Windows Server (many editions, type “Windows Server” into AWS Marketplace search bar)
    • Exchange Server
    • Microsoft Dynamics (many editions, type “Microsoft Dynamics” into AWS Marketplace search bar)
    • Microsoft SQL Server  (many editions, type “SQL Server” into AWS Marketplace search bar)
    • SharePoint (many editions, type “SharePoint” into AWS Marketplace search bar)
  • Third-Party Software Products:

Migration Solutions:

I hope you read through the entire post, and that you now realize how much time, frustration, configuration, and money you can save by using the preconfigured software solutions available at AWS Marketplace, only paying for what you use!

Why do it any other way?

Using Pre-Configured Software Solutions from AWS Marketplace with 1-Click Deployments & Paying by the Hour - Why Do It Any Other Way???

Using Pre-Configured Software Solutions from AWS Marketplace with 1-Click Deployments & Paying by the Hour – Why Do It Any Other Way???

Read the previous post here.

#gottaluvAWS! #gottaluvAWSMarketplace!

Posted in 1-Click to Deploy Software Solutions for Your Choosing Paid for by the Hour, Amazon EC2 On-Demand Instances, Amazon Web Services, Amazon Web Services Analytic Services, AWS Analytic Services, AWS Analytics, AWS BI, AWS Business Value, AWS Data Collection, AWS Marketplace, AWS Marketplace FAQs, AWS Marketplace Security Solutions, Cloud Computing, Faster Time to Data-Driven Results, Faster Time to ROI, How to Find AWS Marketplace Category Langing Pages, How to Find AWS Marketplace Preferred Vendors, List of Main Vendors by Category AWS Marketplace, Making Your IT Life Simpler with AWS Marketplace | Leave a comment

TRADITIONAL RELATIONAL DATABASE MANAGEMENT SYSTEMS (Chapter 3.6 in “All AWS Data Analytics Services”)

A Traditional Relational Database Schema Showing Tables and Relations

A Traditional Relational Database Schema Showing Tables, Relations, & Keys

3.6  TRADITIONAL RELATIONAL DATABASE MANAGEMENT SYSTEMS

A relational database management system (RDBMS) is a database management system (DBMS) that is based on the relational model.

In 1970, Edgar F. Codd, a British computer scientist with IBM, published “A Relational Model of Data for Large Shared Data Banks.” At the time, the renowned paper attracted little interest, and few understood how Codd’s groundbreaking work would define the basic rules for relational data storage for decades to come, which can be simplified as:

  1. Data must be stored and presented as relations, i.e., tables that have relationships with each other, e.g., primary/foreign keys.
  2. To manipulate the data stored in tables, a system should provide relational operators – code that enables the relationship to be tested between two entities. A good example is the WHERE clause of a SELECT statement, i.e., the SQL statement SELECT * FROM CUSTOMER_MASTER WHERE CUSTOMER_SURNAME = ’Smith’ will query the CUSTOMER_MASTER table and return all customers with a surname of Smith.

RDBMSs have been a common choice for the storage of information in databases used for financial records, manufacturing and logistical information, personnel data, and other applications using historical, transactional data since the 1980s.

However, relational databases have received unsuccessful challenge attempts by object database management systems in the 1980s and 1990s (which were introduced trying to address the so-called object-relational impedance mismatch between relational databases and object-oriented application programs) and also by XML database management systems in the 1990s.

Despite such attempts, RDBMSs keep most of the market share, but that share is declining because of the lack of the ability to scale, concurrency issues, and the high network bandwidth required for queries having to traverse many tables that have been architected to be highly normalized. Database normalization is the technique used in organizing the data in an RDBMS. It’s a systematic approach of decomposing tables to eliminate data redundancy and improve data integrity.

Two examples of traditional relational databases are Microsoft SQL Server & Oracle Databases.

In Chapter 22, the second section will compare traditional relational databases with Amazon RDS Aurora database, that is a new RDBMS built from the ground up for the cloud and just recently surpassed Amazon Redshift to be AWS’ fastest growing service.

Read the previous post here.

Read the next post here.

#gottaluvAWS! #gottaluvAWSMarketplace!

 

Posted in Amazon Aurora, Amazon Web Services, AWS BI, Microsoft SQL Server, Oracle Database, RDBMS, Traditional Relational Database Systems | Leave a comment

TRADITIONAL DATA WAREHOUSES (Chapter 3.5 in “All AWS Data Analytics Services”)

Schematic of an OLAP Cube Used in Traditional Data Warehouses

Schematic of an OLAP Cube Used in Traditional Data Warehouses

3.5  TRADITIONAL DATA WAREHOUSES

A traditional data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence.

Online analytical processing (OLAP) cubes are multi-dimensional data structures that traditional Data Warehouse use to contain the data that you import. The cubes divide the data into subsets that are defined by dimensions.

In a dimensional approach, transaction data are partitioned into “facts”, which are generally numeric transaction data, and “dimensions“, which are the reference information that gives context to the facts. For example, a sales transaction can be broken up into facts such as the number of products ordered and the total price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order.

A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval of data from the data warehouse tends to operate very quickly. Dimensional structures are easy to understand for business users, because the structure is divided into measurements/facts and context/dimensions. Facts are related to the organization’s business processes and operational system whereas the dimensions surrounding them contain context about the measurement. Another advantage offered by dimensional model is that it does not involve a relational database every time. Thus, this type of modeling technique is very useful for end-user queries in data warehouse.

The main disadvantages of the dimensional approach are the following:

  1. In order to maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is extremely complicated and time-consuming.
  2. The mapping of the disparate data sources involved very complex mapping of each column, done through the “T” portion of ETL (Extract, Transform & Load). Depending on how much data was to be mapped, the types of disparate data sources, and data competion & cleanliness often led this process to take many, many months done by highly-skilled, highly-paid specialists.
  3. It is difficult to modify the data warehouse structure if the organization adopting the dimensional approach changes the way in which it does business. Normally, a new cube would have to be made to answer a different set of analytical questions.

In Chapter 9, “Amazon Redshift Data Warehouse”, a comparison is done between Amazon Redshift’s modern approach to a Data Warehousing vs. this traditional approach to Data Warehousing.

Read the previous post here.

Read the next post here.

#gottaluvAWS! #gottaluvAWS Marketplace!

Posted in Dimensions & Measures, OLAP Cubes, Traditional Data Warehousing | Leave a comment

CLOUD & DATA SECURITY (Chapter 3.4 of “All AWS Data Analytics Services”)

You Don't Mess with the AWS Cloud - The Most Secure Cloud Platform - OR VITO & ROTWEILLERS WILL COME OUT OF NOWHERE!

You Don’t Mess with the AWS Cloud – The Most Secure Cloud Platform – OR VITO & ROTTWEILERS WILL COME OUT OF NOWHERE!

3.4  AWS CLOUD & DATA SECURITY

AWS provides capabilities across all of your locations, your networks, software and business processes meeting the strictest security requirements that are continually audited for the broadest range of security certifications.

Some of AWS' Strict Security Compliance and Privacy Certifications

Some of AWS’ Strict Security Compliance and Privacy Certifications

Security at AWS is the highest priority. As an AWS customer, you benefit from a data center and network architecture built to meet the requirements of the most security-sensitive customers. Your data and applications are far more secure on AWS than in your own office.

Government, education and nonprofit organizations face unique challenges to accomplish complex missions with limited resources. Public sector leaders engaged in true cloud computing projects overwhelmingly turn to the power and speed of AWS when they want to serve citizens more effectively, achieve scientific breakthroughs, reach broader constituents and put more of their time and resources into their core missions – yet meet all regulatory, compliance, and security mandatory requirements.

The AWS cloud provides governance capabilities enabling continuous monitoring of configuration changes to your IT resources as well as giving you the ability to leverage multiple native AWS security and encryption features for a higher level of data protection and compliance – security at every level up to the most stringent government compliance no matter what your industry. AWS now serves more than 2,300 government, 7,000 education and 22,000 nonprofit organizations worldwide including the U.S. Government, the U.S. Intelligence Community & the U.S. Department of Defense, and NASA/JPL.

AWS provides several security capabilities and services to increase privacy and control network access, including network firewalls built into Amazon VPC, data encryption in Amazon S3 and connectivity options that enable private or dedicated connections from your on-premises environment. Data encryption in transit & at rest.

AWS uses a  “Shared Responsibility Model” when it comes to security. The reason for this is that not every customer wants everything locked down in the same manner. While AWS manages security of the cloud, security in the cloud is the responsibility of the customer. Customers retain control of what security they choose to implement to protect their own content, platform, applications, systems and networks, no differently than they would for applications in an on-site datacenter.

AWS Shared Security Model Schematic

AWS Shared Security Model Schematic (image courtesy of AWS properties)

To read AWS Security Best Practices, read this.

AWS has a tiered competency-badged network of partners that provide application development expertise, managed services and professional services such as data migration. This ecosystem, along with AWS’s training and certification programs, makes it easy to adopt and operate AWS in a best-practice fashion.

Some of the Types of AWS Security Solutions Available in AWS Marketplace

Some of the Types of AWS Security Solutions Available in AWS Marketplace

Recommended AWS Marketplace Security Solutions for Security are presented in an overview manner below. For more detail, visit this page.

Below I’ll overview some of the recommended ISVs for specific security solution in AWS Marketplace:

You can read the AWS Marketplace “Security Solutions on AWS” whitepaper here.

Access comprehensive developer documents on AWS Security Resources here.

Read the previous post here.

Read the next post here.

#gottaluvAWS! #gottaluvAWSMarketplace!

 

Posted in Amazon Web Services, AWS Cloud & Data Security, AWS Marketplace, AWS Marketplace Security Solutions, AWS Shared Responsibility Model, Cloud Computing | Leave a comment