AWS Data Analytics Services Leveraging AWS Marketplace in Detail

Unlock Hidden Insights within Massive Data Sources


Analyzing extensive data sets require noteworthy compute capacity that can fluctuate in size based on the data inputs and type of analytics. This characteristic of scaling workloads is perfectly suited to AWS and the AWS Marketplace’s pay-as-you-go cloud model, where applications can scale up and down based on demand. Being able to analyze data quickly to derive valuable insights can be done within minutes rather than months, and you only pay for what you use.


As an ever-increasing and ubiquitous proliferation of data is emitted from increasingly new and previously unforeseen sources, traditional in-house IT solutions are unable to keep up with the pace. Heavily investing in data centers and servers by “best guess” is a waste of time and money, and a never-ending job.

Traditional data warehouses required very highly skilled employees that addressed a fixed set of questions. The need for speed and agility today in analyzing data differently and efficiently requires complex architectures that are available and ready for use with the click of a button on AWS – eliminating the need to concern yourself with the underlying mechanisms and configurations that you’d have to do on premises.

The AWS Marketplace streamlines the procurement of software solutions provided from popular software vendors by providing AMIs that are pre-integrated with the AWS cloud, further expediting and assisting you with supporting big data analytical software services. The AWS Marketplace has over 290 big data solutions to-date.

This eBook will cover big data and big data analytics as a whole in depth: what it is, where and how it comes from, and what kinds of information you can find when analyzing all of this data. It will then discuss the facts about why AWS and the solutions provided by top software vendors in AWS Marketplace provide the best big data analytics services and offerings. Then there will be a walk-through of the AWS Services that are used in big data analytics with augmented solutions from AWS Marketplace. In conclusion, you will see how AWS is the unequivocal leader when implementing big data analytic solutions.

Conventions Used in this eBook

In order to provide cohesiveness to the longer sections of this eBook, the use of tables are used. The header of the table will list the name of the topic, and each subtopic is listed below it. An example is shown below:


Big Data Analytics Challenges

Data is not only getting bigger (in “Volume”) and in ever-increasing different formats (the “Variety”) faster (the “Velocity”), but the need to derive “Value” through analytics to provide actionable insights for businesses is indeed a differentiating factor between successful businesses that can mitigate risk and respond to customer actions in near real-time vs. other businesses that will fall behind in the day and age of data deluge. Using Amazon Web Services cloud architectures and software solutions available from popular software venders on AWS Marketplace, big data analytics solutions change from extremely complicated to set up and manage to a couple of clicks to deployment.

In addition the metaphorical “V’s” mentioned above to describe big data, there is one more: “Veracity” – being sure your data is clean prior to any analytics performed whatsoever. Garbage in, garbage out. There’s no time to waste making improper, misinformed decisions based on dirty data. This is paramount. Using solutions in the AWS Marketplace make this crucial and difficult step easy.

Big data has also evolved. It used to be that batch processing for reports was sufficient (and the only solution available). To keep competitive today, you need to “think ahead” and answer questions in real-time to provide alerts to mitigate negative impacts on your business, and you need to predictive analytics to forecast what’s going to happen before it ever does so you are prepared at any given point in time.

Overview of AWS and AWS Marketplace Big Data Analytics Advantages


AWS Big Data Analytics Advantages Overview

Analyzing large data sets requires significant compute capacity that can vary in size based upon the amount of input data and the type of analysis. Thus it’s apparent that these big data workloads is ideally suited to a pay-as-you-go cloud environment.

Many companies that have successfully taken advantage of AWS big data analytics processing aren’t just enjoying incremental improvements. The benefits enabled by big data processing becomes the heart of the business – enabling new applications and business processes, using a variety of data sources and analytical solutions – giving insight into their data never dreamed of and giving them a great competitive advantage.

Ongoing developments in AWS cloud computing are rapidly moving the promise of deriving business value from big data in real-time into a reality. With billions of devices globally already streaming data, forward-thinking companies have begun to leverage AWS to reap huge benefits from this data storm.

AWS has the broadest platform for big data in the market today, with deep and rapidly expanding functionality across big data stores, data warehousing, distributed analytics, real-time streaming, machine learning, and business intelligence. Gartner2 confirms AWS has the most diverse customer base and the broadest range of use cases, including enterprise mission-critical applications. For the sixth consecutive year, Gartner2 also confirms AWS is the overwhelming market share leader, with over 10 time more cloud compute capacity in use than the aggregate total of the other 14 providers in their Magic Quadrant!

2 Gartner

AWS has a tiered competency-badged network of partners that provide application development expertise, managed services and professional services such as data migration. This ecosystem, along with AWS’s training and certification programs, makes it easy to adopt and operate AWS in a best-practice fashion.

The AWS cloud provides governance capabilities enabling continuous monitoring of configuration changes to your IT resources as well as giving you the ability to leverage multiple native AWS security and encryption features for a higher level of data protection and compliance – security at every level up to the most stringent government compliance no matter what your industry.

Listed Below are Some of the Specific AWS Big Data Analytics Advantages:

  • The vast majority of big data use cases deployed in the cloud today run on AWS, with unique customer references for big data analytics, of which 67 are enterprise, household names
  • Over 50 AWS Services and hundreds of features to support virtually any big data application and workload
  • AWS releases new services and features weekly, enabling you to keeping the technologies you use aligned with the most current, state-of-the-art big data analytics capabilities and functionalities
  • AWS delivers an extensive range of tools for fast and secure data movement to and from the AWS cloud
  • Computational power that’s second to none2; each optimized with varying combinations of CPU, memory, storage and networking capacity to meet the need of any big data use case
  • AWS makes fast, scalable, gigabyte-to-petabyte scale analytics affordable to anyone via their broad range of storage, compute and analytical options, guaranteed!
  • AWS provides capabilities across all of your locations, your networks, software and business processes meeting the strictest security requirements that are continually audited for the broadest range of security certifications
  • AWS removes limits to the types of database and storage technologies you can use by providing managed database services that offer enterprise performance at open source cost. This results in applications running on many different data technologies, using the right technology for each workload
  • Virtually unlimited capacity for massive datasets
  • AWS provides data encryption at rest an in-transit for all services with the ability for you to directly analyze the encrypted data
  • AWS provides a scalable architecture that supports growth in users, traffic or data without a drop in performance, both vertically and horizontally, and allows for distributed processing
  • Faster time-to-market of products and services, enabling rapid and informed decision-making while shrinking product and service development time
  • Lower cost of ownership and reduced management overhead costs, freeing up your business for more strategic and business-focused tasks
  • In addition to the huge cost savings of simply moving from on-premises to the cloud, AWS provides suggestions on how to further decrease cost savings. Providing the most cost-efficient cloud solutions is a frugality rule at AWS
  • Numerous ways to achieve and optimize a globally-available, unlimited on-demand capacity of resources so you can grow as fast as you can
  • Fault tolerance across multiple servers in Availability Zones and across geographically distant Regions
  • An extremely agile application development environment: go from concept to full production deployment in 24 hours
  • Security at AWS is the highest priority. As an AWS customer, you benefit from a data center and network architecture built to meet the requirements of the most security-sensitive customers
  • AWS provides many suggestions on how to remove a single point of failure

2 Gartner

AWS Marketplace Big Data Analytics Advantages Overview

AWS provides an extensive set of managed services that help you build, secure, and scale big data analytics applications quickly and easily. Whether your applications require real-time streaming, a data warehouse solution, or batch data processing, AWS provides the infrastructure and tools to perform virtually any type of big data project.

When you combine the managed AWS services with software solutions available from popular software vendors on AWS Marketplace, you can get the precise business intelligence and big data analytical solutions you want that augment and enhances your project beyond what the services themselves provide. You get to data-driven results faster by decreasing the time it takes to plan, forecast, and make software provisioning decisions. This greatly improves the way you build business analytics solutions and run your business.

Gartner2 confirms that because AWS has a multi-year competitive advantage over all its competitors, it’s been able to attract over a thousand technology partners and independent software vendors from popular vendor that have licensed and packaged their software to run on AWS, have integrated their software with AWS capabilities, or to deliver add-on services all through the AWS Marketplace. The AWS Marketplace is the largest “app store” in the world, regardless of being strictly a B2B app store!

2 Gartner


Since AWS resources can be instantiated in seconds, you can treat these as “disposable” resources – not hardware or software you’ve spent months deciding which to choose and spending a significant up-front expenditure without knowing if it will solve your problems. The “Services not Servers” mantra of AWS provides many ways to increase developer productivity, operational efficiency and the ability to “try on” various solutions available on AWS Marketplace to find the perfect fit for your business needs without commitment to long-term contracts.

Listed Below are Some of the Specific AWS Marketplace Big Data Analytics Advantages:

  • Get to data-driven results faster by decreasing the time it takes to plan, forecast, and make decisions by performing big data analytics and visualizations on AWS data services and other third-party data sources via software solutions available from popular software vendors on AWS Marketplace – the largest ecosystem of popular software vendors and integrators of any provider2 – giving your organization the agility to experiment and innovate with the click of a button
  • The AWS Marketplace maintains the largest partner ecosystem of any provider. It has over 290+ big data software solutions available from popular software vendors that are pre-integrated with the AWS cloud
  • Deploy business intelligence and advanced analytics pre-configured software solutions in minutes
  • On-demand infrastructure through software solutions on AWS Marketplace allows iterative, experimental deployment and usage to take advantage of advanced analytics and emerging technologies within minutes, paying only for what you consume, by the hour or by the month
  • Many AWS Marketplace solutions offer free trials, so you can “try on” multiple big data analytical solutions to solve the same business problem to see which is the best fit for your specific scenario

2 Gartner

Example Solutions Achieved Through Augmenting AWS Services with Software Solutions Available on AWS Marketplace

Using software solutions available from popular software vendors on AWS Marketplace, you can customize and tailor your big data analytic project to precisely fit your business scenario. Below are just a fraction of example solutions you can achieve when using AWS Marketplace’s software solutions with the AWS big data services.

You can:

  • Launch pre-configured and pre-tested experimentation platforms for big data analysis
  • Query your data where it sits (in-datasource analysis) without moving or storing your data on an intermediate server while directly accessing the most powerful functions of the underlying database
  • Perform “ELT” (extract, load, and transform) vs. “ETL” (extract, transform, and load) your data into Amazon’s Redshift data warehouse so the data is in its original form, giving you the ability to perform multiple data warehouse transforms on the same data
  • Have long-term connectivity among many different databases
  • Ensure your data is clean and complete prior to analysis
  • Visualize millions of data points on a map
  • Develop route planning and geographic customer targeting
  • Embed visualizations in applications or stand-alone applications
  • Visualize billions of rows in seconds
  • Graph data and drill into areas of concern
  • Have built-in data science
  • Export information into any format
  • Deploy machine-learning algorithms for data mining and predictive analytics
  • Meet the needs of specialized data connector requirements
  • Create real-time geospatial visualization and interactive analytics
  • Have both OLAP and OLTP analytical processing
  • Map disparate data sources (cloud, social, Google Analytics, mobile, on-prem, big data or relational data) using high-performance massively parallel processing (MPP) with easy-to-use wizards
  • Fine-tune the type of analytical result (location, prescriptive, statistical, text, predictive, behavior, machine learning models and so on)
  • Customize the visualizations in countless views with different levels of interactivity
  • Integrate with existing SAP products
  • Deploy a new data warehouse or extend your existing one

AWS Marketplace-Specific Site for Data Analytics Solutions
There’s a plethora of options on AWS Marketplace to run big data analytics software solutions available from popular vendors that are already pre-configured on an Amazon Machine Image (AMI) that solve a variety of very specific needs, some of which were mentioned above.

You can visit the AWS Marketplace Big Data Analytics-specific site by clicking the bottom left icon on the AWS Marketplace site or by clicking here to view the premier AWS Marketplace solution providers for transforming and moving your data, processing and analyzing your data, and reporting and visualizing your data.

I’d like to point out if you click the “Learn More” link at the bottom of each type of solution (for example, below I’m showing the section “Business Intelligence and Data Visualization”), you’re taken to an awesome section that works like a “Channel Guide” for webcasts to teach you how to work with some of the solutions presented by software vendor representatives!

The first screenshot below shows where to find the “Learn More” link, and the second screenshot below is of the “Channel Guide” for webcasts by representatives for some of AWS Marketplace software vendors:


Click the “Learn More” link highlighted above in the red rectangle, and for whichever type of Analytics solution you click on, you’re taken to the Webcast Channels:


Overview of AWS Cloud Architecture
The AWS cloud is based on the general design principles of the “Well-Architected Framework” that increases the likelihood of business success. It is based on the following four pillars:

  1. Security: The ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies
    • AWS’s built-in security features can be enhanced with software solutions available from popular software vendors on AWS Marketplace. To browse the various solutions, click here
  2. Reliability: The ability of a system to recover from infrastructure or service failures, dynamically acquiring computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues
    • AWS’s built-in fault tolerance and infrastructure disruption features can be enhanced with software solutions available from popular software vendors on AWS Marketplace. To browse solutions for fault tolerance, click here, and for infrastructure/network solutions click here
  3. Performance Efficiency: The ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve
    • AWS’s built-in performance features can be enhanced with software solutions available from popular software vendors on AWS Marketplace. To browse the various solutions, click here
  4. Cost Optimization: The ability to avoid or eliminate unneeded cost or suboptimal resources
    • AWS’s built-in cost alerting features can be enhanced with software solutions available from popular software vendors on AWS Marketplace. To browse the various solutions, click here

If you’d like to know more about AWS’ “Well-Architected Framework”, from which the above is referenced, click here.
Some of the Types of Big Data Analytical Insights and Example Use Cases


Big Data is such a buzz-word that it’s prudent to ensure we wrap our heads around what it means.

Big data means a massive volume of both structured and unstructured data that’s so large it’s difficult to process using traditional database and software techniques.

In Enterprise scenarios, the volume of data is just too big or it moves too fast, or it exceeds processing capabilities available on-premisis. But this data, when captured, formatted, manipulated, and stored, pulls powerful insights – some never imagined – through analytics.

Below you’ll find a description of some of the types of big data analytical insights and common use cases for each:

Descriptive: Descriptive Analytics uses business intelligence and data mining to ask “What has happened?” Descriptive Analytics mines data to provide trending information on past or current events that can give businesses the context they need for future actions. Descriptive Analytics are characterized by the use of KPIs. It drills down into data to uncover details such as the frequency of events, the cost of operations and the root cause of failures. Most traditional business intelligence reporting falls into this realm, but complex and sophisticated analytic techniques also fall into this realm when their purpose is to describe or characterize past events and states. Summary statistics, clustering techniques, and association rules used in market basket analysis are all examples of Descriptive Analytics.

Diagnostic: Diagnostic Analytics examines data or content to answer the question “Why did it happen?” It’s characterized by techniques such as drill-down, data discovery, data mining and correlations. You can think of it as the casual inference and the comparative effect of different variables on a particular outcome. While Descriptive Analytics might be concerned with describing how large or significant a particular outcome is, it’s more focused on determining what factors and events contributed to the outcomes. As more and more cases are included in a particular analysis and more factors or dimensions are included, it may be impossible to determine precise, limited statements regarding sequences and outcomes. Contradictory cases, data sparseness, missing factors (“unknown unknowns”), and data sampling and preparation techniques all contribute to uncertainty and the need to qualify conclusions in Diagnostic Analytics as occurring in a “probability space”. Training algorithms for classification and regression techniques can be seen as falling into this space since they combine the analysis of past events and states with probability distributions. Other examples of Diagnostic Analytics include attribute importance, principle component analysis, sensitivity analysis and conjoint analysis.

Discovery: Discovery Analytics doesn’t begin with a pre-definition but rather with a goal. It approaches the data in an iterative process of “explore, discover, verify and operationalize.” This method uncovers new insights and then builds and operationalizes new analytic models that provide value back to the business. The key to delivering the most value through Discovery Analytics is to enable as many users as possible across the organization to participate in it to harness the collective intelligence. Discovery Analytics searches for patterns or specific items in a data set. It uses applications such as geographical maps, pivot tables and heat maps to make the process of finding patterns or specific items rapid and intuitive. Examples of Discovery Analytics include using advanced analytical geospatial mapping to find location intelligence or frequency analysis to find concentrations of insurance claims to detect fraud.

Predictive: Predictive Analytics asks “What could happen?” It’s used to make predictions about unknown future events. It uses many techniques from data mining, machine learning and artificial intelligence. This type of analytics is all about understanding predictions based on quantitative analysis on data sets. It’s in the realm of “predictive modeling” and statistical evaluation of those models. Examples of Predictive Analytics includes classification models, regression models, Monte Carlo analysis, random forest models and Bayesian analysis. It helps businesses anticipate likely scenarios so they can plan ahead, rather than reacting to what already happened.

Prescriptive: Prescriptive Analytics uses optimization and simulation to ask “What should we do?” It explores a set of possible actions and suggests actions based on Descriptive and Predictive Analyses of complex data. It’s all about automating future actions or decisions which are defined programmatically through an analytical process. The emphasis is on defined future responses or actions and rules that specify what actions to take. While simple threshold based “if then” statements are included in Prescriptive Analytics, highly sophisticated algorithms such as neural nets are also typically in the realm of Prescriptive Analytics because they’re focused on making a specific prediction. Examples include recommendation engines, next best offer analysis, cueing analysis with automated assignment systems and most operations research optimization analyses.

How Big is “Big Data”?
The amount of digital information a typical business has to deal with doubles every two years. It has been predicted that the data we create and copy annually (“the digital universe”) will reach 44 zettabytes – or 44 trillion gigabytes – by the year 20203. With AWS and the analytical solutions provided by popular software vendors provided by the AWS Marketplace, there is a wealth of yet-to-be-discovered insights that could provide a myriad of understanding in countless types of research.
3 EMC Digital Universe with Research & Analysis by IDC


Examples of Big Data Producers
This section is included to give you an example of some of the types of “things” that produce massive amounts of data that can be analyzed and repurposed.


Machine and Sensor Data
Machine and sensor data come from many sources, and sources continue to proliferate. Some examples are energy meters, telecommunications, road/air/sea pattern analysis, satellites, meteorological sensors and other natural phenomena monitoring, scientific and technical services, manufacturing, medical devices and the Internet of Things (IoT) such as smart homes, appliances and cities. Analyses of this type of data can reveal many trends and predictive analysis can be performed to take action to prevent unwanted scenarios or be alerted when something goes awry.

Image and Video Data
It would take more than 5 million years to watch the amount of video that will cross global IP Networks each month in 20204. Some examples of image and video data are video surveillance, immersive video, virtual (and augmented) reality, internet gaming, smartphone images and video, photo and video sharing sites (YouTube, Instagram, Pinterest, etc.) and streaming video content (such as Netflix). Topological, contextual, hidden statistical patterns and historical analyses are examples of some of the types of analytics that can be done on image and video data5.

4 For a detailed report on Visual Networking, read Cisco’s Visual Networking Index: Forecast and Methodology, 2015-2020

Social Data
There are approximately 2 billion internet users using social networks in 20165, producing enormous amounts of data not only through posts and tweets, but comments, likes, and so forth. Some examples include Facebook and Facebook Messenger, Twitter, LinkedIn, Vine, WhatsApp, Facebook, Skype, and so forth. This type of data is useful for text and sentiment analysis.

6 Statista Statistics Portal

Internet Data
The current forecast projects global IP traffic to nearly triple from 2015 to 2020, growing to 194 exabytes/month7. Examples of internet data include data stored on websites, blogs, and news sources, online banking and financial transactions, package and asset tracking, transportation data, telemedicine, first responder connectivity, and even chips for pets! Internet data can be analyzed for security breaches, bank fraud, traffic analysis, geographic distribution of DNS clients, discovering origins of cybercrime8.

7 Cisco: The Zettabyte Era – Trends and Analysis 2016

8 CAIDA: Center for Applied Internet Data Analysis

Log Data
Log files are records of events that occur in a system, in software, or in communications between users of software. There are many types of logging systems. Some examples are event logs, server logs, RFID logs, Active Directory logs, security logs, mail logs, network logs and transaction logs. Log data analysis include analysis for performance, solve software bugs, testing of new features, audit trails for unauthorized or malicious access, etc.
Third-Party Data
Third-party data is any information collected by an entity that does not have a direct relationship with the user the data is being collected on. Often this data is generated on a variety of platforms and then aggregated together for analysis. Examples include geospatial data, mapping and demographic data, content delivery networks, CRM and other business software systems. Third-party data can be analyzed for trends in traffic, spread of disease, user behavior, and more.

AWS Cloud Computing Models and Deployment Models
Cloud computing provides developers and IT departments the ability to focus on what matters most and avoiding undifferentiated work like procurement, maintenance, and capacity planning. There are several different models and deployment strategies that help meet specific needs of different users. Each type of cloud service and deployment method provides different levels of control, flexibility and management. By understanding the differences between “Infrastructure as a Service” (IaaS), “Platform as a Service” (PaaS), and “Software as a Service” (SaaS), in addition to the different deployment strategies available, can help you decide what set of services is right for your business needs.

Before an analytical cloud project starts, it’s important to choose the right cloud computing and deployment architectures determined. Many factors come into play that affect the location of the data to be analyzed, where the analytics processing will be performed, and to abide by legal and regulatory requirements of different countries. Once you’ve determined the best cloud computing and deployment model, you can utilize Amazon CloudFront to speed up the distribution of your application. Amazon CloudFront delivers your content through a worldwide network of edge locations, so that when a user requests content served from CloudFront, they’re routed to the edge location that provides the lowest latency, so content is delivered with the best possible performance. For more details about Amazon CloudFront, click here.


AWS Cloud Computing Models
There are three main models for cloud computing on AWS. Each model represents different parts of the cloud computing stack.

  1. IaaS contains the basic building blocks for cloud IT. This model typically provides access to networking features, “computers” (virtual or on dedicated hardware), and data storage space. This model gives you the highest level of flexibility and management control over your IT resources and is most similar to most existing IT resources on-premisis. IaaS is usually the first model used when moving to the cloud
  2. PaaS removes the need to manage the underlying infrastructure (usually hardware and operating systems) which allows you to focus on the deployment and management of your applications. This increases efficiency because you don’t have to worry about resource procurement, capacity planning, software maintenance, patching, or any of the other undifferentiated “heavy lifting” involved in running your applications
  3. SaaS provides you with a completed product that’s run and managed by the service provider. In most cases, people referring to SaaS are referring to end-user applications. With SaaS you don’t have to think about how the service is maintained or how the underlying infrastructure is managed; you only need to think about how you’ll use the software.

You’ll find popular open source and commercial software available on AWS Marketplace available as SaaS (in addition to individual Amazon Machine Images (AMIs) or as a cluster of AMIs deployed through an AWS CloudFormation template).

AWS Cloud Computing Deployment Models
There are three AWS cloud computing deployment models: Public, Hybrid, and Private.

*NOTE: These describe where the IT resources reside, and are separate from the many ways to get your data and applications onto AWS.


AWS Public Cloud Model (Cloud Native)
The AWS public cloud is where most companies and individuals start. It’s the easiest, fastest way to begin to use on-demand delivery of IT resources and applications via the Internet with a low-cost, pay-as-you-go pricing model via AWS services and solutions available on AWS Marketplace.

The public cloud is an ideal place to quickly use big data analytics solutions on the AWS Marketplace to experiment, innovate and try new and different analytical solutions. Spin up solutions as you need them, turn them off when you’re done and only pay for what you’ve used.

The public cloud provides a simple way to access servers, storage, databases and a huge set of application services. AWS owns and maintains the network-connected hardware required for these services, while you provision and use what you need via the AWS console. Using the public cloud gives you the benefits of cloud computing such as the following:

  • Rather than investing in data centers and servers before you know what you’re going to use, you can only pay when you consume computing resources, and only for how much you use
  • Achieve lower variable cost because hundreds of thousands of customers are aggregated in the cloud
  • Eliminate guessing on infrastructure capacity needs. Access as much or as little as you need, and scale up or down as required within minutes
  • Increase speed and agility, since new IT resources are only a click away
  • Focus on projects that differentiate your business since AWS does all the heavy lifting of racking, stacking and powering servers
  • Easily deploy your application in multiple regions around the world, going global in minutes

The public network contains elements that may be sourced from the internet, data sources and users, and the edge services needed to access the AWS cloud or enterprise network. The flow from the external internet may come through normal edge services including DNS Servers (Amazon Route 53, for example), Content Delivery Networks (Amazon CloudFront, for example), firewalls (Amazon EC2 VPCs, for example) and load balancers (Amazon EC2 Load Balancers, for example) before entering the data integration or data streaming entry points to the data analytics solution.

AWS Hybrid Cloud Model
For companies who have significant on-premises and / or data center investments, migrating to the cloud can take years. Therefore, it’s very common to see Enterprises use a “Hybrid Cloud Architecture”, where critical data and processing remains in the data center and other resources are deployed in public cloud environments. Processing resources can be further optimized with a hybrid topology that enables cloud analytics engines to work with on-premises data. This leverages enhanced cloud software deployment and update cycles while keeping data inside the firewall.

Another benefit of a hybrid environment is the ability to develop applications on dedicated resource pools which eliminate the need to compromise on configuration details like processors, GPUs, memory, networking and even software licensing constraints. The resulting solution can be subsequently deployed to an Infrastructure as a Service (IaaS) cloud service that offers compute capacity matching the dedicated hardware environment that otherwise would be hosted on-premisEs. This feature is rapidly becoming a big differentiator for cloud applications that need to hit the ground running with the right configuration to meet real-world demands.

AWS Private/Enterprise Cloud Model

The main reason to choose a Private Cloud Environment is for network isolation. Your EC2 instances are created in a virtual private cloud (VPC) to provide a logically isolated section on the AWS cloud.

Within that VPC, you have complete control over the virtual networking environment, including your own IP range selection, subnet creation, and configuration of route tables and network gateways. You can also create a Hardware Virtual Private Network (VPN). You can implement fine-grained access roles and groups, and stages of isolation for users. Enterprise governance and private encryption resources are available in a private cloud model. For more information on Enterprise cloud computing with AWS, click here. There are solutions on AWS Marketplace that allow you to perform big data analytics on the cloud while keeping your data on-premises.

Overview: AWS Identity and Access Management and Other AWS Built-In Security Features


AWS Security Overview
Before delving into any of AWS’s services used for big data advanced analytics, security must be minimally addressed. For any business, cloud security is the number one concern. AWS has industry-leading capabilities across facilities, networks, software and business processes meeting the strictest requirements for any vertical. Security is a core functional requirement that protects mission-critical information from accidental or deliberate theft, leakage, integrity, compromise and deletion.

AWS customers benefit from a data center and network architecture built to satisfy the requirements of their most security-sensitive customers. AWS used redundant layered controls, continuous validation and testing, and a substantial amount of automation to ensure that the underlying infrastructure is monitored and protected 24×7. These controls are replicated in every new data center and service.

Under the AWS “Shared Responsibility Model”, AWS is responsible for the security of the underlying cloud infrastructure and you are responsible for securing workloads you deploy in AWS, giving you the flexibility and agility to implement the most applicable security controls for your business functions in the AWS environment.

There are certain security features, such as individual user accounts and credentials, SSL/TLS for data transmissions and user activity logging that you should configure no matter which AWS service you use.

Identity and Access Management – User Accounts
AWS provides a variety of tools and features to keep your AWS account and resources safe from unauthorized use. This includes credentials for access control, HTTPS endpoints for encrypted data transmission, the creation of separate Identity and Access Management (IAM) user accounts, user activity logging for security monitoring, and Trusted Advisor security checks.

Only the business owner should have “root access” to your AWS account. The screenshot below is what you see when you’re logging in with your “root credentials”:


The screenshot below is the login page you see when you log in with your Identity and Access Management (IAM) account credentials (Note the additional “Account” textbox and the link at the bottom stating “Sign-in using root account credentials”):


In order to prevent access to your AWS root account, create IAM mechanisms for creating and managing individual users (an individual, system, or application that interacts with your AWS resources). With IAM, you define policies that control which AWS services your users can access and what they can do with them. You give very fine-grained control – giving only the minimum permissions needed to do their job.

The Table Below Gives You an Overview of AWS User Security Measures:


To read more about IAM security best practices, click here.

AWS Network, Data and API Security
The AWS network has been architected to permit you to select the level of security and resiliency appropriate for your workload. To enable you to build geographically dispersed, fault-tolerant web architectures with cloud resources via a world-class network infrastructure that’s continually monitored and managed.

Most enterprises take advantage of Amazon Virtual Private Cloud (VPC) that enables you to launch AWS resources into a virtual network you define that resembles your own data center network but with the benefits of using the scalable infrastructure of AWS. For more information, click here.

Below are Some of AWS Network Security Measures:
• Firewall and other boundary devices that employ rule sets, access control lists (ACLs) and configurations
• Secure access points with comprehensive monitoring
• Transmission protection via HTTPS using SSL
• Continually monitoring systems at all levels
• Account audits every 90 days
• Security logs
• Individual service-specific security
• Virtual Private Gateways / Internet Gateways
• Amazon Route 53 Security (DNS)
• CloudFront Security
• Direct Connect Security for Hybrid Cloud Architectures
• Multiple Data Security Options
• Encryption and Data Encryption at rest
• Event Notifications
Amazon Cognito Federated Identity Authentication

For more information on AWS Network, Data, and API Security, look here.

AWS Trusted Advisor
AWS Trusted Advisor scours your infrastructure and provides continual best practice recommendations free of charge in four categories: Cost Optimization, Performance, Security and Fault Tolerance. Within the Trusted Advisor console, details are given and there are direct links to the exact resource that requires attention. However, if you have a Business or Enterprise Support Plan, you have access to numerous other best practice recommendations. See the image below to grok the way AWS Trusted Advisor works:


To read more about AWS Trusted Advisor click here.

AWS Marketplace Software Solutions to Augment AWS’s Built-In Security Features

AWS’s infrastructure monitoring and security features can be enhanced and customized to meet the needs of any business by augmenting AWS built-in features with a plethora of options available on AWS Marketplace to create a secure cloud nirvana.

Some of the solutions to enhance security can be found here.

Using AWS Services with Solutions Available on AWS Marketplace for Big Data Analytics
This section will describe how to implement, augment, or customize some of the most commonly used AWS managed services in big data analytics with solutions available on AWS Marketplace.

Below you’ll find the AWS Management Console (the view below is once you’ve logged in), from where you access AWS’s managed services:


Amazon EC2: Self-Managed Big Data Analytics Solutions on AWS Marketplace


Amazon Elastic Cloud Compute (EC2) Overview
Amazon EC2 provides an ideal platform for operating your own self-managed big data analytics applications on AWS infrastructure. Almost any software you can install on Linux or Windows virtualized environments can be run on Amazon EC2 with a pay-as-you-go pricing model with a solution available on AWS Marketplace. Amazon EC2 uses the implemented architecture to distribute computing power across parallel servers to execute the algorithms in the most efficient manner.

Amazon EC2 provides scalable computing capacity through highly configurable instance types launched as an Amazon Machine Image (AMI). You can use Amazon EC2 to launch as many or as few virtual servers as you need, configure security and networking and manage storage. For a quick test run or 1-time big data analytics project, you can use instance store volumes for temporary data that’s deleted when you stop or terminate your instance, or use Amazon Elastic Block Store (EBS) for persistent storage.

EC2 also provides virtual networks you can create that are logically isolated from the rest of the AWS cloud that you can optionally connect to your own network, known as virtual private clouds (VPCs). You could use a VPC to run analytics solutions with data in your data center, or use one of the solutions on AWS Marketplace that facilitates a hybrid deployment model like Attunity CloudBeam (which has many other big data analytics features).

Examples of Amazon EC2 Self-Managed Analytics Solutions on AWS Marketplace
Some examples of self-managed big data analytics that run on Amazon EC2 include the following:
• A Splunk Enterprise Platform, the leading software platform for real-time Operational Intelligence. Splunk software and cloud services enable organizations to search, monitor, analyze and visualize machine-generated big data coming from websites, applications, servers, networks, sensors and mobile devices. A Splunk Analytics for Hadoop, within AWS, solution is available on AWS Marketplace also. It’s called Hunk and it enables interactive exploration, analysis, and data visualization for data stored in Amazon EMR and Amazon S3
• A Tableau Server Data Visualization Instance, for users to interact with pre-built data visualizations created using Tableau Desktop. Tableau server allows for ad-hoc querying and data discovery, supports high-volume data visualization and historical analysis, and enables the creation of reports and dashboards
• A SAP HANA One Instance, a single-tenant SAP HANA database instance that has SAP HANA’s in-memory platform, to do transactional processing, operational reporting, online analytical processing, predictive and text analysis
• A Geospatial AMI such as MapLarge, that brings high-performance, real-time geospatial visualization and interactive analytics. MapLarge’s visualization results are useful for plotting addresses on a map to determine demographics, analyzing law enforcement and intelligence data, delivering insight to public health information, and visualizing distances such as roads and pipelines
• An Advanced Analytics Zementis ADAPA Decision Engine Instance, which is a platform and scoring engine to produce Data Science predictive models that integrate with other predictive models like R, Python, KNIME, SAS, SPSS, SAP, FICO and more. Zementis ADAPA Decision Engine can score data in real-time using web services or in batch mode from local files or data in Amazon S3 buckets. It provides predictive analytics through many predictive algorithms, sensor data processing (IoT), behavior analysis, and machine learning models
• A Matillion Data Integration Instance, an ELT service natively built for Amazon Redshift, that uses Amazon Redshift’s processing for data transformations to utilize it’s blazing speed and scalability. Matillion gives the ability to orchestrate and/or transform data upon ingestion or simply load the data so it can be transformed multiple times as your business requires

Below is an awesome brochure on how using solutions available on “AWS Marketplace Re-Invents the Way You Choose, Test, and Deploy Analytics Software”:


Amazon EC2 Instance Types
Amazon EC2 provides a wide selection of instance types optimized to fit different use cases. Instance types comprise varying combinations of CPU, memory, storage, and networking capacity that as a whole is measured by developers as “vCPU” or “Virtual CPU” (vs. the legacy way of describing EC2 compute power of “ECU” (Elastic Compute Unit) which you’ll still see at times today. Each instance type includes one or more instance sizes, allowing you to scale your resources to the requirements of your target analytical workload. To read more about the differences between Amazon EC2-Classic and Amazon EC2-VPC, read this.

Performance is based on the Amazon EC2 instance type you choose. There are many instance types that you can read about here, but below the four main EC2 types that power big data analytics are described:

  • Compute Optimized: Compute-optimized instances, such as C4 instances, feature the highest performing processors and the lowest price/compute performance in EC2. With support for clustering C4 instances, they’re ideal for batch processing, distributed analytics, high performance science and engineering applications, ad serving, MMO gaming, and video encoding
  • Memory Optimized: Memory optimized instances have the lowest cost per GB of RAM among Amazon EC2 instance types. These instances are ideal for high performance databases, distributed memory caches, in-memory analytics, genome assembly and analysis, and other large enterprise applications
  • GPU Optimized: GPU instances are ideal to power graphics-intensive applications such as 3D streaming, machine learning, and video encoding. Each instance features high-performance NVIDEA GPUs with an on-board hardware video encoder designed to support up to eight real-time HD video streams (720p@30fps) or up to four real-time full HD video streams (1080p@30fps)
  • Dense Storage: Featuring up to 48 TB of HDD-based local storage, dense storage instances deliver high throughput, and offer the lowest price per disk throughput performance on EC2. This instance type is ideal for Massively Parallel Processing (MPP), Hadoop, distributed file systems, network file systems, and big data processing applications

Amazon S3: A Data Store for Computation and Large-Scale Analytics


Amazon Simple Storage Service (S3) Overview
Amazon S3 is storage for the internet. It’s a simple storage service that offers software developers a highly-scalable, reliable, and low-cost data storage infrastructure. It provides a simple web service interface that can be used to store and retrieve any amount of data, at any time, from within Amazon EC2 or anywhere on the web. You can read, write and delete objects containing from 1 byte to 5 TB of data each. The number of objects you can store in an S3 “bucket” is virtually unlimited. It’s highly secure, supports encryption at rest, and provides multiple mechanisms to provide fine-grained control of access to Amazon S3 resources. Amazon S3’s perd, it allows concurrent read or write access by many separate clients or application threads. No storage provisioning is necessary.

Amazon S3 is very commonly used as a data store for computation and large-scale analytics, such as financial transactions, clickstream analytics, and media transcoding. Because of the horizontal scalability of Amazon S3, you can access your data from multiple computing nodes concurrently without being constrained by a single connection.

Amazon S3 is the common data repository for pre-and-post processing with Amazon EMR.


Amazon S3 is well-suited for extremely spiky bandwidth demands, making it the perfect storage for Amazon EMR batch analysis. Because Amazon S3 is inexpensive, highly durable, stores objects redundantly on multiple devices across multiple i, and provides a highly durable storage infrastructure with its version capability protecting critical data from inadvertent deletion, data is often kept on S3 for long periods of time post-processing with Amazon EMR for subsequent new queries on the same data. If you store your data on Amazon S3, you can access that data from as many Amazon EMR clusters as you need.


Amazon S3 is the common data repository for Amazon Redshift before loading the data into the Amazon Redshift Data Warehouse. You use the “COPY” command to load data from Amazon S3:


In addition, all data written to any node in an Amazon Redshift cluster is continually backed up to Amazon S3.


Examples of Some of Amazon S3’s Benefits in Large-Scale Analytics:
• S3 storage provides the highest level of data durability and availability in the AWS platform
• Error correction is built-in, and there are no single points of failure. It’s designed to sustain concurrent loss of data in two facilities, making it very well-suited to serve as the primary data storage for mission-critical data
• Amazon S3 is designed for 99.999999999% (11 nines) durability per object and 99.99% availability over a one-year period
• Highly scalable, with practically unlimited storage
• Access to Amazon S3 from Amazon EC2 in the same region is lightning fast; server-side latencies are insignificant relative to Internet latencies
• Although Amazon S3 can be accessed using multiple threads, multiple applications and multiple clients concurrently, total Amazon S3 aggregate throughput will scale to rates that far exceed what any single server can generate or consume
• To speed relevant data, many developers pair Amazon S3 with a database, such as Amazon DynamoDB or Amazon RDS where Amazon S3 stores the actual information and the database serves as the repository for associated metadata. Metadata in the database can be easily indexed and queried, making it efficient to locate an object’s reference via a database query, and this result can then be used to pinpoint and retrieve the object itself from Amazon S3
• You can nest folders in Amazon S3 “buckets” and give fine-grained access control to each

Amazon Redshift: A Massively Parallel Processing (MPP) Petabyte-Scale Enterprise Data Warehouse


Amazon Redshift Overview
Amazon Redshift service is a fast and powerful, fully-managed, petabyte-scale data warehouse that makes it easy and cost-effective to efficiently analyze all your data by seamlessly integrating with existing business intelligence, reporting, and analytics tools. It’s optimized for datasets ranging from a few hundred gigabytes to a petabyte or more. You can start small for a very low cost per hour with no commitments and scale to petabytes for up to one tenth the cost less than traditional solutions. And, when you need to scale, you simply add more nodes to your cluster and redistributes your data for maximum performance, with no downtime.


Amazon Redshift is a SQL data warehouse solution and uses standard ODBC and JDBC connections. Your data warehouse can be up and running in minutes, enabling you to use your data to acquire new insights for your business and customers continually.

Traditional data warehouses required significant expenditures, time and resources to buy, build, and maintain, and they don’t scale. Therefore, as your requirements grew, you’d have to invest in more hardware and resources. You also had the expenditure of hiring many DBAs to ensure your queries were working right and that there was no data loss. Amazon Redshift manages the work needed to set up, operate, and scale a data warehouse, from provisioning the infrastructure capacity to automating ongoing administrative tasks such as backups, patches, and upgrades.

Amazon Redshift’s Features Enabling Large-Scale Analytics
Amazon Redshift uses columnar storage and a massively parallel processing (MPP) architecture to parallelize and distribute queries across multiple nodes to consistently deliver high performance at any volume of data.


It automatically and continuously monitors your cluster and copies your data into Amazon S3 so you can restore your data warehouse with a few clicks. Amazon Redshift stores three copies of your data for reliability. Amazon Redshift utilizes data compression, and zone maps to reduce the amount of I/O needed to perform queries. Security is built in, and you can encrypt data at rest and in transit using hardware-accelerated AES-256 and SSL, and if you want to use Amazon VPC with your Amazon Redshift cluster, that’s also built in. All API calls, connection attempts, queries and changes to the cluster are logged and auditable.

An Amazon Redshift data warehouse is a collection of computing resources called “nodes” that are organized into a group called a “cluster”. Each cluster runs an Amazon Redshift engine and contains one or more databases. Each cluster has a leader node and one or more compute nodes. The “leader node” receives queries from client applications, parses the queries and develops query execution plans. The leader node then coordinates the parallel execution of these plans with the compute nodes, aggregates the intermediate results from these nodes then finally returns the results back to the client applications. “Compute nodes” execute the query execution plans and transmit data among themselves to serve these queries. The intermediate results are sent back to the leader node for aggregation before being sent back to the client applications.


Data typically flows into a data warehouse from many different sources and in many different formats including structured, semi-structured, and unstructured data. This data is processed, transformed, and ingested at a regular cadence. You can use AWS Data Pipeline to extract, transform, and load data into Amazon Redshift. AWS Data Pipeline provides fault tolerance, scheduling, resource management and an easy-to-extend API for your ETL. It can reliably process and move data between different AWS compute and storage services as well as on-premise data sources.

You can also use AWS Database Migration Service to stream data to Amazon Redshift from any of the supported sources including Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle, SAP ASE and SQL Server, enabling consolidation for easy analysis of data in Amazon Redshift.

Amazon Redshift is integrated with other AWS services and has built-in commands to load data in parallel to each node from Amazon S3, Amazon DynamoDB or your Amazon EC2 and on-premise servers using SSH. Amazon Kinesis and Amazon Lambda integrate with Amazon Redshift as a target. You can also load streaming data into Amazon Redshift using Amazon Kinesis Firehose.

Amazon Redshift Analytics Examples
• Analyze Global Sales Data for Multiple Products
• Store Historical Stock Trade Data
• Analyze Ad Impressions and Clicks
• Aggregate Gaming Data
• Analyze Social Trends
• Measure Clinical Quality, Operation Efficiency, and Financial Performance in the Healthcare Space

AWS Marketplace Solutions for Amazon Redshift
Data can be loaded into Amazon Redshift from a multitude of solutions available from popular software vendors on AWS Marketplace to assist in Data Integration, Analytics, and Reporting and Visualization. Many of the solutions include much more than the broad topic titles.

For Data Integration and more, Matillion ETL for Redshift is a fast, modern, easy-to-use and powerful ETL/ELT tool that makes it simple and productive to load and transform data on Amazon Redshift. 100 x faster than traditional ETL technology, up and running in under 5 minutes. With a few clicks you can load data into directly into Redshift, fast, from Amazon S3; Amazon RDS; relational, columnar, cloud and NoSQL databases; FTP/HTTP; REST, SOAP, & JSON APIs; Amazon EMR; and directly from enterprise and cloud-based systems including Google Analytics, Google Adwords, Facebook, Twitter and more.


Matillion ETL for Redshift transforms data at eye-popping speed in a productivity-orientated, streamlined, browser-based graphical job development environment. Expect 50% reduction in ETL development and maintenance effort and months off your project as a result of the streamlined UI, perfect integration to AWS & Redshift and the sheer speed.


Matillion ETL for Redshift delivers a rich orchestration environment where you can orchestrate and schedule data load and transform; control flow; integrate with other systems and AWS services via Amazon SQS, Amazon SNS and Python; iterate; manage variables; create and drop tables; vacuum and analyze tables; soft code ETL/ELTs from configuration tables; control transactions and commit/roll-back; setup alerting; and develop data quality, error-handling and conditional logic.

For Advanced Analytics, TIBCO Spotfire Analytics Platform is a complete analytics solution that helps you quickly uncover insights for better decision-making. Explore, visualize, and create dashboards for Amazon Redshift, RDS, Microsoft Excel, SQL Server, Oracle, and more in-minutes. Easily scale from a small team to the entire organization with Spotfire for AWS. Includes 1 Spotfire Analyst user (via Microsoft Remote Desktop), unlimited Consumer and Business Author (web) users, plus Spotfire Server, Web Player, Automation Services and Statistics Services. Go from data to dashboard in under a minute. No other solution makes it as easy to get started or deliver analytics expertise. TIBCO Spotfire® Recommendations suggests the best visualizations based on years of best-practices. Broadest Data Connectivity – Access and combine all of your data in a single analysis to get a holistic view of your business. Cloud or on-premise, small or big data. Best-in-class analytics for any data source, incl. Amazon Redshift and RDS.


Comprehensive Analytics – A full spectrum of analytics capabilities to empower novice to advanced users, including: interactive visualizations, data mashup, predictive and prescriptive analytics, location analytics, and more.


For Data Analysis and Visualization, Tableau Server Tableau Server for AWS is browser and mobile-based visual analytics anyone can use. Publish interactive dashboards with Tableau Desktop and share them throughout your organization.



Embedded or as a stand-alone application, you can empower your business to find answers in minutes, not months. By deploying on the AWS Marketplace you can stand-up a perfectly sized instance for your Tableau Server with just a few clicks. Tableau helps tens of thousands of people see and understand their data by making it simple for the everyday data worker to perform ad-hoc visual analytics and data discovery as well as the ability to seamlessly build beautiful dashboards and reports. Tableau is designed to make connecting live to data of all types a simple process that does not require any coding or scripting. From cloud sources like Amazon Redshift, to on-premise Hadoop clusters, to local spreadsheets, Tableau gives everyone the power to quickly start visually exploring data of any size to find new insights.

For Data Warehouse Databases, SAP HANA One is a production-ready, upgradable to latest HANA SPS version (by Addon), single-tenant configured SAP HANA database instance. Perform real-time analysis, develop and deploy real-time applications with the SAP HANA One. Natively built using in-memory technology and now deployed on AWS, SAP HANA One accelerates transactional processing, operational reporting, OLAP, and, predictive and text analysis while bypassing traditional data latency & maintenance issues created through pre-materializing views and pre-caching query results. Unlike other database management systems, the SAP HANA One on AWS streamlines both transactional (OLTP) and analytical (OLAP) processing by working with single data copy in the in-memory columnar data store. By consolidating OLAP and OLTP workloads into a single in-memory RDBMS, you benefit from a dramatically lower TCO, in addition to mind-blowing speed. Build new, or deploy existing, on-demand applications on top of this instance for productive use. Developers can take advantage of this offering through standard based open connectivity protocols ODBC, JDBC, ODBO, ODATA and MDX allowing ease of integration with existing tools and technologies. Transform decision processing by streamlining transactions, analytics, planning, predictive and text analytics on a single in-memory platform. HANA One instances are now more secure with SSH root login disabled. Customers can now login to the instance using the new ‘ec2-user’ user.

Amazon EMR: A Managed Hadoop Distributed Computing Framework


Amazon Elastic MapReduce (EMR) Overview

With Amazon EMR you can analyze and process vast amounts of data by distributing the computational work across a resizable cluster of virtual servers using Apache Hadoop, an open-source framework. Open-source projects that run on top of the Hadoop architecture can also be run on Amazon EMR, such as Hive, Pig, Spark, etc.

Hadoop uses a distributed processing architecture called MapReduce in which a task is mapped to a set of servers for processing. The results of the computation performed by those servers is then reduced down to a single output set. One node, designated as the “master node”, controls the distribution of tasks.


Amazon EMR has made enhancements to Hadoop and the other open-source applications to work seamlessly with AWS. For example, Hadoop clusters running on Amazon EMR use EC2 instances as virtual Linux servers for the master and slave nodes, Amazon S3 for bulk storage of input and output data, and Amazon CloudWatch to monitor cluster performance and raise alarms. You can also move data into and out of Amazon DynamoDB using Amazon EMR and Hive. All of this is orchestrated by Amazon EMR control software that launches and manages the Hadoop cluster. This process is called an Amazon EMR cluster.


Amazon EMR’s Features Enabling Large-Scale Analytics

Hadoop provides the framework to run big data processing and analytics and Amazon EMR does all the heavy lifting involved with provisioning, managing, and maintaining the infrastructure and software of a Hadoop cluster. You can easily provision a fully managed Hadoop framework in minutes. You can scale your Hadoop cluster dynamically and pay only for what you use, from one to thousands of compute instances. Amazon EMR has native support for Amazon EC2 Spot and Reserved Instances, you can also save 50-80% on the cost of the underlying instances. You can launch a persistent cluster that stays up indefinitely or a temporary cluster that terminates after the analysis is complete.

Amazon EMR securely and reliably handles big data use cases, including log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics.

Amazon EMR monitors your cluster, retrying failed tasks and automatically replacing poorly performing instances. It automatically configures Amazon EC2 firewall settings that control access to instances, and you can launch clusters in an Amazon VPC. For objects stored in Amazon S3, you can use Amazon S3 server-side encryption or Amazon S3 client-side encryption with EMRFS, with AWS Key Management Service or customer-managed keys. You can customize every cluster.

Apache Spark, an engine in the Apache Hadoop ecosystem for fast and efficient processing of large datasets. By using in-memory, fault-tolerant resilient distributed datasets (RDDs) and directed, acyclic graphs (DAGs) to define data transformations, Spark has shown significant performance increases for certain workloads when compared to Hadoop MapReduce. Spark provides additional speed for other power tools such as Spark SQL, and Spark can be run on top of YARN (the resource manager for Hadoop 2). AWS has revised the bootstrap action to install Spark 1.x on AWS Hadoop 2.x AMIs and run it on top of YARN. The bootstrap action also installs and configures Spark SQL (SQL driven data warehouse), Spark Streaming (streaming applications), MLlib (machine learning), and GraphX (graph systems).

The S3 location for the Spark installation bootstrap action is:

Amazon EMR Analytics Examples

  • Log processing and analytics
  • Large extract, transform, and load (ETL) data movement
  • Risk modeling and threat analytics
  • Ad targeting and click stream analytics
  • Genomics
  • Predictive analytics
  • Ad hoc data mining and analytics

You can view an architectural diagram of Web Log Analysis on AWS here, and another on Advertisement Serving here

AWS Marketplace Solutions for Amazon EMR and the Hadoop Ecosystem

Data can be loaded into EMR from a multitude of solutions available from popular software vendors on AWS Marketplace to assist in Data Integration, Analytics, and Reporting and Visualization. Many of the solutions include much more than the broad topic titles.

For Data Integration, Attunity CloudBeam for S3, EMR and other Hadoop distributions simplifies, automates, and accelerates the loading and replication of data from a variety of structured and unstructured sources to create a data lake for Hadoop consumption on Amazon S3, including replication across Amazon Regions.


Attunity CloudBeam simplifies and streamlines ingesting enterprise data for use in Big Data Analytics by EMR or other Hadoop distributions from Cloudera, Hortonworks or MapR as well as for pre-processing before moving data into Redshift, S3, or RDS. Attunity CloudBeam is designed to handle files of any size, transferring content over any given network connection, thereby achieving best-in-class acceleration and guaranteed delivery.


Attunity Cloudbeam’s automation provides intuitive administration, scheduling, replication of deltas only, security and monitoring.



For Advanced Analytics, Infosys Information Platform (IIP) leverages the power of open source to address big data adoption challenges such as inadequate accessibility of easy-to-use development tools; fragmented approach to building data pipelines; and lack of an enterprise-ready version of open source big data analytics platform that can support all forms of data: structured, semi-structured, and unstructured.


It’s a one stop solution from Data Engineering to Data Science requirements enabling Ingestion to Visualization. It’s One-Click Launch, High Performance, Scalable, Enterprise-grade security.


Actionable insights in real-time.


For Data Analysis and Visualization, TIBCO Jaspersoft Reporting and Analytics for AWS is a commercial open source reporting and analytics server built for AWS that can run standalone or be embedded in your application. It is priced very aggressively with a low hourly rate that has no data or user limits and no additional fees. A multi-tenant version is available as a separate Marketplace listing. Free Online Support is available for registration upon launching the instance. Professional Support is available separately from TIBCO sales.


Jaspersoft’s business intelligence suite allows you to easily create beautiful, interactive reports, dashboards and data visualizations. Designed to quickly connect to your Amazon RDS, Redshift and EMR data sources, you can be analyzing your data and building reports in under 10 minutes.


TIBCO Jaspersoft’s software empowers millions of people every day to make better decisions faster by bringing them timely, actionable data inside their apps and business processes. Thanks to a community hundreds of thousands strong, TIBCO Jaspersoft’s software has been downloaded millions of times and is used to create the intelligence inside hundreds of thousands of apps and business processes. Full BI Server for Cents/Hour: no user or data limits and no additional fees. Suite includes ad hoc query and reporting, dashboards, data analysis, data visualization and data virtualization. 10 Minutes to Your AWS Data: purpose-built for AWS, our reporting and analytics server allows you to quickly and easily connect to Amazon RDS, Redshift and EMR. In under 10 minutes you can be reporting on and analyzing your data. BI for Your Business or App: built to modern web standards with a HTML5 UI and JavaScript and REST APIs, our flexible BI suite can be used to analyze your business or deliver stunning interactive reports and dashboards inside your app.


Amazon Elasticsearch Service: Real-time Data Analysis and Visualization


Amazon Elasticsearch Service (ES) Overview

Organizations are collecting an ever-increasing amount of data from numerous sources such as log systems, click streams, and connected devices. Launched in 2009, Elasticsearch —an open-source analytics and search engine— has emerged as a popular tool for real-time analytics and visualization of data. Some of the most common use cases include risk assessment, error detection, and sentiment analysis. However, as data volumes and applications grow, managing the open source version of Elasticsearch clusters can consume significant IT resources while adding little or no differentiated value to the organization. Amazon Elasticsearch Service (Amazon ES) is a managed service that makes it easy to deploy, operate, and scale Elasticsearch clusters in the AWS Cloud. Amazon ES offers the benefits of a managed service, including cluster provisioning, easy configuration, and replication for high availability, scaling options, data durability, security, and node monitoring.

Amazon ES is integrated with and tightly coupled with Logstash and a Kibana instance is automatically configured for you. When you deploy Elasticsearch, you deploy an “ELK Stack”.

Amazon ES Service Features Enabling Large-Scale Analytics

Logstash is an open source data pipeline that helps process logs and other event data and has built-in support for Kibana. Kibana is an open source analytics and visualization platform that helps you get a better understanding of your data in Amazon ES Service. You can set up your Amazon ES Service domain as the backend store for all logs coming through your Logstash implementation to easily ingest structured and unstructured data from a variety of sources. It allows you to explore your data at a speed and at a scale never before possible. It’s used for full text-search, structured search, analytics, and all three in combination.

Amazon ES Service gives you direct access to the open-source Elasticsearch APIs to load, query and analyze data and manage indices. There is integration for streaming data from Amazon S3, Amazon Kinesis Streams, and DynamoDB Streams. The integrations use a Lambda function as an event handler in the cloud that responds to new data by processing it and streaming the data to your Amazon ES Service domain.

Click here to read how to get started with Amazon ES Service and Kibana on Amazon EMR.

Amazon ES Service Examples

  • Real-time application monitoring
  • Analyze activity logs
  • Analyze Amazon CloudWatch logs
  • Analyze product usage data coming from various services and systems
  • Analyze social media sentiments and CRM data, and find trends for brands and products
  • Analyze data stream updates from other AWS services, such as Amazon Kinesis Streams and DynamoDB
  • Monitor usage for mobile applications
  • e-Commerce filtering and navigation
  • Streaming data analytics
  • Social media sentiment analysis
  • Text search
  • Risk assessment
  • Error detection

AWS Case Study: MLB Advanced Media Using Amazon ES Service as Part of a New Data Collection and Analysis Tool

MLB Advanced Media (MLBAM) wanted a new way to capture and analyze every play using data-collection and analysis tools. It needed a platform that could quickly ingest data from ballparks across North America, provide enough compute power for real-time analytics, produce results in seconds, and then be shut down during the off season.  It turned to AWS to power its revolutionary Player Tracking System, which is transforming the sport by revealing new, richly detailed information about the nuances and athleticism of the game—information that’s generating new levels of excitement among fans, broadcasters, and teams.


You can read the story here.

AWS Marketplace Solutions for Amazon ES, Logstash, and Kibana (ELK)

There are quite a few software solutions available from popular software vendors on AWS Marketplace that help implement Amazon Elasticsearch Service. You can browse them here. Some of them provide the entire ELK Stack.

ELK Stack (PV) built by Stratalux: ELK stack is the leading open-source centralized log management solution for companies who want the benefits of a centralized logging solution without the enterprise software price. ELK stack provides a centralized and searchable repository for all your infrastructure logs providing a unique and holistic insight to your infrastructure. The ELK stack built by Stratalux AMI has been configured with all the basic components that together make a complete working solution. Included in this AMI are the Logstash server, Kibana web interface, ElasticSearch storage and Redis data structure server. Simply install and point your Logstash agents to this AMI and begin searching through your logs and create custom dashboards. With over five years of experience, Stratalux is the leading cloud-based managed services company for ELK stack on AWS. Sandbox environment to try out different functions. An image of this product is shown below:


Amazon Machine Learning: Highly Scalable Predictive Analytics


Amazon Machine Learning (ML) Overview
You see machine learning in action every day. Websites make suggestions on products you’re likely to buy based on past purchases, you get an alert from your bank if they suspect a fraudulent transaction and you get emails from stores when items you typically buy are on sale.

With Amazon Machine Learning, anyone can create ML models via Amazon ML’s learning and visualization tools and wizards without having to learn complex machine learning algorithms and technology. Amazon ML can create models based on data stored in Amazon S3, Amazon Redshift, or Amazon RDS. There is no set-up cost and you pay as you go, so you can start small and scale as your application grows, and you don’t have to manage any infrastructure required for the large amount of data used in machine learning.

Amazon ML Features Enabling Large-Scale Analytics
Built-in wizards guide you through the steps of interactively exploring your data to train the ML model by finding patterns in existing data, and use these patterns to make predictions from new data as it becomes available. You’re guided through the process of measuring the quality of your models and evaluating the accuracy of predictions, fine-tuning the predictions to align with business goals. You don’t have to implement custom prediction generation code.


Amazon ML can generate billions of predictions daily, and serve those predictions in low-latency batches or in real-time at high throughput.

List of Amazon ML Analytics Examples
Amazon ML can perform document classification to help you process unstructured text and take actions based on content from forms, emails and product reviews, for example. You can process free-form feedback from your customers, including email messages, comments or phone conversation transcripts, and recommend actions based on their concerns. One example would be using Amazon ML to analyze social media traffic to discover customers who have a product support issue, and connect them with the right customer care specialists.

Other examples you can perform with Amazon ML include the following:
• Predict customer churn
• Fraud detection
• Content personalization
• Propensity Modeling for marketing campaigns
• Readmission Prediction through patient risk stratification
• Predict if a website comment is spam
• Forecast product demand
• Personalize content
• Predict user activity

AWS Marketplace Solutions for Amazon ML
There are many solutions from leading software vendors available on AWS Marketplace, some of which are highlighted below.

BigML PredictServer is a dedicated machine image that you can deploy in your own AWS account to create blazingly fast predictions from your BigML models and ensembles.
PredictServer is ideal for real-time scoring and/or for very large batch predictions (millions and upwards). Dedicated in-memory prediction server guarantees fast and consistent prediction rates. And the built-in dashboard makes it easy to track performance. Models and Ensembles are cached directly from and predictions can be created with API calls similar to that of the API and/or through BigML’s command line tool bigmler. You can deploy BigML PredictServer in a region closer to your application servers to reduce latency, or even in a VPC.



BigML also supports text analytics.


Zementis ADAPA Decision Engine is a predictive analytics decision engine based on the PMML (Predictive Model Markup Language) standard.


With ADAPA, deploy one or many predictive models from data mining tools like R, Python, KNIME, SAS, SPSS, SAP, FICO and many others. Score your data in real-time using Web-Services, or use ADAPA in batch mode for Big Data scoring directly from your local file system or an Amazon S3 bucket. As a central solution for today’s data-rich environments, ADAPA delivers precise insights into customer behavior and sensor information. Predictive Analytics Using Vendor-neutral Standards: ADAPA uses the Predictive Model Markup Language (PMML) industry standard to import and deploy predictive algorithms and machine learning models. ADAPA can understand any version of PMML and is compatible most data mining tools, open source and commercial. Model Deployment Made Easy: ADAPA allows for one or many predictive models to be deployed at the same time. It executes many algorithms, from simple regression models to the most complex machine learning ensembles, e.g. Random Forest and boosted models. Scoring at the Speed of Business: ADAPA is able to instantly transform your scores into business decisions. The use of PMML-based rules allows for different score ranges to be paired with specific business decisions. Applications range from fraud detection and risk scoring to marketing campaign optimization and sensor data processing in the Internet of Things (IoT).


AWS Storage and Database Options for Use in Big Data Analytics with Use Cases


AWS Big Data Analytics Storage and Database Options Overview
AWS has a broad set of engines for storing data throughout a big data analytics lifecycle. Each has a unique combination of performance, durability, availability, scalability, elasticity and interfaces.

Most big data analytics infrastructures and application architectures employ multiple storage technologies in concert, each of which has been selected to satisfy the needs of a particular subclass of data storage, or for the storage of data at a particular point in its lifecycle. These combinations form a hierarchy of data storage tiers. The image below gives one example of using a combination of data storage and database usage during a big data analytics workflow:


Amazon S3 for Storage for Big Data Advanced Analytics
Please refer to the section above on the benefits of Amazon S3 for storage in big data analytics here.

Amazon DynamoDB Database for Big Data Advanced Analytics
Amazon DynamoDB is a fast, flexible and fully managed NoSQL database service for all applications – mobile, web, gaming, ad tech, IoT and more – that need a consistent, single-digit millisecond latency at any scale. It supports both document and key-value store models.

DynamoDB supports storing, querying, and updating documents. It supports three data types –
number, string, and binary – in both scalar and multi-valued sets. Using the AWS SDK you can write applications that store JSON documents directly into Amazon DynamoDB tables. This capability reduces the amount of code to write insert, update and retrieve JSON documents and perform powerful database operations like nested JSON queries using only a few lines of code.


Other document stores Amazon DynamoDB supports are XML and HTML. Tables don’t have to have a fixed schema, so each data item can have a different number of attributes. The primary key can either be a single-attribute hash key or a composite hash-key range.

In addition to querying the primary key, you can query non-primary key attributes using Global
Secondary Indexes and Local Secondary Indexes. DynamoDB provides both eventually-consistent reads by default, strongly-consistent reads (optional), and implicit item-level transactions for item put, update, delete, conditional operations, and increment / decrement.

There is no limit to the amount of data you can store in an Amazon DynamoDB table. The service automatically allocates more storage as you store more data. DynamoDB streams captures all data activity that happens on your table and allows the ability to set up regional replication from one geographic region to another for more availability.

DynamoDB integrates with AWS Lambda to provide triggers for alerts when things change in your DynamoDB instance. It also integrates with Amazon Elasticsearch using the Amazon DynamoDB Logstash plugin to search Amazon DynamoDB content for things like messages, locations, tags and keywords. It integrates with Amazon EMR, so Amazon EMR can analyze data sets stored in DynamoDB, yet keeping the the original data set intact. It integrates with Amazon Redshift to perform complex data analysis queries, including joins with other tables in the Amazon Redshift cluster. DynamoDB integrates with the AWS Data Pipeline to automate data movement and transformation into and out of Amazon DynamoDB. It also integrates with Amazon S3 for analytics, AWS Import/Export, backup and archive.

Common use cases for DynamoDB include:
• Gaming
• Mobile applications
• Digital Ad serving
• Live voting
• Sensor networks
• IoT
• Log ingestion
• Access control for e-Commerce shopping carts or other web-based content
• Web session management

Amazon DynamoDB can be the storage backend to Titan, enabling you to store Titan graphs of any size in fully-managed DynamoDB tables that stores and traverses both small and large graphs up to hundreds of billions of vertices and edges distributed across a multi-machine cluster.

Amazon Redshift Data Warehouse for Storage in Big Data Advanced Analytics
Please refer to the section above on the benefits of Amazon Redshift for a data warehousing solution in big data analytics here.

Amazon Aurora Database for Big Data Advanced Analytics
Amazon Aurora is a MySQL-compatible relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. Amazon Aurora delivers five times the throughput vs. standard MySQL open source databases. This performance is achieved by tightly integrating the database engine with an SSD-backed virtualized storage layer that’s fault-tolerant and self-healing. Disk failures are repaired in the background without loss of database availability. It automates most administrative tasks and enables point-in-time recovery of your instances. Amazon Aurora can help cut down your database costs by 90% or more while improving reliability and providing high availability. It tolerates failures and fixes them automatically. Durable backups are continual and automatic to Amazon S3 and has six copies of your data replicated across three Availability Zones.

Amazon Aurora is compatible with MySQL 5.6 so that any existing MySQL applications and tools can run on Aurora without modification. It’s managed by Amazon RDS, which takes care of complicated administration tasks like provisioning, patching, and monitoring.

Historical data analysis is the most common type of big data analytics implemented on Aurora. The benefits of using Aurora vs. other relational databases is its scalability; therefore, with terabytes of real-time data processing daily and scales to millions of transactions per minute. If you need more transactions per minute, you can add replicas, up to 15 of them. It will automatically scale up as needed up to 64 TB.


Common use cases for Amazon Aurora include:
• Data warehouse analytics
• Website responsiveness
• Content Management
• IoT data analysis
• Transaction processing
• Great for any enterprise application that uses a relational database
• SaaS applications that need flexibility in instance and storage scaling
• Web and mobile applications

Amazon EC2 Instance Store Volumes for Big Data Advanced Analytics
Amazon EC2 provides flexible, cost-effective, and easy-to-use data storage for your instances. Each option has a unique combination of performance and durability. These options can be used independently or in combination to suit your requirements. The storage option that best fits running advanced analytics is called “Amazon EC2 Instance Store Volumes”.

Amazon EC2 Instance Store Volumes (also called ephemeral drives) provide temporary block-level storage for many EC2 instance types. The storage-optimized Amazon EC2 instance family provides special-purpose instance storage targeted to specific use cases. HI1 instances provide very fast solid-state drive (SSD)-backed instance storage capable of supporting over 120,000 random read IOPS, and are optimized for very high random I/O performance and low cost per IOPS.

Example applications well-suited to use HI1 storage-optimized EC2 Instance Store Volumes include data warehouses, Hadoop storage nodes, seismic analysis, cluster file systems, etc. Note, however, that the data on instance store volumes is lost if the Amazon EC2 instance is stopped, re-started, terminates, or fails.

AWS Marketplace Solutions to Augment AWS Storage and Database Services for Big Data Advanced Analytics


Overview of AWS Marketplace Solutions to Augment AWS Big Data Analytics Storage and Database Services
There’s an abundance of AWS Marketplace software solutions from top vendors that augment these AWS built-in solutions for storage and databases used in big data analytics.

AWS Marketplace Solutions to Augment Amazon S3
AWS Marketplace solutions from top software vendors that augment the functionality or interact with Amazon S3 to a complete end-to-end out-of-the-box solutions are many. I’ll mention a couple of them below, but you can browse for yourself here.

Attunity CloudBeam for Amazon S3, EMR, and Hadoop was described in an earlier section. Click here to review that section again.

Matillion ETL for Redshift was also described in an earlier section. Matillion first loads data into Amazon S3 prior to Redshift ingestion. Click here to return to review that section again.

Informatica Cloud for Amazon S3 provides native, high-volume connectivity to S3. It is designed and optimized for data-integration between cloud and on-prem data sources to S3 as object store. It handles special characters within data-set, uni-code characters, escape characters and multiple formats of delimited files. It also supports multi-part upload and download to/from S3. It allows you to develop and run data integration tasks (mappings), task flows and unlimited scheduling restricted to use only for S3. Informatica Cloud integration is a visual, metadata-driven solution, enabling self-documenting code, extensive reuse in development and automation in deployment. Solution is limited to 1TB of S3 storage. The solution includes: Cloud Designer: Cloud Based service that enables visual design, development and deployment of data integration mappings (static data flows) A simple 6-step wizard to support the needs of citizen integrators. Informatica Cloud Data Synchronization Service Secure Agent: A light-weight binary that runs in AWS EC2 environment to access the Informatica cloud services located in Informatica’s hosted environment. Secure Agent is installed on your AMI. One instance of Informatica Cloud service in Informatica’s hosted environment. One connector for Amazon S3 as target One connector as source Run on SUSE Linux Enterprise Server 11.

AWS Marketplace Solutions to Augment Amazon DynamoDB
The AWS Marketplace has independent software vendors that augment Amazon DynamoDB, in addition to solutions that offer complete graphing solutions other than Titan graph. Below you’ll find some selected solutions:

Informatica Cloud for Amazon DynamoDB provides native, high-volume connectivity to DynamoDB. It is designed to catalog data from data sources such as SQL, NoSQL and Social into a single DynamoDB store and take advantage of high throughput and scale of DynamoDB. It saves cost by temporarily increasing Write capacity or Read capacity as needed. It allows you to develop and run data integration tasks (mappings), task flows and unlimited scheduling restricted to use only for DynamoDB. Informatica Cloud integration is a visual, metadata-driven solution, enabling self-documenting code, extensive reuse in development and automation in deployment. Solution is limited to 1TB of DynamoDB storage. The solution includes: Cloud Designer: Cloud Based service that enables visual design, development and deployment of data integration mappings (static data flows) A simple 6-step wizard to support the needs of citizen integrators. Informatica Cloud Data Synchronization Service Secure Agent: A light-weight binary that runs in AWS EC2 environment to access the Informatica cloud services located in Informatica’s hosted environment. Secure Agent is installed on your AMI. One instance of Informatica Cloud service in Informatica’s hosted environment One connector for Amazon DynamoDB as target One connector as source Run on SUSE Linux Enterprise Server 11.

Mentioned earlier, Amazon DynamoDB integrates with Titan graphs. I’ll mention one of the solutions available from top software vendors here that can offer graphing solution other than Titan below, but you can browse them yourself here.

MicroStrategy Analytics Platform is a powerful Mobile and Business Intelligence solution that enables leading organizations to quickly analyze vast amounts of data and distribute actionable business insight throughout the enterprise. MicroStrategy enables users to conduct ad hoc analysis and share their insights anywhere, anytime with reports, documents, and dashboards delivered via Web or mobile devices. Anyone can create dashboards with stunning visualizations, explore dynamic reports to investigate performance, graph data instantly, drill into areas of concern, and export information into any format.


Users benefit from powerful, sophisticated statistical analysis that yields new critical business insights. Uniquely positioned at the nexus of analytics, security, and mobility, MicroStrategy delivers superior analytics and mobile applications secured with advanced authentication, enhanced user administration, and user authentication tracking. Our software is built for AWS and is certified with numerous AWS services such as Amazon Redshift, Amazon RDS’s and MicroStrategy is an Advanced Technology Partner.


AWS Marketplace Solutions to Augment Amazon Redshift
The AWS Marketplace has independent software vendors that augment or work in tandem with Amazon Redshift. Some will be highlighted below, but you can browse the many solutions available in the AWS Marketplace that work with Amazon Redshift here.

Matillion ETL for Redshift was also described in an earlier section. Click here to return to review that section again.

Attunity CloudBeam for Amazon Redshift enables organizations to simplify, automate, and accelerate data loading and near real-time incremental changes from on-premises sources (Oracle, Microsoft SQL Server, and MySQL) to Amazon Redshift.


Attunity CloudBeam allows your team to avoid the heavy lifting of manually extracting data, transferring via API/script, chopping, staging, and importing. A Click-to-Load solution, Attunity CloudBeam is easy to setup and allows organizations to start validating or realizing the benefits of Amazon Redshift in just minutes. Zero-footprint
technology: Reduces impact on IT operations with log-based capture and delivery of transaction data that does not require the Attunity software to be installed on each source and target database. Performance: Accelerated, secured, and guaranteed delivery of data.

AWS Marketplace Solutions to Augment Amazon Aurora
The AWS Marketplace has services from top software vendors that augment or work in tandem with Amazon Aurora. You can view them all here, but below you’ll find two Informatica solutions.

Informatica Cloud for Amazon Aurora (Windows) provides native, high-volume connectivity to Aurora and is optimized for Oracle to Aurora migration. It allows you to develop and run data integration tasks (mappings), data synchronization and replication tasks, task flows and unlimited scheduling restricted to use only for Aurora. It supports single inserts and batched statement, as well as more advanced capabilities such as create tables on-the-fly, custom queries, look-ups, joiners, filters, expressions and sorters. Informatica Cloud integration is a visual, metadata-driven solution, enabling self-documenting code, extensive reuse in development and automation in deployment. Solution is limited to 1TB of Aurora storage. The solution includes: Cloud Designer: Cloud Based service that enables visual design, development and deployment of data integration mappings (static data flows) A simple 6-step wizard to support the needs of citizen integrators. Informatica Cloud Data Synchronization Service Secure Agent: A light-weight binary that runs in AWS EC2 environment to access the Informatica cloud services located in Informatica’s hosted environment. Secure Agent is installed on your AMI. One instance of Informatica Cloud service in Informatica’s hosted environment. One connector for Amazon Aurora (MySQL) as target One connector as source Run on Windows Server 2012 R2.

Informatica Cloud for Amazon Aurora (Linux) has the same features as the Windows version above except for the connector. In this version, one connector for Amazon Aurora (MySQL) as target. One connector as source Run on SUSE Linux Enterprise Server 11.

AWS Marketplace Solutions to Augment Amazon EMR
The AWS Marketplace has independent software vendors to augment your big data analytics solutions with Amazon EMR.

Syncsort DMX-h, Amazon EMR Edition is designed for Hadoop and now deployed on Amazon EMR, Syncsort DMX-h helps organizations propel their Big Data initiatives, getting productive and delivering results with Hadoop and Amazon EMR in almost no time.


Syncsort DMX-h is the only Hadoop ETL application available for EMR. Syncsort DMX-h delivers: 1) Blazingly fast, easy to use Hadoop ETL in the Cloud. 2) A graphical user interface for developing & maintaining MapReduce ETL jobs.


3) A library of Use Case Accelerators to fast-track development. 4) Unbounded scalability at a disruptively low price. With Syncsort Ironcluster (10 Nodes) you can test, pilot and perform Proof of Concepts for free on up to ten Hadoop EMR nodes. 30 days of free phone and email support are also available.


MapR Enterprise Edition Plus Spark includes 24/7 support for the MapR Enterprise Edition plus the Apache Spark stack. IMPORTANT: Use MapR Standard Cluster with VPC Support delivery method to launch your cluster. This edition provides a standards-based enterprise-class distributed file system, complete with high availability and disaster recovery features.


Also included is a broad range of technologies like data processing with Spark, machine learning with MLlib, SQL with Spark SQL, graph processing with GraphX, and YARN for resource management.


With the browser-based management console, MapR Control System, you can monitor and manage your Hadoop cluster easily and efficiently.

Other Notable Marketplace Solutions to Augment AWS Built-In Storage and Databases for Big Data Analytics
The AWS Marketplace has services from top software vendors that augment or work in tandem with AWS Big Data Storage and Database Services. Some notable choices are listed below.

Looker Analytics Platform for AWS allows anyone in your business to quickly analyze and find insights in your Redshift and RDS datasets. By connecting directly to your AWS instance, Looker opens up access to high-resolution data for detailed exploration and collaborative discovery, building the foundation for a truly data-driven organization.


To help you get started quickly, the Looker for AWS license includes implementation services from our team of expert analysts. And throughout your entire subscription, you’ll receive 100% unlimited support from a live analyst using our in-app chat functionality. Purpose-built to leverage the next generation of analytic databases, like Amazon Redshift, and to live in the cloud, Looker takes an entirely new approach to business intelligence. Unlike traditional BI tools, Looker doesn’t move and store your data; instead, it optimizes data discovery within the database itself. Using Looker’s modern data explanation language, called LookML, data analysts create rich experiences so that end users can self-serve their own data discovery. Key to LookML is its reusability: Measures and dimensions are created in only one place and then consistently (and automatically) reused in all relevant views of that same data concept, creating a single source of the truth across your organization. Powerful data discovery, including contextual filtering, pivoting, sequencing, and cohort tiering, so your entire organization can ask questions, share views, and collaborate, all from within the browser, on any device. Live connection to the database using the LookML data modeling language and browser-based agile development IDE, so data analysts can call any Redshift function, such as sortkeys, distkeys, and HyperLogLog, for advanced performance and insights. Wide set of visualizations, including scatter, table, bar, and line charts, a streamlined approach to dashboarding, and the ability to embed visualizations in any web application.

Mapping by MapLarge gives 5 User License to the MapLarge Mapping Engine, a high performance geospatial visualization platform that dynamically renders data for interactive analysis and collaboration. For more information visit Scales to millions of records and beyond. Intuitive User Interface. Robust APIs allow complete customization.


The Teradata Database Developer (Single Node) with SSD local storage is the same full-feature data warehouse software that powers analytics at many of the world’s greatest companies. Teradata Database Developer includes Teradata Columnar and rights to use: Teradata Parallel Transporter (TPT), including the Load, Update, and Export operators; Teradata Studio; and Teradata Tools and Utilities (TTU). These tools are included with the Teradata Database AMI or available as a free download. In addition to the Teradata Database, your subscription includes rights to use the following products, which are listed in the AWS Marketplace: Teradata Data Stream Utility; Teradata REST Services; Teradata Server Management; and Teradata Viewpoint (Single Teradata System). With Teradata Database, customers get quick time to value and low total cost of ownership. Applications are portable across cloud and on-premises platforms and there is no re-training required. Teradata 15.10 is the newest release of the Teradata Database, bringing industry leading features for data fabric enhanced support, fast JSON performance, and the world’s most advanced hybrid row/column database. Query processing performance is also accelerated by the world’s most advanced hybrid row/column table storage capability. Enhanced hybrid row/column storage enables high performance for selective queries on tables with many columns while also allowing pinpoint access to single rows by operational queries.

HPE Vertica Analytics Platform is Enterprise-Class Analytics that fits your budget. Until now, enterprise-class Big Data analytics in the cloud was just not available. Current cloud analytics offerings lack critical enterprise features — fine-tuning capabilities, integrated BI/reporting, data ingestion and more. With HPE Vertica Analytics Platform for Amazon Web Services, you can tap into all of the core enterprise capabilities and more. HPE Vertica Analytics Platform for Amazon Web Services offers you the flexibility to start small and grow as your business grows, and you get analytics functionality that no other cloud analytics provider can offer. The HPE Vertica Analytics Platform also runs on-premise on industry-standard hardware and in the cloud. Get started immediately with your analytics initiative via the cloud or the deployment model that makes sense for your business without any compromises or limits. Optimized data ingestion for high performance. Fast query optimization for quick insight, with comprehensive SQL & extensions for true openness. Enhanced data storage for cost efficiency, and ease of administration for true reliability.

Zoomdata for AWS is the Fastest Visual Analytics for Big Data and includes smart connectors for Redshift, S3, Kinesis, Apache Spark, Cloudera, Hortonworks, MapR, Elastic, real time, SQL and NoSQL sources.


Sign up for a free trial today, and you’ll be visualizing billions of rows of data in seconds! Free support is available for users who register at Using patented Data Sharpening and micro-query technologies, Zoomdata visualizes Big Data in seconds, even across billions of rows of data. Zoomdata is designed for the business user — not just data analysts — via an intuitive user interface that can be used for interactive dashboards or embedded in a custom application.


Built for Big Data: By taking the query to the data, Zoomdata leverages the power of modern databases to visualize billions of data points in seconds. Includes Redshift, S3, Cloudera, Solr, and Hortonworks connectors.


TIBCO Clarity is the data cleaning and standardization component of the TIBCO Software System. It serves as a single solution for business users to handle massive messy data across various applications and systems, such as TIBCO Jaspersoft, Spotfire, Marketo and Salesforce. The quality of data impacts your decision-making. So data coming from external sources such as SaaS applications or partners needs to be validated before used in systems. TIBCO Clarity makes it easy for business users to profile, standardize, and transform data so that trends can be identified and smart decisions can be made quickly.


TIBCO Clarity provides an easy-to-use Web environment, and since it’s a cloud-based subscription service, it only requires an investment relative to the usage of the service. De-duplication: TIBCO Clarity discovers duplicate records in a dataset by using configurable fuzzy match algorithms. Seamless Integration: You can collect your raw data from disparate sources in variety of data formats. Such as files, databases, spreadsheets, both cloud and on-premise. Data Discovery and Profiling: TIBCO Clarity detects data patterns and data types for auto-metadata generation enabling profiling of row and column data for completeness, uniqueness, and variation.

AWS Services for Data Collection


AWS Data Collection Overview
Before you can do any big data analytics using Amazon Services, you have to get the data loaded to an AWS storage location. This is a crucial step and can prohibit a company’s first move into a cloud-based environment. It can seem complex and time consuming, and taking a lot of time and resources, and concerns about how to recode and convert you data to another format can seem daunting. However, AWS has many different services to help you move your data onto AWS, whether you are loading from numerous external resources, integration with your premises infrastructure, or migrating data from an existing data center.

AWS Data Migration Service (DMS)
With just a few clicks, the AWS Data Migration Service starts while your original database stays live. AWS DMS handles all the complexity. You even have the ability to replicate back to your original database, or replicate to other databases in different regions or Availability Zones. Heterogeneous migration is taken care of by the AWS Schema Conversion tool. Migration assessment and code conversion is taken care of for you. The source database and code are converted into a format compatible with the target database, and any code that can’t be converted is marked for manual conversion. And costs start at $3.00 per TB.

AWS Import / Export Snowball
AWS Import /Export Snowball is an easy, secure and affordable solution for even the biggest (petabytes) data transfer jobs via a highly secure hardware appliance. You don’t need to purchase any hardware: with just a few clicks in the AWS Management Console you create a job and a Snowball appliance will be shipped to you, or up to 50 of them if you need them. When it arrives, you attach the appliance to your network, download and run the Snowball client to establish a connection, then use the client to select the file directories you want to transfer. Snowball will encrypt and transfer the files at an extremely high speed. When the transfer is complete, you ship the appliance back using the free shipping label supplied. Snowball uses multiple layers of security to protect your data including tamper-resistant enclosures, 256-bit encryption, and an industry-standard Trusted Platform Module (TPM) designed to ensure both security and full chain-of-custody of your data. Snowball unloads your data into Amazon S3 and from there you can access any AWS Service that you need.

Amazon S3 Transfer Acceleration
Amazon S3 Transfer Acceleration can be used when your upload speeds to Amazon S3 are sub-optimal, which can occur for a few reasons. Amazon S3 Transfer Acceleration enables fast, easy and secure transfers of files over long distances between your client and an S3 bucket. It takes advantage of Amazon CloudFront’s globally distributed edge locations. As the data arrives at an edge location, data is routed to Amazon S3 over an optimized network path.

AWS Direct Connect
AWS Direct Connect makes it easy to establish a dedicated, private network connection between AWS and your premisis, data center, co-location environment, etc. This increases bandwidth and throughput while reducing network costs and provides a more consistent network experience than Internet-based connections. This dedicated connection can be partitioned into multiple virtual interfaces so you can use the same connection to access public resources and private resources, maintaining separation between the environments.

AWS Storage Gateway
The AWS Storage gateway is a service that connects an premises software appliance with cloud-based storage to provide seamless and secure integration between an organization’s on-premises IT environment and AWS’s storage infrastructure. The service allows you to securely store data in the AWS cloud for scalable and cost-effective storage for data backups without buying more storage or managing the infrastructure, and pay only for what you use. It supports industry-standard storage protocols that work with your existing applications. It provides low-latency performance by maintaining frequently accessed data on-premises while securely storing all your data encrypted in Amazon S3 or Amazon Glacier. The AWS Storage Gateway Appliance is software that sits in your data center between your applications and your storage infrastructure.


Without making any changes to your applications, it backs up your data with SSL encryption, and you can pull your old data back when needed.

Amazon Kinesis Streams
Amazon Kinesis Streams capture large amounts of data (TB/hr) in real-time from data producers and streaming it into custom applications for data processing and analysis. Streaming data is replicated by Kinesis across three availability zones to ensure reliability.

Amazon Kinesis Streams is capable of scaling from a single MB up to TB/hr of streaming data, but in contrast to Firehose, you have to manually provision the capacity. Amazon provides a “shard calculator” (below image) when creating a Kinesis Stream to correctly provision the appropriate number of shards for your stream to handle the volume of data you’re going to process. Once created, it’s possible to scale up or down the number of shards to meet demand.


AWS Marketplace Solutions to Assist and Augment Data Collection
The AWS Marketplace has services from top software vendors that augment or work in tandem with AWS Services for data ingestion. Many have capabilities beyond data ingestion, including the ability to perform ETL/ELT, data cleansing and much more.

Matillion ETL for Redshift was also described in an earlier section. Click here to return to review that section again.

Informatica CloudBeam for Amazon S3, EMR, Hadoop was also described in an earlier section. Click here to return to review that section again.

CloudBerry Backup Desktop Edition is Simple and fast backup to Amazon S3 cloud. CloudBerry Backup is a secure online backup solution that helps organizations to store backup copies of their data in online storage. It is a powerful Backup and Restore program designed to leverage Amazon S3 technology to make your disaster recovery plan simple, reliable, and affordable. Keep your backups in remote location. Access your backups anywhere where you have internet connection. Strong data encryption protects your data from unauthorized access.

CloudBasic RDS Deploy for DevOps DLM/Jenkins (SQL Server) enables you to Move RDS databases around from development and staging environments without access to RDS file system. Integrate RDS Deploy into your DevOps tools, such as Jenkins and GO, to further automate DLM and achieve true one-click deployments. Sample DevOps Scenario involving Jenkins and RDS: The job is to deploy 3 SQL Server databases from RDS to a standard SQL Server. Merge data from two of them into the third. Send the third database back to RDS as the new production system. The job will also flush any open sessions to the website, take it offline and put it back online when everything is finished. This is all done in a PowerShell script which is executed from Jenkins so that it can be performed by employees with minimal security access (only have access to that job) and all history is recorded. No file system access to the RDS instance is required. Traditional tools require access to the SQL Server file system and cannot be used with AWS RDS. Easy integration into DevOps tools such as Jenkins and GO. REST API allows for RDS DB Deployments to be initiated from PowerShell etc.

AWS Services for Data Orchestration and Analytic Workflows


AWS Services for Data Orchestration and Analytic Workflow Overview
Big data advanced analytics solutions very often require automated arrangement, coordination, and management of data once it’s on the AWS cloud, moving data when finished from one AWS Service to another for subsequent processing or storage, and vice versa. Automating workflows can ensure that necessary activities take place when required to drive the analytic processes.

Amazon Simple Workflow Service (SWF)
Amazon SWF allows you to build distributed applications in any programming language with components that are accessible from anywhere. It reduces infrastructure and administration overhead because you don’t need to run orchestration infrastructure. SWF provides durable, distributed-state management that enables resilient, truly distributed applications. Think of SWF as a fully-managed state tracker and task coordinator in the cloud.

Amazon SWF’s key concepts include the following:
o Workflows are collections of actions
o Domains are collections of related workflows
o Actions are tasks or workflow steps
o Activity workers implement actions
o Deciders implement a workflow’s execution logic
o Maintains distributed application state
o Tracks workflow executions and logs their progress
o Controls which tasks each of your application hosts will be assigned to execute
o Supports the execution of Lambda functions as “workers”

Amazon Data Pipeline
AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premise data sources, at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon Elastic MapReduce (EMR).
AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available. You don’t have to worry about ensuring resource availability, managing inter-task dependencies, retrying transient failures or timeouts in individual tasks, or creating a failure notification system. AWS Data Pipeline also allows you to move and process data that was previously locked up in on-premise data silos.

Using AWS Data Pipeline, you define a pipeline composed of the “data sources” that contain your data, the “activities” or business logic such as EMR jobs or SQL queries, and the “schedule” on which your business logic executes. For example, you could define a job that, every hour, runs an Amazon Elastic MapReduce (Amazon EMR)–based analysis on that hour’s Amazon Simple Storage Service (Amazon S3) log data, loads the results into a relational database for future lookup, and then automatically sends you a daily summary email.

AWS Data Pipeline handles:
o Your jobs’ scheduling, execution, and retry logic
o Tracking the dependencies between your business logic, data sources, and previous processing steps to ensure that your logic does not run until all of its dependencies are met
o Sending any necessary failure notifications
o Creating and managing any temporary compute resources your jobs may require


Amazon Kinesis Firehose
Amazon Kinesis Firehose is AWS’s data-ingestion product offering for Kinesis. It’s used to capture and load streaming data into other Amazon services such as Amazon S3 or Amazon Redshift. From there, you can load the streams into data processing and analysis tools like Amazon Elastic Map Reduce (EMR) or Amazon ElasticSearch Service. It’s also possible to load the same data into Amazon S3 and Amazon Redshift at the same time using Firehose.

Firehose can scale to gigabytes of streaming data per second, and allows for batching, encrypting and compression of data. It will automatically scale to meet demand. It’s possible to load data into Firehose including HTTPS, the Kinesis Producer Library, the Kinesis Client Library and the Kinesis Agent. Monitoring is available through Amazon CloudWatch.

Amazon CloudFront
Amazon CloudFront is a global Content Delivery Network (CDN) service that gives you the ability to distribute your application globally in minutes. In Amazon CloudFront, your content is organized into distributions. A distribution specifies the location or locations of the original version of your files. Store the original versions of your files on one or more origin servers. An origin server is the location of the definitive version of an object. Origin servers could be other Amazon Web Services – an Amazon S3 bucket, an Amazon EC2 instance, or an Elastic Load Balancer – or your own origin server. Create a distribution to register your origin servers with Amazon CloudFront through a simple API call or the AWS Management Console. When configuring more than one origin server, use URL pattern matches to specify which origin has what content. You can assign one of the origins as the default origin. Use your distribution’s domain name in your web pages, media player, or application. When end users request an object using this domain name, they are automatically routed to the nearest edge location for high performance delivery of your content. An edge location is where end users access services located at AWS. They are located in most of the major cities around the world and are specifically used by CloudFront (CDN) to distribute content to end user to reduce latency.

AWS Data Processing Types


AWS Data Processing Types Overview
Analysis of data is a process of inspecting, cleaning, transforming and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making.
Big data can be processed and analyzed in two different ways on AWS: batch processing or stream processing.

When deciding which type of processing you need, consider the performance of Batch vs. Stream processing. Batch processing is latencies in minutes to hours, while Stream processing is latency in the order of seconds or milliseconds.

AWS Batch Processing
Batch processing is normally done on AWS using Amazon S3 for storage pre-and-post processing. Amazon EMR is then used to run managed analytic clusters on top of this data with Hadoop ecosystem tools like Spark, Presto, Hive, and Pig. Batch processing is often used to normalize the data then compute arbitrary queries over the varying sets of data. It computes results derived from all that data it encompasses and enables deep analysis of large data sets. Once you have the results, you can shut down your Amazon EMR or keep it running for further processing or querying.

You can find an architectural drawing of AWS Batch Processing here.

AWS Stream Processing
Streaming data is generated continuously by more than thousands of data sources, typically sending data simultaneously in small sizes. This event data needs to be processed sequentially and incrementally on a record-by-record basis over sliding time windows, and used for a wide variety of analytics. Information from such analysis incrementally updates metrics, reports, and summary statistics which gives companies visibility into many aspects of business and consumer activity as it “streams into AWS” and allows businesses to respond promptly to emerging situations. Amazon Kinesis Streams or Amazon Kinesis Firehose are used to capture and load data into a data store.

With AWS Stream processing, analytic processing and decision making to in-motion and transient data is done with minimal latency. Filtering and diverting in-motion data to a data warehouse like Amazon Redshift for example, where existing business intelligence tools are used to analyze the data for deeper background analysis and/or data augmentation.

Producers of streaming data include machine data, sensor-based monitoring devices, messaging systems, IoT and financial market feeds.

You can find an architectural drawing of AWS Time Series Processing, which is a type of Stream Processing, here.

In this day and age of needing analytics, with more data, more questions to answer, and tougher competition, your success depends on making the right decisions by relying on fast, secure, scalable, durable cloud data analytics, and AWS is the clear leader in this realm by leaps & bounds!

**Caveat: This document was created ~9 months ago (Today is 5/23/2017), so there might be more up-to-date information & there are certainly more AWS Data Analytics Services that should be included here. However, the information herein was complete & accurate when created. Thank you!

#gottaluvAWS! #gottaluvAWSMarketplace!

This entry was posted in 1-Click to Deploy Software Solutions for Your Choosing Paid for by the Hour, Amazon Aurora, Amazon CloudFront, Amazon CloudWatch, Amazon DynamoDB, Amazon EC2, Amazon EC2 On-Demand Instances, Amazon Elastic MapReduce, Amazon Elasticsearch Service, Amazon EMR, Amazon IAM, Amazon Kinesis Family, Amazon Machine Learning, Amazon Redshift Data Warehouse, Amazon S3, Amazon S3 Transfer Acceleration, Amazon Web Services, Amazon Web Services Analytic Services, Attunity CloudBeam in AWS Marketplace, AWS Analytics, AWS Batch Processing, AWS BI, AWS Built-In Security Features, AWS Cloud & Data Security, AWS Cloud Architecture, AWS Cloud Computing Models, AWS Cloud Deployment Models, AWS CloudFront, AWS Data Collection, AWS Data Migration Service, AWS Data Orchestration, AWS Data Pipeline, AWS Direct Connect, AWS Kinesis Firehose, AWS Kinesis Streams, AWS Marketplace, AWS Marketplace FAQs, AWS Snowball, AWS Storage and Database Options Big Data Analytics, AWS Storage Gateway, AWS Stream Processing, AWS SWF, AWS Trusted Advisor, Benefits of Analytics, Big Data Analytics Challenges, Big Data Producers, BigML PredictServer in AWS Marketplace, Business Intelligence & Big Data, Cloud Computing, CloudBasic RDS Deploy for DevOps DLM/Jenkins on AWS Marketplace, CloudBerry Backup on AWS Marketplace, ELK Stack for AWS Elasticsearch in AWS Marketplace, EMR and Hadoop on AWS Marketplace, HPE Vertica on AWS Marketplace, Informatica Cloud on AWS Marketplace, Infosys Information Platform on AWS Marketplace, Looker Analytics Platform in AWS Marketplace, Mapping by MapLarge in AWS Marketplace, MapR Enterprise Edition in AWS Marketplace, Matillion ETL/ELT for Redshift, MicroStrategy Analytics Platform on AWS Marketplace, SAP HANA One on AWS Marketplace, Syncsort for Amazon EMR in AWS Marketplace, Tableau Server on AWS Marketplace, Teradata on AWS Marketplace, TIBCO Clarity on AWS Marketplace, TIBCO Jaspersoft Reporting & Analytics for AWS, TIBCO Spotfire Analytics Platform, Zementis ADAPA Decision Engine in AWS Marketplace, Zoomdata on AWS Marketplace. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s