AWS vs Azure vs Google Cloud Platform - Analytics & Big Data
Choosing the right cloud platform provider can be a daunting task. Take the big three, AWS, Azure, and Google Cloud Platform; each offer a huge number of products and services, but understanding how they enable your specific needs is not easy. Since most organisations plan to migrate existing applications it is important to understand how these systems will operate in the cloud. Through our work helping customers move to the cloud we have compared all three provider's offerings in relation to three typical migration scenarios:
- Lift and shift - the cloud service can support running legacy systems with minimal change
- Consume PaaS services - the cloud offering is a managed service that can be consumed by existing solutions with minimal architectural change
- Re-architect for cloud - the cloud technology is typically used in solution architectures that have been optimised for cloud
Choosing the right strategy will depend on the nature of the applications being migrated, the business landscape and internal constraints.
In this series, we're comparing cloud services from AWS, Azure and Google Cloud Platform. A full breakdown and comparison of cloud providers and their services are available in this handy poster.
We have grouped all services into 10 categories:
- Storage and Content Delivery
- Analytics & Big Data
- Internet of Things
- Mobile Services
- Security & Identity
- Management & Monitoring
In this post we are looking at...
Analytics & Big Data
Cloud is transforming the way organisations are thinking about their data. Advanced analytics is now driving business decision making and opening up new business opportunities.
Cloud platforms need a cost effective way to process vast amounts of data they are storing. At the centre of Amazon's analytics offerings is AWS Elastic MapReduce (EMR), a managed Hadoop, Spark and Presto solution. EMR takes care of setting up an underlying EC2 cluster and provides integration with a number of AWS services including S3 and DynamoDB. EMR pricing is based on an hourly cost for each node plus the price of the underlying EC2 instance. To reduce costs, it is possible to take advantage of Amazon's EC2 reserved instance pricing to run jobs against a known schedule. Clusters can be created and deleted on demand to process specific jobs or kept running for extended periods of time. Clusters typically take around 15 minutes to provision before job execution begins.
Data Pipeline is a data orchestration product, that moves, copies, transforms and enriches data. Data Pipeline manages the scheduling, orchestration and monitoring of the pipeline activities as well as any logic required to handle failure scenarios. Data Pipeline can read and write data from most AWS storage services and supports a range of data processing activities including EMR, Hive, Pig, and can execute Unix/Linux shell commands.
For high frequency real-time analytics AWS has Kinesis Streams. Data consumers can push data in realtime to a Kinesis stream where it is processed by consuming applications using the Kinesis Client Library and Connector Library. Alternatively, it is possible to connect Kinesis to an Apache Storm cluster.
Kinesis Firehose can be used for large scale data ingestion. Data is pushed to a Kinesis Firehose delivery stream where it is automatically routed to an S3, Redshift or Elasticsearch service. The service supports client side compression and server side encryption.
Predictive analytics is possible through Machine Learning. AWS makes it very easy to create predictive models without the need to learn complex algorithms. To create a model users are guided through the process of selecting data, preparing data, training and evaluating models through a simple wizard based UI. It is also possible to create models via the AWS SDK. Once trained the model can be used to create predictions via online API (request / response) or a batch API for processing multiple input records.
To making sense of data through dashboards and data visualisations AWS offers QuickSight (currently in preview). Dashboards can be built from data stored across most AWS data storage services and supports a number of third party solutions. The number of third party data connectors is currently limited but is likely to grow over time.
Azure Synapse Analytics is a complete data analytics platform service. It bring together Azure Data Warehouse, Azure Data Lake, Azure Data Factory, Spark and Power BI under one unified development experience. It allows you to pick and choose the languages and frameworks that best suit your skills and needs. If you prefer SQL then you have the choice of using the pre-provisioned massively parallel processing architecture (MPP - formally SQL Data Warehouse), a new serverless options for querying data in a Data Lake Store or using Spark SQL. A managed Apache Spark environment including a rich interactive notebook experience is available which also comes with support for C# through .NET for Apache Spark. Pricing is based on the services you use, for example, provisioned SQL is measured using its own performance measurement units called Data Warehouse Units (DWU), while the serverless offering is based on the amount of data that's processed per query. Spark is charged based on the VMs that make up the Spark cluster based on standard VM prices.
If you prefer to stick with Open Source, then HDInsight which comes with Hadoop, Spark, Storm or HBase. The platform has a standard and premium tier, the latter including the option of running Microsoft Machine Learning Server, Microsoft's enterprise solution for building and running R models at scale. Azure has a simple pricing model that is based on the number and type of nodes running. The node type governs the number of cores, RAM and disk space available on each node. HDInsight comes with a local HDFS and can also connect to blob storage or Data Lake Store. Data stored on the local HDFS is lost when the cluster is shutdown. Clusters can be automatically created and deleted on a schedule using PowerShell and Automation, alternatively on-demand HDInsight clusters can be created for specific jobs invoked through Data Factory.
Azure Data Factory still exists as it's own standalone service used to build data processing pipelines. Data factory can read data from a range of Azure and third party data sources, and through Data Management Gateway, can connect and consume on-premise data. Data Factory comes with a range of activities that can run compute tasks in HDInsight, Azure Machine Learning, stored procedures, Data Lake and custom code running on Batch.
For processing realtime data Azure has Stream Analytics. Stream Analytics can process data from Blob storage or streamed through Event Hubs, and IoT Hub. A SQL-like language is used to perform times series based queries and can call into Azure Machine Learning to score data in realtime.
Azure Machine Learning is a fully managed data science platform that is used to build and deploy powerful predictive and statistical models. Azure Machine Learning comes with a flexible UI canvas and a set of predefined modules that can be used to build and run powerful data science experiments. The platform comes with a series of predefined machine learning models and includes the ability to run custom R or Python code. R code can be packaged as a custom module that can then be reused between experiments if required. Trained models can be published as web services for consumption either as a realtime request/response API or for batch execution. Azure Machine Learning also comes with interactive Jupyter notebooks for recording and documenting lab notes.
Microsoft DeployR is another option when it comes to developing and operationalising high performance R models. It provides a set of web services that can be used to integrate R models into applications and is useful when control over the R execution environment and underlying compute nodes is required.
For dashboards and visualisations Azure has Power BI. Power BI can consume data from a range of Azure and third party services, as well as being able to connect to on-premise data sources. Users can choose from a set of built-in visualisations, create their own or select from a custom visuals gallery. Power BI also allows users to run R scripts and embed R generated visuals. Recently Microsoft added Power BI Embedded, a new offering that allows Power BI driven content to be embedded within custom applications.
Azure Data Catalog is a registry of data assets within an organisation. Information is captured about each source including data structures and connection information. Technical and business users can then use Data Catalog to discover datasets and their intent.
Cognitive Services is a suite of ready made intelligence APIs that make it easy to enable and integrate advanced speech, vision, and natural language into business solutions.
Google Cloud Platform
Cloud Dataproc is Google's fully managed Hadoop and Spark offering. Google boasts an impressive 90 second lead time to start or scale Cloud Dataproc clusters, by far the quickest of the three providers. Pricing is based on the underlying Compute Engine costs plus an additional charge per vCPU per minute. An HDFS compliant connector is available for Cloud Storage that can be used to store data that needs to survive after the cluster has been shut down. There is no built-in support for on-demand clusters, however full control over the cluster is available through the gcloud cli, REST API or SDK so this can be automated if required.
Data processing pipelines can be built using Cloud Dataflow. Google has taken a different approach to AWS and Azure, both have gone with a declarative model that delegates processing work to other services such as Hadoop. Cloud Dataflow on the other hand provides a fully programmable framework, available for Java and Python, and a distributed compute platform. The programming model and SDK was recently submitted to the Apache Foundation and have become Apache Beam, which can use both Cloud Dataflow as well as Spark for pipeline execution. Cloud Dataflow supports both batch and streaming workers. By default, the number of workers is pre-defined when the service is created, although batch workers have the option to auto-scale based on demand. The latest pricing is based on the aggregate CPU, memory and storage consumed, and varies according to whether batch or streaming workers are used.
Google offers Machine Learning as a fully managed platform for training and hosting Tensorflow models. It relies on Cloud Dataflow for data and feature processing and Cloud Storage for data storage. There is also Cloud Datalab, a lab notebook environment based on Jupyter. A set of pre-trained models are also available. Vision API detects features in images, such as text, faces or company logos, Speech API converts audio to text across a range of languages, Natural Language API can be used to extract meaning from text, and there is an API for translation. In addition, Google Cloud Prediction API sits somewhere in the middle, allowing users to easily train categorical or regression models depending on the nature of the training set. This simply requires users to upload a training dataset and specify an answer column to predict, and Prediction API will do the rest.
An obvious omission from Google's stack is any form of business facing dashboards and visualisations. For this Google relies on a range of third party partners.
Cloud analytics is clearly a competitive space. It is becoming a critical component of modern business and a core capability that is driving cloud adoption. All three providers offer similar building blocks; data processing, data orchestration, streaming analytics, machine learning and visualisations.
AWS certainly has all the bases covered with a solid set of products that will meet most needs. Minor omissions include pre-trained machine learning models and managed lab notebooks but otherwise AWS scores highly across the board.
Google provide their own twist to cloud analytics with their range of services. With Dataproc and Dataflow, Google have a strong core to their proposition. Tensorflow has been getting a lot of attention recently and there will be many who will be keen to see Machine Learning come out of preview. Google has a strong rich set of pre-trained APIs but lacks BI dashboards and visualisations.
With the introduction of Azure Synapse Analytics, Azure has taken a significant leap ahead of the other contenders. We think Azure Synapse Analytics is a class above the rest for flexibility, choice and ease of integration. You can find out more from our blog post explaining the 5 reasons we think you should be looking at Synapse.
Next up we will be looking at Internet of Things.