Skip to content

Big Data Cluster Formations

Deploy managed Spark and Hadoop clusters across clouds.

Overview

Big Data Cluster formations provision fully managed Spark and Hadoop clusters through cloud-native services. Each cluster includes a proxy machine for browser access, a master node, and configurable worker nodes.

Available in On-Demand and Spot modes (worker nodes can use Spot for cost savings).

Supported Services

Provider Service
AWS EMR (Elastic MapReduce)
GCP Dataproc
Azure HDInsight
Alibaba Cloud E-MapReduce

Architecture

Browser ──► Proxy Machine ──► Master Node ──► Worker Nodes
            (RosettaHub       (Cluster        (Configurable
             container)        controller)     pool size)
Component Description
Proxy Machine RosettaHub container providing browser access to the cluster
Master Node Cluster controller running the resource manager
Worker Nodes Configurable pool of compute nodes for data processing

Configuration

Setting Description
Cloud Key Credentials for the target cloud account
Cluster Service EMR, Dataproc, HDInsight, or E-MapReduce
Master Instance Type Compute resources for the master node
Worker Instance Type Compute resources for worker nodes
Worker Count Number of worker nodes
Applications Pre-installed tools (Spark, Hive, Presto, Jupyter, Zeppelin, etc.)

Use Cases

Use Case Description
Data engineering ETL pipelines and data transformation at scale
Analytics Interactive SQL queries on large datasets
Machine learning Distributed model training with Spark MLlib
Research computing Large-scale scientific data processing