Deploy managed Spark and Hadoop clusters across clouds.
Overview
Big Data Cluster formations provision fully managed Spark and Hadoop clusters through cloud-native services. Each cluster includes a proxy machine for browser access, a master node, and configurable worker nodes.
Available in On-Demand and Spot modes (worker nodes can use Spot for cost savings).
Supported Services
| Provider |
Service |
| AWS |
EMR (Elastic MapReduce) |
| GCP |
Dataproc |
| Azure |
HDInsight |
| Alibaba Cloud |
E-MapReduce |
Architecture
Browser ──► Proxy Machine ──► Master Node ──► Worker Nodes
(RosettaHub (Cluster (Configurable
container) controller) pool size)
| Component |
Description |
| Proxy Machine |
RosettaHub container providing browser access to the cluster |
| Master Node |
Cluster controller running the resource manager |
| Worker Nodes |
Configurable pool of compute nodes for data processing |
Configuration
| Setting |
Description |
| Cloud Key |
Credentials for the target cloud account |
| Cluster Service |
EMR, Dataproc, HDInsight, or E-MapReduce |
| Master Instance Type |
Compute resources for the master node |
| Worker Instance Type |
Compute resources for worker nodes |
| Worker Count |
Number of worker nodes |
| Applications |
Pre-installed tools (Spark, Hive, Presto, Jupyter, Zeppelin, etc.) |
Use Cases
| Use Case |
Description |
| Data engineering |
ETL pipelines and data transformation at scale |
| Analytics |
Interactive SQL queries on large datasets |
| Machine learning |
Distributed model training with Spark MLlib |
| Research computing |
Large-scale scientific data processing |