Skip to content

RosettaHub Documentation

Big Data Clusters

Big Data Cluster Formations¶

Deploy managed Spark and Hadoop clusters across clouds.

Overview¶

Big Data Cluster formations provision fully managed Spark and Hadoop clusters through cloud-native services. Each cluster includes a proxy machine for browser access, a master node, and configurable worker nodes.

Available in On-Demand and Spot modes (worker nodes can use Spot for cost savings).

Supported Services¶

Provider	Service
AWS	EMR (Elastic MapReduce)
GCP	Dataproc
Azure	HDInsight
Alibaba Cloud	E-MapReduce

Architecture¶

Browser ──► Proxy Machine ──► Master Node ──► Worker Nodes
            (RosettaHub       (Cluster        (Configurable
             container)        controller)     pool size)

Component	Description
Proxy Machine	RosettaHub container providing browser access to the cluster
Master Node	Cluster controller running the resource manager
Worker Nodes	Configurable pool of compute nodes for data processing

Configuration¶

Setting	Description
Cloud Key	Credentials for the target cloud account
Cluster Service	EMR, Dataproc, HDInsight, or E-MapReduce
Master Instance Type	Compute resources for the master node
Worker Instance Type	Compute resources for worker nodes
Worker Count	Number of worker nodes
Applications	Pre-installed tools (Spark, Hive, Presto, Jupyter, Zeppelin, etc.)

Use Cases¶

Use Case	Description
Data engineering	ETL pipelines and data transformation at scale
Analytics	Interactive SQL queries on large datasets
Machine learning	Distributed model training with Spark MLlib
Research computing	Large-scale scientific data processing

Formations Overview -- All formation types and lifecycle
Sessions -- Manage cluster sessions
Machine Formations -- Single machine deployments