Create a Custom Big Data Lab¶
Overview¶
In this tutorial, you'll create a custom Big Data Lab on the RosettaHub Supercloud platform. Big Data Labs provision managed Hadoop/Spark clusters using services such as AWS EMR or GCP Dataproc. Unlike single-machine formations, Big Data Labs deploy a cluster consisting of a master node and one or more worker nodes, giving you distributed computing power for large-scale data processing.
You'll learn how to clone a Big Data formation, configure cluster size and applications, launch the cluster, connect to it, run workloads, snapshot your customizations, and share the formation with others.
Prerequisites¶
- [ ] RosettaHub account with active subscription
- [ ] An AWS or GCP cloud account connected (see Cloud Keys)
- [ ] Access to a Big Data Lab public formation
- [ ] Basic familiarity with Hadoop/Spark concepts
Steps¶
Step 1: Open the Container Apps Perspective¶
From the RosettaHub dashboard, select the Container Apps perspective. Navigate to the Big Data Labs section, which lists available Big Data formations.
You'll see pre-configured Big Data formations for common cluster configurations with Hadoop, Spark, and related ecosystem tools.
Step 2: Clone a Big Data Formation¶
Select a Big Data formation that matches your workload requirements.
- Right-click the formation
- Select Clone
- Name the new formation my-spark-cluster
Tip
Cloning creates a private copy that you can configure independently. The original formation remains unchanged.
Step 3: Configure the Cluster¶
Right-click my-spark-cluster and select Configure to customize your cluster settings.
Key configuration options include:
| Setting | Description |
|---|---|
| Master Instance Type | Compute resources for the master node (coordinates the cluster) |
| Worker Instance Type | Compute resources for each worker node (executes distributed tasks) |
| Number of Workers | How many worker nodes to provision |
| Applications | Software stack to install (Spark, Hadoop, Hive, Presto, Zeppelin, etc.) |
| Cloud Key | Cloud provider credentials for provisioning |
| Region | Where to deploy the cluster |
Note
Big Data Labs provision a multi-node cluster, not a single machine. The master node manages job scheduling and coordination, while worker nodes perform the actual data processing. More workers means more parallel processing capacity.
Warning
Larger clusters with more workers consume resources proportionally. Scale your cluster to match your workload needs.
Step 4: Configure Applications¶
In the configuration panel, review and customize the applications installed on your cluster:
- Apache Spark - In-memory distributed computing
- Apache Hadoop - Distributed storage and MapReduce
- Apache Hive - SQL-on-Hadoop data warehouse
- Apache Zeppelin - Interactive notebook for Spark
- JupyterHub - Jupyter notebooks with PySpark integration
- Presto - Distributed SQL query engine
Select the applications you need and click Save.
Step 5: Launch the Cluster¶
Click on my-spark-cluster to launch it. Click Yes in the confirmation dialog.
Your session appears under the Sessions panel. Cluster provisioning takes longer than single-machine deployments because multiple nodes must be started and configured.
Wait for the status indicator to show a green tick. This typically takes 5-10 minutes depending on cluster size and region.
Note
During provisioning, the cloud provider is creating the master node, worker nodes, configuring networking, and installing the selected applications across all nodes.
Step 6: Connect to Your Cluster¶
Click the running session to view connectivity options. Big Data Labs offer several ways to connect:
Web Interfaces:
- Jupyter/Zeppelin Notebook - Click the session link to open a notebook interface in your browser for interactive Spark/Hadoop development
- Spark UI - Monitor running and completed Spark jobs
- YARN Resource Manager - View cluster resource utilization
SSH Access:
- Download the PEM/PPK key file from the connectivity panel
- SSH into the master node for command-line access
Tip
Use the notebook interfaces for interactive data exploration and development. Use SSH for system administration, log inspection, and running batch jobs.
Step 7: Run a Workload¶
With your cluster running, you can submit Spark or Hadoop jobs. For example, in a Jupyter notebook with PySpark:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("my-analysis").getOrCreate()
# Read data from S3
df = spark.read.csv("s3://my-bucket/data.csv", header=True, inferSchema=True)
# Perform analysis
df.groupBy("category").count().show()
Or via SSH on the master node:
You can also customize Spark and Hadoop configuration files on the master node to tune performance for your specific workloads.
Step 8: Create a Machine Image¶
After customizing the cluster environment (installing additional libraries, tuning configuration), snapshot your session.
- Right-click your running session in the Sessions panel
- Select Create Machine Image
- Keep Update Originator Formation On Success checked
RosettaHub will:
- Snapshot the cluster configuration into a new machine image
- The image appears under the Images panel (see Images Guide)
- Automatically update the my-spark-cluster formation to use the new image
Note
The snapshot captures the master node's environment. Custom packages, configuration changes, and installed tools are preserved for future launches.
Step 9: Share Your Formation¶
Share your customized Big Data formation with others:
- Right-click the my-spark-cluster formation
- Select Share
- Choose to share with a specific user, your organization, or a group
Step 10: Shut Down the Cluster¶
When your workload is complete, shut down the cluster to stop costs.
- Right-click your running session
- Select Shutdown
Warning
A running cluster with multiple nodes incurs costs for every node in the cluster. Always shut down the cluster when you are not actively running workloads. After shutdown, only image storage costs remain.
Key Concepts¶
| Concept | Description |
|---|---|
| Big Data Lab | A formation that provisions a managed Hadoop/Spark cluster |
| Master Node | Coordinates the cluster, runs job schedulers (YARN), hosts web interfaces |
| Worker Nodes | Execute distributed tasks, store HDFS data |
| EMR / Dataproc | Managed cluster services from AWS and GCP respectively |
- Big Data formations deploy clusters, not single machines
- The number of worker nodes determines your cluster's parallel processing capacity
- Applications (Spark, Hive, Zeppelin) are installed automatically during provisioning
- Cluster configuration can be customized via SSH on the master node
Next Steps¶
- Create a Custom Jupyter Lab - Single-machine Jupyter for smaller workloads
- Create a Custom RStudio Lab - R development environment
- Object Storages Guide - Store input/output data for your cluster
- Formations User Guide - Complete formations documentation
- Sessions Guide - Managing running sessions with real-time cost tracking
- Cloud Operations - Governance, budgets, and policy enforcement
Troubleshooting¶
Cluster takes a long time to provision
Big Data clusters typically take 5-10 minutes to provision. This is normal because:
- Multiple nodes must be started
- Applications are installed and configured across the cluster
- Networking between nodes must be established
If provisioning exceeds 15 minutes, check your cloud account quotas and region availability.
Cannot connect to the notebook interface
Ensure that:
- The session shows a green tick (all nodes fully provisioned)
- Your browser allows pop-ups from the RosettaHub domain
- Your network does not block the required ports
Try refreshing the session connectivity panel and clicking the link again.
Spark job fails with resource errors
Your cluster may not have enough resources. Consider:
- Increasing the number of worker nodes in the formation configuration
- Selecting a larger instance type for workers
- Optimizing your Spark job's memory and executor settings
- Checking YARN Resource Manager for available cluster capacity
Installed libraries are missing after relaunch
Libraries installed via SSH or notebook must be preserved through a machine image snapshot. Use Create Machine Image (Step 8) to save your customizations before shutting down.