Create a Custom Big Data Lab¶

Intermediate 20 minutes Auto-generated screenshots

Overview¶

In this tutorial, you'll create a custom Big Data Lab on the RosettaHub Supercloud platform. Big Data Labs provision managed Hadoop/Spark clusters using services such as AWS EMR or GCP Dataproc. Unlike single-machine formations, Big Data Labs deploy a cluster consisting of a master node and one or more worker nodes, giving you distributed computing power for large-scale data processing.

You'll learn how to clone a Big Data formation, configure cluster size and applications, launch the cluster, connect to it, run workloads, snapshot your customizations, and share the formation with others.

Prerequisites¶

[ ] RosettaHub account with active subscription
[ ] An AWS or GCP cloud account connected (see Cloud Keys)
[ ] Access to a Big Data Lab public formation
[ ] Basic familiarity with Hadoop/Spark concepts

Steps¶

Step 1: Open the Container Apps Perspective¶

From the RosettaHub dashboard, select the Container Apps perspective. Navigate to the Big Data Labs section, which lists available Big Data formations.

You'll see pre-configured Big Data formations for common cluster configurations with Hadoop, Spark, and related ecosystem tools.

Step 2: Clone a Big Data Formation¶

Select a Big Data formation that matches your workload requirements.

Right-click the formation
Select Clone
Name the new formation my-spark-cluster

Tip

Cloning creates a private copy that you can configure independently. The original formation remains unchanged.

Step 3: Configure the Cluster¶

Right-click my-spark-cluster and select Configure to customize your cluster settings.

Key configuration options include:

Setting	Description
Master Instance Type	Compute resources for the master node (coordinates the cluster)
Worker Instance Type	Compute resources for each worker node (executes distributed tasks)
Number of Workers	How many worker nodes to provision
Applications	Software stack to install (Spark, Hadoop, Hive, Presto, Zeppelin, etc.)
Cloud Key	Cloud provider credentials for provisioning
Region	Where to deploy the cluster

Note

Big Data Labs provision a multi-node cluster, not a single machine. The master node manages job scheduling and coordination, while worker nodes perform the actual data processing. More workers means more parallel processing capacity.

Warning

Larger clusters with more workers consume resources proportionally. Scale your cluster to match your workload needs.

Step 4: Configure Applications¶

In the configuration panel, review and customize the applications installed on your cluster:

Apache Spark - In-memory distributed computing
Apache Hadoop - Distributed storage and MapReduce
Apache Hive - SQL-on-Hadoop data warehouse
Apache Zeppelin - Interactive notebook for Spark
JupyterHub - Jupyter notebooks with PySpark integration
Presto - Distributed SQL query engine

Select the applications you need and click Save.

Step 5: Launch the Cluster¶

Click on my-spark-cluster to launch it. Click Yes in the confirmation dialog.

Your session appears under the Sessions panel. Cluster provisioning takes longer than single-machine deployments because multiple nodes must be started and configured.

Wait for the status indicator to show a green tick. This typically takes 5-10 minutes depending on cluster size and region.

Note

During provisioning, the cloud provider is creating the master node, worker nodes, configuring networking, and installing the selected applications across all nodes.

Step 6: Connect to Your Cluster¶

Click the running session to view connectivity options. Big Data Labs offer several ways to connect:

Web Interfaces:

Jupyter/Zeppelin Notebook - Click the session link to open a notebook interface in your browser for interactive Spark/Hadoop development
Spark UI - Monitor running and completed Spark jobs
YARN Resource Manager - View cluster resource utilization

SSH Access:

Download the PEM/PPK key file from the connectivity panel
SSH into the master node for command-line access

ssh -i ~/Downloads/my-spark-cluster.pem hadoop@<master-node-ip>

Tip

Use the notebook interfaces for interactive data exploration and development. Use SSH for system administration, log inspection, and running batch jobs.

Step 7: Run a Workload¶

With your cluster running, you can submit Spark or Hadoop jobs. For example, in a Jupyter notebook with PySpark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("my-analysis").getOrCreate()

# Read data from S3
df = spark.read.csv("s3://my-bucket/data.csv", header=True, inferSchema=True)

# Perform analysis
df.groupBy("category").count().show()

Or via SSH on the master node:

spark-submit --master yarn my_spark_job.py

You can also customize Spark and Hadoop configuration files on the master node to tune performance for your specific workloads.

Step 8: Create a Machine Image¶

After customizing the cluster environment (installing additional libraries, tuning configuration), snapshot your session.

Right-click your running session in the Sessions panel
Select Create Machine Image
Keep Update Originator Formation On Success checked

RosettaHub will:

Snapshot the cluster configuration into a new machine image
The image appears under the Images panel (see Images Guide)
Automatically update the my-spark-cluster formation to use the new image

Note

The snapshot captures the master node's environment. Custom packages, configuration changes, and installed tools are preserved for future launches.

Share your customized Big Data formation with others:

Right-click the my-spark-cluster formation
Select Share
Choose to share with a specific user, your organization, or a group

Step 10: Shut Down the Cluster¶

When your workload is complete, shut down the cluster to stop costs.

Right-click your running session
Select Shutdown

Warning

A running cluster with multiple nodes incurs costs for every node in the cluster. Always shut down the cluster when you are not actively running workloads. After shutdown, only image storage costs remain.

Key Concepts¶

Concept	Description
Big Data Lab	A formation that provisions a managed Hadoop/Spark cluster
Master Node	Coordinates the cluster, runs job schedulers (YARN), hosts web interfaces
Worker Nodes	Execute distributed tasks, store HDFS data
EMR / Dataproc	Managed cluster services from AWS and GCP respectively

Big Data formations deploy clusters, not single machines
The number of worker nodes determines your cluster's parallel processing capacity
Applications (Spark, Hive, Zeppelin) are installed automatically during provisioning
Cluster configuration can be customized via SSH on the master node

Next Steps¶

Create a Custom Jupyter Lab - Single-machine Jupyter for smaller workloads
Create a Custom RStudio Lab - R development environment
Object Storages Guide - Store input/output data for your cluster
Formations User Guide - Complete formations documentation
Sessions Guide - Managing running sessions with real-time cost tracking
Cloud Operations - Governance, budgets, and policy enforcement

Troubleshooting¶

Cluster takes a long time to provision

Big Data clusters typically take 5-10 minutes to provision. This is normal because:

Multiple nodes must be started
Applications are installed and configured across the cluster
Networking between nodes must be established

If provisioning exceeds 15 minutes, check your cloud account quotas and region availability.

Cannot connect to the notebook interface

Ensure that:

The session shows a green tick (all nodes fully provisioned)
Your browser allows pop-ups from the RosettaHub domain
Your network does not block the required ports

Try refreshing the session connectivity panel and clicking the link again.

Spark job fails with resource errors

Your cluster may not have enough resources. Consider:

Increasing the number of worker nodes in the formation configuration
Selecting a larger instance type for workers
Optimizing your Spark job's memory and executor settings
Checking YARN Resource Manager for available cluster capacity

Installed libraries are missing after relaunch

Libraries installed via SSH or notebook must be preserved through a machine image snapshot. Use Create Machine Image (Step 8) to save your customizations before shutting down.

Create a Custom Big Data Lab¶

Overview¶

Prerequisites¶

Steps¶

Step 1: Open the Container Apps Perspective¶

Step 2: Clone a Big Data Formation¶

Step 3: Configure the Cluster¶

Step 4: Configure Applications¶

Step 5: Launch the Cluster¶

Step 6: Connect to Your Cluster¶

Step 7: Run a Workload¶

Step 8: Create a Machine Image¶

Step 9: Share Your Formation¶

Step 10: Shut Down the Cluster¶

Key Concepts¶

Next Steps¶

Troubleshooting¶