Skip to content

Create a Custom Big Data Lab

Intermediate 20 minutes Auto-generated screenshots

Overview

In this tutorial, you'll create a custom Big Data Lab on the RosettaHub Supercloud platform. Big Data Labs provision managed Hadoop/Spark clusters using services such as AWS EMR or GCP Dataproc. Unlike single-machine formations, Big Data Labs deploy a cluster consisting of a master node and one or more worker nodes, giving you distributed computing power for large-scale data processing.

You'll learn how to clone a Big Data formation, configure cluster size and applications, launch the cluster, connect to it, run workloads, snapshot your customizations, and share the formation with others.

Prerequisites

  • [ ] RosettaHub account with active subscription
  • [ ] An AWS or GCP cloud account connected (see Cloud Keys)
  • [ ] Access to a Big Data Lab public formation
  • [ ] Basic familiarity with Hadoop/Spark concepts

Steps

Step 1: Open the Container Apps Perspective

From the RosettaHub dashboard, select the Container Apps perspective. Navigate to the Big Data Labs section, which lists available Big Data formations.

You'll see pre-configured Big Data formations for common cluster configurations with Hadoop, Spark, and related ecosystem tools.


Step 2: Clone a Big Data Formation

Select a Big Data formation that matches your workload requirements.

  1. Right-click the formation
  2. Select Clone
  3. Name the new formation my-spark-cluster

Tip

Cloning creates a private copy that you can configure independently. The original formation remains unchanged.


Step 3: Configure the Cluster

Right-click my-spark-cluster and select Configure to customize your cluster settings.

Key configuration options include:

Setting Description
Master Instance Type Compute resources for the master node (coordinates the cluster)
Worker Instance Type Compute resources for each worker node (executes distributed tasks)
Number of Workers How many worker nodes to provision
Applications Software stack to install (Spark, Hadoop, Hive, Presto, Zeppelin, etc.)
Cloud Key Cloud provider credentials for provisioning
Region Where to deploy the cluster

Note

Big Data Labs provision a multi-node cluster, not a single machine. The master node manages job scheduling and coordination, while worker nodes perform the actual data processing. More workers means more parallel processing capacity.

Warning

Larger clusters with more workers consume resources proportionally. Scale your cluster to match your workload needs.


Step 4: Configure Applications

In the configuration panel, review and customize the applications installed on your cluster:

  • Apache Spark - In-memory distributed computing
  • Apache Hadoop - Distributed storage and MapReduce
  • Apache Hive - SQL-on-Hadoop data warehouse
  • Apache Zeppelin - Interactive notebook for Spark
  • JupyterHub - Jupyter notebooks with PySpark integration
  • Presto - Distributed SQL query engine

Select the applications you need and click Save.


Step 5: Launch the Cluster

Click on my-spark-cluster to launch it. Click Yes in the confirmation dialog.

Your session appears under the Sessions panel. Cluster provisioning takes longer than single-machine deployments because multiple nodes must be started and configured.

Wait for the status indicator to show a green tick. This typically takes 5-10 minutes depending on cluster size and region.

Note

During provisioning, the cloud provider is creating the master node, worker nodes, configuring networking, and installing the selected applications across all nodes.


Step 6: Connect to Your Cluster

Click the running session to view connectivity options. Big Data Labs offer several ways to connect:

Web Interfaces:

  • Jupyter/Zeppelin Notebook - Click the session link to open a notebook interface in your browser for interactive Spark/Hadoop development
  • Spark UI - Monitor running and completed Spark jobs
  • YARN Resource Manager - View cluster resource utilization

SSH Access:

  • Download the PEM/PPK key file from the connectivity panel
  • SSH into the master node for command-line access
ssh -i ~/Downloads/my-spark-cluster.pem hadoop@<master-node-ip>

Tip

Use the notebook interfaces for interactive data exploration and development. Use SSH for system administration, log inspection, and running batch jobs.


Step 7: Run a Workload

With your cluster running, you can submit Spark or Hadoop jobs. For example, in a Jupyter notebook with PySpark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("my-analysis").getOrCreate()

# Read data from S3
df = spark.read.csv("s3://my-bucket/data.csv", header=True, inferSchema=True)

# Perform analysis
df.groupBy("category").count().show()

Or via SSH on the master node:

spark-submit --master yarn my_spark_job.py

You can also customize Spark and Hadoop configuration files on the master node to tune performance for your specific workloads.


Step 8: Create a Machine Image

After customizing the cluster environment (installing additional libraries, tuning configuration), snapshot your session.

  1. Right-click your running session in the Sessions panel
  2. Select Create Machine Image
  3. Keep Update Originator Formation On Success checked

RosettaHub will:

  • Snapshot the cluster configuration into a new machine image
  • The image appears under the Images panel (see Images Guide)
  • Automatically update the my-spark-cluster formation to use the new image

Note

The snapshot captures the master node's environment. Custom packages, configuration changes, and installed tools are preserved for future launches.


Step 9: Share Your Formation

Share your customized Big Data formation with others:

  1. Right-click the my-spark-cluster formation
  2. Select Share
  3. Choose to share with a specific user, your organization, or a group

Step 10: Shut Down the Cluster

When your workload is complete, shut down the cluster to stop costs.

  1. Right-click your running session
  2. Select Shutdown

Warning

A running cluster with multiple nodes incurs costs for every node in the cluster. Always shut down the cluster when you are not actively running workloads. After shutdown, only image storage costs remain.


Key Concepts

Concept Description
Big Data Lab A formation that provisions a managed Hadoop/Spark cluster
Master Node Coordinates the cluster, runs job schedulers (YARN), hosts web interfaces
Worker Nodes Execute distributed tasks, store HDFS data
EMR / Dataproc Managed cluster services from AWS and GCP respectively
  • Big Data formations deploy clusters, not single machines
  • The number of worker nodes determines your cluster's parallel processing capacity
  • Applications (Spark, Hive, Zeppelin) are installed automatically during provisioning
  • Cluster configuration can be customized via SSH on the master node

Next Steps

Troubleshooting

Cluster takes a long time to provision

Big Data clusters typically take 5-10 minutes to provision. This is normal because:

  • Multiple nodes must be started
  • Applications are installed and configured across the cluster
  • Networking between nodes must be established

If provisioning exceeds 15 minutes, check your cloud account quotas and region availability.

Cannot connect to the notebook interface

Ensure that:

  • The session shows a green tick (all nodes fully provisioned)
  • Your browser allows pop-ups from the RosettaHub domain
  • Your network does not block the required ports

Try refreshing the session connectivity panel and clicking the link again.

Spark job fails with resource errors

Your cluster may not have enough resources. Consider:

  • Increasing the number of worker nodes in the formation configuration
  • Selecting a larger instance type for workers
  • Optimizing your Spark job's memory and executor settings
  • Checking YARN Resource Manager for available cluster capacity
Installed libraries are missing after relaunch

Libraries installed via SSH or notebook must be preserved through a machine image snapshot. Use Create Machine Image (Step 8) to save your customizations before shutting down.