Skip to content

Observability

The Observability module provides a unified view of metrics, alerts, and events across your Cloud HPC platform resources β€” including clusters, queues, disks, and workstations.
It enables administrators and developers to analyze system health, track resource utilization, and troubleshoot performance or scaling issues from a single pane of glass.


1. How to Use This Dashboard

  • Select the resource type (Cluster, Queue, Disk, or Workstation) from the sidebar.
  • Use search and filtering options to quickly locate a specific instance, queue, or volume.
  • View status, configuration details, live metrics, and alert history for each resource.

Filter Menu


2. Cluster Observability

Overview

  • Provider: Cloud vendor (e.g., AWS).
  • Resource Type: Typically EC2 or ParallelCluster.
  • Account ID / Region: Identifies where the cluster resides.
  • Cluster Name: Unique name or ID for the cluster.
  • Asset ID: Internal tracking identifier within your management system.

2.1 Head Node Details

  • Provider: Indicates the cloud vendor hosting the head node (e.g., AWS).
  • Resource Type: Shows "ParallelCluster" for the overall cluster and "EC2" for the head node instance.
  • Account ID: AWS account that owns the cluster.
  • Region: Geographic AWS region where the cluster is deployed (e.g., us-east-1).
  • Cluster Name: Unique identifier of the HPC cluster.
  • Asset ID: Internal or cloud-mapped ID for tracking purposes.
  • Created At: Timestamp indicating when the head node was provisioned (ISO or local time format).
  • Instance ID: EC2 instance ID of the head node.
  • Instance Type: AWS EC2 instance family/type (e.g., t3.medium).
  • Status: Real-time operational state (e.g., running, stopped).
  • Private IP Address: Internal network IP of the node.
  • Service Name: Underlying infrastructure service (typically ec2).
  • Asset Name: AWS EC2 instance ID or platform-generated name for the instance.
  • Image ID: AMI used to boot the head node (includes OS and app base).
  • Volume ID: EBS volume(s) attached to the instance, listed by ID.

HeadNode Menu

    πŸ’‘ Insight
  • The head node acts as the cluster’s control plane β€” managing job scheduling, monitoring, and communication with compute nodes. Any interruption to it can affect the entire workload queue.

Cluster Metrics

Metrics Panel

The Metrics Panel provides real-time and historical usage data for your selected resource β€” including cluster, queue, disk, or workstation.
It helps track resource health, identify performance bottlenecks, and support troubleshooting activities.


How to Use the Metrics Panel

Duration Selector

  • Located on the top right (e.g., "Last 3 days").
  • Allows you to choose the time window for visible metrics.

Timezone Indicator

  • Displays the time standard for graphed data (usually IST).
  • Ensures consistent comparison and troubleshooting across local and UTC times.

Key Metrics Displayed

CPUUtilization (%)

  • Displays the percentage of CPU resources used over time.
  • Peaks indicate heavy compute jobs or resource contention.

DiskUsedPercent (%)

  • Shows the percentage of disk space currently used.
  • High values (over 80–90%) should be monitored to prevent storage outages.

MemUsedPercent (%)

  • Displays the percentage of system memory in use.
  • Prolonged high memory usage can signal memory leaks or undersized resources.

StatusCheckFailed

  • Counts failed instance health or status checks in the selected period.
  • Any nonzero values require immediate investigation to ensure resource health.

NetworkIn (MB)

  • Amount of incoming network traffic per period (in megabytes).
  • Useful for identifying burst data uploads or distributed workloads.

NetworkOut (MB)

  • Amount of outgoing network traffic per period (in megabytes).
  • Spikes may indicate data exports, result delivery, or synchronization jobs.

Reading the Charts

  • Each metric is visualized as a time-series line graph.
  • Gaps or drops in the graph may signal downtime or monitoring misconfigurations.
  • Sudden spikes or persistently high values are strong indicators that further investigation is required.

HeadNode Metrics


    πŸ’‘ Tip
  • Use this panel regularly to compare performance trends and detect early signs of system stress or misconfiguration.

Cluster Alerts

  • Alert Sources: Cluster-level alerts include failures, degraded node performance, or misconfigurations.
  • Alert Examples:
    • Critical: Head node unreachable or failed initialization.
    • High: CPU utilization above defined thresholds.
    • Medium: Disk latency or memory usage warnings.
  • Monitoring Tools: These alerts are visible in the Alerts Menu under Cluster category.
  • Action: Click an alert to view timestamp, severity, and related AWS resource IDs.

HeadNode Alerts


2.2 Queue Observability

Queue Configuration

  • Queue Configuration: Logical grouping for running compute jobs within the cluster.
  • Subnet IDs: Network subnets where queue nodes are launched (defines AZ and network isolation).
  • Security Group IDs: Security groups applied to queue nodes (firewall rules).
  • Policies: IAM roles/policies that grant nodes permission to access AWS resources (e.g., SSM).
  • Script Location: S3 URI pointing to initialization scripts for queue nodes.

Volume Details

  • Type: Storage class (e.g., gp3 for high-performance SSD).
  • Size: Disk size in GB.
  • Encryption: Indicates whether encryption is enabled (True/False).

S3 Access

  • Bucket: S3 bucket configured for data access (may be blank if not set).
  • Write Access: True if queue nodes have write permissions; otherwise False.

Compute Resources

  • Name: Logical compute group (e.g., xsmall, medium).
  • Instance Type: Instance family and vCPU count (e.g., c6i.4xlarge).
  • Min/Max Instance Count: Defines autoscaling boundaries for the queue.
  • EFA Enabled: True if Elastic Fabric Adapter networking is enabled for low latency.
  • Placement Group Enabled: Indicates whether nodes are grouped for network optimization.
  • Disable Simultaneous Multithreading: Disables vCPU hyper-threading for better single-core performance when set to True.

Additional Details

  • Status: Overall cluster or queue state (e.g., up, down).
  • Cluster Version: Version of AWS ParallelCluster deployed.
  • CloudFormation Stack ARN: Identifier for the deployment stack (optional).
  • Created At / Last Modified At: Timestamps for tracking configuration changes.
  • Compute Fleet Status: Current state of the scheduler or compute fleet (e.g., active, on demand).
  • Tags: Key-value pairs used for grouping, billing, or automation (e.g., node-type, project-environment).

Queue Menu

Queue Metrics and Graphs

  • Live Metrics: Displays queue activity β€” allocated, idle, and failed nodes.
  • Graph Data: Lines represent load, scaling behavior, and utilization trends.

Queue Metrics


Queue Alerts

  • Alert Types:
    • Critical: Queue scaling failure or scheduler crash.
    • High: Spot instance interruptions.
    • Medium: Job pending or policy misconfiguration.
  • Alert Actions:
    • Use severity filters to isolate specific queue issues.
    • Timeline view highlights bursts of queue errors.
  • Screenshot: Insert queue metrics and alert view for visual context.

Queue Metrics

    πŸ’‘ Tip
  • Consistently monitor queue scaling trends β€” unexpected node counts or prolonged idle states may signal misconfigured autoscaling or insufficient IAM permissions.

3. Disk Observability

Disk Overview

  • Provider: Cloud service provider (e.g., AWS).
  • Resource Type: Storage service type (e.g., "ElasticFileSystem" for EFS).
  • Account ID: Owner account responsible for billing and access.
  • Region: AWS region where the disk resides.
  • Asset Name / ID: Identifiers for the EFS or FSx resource.
  • Status: State of the file system (e.g., available, degraded).
  • File System Name: User-assigned or automatically generated name.
  • FileSystemArn: Complete Amazon Resource Name (ARN) for the EFS volume.
  • File System Owner ID: AWS account ID that created the file system.
  • File System Creation Token: Secure token used during provisioning.
  • File System KMS Key ID: Key ARN for encrypting data at rest (if enabled).
  • Created At: Date and time when the file system was created.
  • File System Performance Mode: Defines performance class (e.g., generalPurpose).
  • File System Encrypted: True if encrypted at rest.
  • File System Throughput Mode: Throughput mode (e.g., bursting, provisioned).
  • Tags: Metadata used for identification and automation (e.g., project name, environment).

Disk


Disk Metrics and Graphs

  • Storage Utilization: Graphs showing reads/writes, capacity, and latency.
  • Performance History: Track throughput spikes or performance degradation.

Disk Metrics


Disk Alerts

  • Alert Types:
    • Critical: Filesystem degraded or unavailable.
    • High: I/O saturation or mount errors.
    • Medium: Capacity nearing threshold.
  • Alert Data Fields: Include timestamp, source, affected resource, and message text.
  • Recommended Actions:
    • Expand capacity or tune performance mode.
    • Check KMS encryption status for failures.

Disk Metrics


4. Workstation Observability

Workstation Overview

  • Provider: AWS or other cloud vendor.
  • Account ID / Region: Location and ownership details.
  • Workstation ID / Name: Unique identifiers.
  • Compute Name: Node group association.
  • Resource Type: EC2 or equivalent compute.
  • Asset Name / Asset ID: Instance identifiers.
  • Private/Public IP: Network endpoints.
  • Created At: Launch date and time.
  • Tags: Key-value pairs for management and automation.
  • User Details: Assigned user name and ID.
  • Catalog & Image: Source image details and OS.
  • Volume Details: Storage info, size, and status.
  • Capacity Details: vCPUs, RAM, and cost details.
  • Network & Security: Subnet, VPC, SGs, IAM roles.
  • Stack Details: CloudFormation metadata.

Workstation


Workstation Metrics and Graphs

  • System Health: Displays CPU, RAM, and network usage over time.
  • Performance Trends: Identify underutilized or overused compute.
  • Visuals: Add screenshot of workstation monitoring page.

Workstation


Workstation Alerts

  • Alert Types:
    • Critical: Workstation unresponsive or failed to initialize.
    • High: High CPU/memory utilization or network failure.
    • Medium: Missing IAM role or outdated image.
  • Alert Features:
  • Each alert displays timestamp, resource name, and impact severity.
  • Use search and filters to quickly isolate workstation-specific alerts.

Workstation Alerts


    πŸ’‘ Key Points:
  • Alerts are generated from real-time metrics and threshold events.
  • Filtering and search make diagnosis faster.
  • Timeline visualization helps detect recurring issues.
  • Alerts support trend analysis, capacity planning, and incident response.
    πŸ’‘ Notes:
  • All values and keys are sourced from the actual Observability Dashboard.
  • Each bullet explains what the field represents and its operational importance.
  • Structure is uniform across all sections for maintainability.