Penguin Solutions Releases ICE ClusterWare Management Software 13.0

2025-11-18     taewoo.choi

Penguin Solutions, Inc., a leading provider of high-performance computing and AI infrastructure solutions, announced the release of ICE ClusterWare software 13.0.  

This latest version introduces powerful new capabilities that solve two critical challenges in production-scale AI and HPC: sustaining peak cluster performance and secure provisioning of a single cluster to diverse user groups. 

These new features enable organizations to maximize return on their AI infrastructure investments by safely sharing resources across more users while ensuring consistent, reliable performance.

When an organization’s AI deployments progress from isolated pilot projects to enterprise-wide production environments, operational demands on infrastructure intensify immediately.

Penguin’s ICE ClusterWare 13.0 addresses this with built-in anomaly detection and auto-remediation, along with network-isolated multi-tenancy—delivering the operational excellence required to support AI as a core business function.

“With the launch of our ICE ClusterWare software 13.0, we’re delivering pivotal advancements to help organizations manage the growing complexity of modern AI and HPC environments,” said Sharri Parsell, vice president software engineering for Penguin Solutions. 

“As AI continues to evolve from experimental pilots to enterprise-scale deployments, organizations need robust, intelligent infrastructure that drives operational excellence and enables AI success across the enterprise.”

The patent-pending anomaly detection and auto-remediation technology ensures peak cluster performance and resource availability, continuously monitoring for hidden performance degradation that traditional diagnostic tools miss.

Upon detection, the system automatically isolates underperforming nodes and initiates remediation in real time, ensuring that workloads are scheduled on validated, high performing nodes. 

This proactive approach reduces administrative burdens, prevents unplanned downtime, and maximizes the cluster’s usable capacity. As a result, this new capability significantly shortens model training by reducing restarts and loss of work.

The new optional network-isolated multi-tenancy feature enables organizations to securely and efficiently share high-value GPU clusters, creating dedicated subclusters to support different departments, projects, GPU-as-a-Service(GPUaaS) customers. 

This capability provides isolated environments, giving tenants the autonomy to select their own workload manager, govern users, and run workloads with confidence that data and operations remain segregated and secure.

Reducing the security and resource utilization conflicts that previously forced organizations to build separate clusters drastically improves time to value. 

This capability is essential for cloud service providers and hyperscalers providing GPUaaS, enterprises and research institutes delivering AI computing to internal business groups, and federal or government agencies that require the highest level of security and resource isolation.