Tuesday, May 14 (Ballroom – student center south)
|7:30am to 8:30am||Registration/Breakfast|
|8:00am to 12:00pm||Lustre New Users Tutorial
This session will introduce the Lustre file system environment to new users. It will familiarize users and data center management new to Lustre storage environments with Lustre before the conference begins. The class is not for experienced administrators – they should check out the afternoon session. The class will cover:
|12:00pm to 1:00pm||LUNCH|
|1:00pm to 5:00pm||Lustre Day‐to‐Day Operations Tutorial
This session is designed for users already familiar with Lustre. The focus will be on topics relevant to system administrators who manage Lustre on a day-to-day basis. This tutorial is not intended for new users – they should consider attending the morning session. This session will cover:
Plan to arrive early enough so that you can join us in the UH Student Center South Houston Room for our welcoming reception. There will be delicious hot and cold hors d’oeuvres prepared by UH’s own catering service, Chartwells, as well as a wine and beer bar, with other non-alcoholic beverages also available.
Wednesday, May 15 (Houston Room – student center south)
|7:30am to 8:45am||Registration/Breakfast|
|8:45am||Opening Remarks – OpenSFS/University of Houston|
|9:00am||Community Release Update – Peter Jones, Whamcloud||Slides|
|9:30am||IML Overview and Roadmap – Joe Grund, Whamcloud
Integrated Manager for Lustre (IML) is an open source suite of tools and a unified interface for managing / monitoring Lustre. In this session, you will learn what IML is, how it can make your life as a Lustre administrator easier, how it will be improving in the future, and how you can contribute.
While historically IML had been a monolithic piece of proprietary software, its open sourcing has increased its usage beyond the hundreds of paid customer production deployments. This expansion has been further increased due to efforts to modularize IML into components so that it is now possible for users to leverage a subset of the functionality for their own purposes.
IML enables a user to install, monitor and manage a Lustre filesystem, and get near real-time alerts and feedback of statistics and possible issues. Monitoring features include heatmaps of OST bandwidth / IOPS, and view of top jobs. Management features include High Availability of a Lustre filesystems via Corosync and Pacemaker and resource management via a state machine.
The latest 4.0.x release can be accessed at https://github.com/whamcloud/integrated-manager-for-lustre/releases. It supports the latest Lustre 2.10.x LTS and ZFS 0.7.x releases and has added scalability improvements such as device detection via Udev events.
The upcoming 5.0 release adds support for Lustre 2.12, monitoring of devices and pool health via ZED and libzfs, HA using Lustre and ZFS resource agents, delivery of IML in RPM form, manager install via Docker stack, UI Enhancements, and performance improvements using Rust.
Finally, the presentation will outline proposed enhancements after IML 5, Including full ZFS management and enhanced deployments.
|10:30am||Lustre 2.13 and beyond – Andreas Dilger, Whamcloud||Slides|
|11:00am||Making Lustre – James Simmons
Over the years Lustre has continued to grow in its feature set. The HPC systems that deploy Lustre also continue to grow size. The combination of these two factors have created an incredible burden for sites to handle such systems. The largest cost comes from the complex of configuring the file system to optimize performance as well as managing day to day maintenance to keep the file system operating. Traditionally file systems are not interactive stacks which requires sites to develop novel techniques to gauge the state of the file system.
For Lustre to move into the Linux kernel source tree certain requirements have to be meet. Adopting those new requirements have actually opened up Lustre to leverage some new powerful functionality. Some of these new approaches offer better performance and scalability. Adopting these new APIs allows Lustre to better integrate with the standard OS software stack as well. This new functionality can ease the burden of configuring as well as maintaining any size deployment of Lustre. In this presentation we will examine new ways to handle large scale configuration. How Lustre can be monitored for state changes and what administrative setups can be done to act on those changes. This allows the potential for a cluster to manage its file system without direct administrative action. We can demonstrate the use of various tools in typical HPC environments that were never available before. Exploration of new potential features such as automated file system recovery or adding the ability for Lustre aware utilizes to be aware of Lustre events that occurred on another node.
|11:30am||Lustre – A view from the outside – Neil Brown, SUSE
A little under two years ago I joined the Lustre development community with one very specific goal: to get the lustre code fully integrated with the Linux kernel. Combining my experience of nearly two decades of active kernel development with a strong desire to see the integration effort succeed has given me a unique perspective into the lustre code base, a perspective that I would like to share.
Lustre contains much that is good, but could be better. The strengths come from a focused development community, a demanding user community, and effective feedback between these group allowing (and requiring) problems to be identified and resolved. The weaknesses come in part from a history of trying to support multiple operating systems and in part from being isolated from the much large communities which can both provide valuable expertise, and can change the base OS kernel to better support the needs of Lustre. Lustre can clearly benefit from improved integration with both the Linux code and the Linux community, and Linux itself is likely to benefit too.
Through an exploration of various subsystems include wait queues, hash tables, linked lists, locking, build infrastructure, fault injection, trace logging and more, an external perspective on the Lustre code base will be presented, and a case will be made that if full integration can be achieved then there will be clear benefits for Lustre in maintainability, in performance, in memory usage, and in test coverage, and benefits for Linux through improved cross-community collaboration and through enhancements that Lustre demands and others could benefit from.
|1:30pm||Lustre Security Update – Sébastien Buisson, Whamcloud
Today, parallel file systems are not just scratch, and Lustre as a user home or project directory has become commonplace in non-traditional HPC field domains. Some organizations have obligations to comply with new standards, rules, and methods that require security hardening. High Performance file systems are more and more often inserted into ‘Enterprise’ workflows requiring sophisticated security configurations. File systems now have to support technologies that have been designed and developed with enhanced security in mind.
Under these circumstances, Lustre endeavors to fulfill various security requirements, such as authentication, access control, network security, multi-tenancy, encryption, or audit. Unfortunately, a number of these security requirements may sound complicated or unfamiliar to people in charge of file system deployment and administration.
With a pedagogical approach, this presentation proposes to explain how each of these security requirements can be achieved with the community versions of Lustre shipping today. We are going to detail the features involved with each requirement, and show how each of these features can be implemented in order to meet the requirement with the intention of making this a less daunting prospect for those who are responsible for deploying and administrating file systems.
Of course, there is still room for further improvements in the security area, so we will also mention what additional development is currently in progress for each of the requirements. This includes new features being added to future community versions of Lustre, like encryption directly at the Lustre client level, but also stabilization efforts and endeavors to expand the documentation.
|2:00pm||Long Distance Lustre Communication – Raj Gautam, ExxonMobil
Lustre is widely used in HPC datacenters with Infiniband, Omnipath, Aries and Ethernet fabrics. Lustre networking (LNET) plays a big role in how lustre devices communicate with each other. LNET router is a great way to bridge different network fabrics together, where client and server across different fabrics can communicate with each other. LNET routers also adds resiliency by using multiple LNET routers to route to a FileSystem.
We installed a brand new HPC datacenter about 30 miles (48 km) away from an existing datacenter. Both datacenters uses Lustre FileSystems in a flat Infiniband network. This presentation explains how we were able to connect these two datacenters where Lustre clients on one datacenter can access lustre FileSystems on other datacenter across the town. Since long distance Infiniband is expensive and complex, we chose to use high speed Ethernet network for long distance communication and use IB-Ethernet LNET routers on both ends to bridge two fabrics together. We will show how the various OS and Lustre tunings on LNET routers, Lustre clients and servers that needs to be performed to maximize the throughput and show some test results. We will also present challenges that we faced along the way and how we were able to resolve and/or mitigate them. The system is now in production exceeding our expectations.
|2:30pm||LNet Features Overview – Amir Shehata, Whamcloud
In this talk we will go over the major LNet features which have been developed for the community releases of Lustre over the past few years. The presentation aims to be user centric. It will cover the main functionality provided by Multi-Rail, Dynamic Discovery, LNet Health, Multi-Rail Routing and User Defined Selection Policies. It will give configuration examples of how all these features can be configured on the system to get maximum performance, reliability and control.
Multi-Rail feature brought the ability to utilize multiple interfaces over the same network. LNet can also use multiple networks to communicate with peers. Performance testing has shown that Multi-Rail almost aggregates the bandwidth of the interfaces. With Dual EDR cards LNet selftest has shown read/write bandwidth of approximately 24 GB/s.
Dynamic Discovery simplifies Multi-Rail configuration by dynamically discovering peer interfaces without having to explicitly configure them, as will be shown in this presentation.
LNet Health came on the heels of Multi-Rail and brought reliability and redundancy to LNet. With Health, LNet can monitor the errors on the links and dynamically switch to a healthier interface if one is available without dropping messages.
As the LNet design have become more Multi-Rail oriented, it became clear that the routing code needed to be brought inline with LNet’s Multi-Rail design. This was done in order to benefit from the performance and reliability aspects of Multi-Rail and Health. The presentation will cover the configuration changes introduced by the Multi-Rail Routing feature.
Finally, with multiple paths available for traffic to take, many use cases have been discovered that will benefit from the ability to control which interfaces message should be sent to and from. The User Defined Selection Policies (UDSP) feature brings the ability to configure policies to do just that.By the end of the presentation it should be clear how all these features work together, the benefit they bring to a Lustre installation, and how they can be configured to work smoothly.
|3:30pm||A quantitative approach to architecting all‐flash Lustre file systems – Kirill Lozinskiy, LBNL/NERSC
New experimental and AI-driven workloads are moving into the realm of extreme-scale HPC systems at the same time that high-performance flash is becoming cost-effective to deploy at scale. This confluence poses a number of new technical and economic challenges and opportunities in designing the next generation of HPC storage and I/O subsystems to achieve the right balance of bandwidth, latency, endurance, and cost. In this presentation, we present the quantitative approach to requirements definition that resulted in the 30 PB all-flash Lustre file system that will be deployed with NERSC’s upcoming Perlmutter system in 2020. By integrating analysis of current workloads and projections of future performance and throughput, we were able to constrain many critical design space parameters and quantitatively demonstrate that Perlmutter will not only deliver optimal performance, but effectively balance cost with capacity, endurance, and many modern features of Lustre.
The National Energy Research Scientific Computing Center (NERSC) will be deploying the Perlmutter HPC system in 2020 and has designed the system from the ground up to address the needs of both traditional modeling and simulation workloads and these emerging data-driven workloads. A foundation of Perlmutter’s data processing capabilities is its 30 PB, all-flash Lustre file system that is designed to provide both a high peak bandwidth (4 TB/sec) for checkpoint-restart workloads and high peak I/O operation rates for both data and metadata.
All-flash Lustre file systems have been tested at modest scale, and all-flash burst buffers based on custom file systems are being deployed at large scales. However completely replacing the proven disk-based high-performance tier with an all-flash tier at scale introduces a number of new questions:
– Is it economically possible to deploy enough flash capacity to replace the scratch tier and burst buffer tier?
– How much capacity is enough capacity for a scratch file system? What should the purge policy be to manage this capacity?
– Will the SSDs wear out too quickly? What drive endurance rating is required?
In addition, deploying an all-flash Lustre file system at scale poses a unique set of design questions:
– What new Lustre features are required to get the maximum performance from the SSDs?
– How much flash capacity should be provisioned for metadata versus data?
– Using Lustre’s new Data-on-MDT feature, what is the optimal default file layout to balance low latency, high bandwidth, and overall system cost?
In this presentation, we describe the sources of data and analytical methods applied during the design of the Perlmutter file system’s specifications to answer these questions.
|4:00pm||Performance evaluation of Lustre on All‐Flash Storage system at OIST (Okinawa Institute of Science and Technology) Graduate University – Koji Tanaka SCDA OIST Graduate University/Shuichi Ihara Whamcloud
The Okinawa Institute of Science and Technology is an interdisciplinary
At OIST, the Scientific Computing and Data Analysis Section (SCDA)
OIST recently expanded its Lustre storage by adding an All-Flash based
This presentation provides fundamental performance evaluation of Lustre
As the use of machine-learning has grown across many research
|4:30pm||Flash Based Lustre Filesystem for the Arm‐based Supercomputer‐Astra – Lee Ward, Sandia National Laboratories
|5:00pm||Hybrid Flash/Disk Storage Systems with Lustre – Nathan Rutman, Cray
|5:30pm||Session End – Proceed to the bus transportation|
Networking Event - St. Arnold’s Brewery
More information on St. Arnold can be found on their website, www.saintarnold.com
Thursday, May 16 (Houston Room – student center south)
|7:30am to 8:45am||Breakfast|
|8:45am||Sponsor Presentation – Kmesh||Slides|
|9:00am||Lustre on Public Clouds‐Opportunities, Challenges & Learning - Vinay Gaonkar & Saravanan Purushothaman, Kmesh
An opportunistic trend that is enticing enterprises, national lab and other scientific institutions to move applications out of on-premises infrastructure is the advent of cloud computing. But not every workload and application workflow is suited for the cloud. Many traditional HPC workloads, like scientific computing, modelling and simulation need large and dynamically changing compute resources, and cloud offers a cheaper alternative to on-premises. There are many efforts by application vendors in HPC (Rescale) and EDA (Synopsys, Cadence). In this session, we will go over the challenges and opportunities of running these workloads, especially opportunities and challenges and lessons learned from running Lustre on the public cloud for achieving HPC application requirements. Following are the examples of challenges that we will be discussing in the session.
Challenge #1 Lustre complexity : In every high-performance file system evaluation, Lustre gets lower ratings for complex setup, configuration and maintenance. We will go over these challenges in the context of the public cloud. We will discuss some cloud-specific Lustre architectures that can simplify the installation and maintenance of Lustre. In addition, we will also discuss how features like DoM & Project Quotas can be leveraged to achieve performance and operational goals on the public cloud.
Challenge #2: Cloud Provisioning: Cloud consumption of resources is fairly simple compared to on-premises resources. Bringing traditional applications, like HPC and EDA, has its own challenges. We will discuss some opportunities which cloud provides to simplify and extend overall HPC application architecture. We will discuss efforts in terms of moving broader HPC applications to cloud, including the complexities of choosing cloud infrastructure based on price/performance. We will also discuss some lessons learned in terms of optimizing the cloud resource consumption while achieving the best performance.
Challenge #3 Data synchronization: Application data is generated in numerous locations. The challenge is to bring data close to applications. This is more important with applications running on the cloud. We will discuss challenges of bringing data to HPC applications on the cloud but also sharing data and results with other applications and consumers who may not be in the same cloud.
Throughout the presentation, we will be showing performance results and live or recorded demos to illustrate the points that are discussed.
|9:30am||Managing Lustre on AWS – Andy Pollock and Aurélien Degrémont, Amazon
In this presentation we will introduce Amazon FSx for Lustre, a new managed service offering launched at AWS Re:Invent in November 2018. While we already offered Elastic File System (EFS) as a file system on AWS, we heard from customers that their workloads required greater throughput and lower latencies, and they were willing to sacrifice durability to achieve that. These customers often named Lustre by named as the “F1 Ferrari” of file systems for fast file. Working backwards from customer needs, we introduced an unreplicated managed Lustre service to fill this gap, and it is now a key accelerator of our investment into the High Performance Computing space.
The choice to manage Lustre was an obvious business decision but how to integrate it seamlessly into the AWS ecosystem technically was not. We will walk through the features of Lustre that we are leveraging to do this integration, why we had a bias toward leveraging prior art from the open source community and some of the technical challenges that come from leveraging these features at Amazon scale. In particular, we will dive deep into our usage of the Hierarchical Storage Management feature and performance challenges when trying to bulk import and export millions, tens of millions or even hundreds of millions of objects in a spin-up, spin-down compute model. We will walk through an example that uses AWS Batch to orchestrate a deep learning-focused HPC workload customer workload built on top of Amazon FSx for Lustre. We will close by describing any gaps we in the Lustre offering today that are impeding adoption by more customers and how we can help close those gaps.
HSM, Data Movement, Tiering, Layouts and More - Ben Evans, Cray
|11:00am||Smart policies for data placement and storage tiering of Lustre - Xi Li, Whamcloud
In a massive storage system, it becomes more and more common to see heterogeneous media being used at the same time. Different types of mechanical hard disks, SSDs and NVMe can all be attached as storage media in a single Lustre file system with a unified name-space. These devices have different specifications on the aspects of capacity, latency, bandwidth, reliability, cost and so on. A major challenge to the Lustre file system is how to provide necessary support to the users so as to help them to get the maximum benefit out of the different specifications of the storage media.
The mechanism of Lustre OST/MDT pool provides nature basis to the support for heterogeneous storage devices. The OST/MDT pools can be used to classify and isolate the storage devices logically according to their specifications. However, in order to build a sophisticated and complete solutions of data management in a file system with different storage pools, OST/MDT pool needs to provide necessary mechanisms or tools, including the policies and tools for data placement and movement.
One improvement of OST/MDT pool that we’ve been recently working on (LU-11234) is adding an Data Placement Policy (DPP) mechanism for it. DPP enables users to define the rules that determines what pools the newly created files will be located on. The rules can be based on UID, GID, JobID, file name and the expressions based on the combinations of these attributes. By configuring proper rules in DPP, administrators of a Lustre file system with different storage types have better ways to control how the storage spaces and bandwidths should be allocated. DPP is useful for the following use cases:
Besides of the internals and use cases of DPP, the presentation will introduce how DPP can be used together with the existing and upcoming Lustre features or tools for better data management, space allocation and quality of service in a Lustre file system with multiple storage tiers, including:
|11:30am||Robinhood Reports: a new Robinhood web front to help users find their data - Shawn Hall, NAG (Numerical Algorithms Group)
For many Lustre sites, file system purging is not an option. At these sites, the main way to manage the data on the file system is to rely on users cleaning up on their own. This is frequently a losing situation.
Finding data on a file system is also often a difficult task, especially when wading through millions or billions of files. Thankfully Robinhood allows us to maintain a replica of our file system metadata in a database, so we have the foundation necessary for users to find their data easily. The trouble is that there are only so many ways to retrieve that data from Robinhood, and typically those methods aren’t detailed enough or digestible by the common user.
That’s where we have filled the gap. To give our users a simple view into their data, we developed robinhood-reports. This is a new web front end for Robinhood databases that simplifies how users can search for their data by providing reports directed specifically to users’ needs. Currently, robinhood-reports includes the following preconfigured reports:
|1:30pm||Sponsor Presentation – DDN||Slides|
|1:45pm||migratefs: overlay filesystem for transparent, distributed migration of active data - Stéphane Thiell, Stanford
Since February 2019, Stanford’s Sherlock cluster has been running Lustre 2.12 in production, and has been taking advantage of the latest Lustre features like DNE, DOM and PFL. Managed by the Stanford Research Computing Center, Sherlock is a shared and heterogeneous 1,500-node computing cluster available to the whole Stanford research community, running all kinds of Research Computing applications, from interactive tools to the most taxing of HPC and AI workloads. Unlike most large clusters in computing centers, Sherlock is driven by contributions from individual PIs and groups, and as such, is constantly evolving. This provides a valuable pool of resources for its 4,000 users, but also poses a unique set of challenges from the system administration perspective, especially in the domain of data storage.
In my talk, I’ll first describe Sherlock’s new scratch file system design, especially focused on small files performance, and designed around DNE and DOM, and I’ll provide feedback to the community about our early experience with Lustre 2.12.
In a second part, I’ll introduce migratefs, an open source overlay file system that we developed in-house, to ease data migration between file systems. When maintaining access to existing data is required, the lifecycling of data storage systems usually presents two main challenges: trying to minimize the amount of old and unused data that is transferred to the new system, as well as minimizing disruption to users’ existing workflows. migratefs addresses both needs in a novel and innovative way: it combines the old and the new scratch file systems in a single and unified view, and transparently migrates modified data when it’s accessed.
This unique approach allowed us to put our new scratch file system into production within our regular cluster maintenance schedule, without the traditional need of an extended downtime that is usually required to copy all the existing data to the new system. It has also been completely transparent to our users, which can continue to access their existing data and create new files in the new system without having to modify a single line of code, saving them time and hassle.
We released migratefs as an open-source tool, to make it available to the wider community, and plan to continue improving it by making it suitable for additional file system migration use cases.
|2:15pm||Cross‐tier Unified Namespace Update - Mohamad Chaarawi, Intel
This presentation will give an update on the cross-tier unified namespace concept that was introduced at LAD’18.
This functionality allows a userspace distributed object store like DAOS to be integrated with a Lustre filesystem under a single unified namespace. The open-source DAOS object store will be used as the baseline example, but the suggested approach and associated Lustre changes are effectively agnostic to the object store and are designed to work with any storage tier relying on a URI to identify a collection of objects (i.e. storage containers in the DAOS case). The presentation will also cover:
The Lustre changes required to support this integration will then be presented in further details. DAOS containers will be represented in the Lustre namespace through files and directories with special – aka foreign – LOV and LMV EAs (see LU-11376 for further details). Beyond the unified namespace, the foreign LOV and LMV formats can be reused to implement future features like the Lustre Client Container Image (CCI) or advanced HSM functionality.
Finally, dataset migrations between the object store and the Lustre tier will be considered with a discussion on the different use cases (i.e. I/O middleware-level vs DAOS-level copy) and tools that can potentially be leveraged.
|2:45pm||OpenSFS update||Slides1 Slides2|
|3:45pm||A Performance Study of Lustre File System Checker: Bottlenecks and Potentials - Dong Dai, UNC Charlotte
Lustre, as one of the most popular parallel file systems in high-performance computing (HPC), provides POSIX interface and maintains a large set of POSIX-related metadata, which could be corrupted due to hardware failures, software bugs, configuration errors, etc. The Lustre file system checker (LFSCK) is the remedy tool to detect metadata inconsistencies and to restore a corrupted Lustre to a valid state, hence is critical for reliable HPC.
Unfortunately, in practice, LFSCK runs slow in large deployment, making system administrators reluctant to use it as a routine maintenance tool. Consequently, cascading errors may lead to unrecoverable failures, resulting in significant downtime or even data loss. Given the fact that HPC is rapidly marching to Exascale and much larger Lustre file systems are being deployed, it is critical to understand the performance of LFSCK.
In this research, we study the performance of LFSCK to identify its bottlenecks and analyze its performance potentials. Specifically, we design an aging method based on real-world HPC workloads to age Lustre to representative states, and then systematically evaluate and analyze how LFSCK runs on such an aged Lustre via monitoring the utilization of various resources. From our experiments, we find out that the design and implementation of LFSCK is sub-optimal. It consists of scalability bottleneck on the metadata server (MDS), relatively high fan-out ratio in network utilization, and unnecessary blocking among multiple internal components. Based on these observations, we will discuss potential optimization and present some preliminary results.
The presentation will include a quick introduction of Lustre file system checker implementation, a detailed description of our performance evaluation methodology and results, and a discussion about the potential LFSCK performance optimizations with preliminary results. The goal of this presentation is to draw the community’s attention to the performance problem of LFSCK and discuss the potential optimizations to solve such issues.
Lustre in the Compute Canada Federation and the deployment of Beluga - Simon Guilbault, Calcul Québec
The first part of the presentation will present the current landscape of Lustre in Compute Canada. The services available to researchers will be presented, with a quick overview on the Canada-wide scientific software stack and general user environment, common across the clusters. A new near-line storage service will be available to researchers in 2019. This service is based on the Lustre HSM feature, with tape libraries based on TSM. A HSM connector to TSM tapes was developed and is now used in production. Experience on this new service will be presented.
The second part of the presentation will focus on the newest deployment of a general purpose cluster in Canada called Beluga, and managed by Calcul Quebec. This cluster is planned to be in production for the researcher in April 2019.
The presentation will list and explain the choices leading to the adoption of some of the new features of Lustre. Theses new features are used in production for the first time on a Compute Canada system: ZFS on OST and MDT, disk encryption, SAS multipath and DNE.
The provisioning system and the modification needed to the OS will be presented. Some issues and workaround encountered with the new system hardware will be discussed. For example problems with the scalability of the mpt3sas drivers, and the development of custom scripts to manage and monitor the JBODs.
Finally, a few benchmarks results will be shown using VDBench, obdfilter-survey, IOR and mdtest. A limitation in performance was observed during theses benchmark, some measurements point to a bottleneck with the memory bandwidth of the Skylake OSS with ZFS and/or LUKS.
|5:00pm||Session End – Proceed to the Gameroom|
|5:30pm||Networking Event - UH Gameroom
The UH Gameroom is located in the basement of the Student Center, two floors below our conference location. Join us for an evening of bowling, pool, and ping pong. (There are also video games and an air hockey table available). We will also bring in a Tex-Mex buffet of beef/chicken and grilled vegetable fajitas with all the fixings (guacamole, salsa, sour cream, cheese, rice and beans). We have the entire facility reserved for our exclusive use this evening.
Friday, May 17 (Houston Room – student center south)
|7:30am to 8:45am||Breakfast|
|8:45am||Sponsor Presentation – Cray||Slides|
|9:00am||Un‐scratching Lustre - Cameron Harr, LLNL
File systems, especially complex, parallel ones, take many years to mature. Given the risky behavior inherent in adolescence, these young file systems are often used as “scratch” file systems, containing non-critical or easily-reproducible data in case of data loss or corruption from unforeseen bugs. With Lustre now starting its third decade of life since conception at Carnegie Mellon University in 1999, it is striving to cast off its teenage years and present itself in a responsible and mature fashion.
As one of the original funders of Lustre and the first user of Lustre in production back in 2003, Lawrence Livermore National Lab’s Livermore Computing (LC) has a long and close relationship with the file system and has an interest in seeing Lustre further mature. To that end, in the second half of 2018 and coinciding with the retirement of many older Lustre file systems, LC commenced the “un-scratching” of Lustre: the migration of multiple, production, “scratch” Lustre file systems to persistent, non-scratch, non-purgeable, file systems.
This presentation first addresses the state of Lustre in LC through the first half of 2018. It then further details the rationale behind this change, specifically from a user and an administrative perspective. Next it covers some of the mechanics, results, and yes, even a bit of politics involved in the conversion. The presentation then addresses the current state of the production Lustre file systems in LC, including the implementation of user quotas and the takeaways gathered from that experience. Finally, the presentation will touch on what this change means for the future, specifically in regards to refreshing the hardware underlying Lustre.
|9:30am||Introducing pool quotas - Cory Spitz, Cray
Quota controls are the natural solution to administrative limits on space resource. However, quotas in Lustre today are limited to filesystem-wide quota limits on a per-user, per-group, or per-project basis. We describe a new pool quotas design to extend Lustre’s quotas capabilities to limit allocations within pools. We describe the feature design and explain the initially confusing concepts of using multiple quotas.
|10:00am||Layering ZFS Pools on Lustre - Rick Mohr, University of Tennessee
For most HPC systems, Lustre is a good solution for providing high-bandwidth I/O to shared storage resources that can be accessed simultaneously from many clients for parallel computations. Lustre performs best for large sequential read/write operations, but performance can diminish for workloads that produce lots of small I/O requests or random file accesses. This is the reason many sites deploy additional storage resources (like NFS) for user home directories where tasks like code compilation or interactive file editing may perform better. However, deploying these secondary storage resources adds additional burden to system administrators and fails to leverage the advantages of an existing Lustre investment (like increased storage capacity).
In this presentation, we share our experiences with layering a ZFS file system on top of a Lustre file system. We outline potential use cases and discuss benefits from a system administration standpoint, such as:
– Conserving Lustre inodes by using ZFS to consolidate large numbers of small files into a single Lustre file
We investigate the performance of ZFS-on-Lustre and present the results of several benchmark tests. Based on these benchmark results, we discuss the possibility of using ZFS to speed up code compilation and look at ZFS’ ability to shape I/O traffic to the backend Lustre file system. We also look at using NFS to export a ZFS-on-Lustre configuration and benchmark performance from a NFS client system.
|11:00am||Lustre Overstriping‐Improving Shared File Write Performance - Patrick Farrell, Whamcloud
From its earliest versions, Lustre has included striping files across multiple data targets (OSTs). This foundational feature enables scaling performance of shared-file I/O workloads by striping across additional OSTs. Current Lustre software places one file stripe on each OST and for many I/O workloads this behavior is optimal. However, faster OSTs backed by non-rotational storage show individual stripe bandwidth limitations due to the underlying file systems (ldiskfs, ZFS). Additionally, shared-file write performance, for I/O workloads that don’t use special optimizations like Lustre lockahead, may be limited by write-lock contention since Lustre file locks are granted per-stripe. This issue is becoming more pressing with new distributed raid technologies (DCR,GridRaid,dRaid) allowing larger OSTs, reducing the number of OSTs in a file system. Traditionally, the only solution has been to switch from shared file to file per process, which is not ideal or always possible.
A new Lustre feature known as ‘overstriping’ addresses these limitations by allowing a single file to have more than one stripe per OST. The presence of more than one stripe per OST allows the full bandwidth of a given OST to be exploited while still using one file. This presentation will discuss synthetic and application I/O performance using overstriping and implications for achieving expected performance of next-generation file systems in shared-file I/O workloads.
|11:30am||Solving I/O Slowdown: DoM, DNE and PFL Working Together - John Fragalla and William Loewe, Cray
|12:00pm|| IO-500 - A Storage Benchmark for HPC, Andreas Dilger, Whamcloud
For years, high performance computing has been dominated by the overwhelming specter of Linpack and the Top500. Many sites, tempted by the allure of fleeting Top500 glory, chased architectures well-suited to Linpack but to the detriment of their core workflows. Despite this, the Top500 has overall provided value to the community by bringing attention to HPC and driving competition and innovation in processor architectures. Two years ago, with these observations in mind, we formed a comparable list for HPC storage called the IO500.
The IO500 seeks to provide more balance for HPC. By creating a complementary list to the Top500, we hope that sites that pursue these lists will design machines that work well for both the Top500 and the IO500 thereby resulting in generally more balanced overall data centers. Additionally, the IO500 consists of a suite of benchmarks designed to identify a storage system’s range of possible performance. For too long, storage vendors and data centers have only published their “hero” bandwidth numbers which provides a tremendous disservice to the community by creating unreasonable and unattainable performance expectations. Accordingly, the IO500 forces submitters to report both their “hero” numbers as well as their performance using notoriously challenging patterns of both data and metadata. This provides the community with an understanding of both a system’s possible and its probable potentials.
Over two years, we have now had three lists and collected over sixty submissions across more than twenty institutions and nine different file systems. All collected data is publicly available such that the community can begin to discover which file systems (and which configurations) will best serve their particular workflow balance.
In this talk, we will present a brief history and motivation of the IO500 and spend the majority of the time attempting to find trends and other observations from the submissions received thus far.
|12:30pm||Conference Concludes – BOXED LUNCH|
Networking Event - Space Center Houston Tour
Learn more about Space Center Houston and all of its attractions at their official website: www.spacecenter.org
We hope to see you next year in Berkley for LUG 2020.