SIGHPC Systems Professionals Workshop

HPCSYSPROS19

FRIDAY November 22, 2019

8:30am to noon

Denver, CO

Held in conjunction with

SC19 Logo

and in cooperation with

SIGHPC Logo

 Quick Information

Supercomputing systems present complex challenges to personnel who design, deploy and maintain these systems. Standing up these systems and keeping them running require novel solutions that are unique to high performance computing. The success of any supercomputing center depends on stable and reliable systems, and HPC Systems Professionals are crucial to that success.

The Fourth Annual HPC Systems Professionals Workshop will bring together systems administrators, systems architects, and systems analysts in order to share best practices, discuss cutting-edge technologies, and advance the state-of-the-practice for HPC systems. This CFP requests that participants submit either papers, slide presentations, or 5-minute Lightning Talk proposals along with reproducible artifacts (code segments, test suites, configuration management templates) which can be made available to the community for use.

 Keynote Speaker

Kate Keahey will be presenting Chameleon: How to Build a Cloud++

Chameleon is a large-scale, deeply reconfigurable experimental platform built to support Computer Sciences systems research. Community projects range from systems research developing exascale operating systems, virtualization methods, performance variability studies, and power management research to projects in software defined networking, machine learning, and resource management. What makes Chameleon unique is that it provides these sophisticated capabilities based on a mainstream infrastructure cloud system (OpenStack). In this talk, I will explain the challenges we faced in building Chameleon, lessons learned, operations experiences, and describe our packaging of the system that integrates both the developed capabilities and the operational experience and facilitates managing platforms of this kind.

Kate Keahey is one of the pioneers of infrastructure cloud computing. She created the Nimbus project, recognized as the first open source Infrastructure-as-a-Service implementation, and continues to work on research aligning cloud computing concepts with the needs of scientific centers and applications. To facilitate such research for the community at large, Kate leads the Chameleon project, providing a deeply reconfigurable, large-scale, and open experimental platform for Computer Science research. Kate also co-founded and serves as co-Editor-in-Chief of the SoftwareX journal, a new format designed to publish software contributions. Kate is a Senior Computer Scientist with the Math and Computer Science division at Argonne National Laboratory and a Senior Fellow at the Computation Institute at the University of Chicago.

 Schedule

StartEndDescription
8:30 AM8:45 AMWelcome
8:45 AM9:30 AM Keynote: Chameleon: How to build a Cloud++, Kate Keahey, Argonne National Laboratory
9:30 AM9:45 AM Paper: Decoupling OpenHPC Critical Services, Jacob Chappell, Bhushan Chitre, Vikram Gazula, Lowell Pike, James Griffioen, University of Kentucky
9:45 AM10:00 AM Paper: Implementing a Common HPC Environment in a Multi-User Spack Instance, Carson Woods, Matthew L. Curry, Anthony Skjellum, University of Tennessee, Chattanooga and Sandia National Laboratories
10:00 AM10:30 AM Break
10:30 AM10:37 AM Lightning Talk: Arbiter: Dynamically Limiting Resource Consumption on Login Nodes, Dylan Gardner, Robben Migacz, Brian Haymore, University of Utah, Center for High Performance Computing
10:37 AM10:44 AM Lightning Talk: Using GUFI in Data Management, Christopher Hoffman, Bill Anderson, National Center for Atmospheric Research
10:44 AM11:00 AM Slide Presentation: Monitoring HPC Services with CheckMK, Kieran Leach, Philip Cass, Craig Manzi, Edinburgh Parallel Computing Centre
11:00 AM11:15 AM Slide Presentation: The Road to Devops HPC Cluster Management, Ken P. Schmidt, Evan J. Felix, Pacific Northwest National Laboratory
11:15 AM11:30 AM Paper: What Deploying MFA Taught Us About Changing Infrastructure, Abe Singer, Shane Canon, Rebecca Hartman-Baker, Kelly L. Rowland, David Skinner, Craig Lant, Lawrence Berkeley National Laboratory and National Energy Research Scientific Computing Center
11:30 AM11:45 AM Paper: A Better Way of Scheduling Jobs on HPC Systems: Simultaneous Fair-Share, Craig P. Steffen, University of Illinois and National Center for Supercomputing Applications
11:45 AM12:00 AM Closing Remarks and Open Discussion

 Proceedings

https://github.com/HPCSYSPROS/Workshop19

 Topics of Interest

Here are some topics of interest for this group. Note that these are here to indicate direction, not to disallow other related topics.

  • Cluster, configuration, or software management
  • Performance tuning/Benchmarking
  • Resource manager and job scheduler configuration
  • Monitoring/Mean-time-to-failure/ROI/Resource utilization
  • Virtualization/Clouds
  • Designing and troubleshooting HPC interconnects
  • Designing and maintaining HPC storage solutions
  • Cybersecurity and data protection
  • Cluster storage

Example paper ideas might be:

  • Best practices for job scheduler configuration
  • Advantages of cluster automation
  • Managing software on HPC clusters

 Calendar

EventDate
Workshop Submissions OpenApril 17, 2019
Workshop Submission Close August 30, 2019
Reviews SentSeptember 14, 2019
Resubmission OpenSeptember 16, 2019
Resubmission ClosedSeptember 28, 2019
Acceptance NotificationOctober 11, 2019

 Organizing Committee

PositionNameAffiliation
ChairDavid CliftonAnsys
Program Committee ChairJohn BlaasUniversity of Colorado Boulder
Organizing Committee
Stephen FralichBoeing
Stephen Lien HarrellPurdue University
Adam HoughUniversity of Washington
William ScullinLaboratory for Laser Energetics
Jenett TillotsonNCAR

 Program Committee

NameAffiliation
Jonathon AndersonUniversity of Colorado Boulder
Jared BakerNCAR
Matt BidwellNREL
Stephen FralichBoeing
Brian HaymoreUniversity of Utah
Randy HerbanSylabs Inc.
Andrew HowardMicrosoft
Ti LeggettArgonne National Laboratory
Hon Wai LeongDDN
Scott McMillianNvidia
Paul Peltz Jr.Oak Ridge National Laboratory
Jeff Raymond University of Pittsburgh
Jenett TillotsonNCAR
Alex YountsPurdue University

 Publication Information

All accepted papers and artifacts will be published on GitHub and archived with a DOI in Zenodo. You can view last years accepted papers here HPCSYSPROS SC18 Workshop Proceedings

 Contact Information

If you need to contact us, send email to SIGHPC SYSPROS.

 Links