Program

HPC Summer School-University of Trento
17-21 June 2024

Mon. 17 June 2024

Morning

9:00-10:00

Architecture Fundamentals

F. Mantovani

TBD.


11:00-11:15

Coffee Break

11:15-13:15

A RISC-V vector CPU for High-Performance Computing: architecture, platforms and tools to make it happen

F.Mantovani

RISC-V is an open-source instruction set architecture (ISA) for computer processors. It is designed to be simple, modular, and extensible, with a minimalist approach to instruction set design that aims to provide flexibility, performance, and energy efficiency. The class provides an introduction to RISC-V and vector supercomputing. A particular focus will be given to the RISC-V vector extensions (RVV) and especially to an implementation using large vectors. Students will learn how RVV compares to other vector architectures and explore a design point leveraging up to 16-kb-wide vectors. Students will be exposed to methodology, tools and libraries available for vectorization and the challenges and limitations coming with them. A prototype platform implementing a European RISC-V CPU supporting the RVV extension will be presented for testing and analyzing simple codes and parallel scientific applications.

13:15-14:30

Lunch

Afternoon

14:30-16:30

RISC-V computer architecture: a practical approach

A. Bartolini

The tutorial will consist of a mix of practical hands-ons on the Monte Cimone RISC-V Cluster accompanied by short explanations. This tutorial will guide the student in applying concepts from computing architecture to evaluate the performance of the underlying HW and measure its bottleneck. To this purpose performance monitoring tools will be used as well as practical case studies will be discussed.

16:30-16:45

Coffee Break

16:45-17:45

Mentoring

Students and lecturers

17:45-18:30

Mentoring

Students and lecturers

Tue. 18 June 2024

Morning

9:00-10:00

Compilers Fundamentals

G. Agosta

TBD

10:00-11:00

Compiler Construction with the Multi-Level Intermediate Representation

G. Agosta

The talk introduces the concept of intermediate representation (IR) in compiler construction, its motivation and design principles, as well as an overview of IR designs. Then, the Multi-Level Intermediate Representation (MLIR) is introduced, demonstrating its usefulness through examples that show how high level programming concepts can be modelled in MLIR and lowered through successive steps, balancing the needs of target-independent and target-dependent optimization.

11:00-11:15

Coffee Break

11:15-12:15

The Photonic Engine: The Key to the Future of Deep Learning

P. Velha

As we explore the transformative realm of machine learning, we can observe a meteoric rise in the use of deep learning algorithms, fuelled by increasing computational power and vast amounts of unexplored data. This progression has led to ground-breaking advancements across industries. However, it also presents us with an insatiable demand for computational power, one that traditional electronic hardware approaches struggle to meet sustainably. Enter the concept of the Photonic Engine – a beacon of innovation within the photonics technology domain that promises to revolutionize deep learning through advanced photonic integrated circuits designed to harness light for interconnection within the computational cores but also for computation.

Our investigation into this topic unveils the emerging influence of the Photonic Engine on the computational capabilities required for deep learning, contrasting it with the current paradigm of computational techniques. Embarking on this journey, we will dissect the inner workings of photonic chips, their role in hardware-software integration, and their potential to offer computational advantages in signal processing surpassing current hardware accelerators. We'll further examine the obstacles and prospects that pervade photonics technology, paving the way to understand if the Photonic Engine indeed holds the key to the future of deep learning.

12:15-13:15

Writing, Benchmarking, and Reproducibility of HPC Research Papers

D. De Sensi

This talk presents some guidelines for writing an HPC research paper. We will start discussing how to organize and present research ideas.

We then analyze good and bad practices in benchmarking and results presentation with practical and interactive examples. We discuss some common mistakes that might impact the results' meaningfulness and interpretability. Last, we conclude by discussing how to guarantee the reproducibility of research results.

13:15-14:30

Lunch

Afternoon

14:30-15:30

CINECA Introduction to HPC principles and Leonardo

A. Marani

In this short lecture will be presented an overview of the general state of art of the world of High Performance Computing, focusing on the present technologies and with a glimpse of the future. 

CINECA, national consortium aimed at providing HPC resources for academical and industrial research, will discuss the infrastructure that can be made available to their users, with a major focus on Leonardo, the pre-exascale cluster ranked among the first ten most performant supercomputers in the World.

15:30-16:30

OpenMP Tutorial

M. Guernelli

OpenMP is a commonly used parallel programming paradigm that enables work sharing between cpus working in a shared memory environment as "threads". The tutorial will provide the basic principles for learning how to parallelize a C/C++ serial code using OpenMP and will be complete with exercises that will give the student the opportunity to apply the concepts learned during the lectures.

16:30-16:45

Coffee Break

16:45-18:30

OpenMP Tutorial

M. Guernelli

[continuation of the talk after the break]

OpenMP is a commonly used parallel programming paradigm that enables work sharing between cpus working in a shared memory environment as "threads". The tutorial will provide the basic principles for learning how to parallelize a C/C++ serial code using OpenMP and will be complete with exercises that will give the student the opportunity to apply the concepts learned during the lectures.

Wed. 19 June 2024

Morning

9:00-11:00

GPU Programming Fundamentals

B. Cosenza

Programming modern high-performance computing systems is challenging due to the need to efficiently program GPUs and accelerators, handle data movement between nodes, and support optimization for modern workloads such as neural networks. The C++ language has been greatly enhanced in recent years with features that greatly increase productivity. In particular, the C++-based SYCL standard provides a powerful programming model for heterogeneous systems that can target a wide range of devices, including multicore CPUs, GPUs, FPGAs, and accelerators.

The course consists of two sessions. In the first session, attendants will learn about the architecture of modern GPUs, including the execution model, the memory model, and other key concepts. Attendants will be able to write GPU kernels in SYCL using parallel_for semantics and different memory access models. They will also understand the similarities and differences between different vendors including AMD, Intel, and NVIDIA, and implement basic optimization techniques.

The second session will focus on advanced SYCL topics, including group algorithms, kernel reduction, atomics, and specialization constants. The session will also present ongoing research efforts to design advanced SYCL-based high-level programming semantics that provide advanced optimizations. In particular, we will see SYCL extensions dealing with cluster of accelerators and workload distribution (CELERITY), energy efficient computing (SYnergy), support for approximate computation (SYprox), and support for tensor units (joint_matrix).


11:00-11:15

Coffee Break

11:15-13:15

An Introduction to Distributed Memory Programming and Interconnection Networks for High-Performance Computing

D. De Sensi

This talk introduces the basics of two essential aspects of HPC: distributed memory programming and interconnection networks.

We will analyze the challenges in writing scalable distributed memory applications and the impact the underlying interconnection network might have on application performance. We will cover the message-passing model, collective communication algorithms and how they are optimized for the underlying network, in-network computing, and the impact of congestion control and routing on performance variability.

13:15-14:30

Lunch

Afternoon

14:30-16:30

Introduction to Accelerated Computing using OpenACC

N. Shukla

The OpenACC programming model provides a directive-based approach to accelerate parallel computing on heterogeneous architectures, such as GPUs and multi-core CPUs. Developers annotate their code with compiler directives, guiding the parallelization and optimization process. OpenACC tries as a bridge between productivity and performance in the era of heterogeneous computing. This short tutorial will cover the fundamentals of Accelerated Computing with OpenACC, including exercises in C/C++ and Fortran.

16:30-16:45

Coffee Break

16:45-17:45

Introduction to Accelerated Computing using OpenACC

N. Shukla

[continuation of the talk after the break]

The OpenACC programming model provides a directive-based approach to accelerate parallel computing on heterogeneous architectures, such as GPUs and multi-core CPUs. Developers annotate their code with compiler directives, guiding the parallelization and optimization process. OpenACC tries as a bridge between productivity and performance in the era of heterogeneous computing. This short tutorial will cover the fundamentals of Accelerated Computing with OpenACC, including exercises in C/C++ and Fortran.

17:45-18:30

Mentoring

Thu. 20 June 2024

Morning

9:00-11:00

HPC Interconnect, trends and emerging technologies

S. Di Girolamo, G. Bloch

AI Data-center HW architecture
Artificial Intelligence, and specifically deep neural networks become the single most interesting application. It is expected that a growing percentage of the world’s compute power will be dedicated to training and inferencing of neural networks for many tasks. Training large neural network models such as Large Language Models (LLM) require specialized systems and standard datacenters cannot train such models in an efficient way. In this talk we will focus on the connectivity requirements for large scale datacenters (AI factories) with the focus on LLM workloads connecting tens of thousands of processing engines (GPUs). We will discuss requirements and how in-network computing can accelerate today's and tomorrow's AI workloads.

An Introduction to NVIDIA BlueField 3 DPUs
As we scale systems, the performance of communications becomes increasingly critical, necessitating the optimization of these processes. A promising approach to optimization involves delineating specific parts of the communication stack for offloading to the network, thereby enhancing efficiency and conserving CPU cycles. Unlike traditional hardware offloads, Data Processing Units (DPUs) feature programmable engines. These allow both applications and communication libraries to define and delegate tasks for execution directly on the network card, offering a new network programming paradigm. In this talk, we will delve into the programmable components of the NVIDIA BlueField 3 DPU. We'll explore its architecture, the programming models, and potential use cases, providing insights into how these technologies can be leveraged to optimize large-scale communication systems.

11:00-11:15

Coffee Break

11:15-12:15

When CPUs Take a Back Seat: An Autonomous Execution and Profiling Tool for multi-GPUs

D. Unat

In recent times, GPUs have taken the lead as primary accelerators in high-performance systems, concentrating much computational power in GPU clusters. While multi-GPU acceleration benefits many HPC and ML applications, inter-GPU communication, handled traditionally by the host, can hinder scalability. The host traditionally controls execution by managing kernels, communication, and synchronization. This CPU involvement can be fully shifted to devices, improving performance for multi-GPU applications. 

In this talk, first I present a fully autonomous multi-GPU execution model, eliminating CPU involvement after the initial kernel launch. Our CPU-free model leverages techniques like persistent kernels and device-initiated communication, reducing communication overhead. We validate the model on iterative solvers—2D/3D Jacobian stencil and Conjugate Gradient (CG). Second, I introduce a multi-GPU communication profiling tool founded on NVBit and relying on instrumentation. Our tool excels in tracking both peer-to-peer data transfers and communication library calls. The tool provides a variety of visualization modes and levels of detail, ranging from a broad overview of data movement across the system to the precise instructions and memory addresses involved. 

12:15-13:15

Keynote on Distributed Memory and interconnects

Next-Generation Accelerated HPC and AI Networks for Cloud and On-Prem Datacenters


T. Hoefler

Accelerated computing has long been a cornerstone of HPC and AI clusters. GPUs and their programming environment CUDA have led to massive efficiency improvements that drove both the AI and HPC industry over the last decade. Now, that computation has been optimized significantly, the new bottleneck is data movement, especially over the network. We discuss these trends and show how streaming Processing In the Network (sPIN) can act as the CUDA for data-movement intensive workloads. The sPIN framework allows to optimize network transactions and admits many different implementations. We will furthermore provide an outlook to Ultra Ethernet, a next-generation network technology that is being developed to address the future needs of the rapidly expanding HPC and AI markets.

13:15-14:30

Lunch

Afternoon

14:30-16:30

MPI tutorial

A. Marani

MPI (Message Passing Interface) is a commonly used parallel programming paradigm that enables data exhange between processes in a distributed memory environment. The tutorial will provide the basic principles for learning how to parallelize a C/C++ serial code using MPI and will be complete with exercises that will give the student the opportunity to apply the concepts learned during the lectures.

16:30-16:45

Coffee Break

16:45-17:45

MPI tutorial

A. Marani

[continuation of the talk after the break]

MPI (Message Passing Interface) is a commonly used parallel programming paradigm that enables data exhange between processes in a distributed memory environment. The tutorial will provide the basic principles for learning how to parallelize a C/C++ serial code using MPI and will be complete with exercises that will give the student the opportunity to apply the concepts learned during the lectures.

17:45-18:30

Mentoring

Fri. 21 June 2024

Morning

9:00-11:00

Linear Algebra Libraries for HPC 

P. D'Ambra

Linear Algebra (LA) libraries are key components in any software platform for scientific and

engineering computing. Solution of large linear systems and/or the computation of a few

eigenvalues/eigenvectors of an operator are indeed at the core of physics-driven numerical

simulations relying on partial differential equations and often represents a main bottleneck in data-

driven procedures, such as model reduction, graph analysis and scientific machine learning.

In this lecture I will give an introduction to scientific libraries for dense LA, such as BLAS, Linpack

and its successors, which represent “de facto” standard platforms for scientific code development

and also benchmarks for new proposals of hardware/software architectures for HPC. Then I will

focus on fundamental algorithms for sparse LA and present some recent research efforts to propose

new algorithms and software libraries for high-end GPU-accelerated hybrid architectures.

11:00-11:15

Coffee Break

11:15-12:15

Programming Dynamic and Intelligent Workflows for the Computing Continuum

D. Lezzi

Progress in science is deeply bound to the effective use of high-performance computing infrastructures and to the efficient extraction of knowledge from vast amounts of data. Such data comes from different sources that follow a cycle composed of pre-processing steps for data curation and preparation for subsequent computing steps, and later analysis and analytics steps applied to the results. However, scientific workflows are currently fragmented in multiple components, with different processes for computing and data management, and with gaps in the viewpoints of the user profiles involved. Our vision is that future workflow environments and tools for the development of scientific workflows should follow a holistic approach, where both data and computing are integrated in a single flow built on simple, high-level interfaces. In this presentation, we present our vision for novel definition of workflows that integrate the different data and compute processes, dynamic runtimes to support the execution of the workflows in complex and heterogeneous computing infrastructures efficiently, both in terms of performance and energy. These infrastructures are spawn across the so-called computing continuum, that includes highly distributed resources, from sensors and instruments, and devices in the edge, to High-Performance Computing and Cloud computing resources.

12:15-13:15

Scalability and Productivity in Genomics on Massively Parallel Systems

G. Guidi

The use of massively parallel systems continues to be crucial for processing large volumes of data at unprecedented speed and for scientific discoveries in simulation-based research areas. Today, these systems play a crucial role in new and diverse areas of data science, such as computational biology and data analytics. Computational biology is a key area where data processing is growing rapidly. The growing data volume and complexity have outpaced the processing capacity of single-node machines in these areas, making massively parallel systems an indispensable tool.

The diverse and non-trivial challenges of parallelism in these areas require computing infrastructures that go beyond the demand of traditional simulation-based sciences. However, programming on high-performance computing (HPC) systems poses significant productivity and scalability challenges. It is important to introduce an abstraction layer that provides programming flexibility and productivity while ensuring high system performance. As we enter the post-Moore's Law era, effective programming of specialized architectures is critical for improved performance in HPC. As large-scale systems become more heterogeneous, their efficient use for new, often irregular, and communication-intensive data analysis computation becomes increasingly complex. In this talk, we discuss how to achieve performance and scalability on extreme-scale systems while maintaining productivity for new data-intensive biological challenges and how to achieve high-perfromance on new specialized architectures such as SRAM-based Graphcore IPUs.

13:15-14:30

Lunch