Parallel Processing for Energy Efficiency (PP4EE)

NTNU, 3. October 2013, at 10:00h – 17:30 in Room R5

Link to Room R5 in Realfagbygget: Room R5

Videos of most of the presentations are now available here

The computer architecture and design research group (CARD) at NTNU together with NTNU HPC-section, ARM Norway and the EMECS Erasmus Mundus study will host a one-day seminar covering topics within supercomputing, computational science, parallel processing, heterogeneous computing and energy efficiency. The programme will have a mix of invited international speakers and presentations from NTNU, and will cover all the main abstraction levels from supercomputer applications down to parallel languages and multicore architectures. The aim of the seminar is to stimulate collaborative research involving partners from several of the institutions and several of the abstraction layers.

The seminar had 81 participants. Thanks to all the presenters, sponsors, co-organizers and all the participants showing interest and taking part in the discussions. Slides and videos of the presentations will be posted here soon.

(The seminar is also part of course TDT1 - Energy Efficient Multicore Computing)

Final programme

Abstracts for the presentations are found below.

  • 09:30 – 10:00: Registration, coffee
  • 10:00 – 10:05: Welcome (Magnus Jahre, NTNU-EECS project manager)
Morning session (Chairman: Nikita Nikitin)
  • 10:05 – 10:50: Hiroshi Okuda, University of Tokyo, Open-Source Parallel FE Software : FrontISTR, Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters, slides, extra slides
  • 10:50 – 11:25: Georgios Goumas, National Techn. University of Athens, Alleviating memory-bandwidth limitations for scalability and energy efficiency: Lessons learned from the optimization of SpMV, slides
  • 11:25 – 12:00: Magnus Jahre, NTNU - CARD/EECS, The NTNU/IME focus area research project: Energy Efficient Computing Systems (EECS), slides
12:00 -- 13:00: Lunch - in "Hangaren"
Afternoon session (Chairman: Per Gunnar Kjeldsberg)
  • 13:00 – 13:35: Ana Varbanescu, University of Amsterdam, OpenCL and performance portability, slides
  • 13:35 – 14:10: Javed Absar, ARM (GPU Programming Research Group Leader, Cambridge), The EU project CARP: Correct and Efficient Accelerator Programming, slides
  • 14:10 – 14:45: Juan Manuel Cebrian, NTNU-CARD, Are we Optimizing Hardware for non-optimized Applications?. PARSEC's Vectorization Effects on Energy Efficiency and Architectural Requirements, slides
  • 14:45 – 15:10: Break (coffee)
  • 15:10 – 15:55: Guillermo Miranda, Universitat Politècnica de Catalunya and Barcelona Supercomputing Centre, OmpSs with Open CL and OmpSs/MPI, slides
  • 15:55 – 16:20: Jan Chr. Meyer, NTNU, Energy efficiency on the NTNU supercomputer Vilje,slides
  • 16:20 – 16:40 Break (no coffee)
  • 16:40 – 17:30: Discussion, brainstorming

Presentation abstracts

Hiroshi Okuda

Open-Source Parallel FE Software : FrontISTR, Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters

Abstract: FrontISTR is an open-source structural analysis system, supporting fruitful nonlinear analysis functions. FrontISTR also exhibits an innovative aspect that addresses large-scale application, parallelism, and programmability. A 7.5 billion DOF problem can be solved in 13.7 h using 65,536 cores of “K.” A single core performance is a most crucial factor in FEM, which uses iterative equation solvers, and SpMV (Sparse-Matrix Vector Product) is a hotspot there. Cache blocking and contiguous data structure for matrix have been investigated to challenge the memory wall problem. Running on a note PC, PC clusters and supercomputers including the Earth Simulator 2 and the K-computer, FrontISTR has been used for solving various industrial problems, for example, (1) Dynamic friction behaviors between rail and fast running train’s wheel, (2) Thermal structural deformation of electrical devices, (3) Thermal elastic-plastic residual stress of large-scale welded structures, (4) Friction of power transmission belt, (5) Large strain evaluation of fill rubber tire, (6) Fluid-structure coupled behavior of turbine blades, and so on.

FrontISTR poster(2 pages), Full abstract as PDF

Georgios Goumas

Alleviating memory-bandwidth limitations for scalability and energy efficiency: Lessons learned from the optimization of SpMV

Abstract: In this talk we will present our approach towards optimizing Sparse Matrix-Vector Multiplication (SpMV) on modern multicore platforms. SpMV is one of the most memory-bandwidth hungry computational kernels, heavily used in a large variety of HPC applications. To cope with this problem we propose a new online storage format for sparse matrices called Compressed Sparse eXtended (CSX). CSX applies aggressive compression to the indexing structure of sparse matrices and is able to store them with significantly reduced memory footprint. When it comes to parallel execution, the scheme achieves remarkable performance improvement and stability for a variety of matrices, both in SMP and NUMA configurations. Based on our findings on CSX, we will also discuss directions for future research on the optimization of resource demanding applications in modern execution platforms.

Magnus Jahre

The NTNU/IME focus area research project: Energy Efficient Computing Systems (EECS)

Abstract: Future computing systems are expected to be a collection of processing elements with different energy/performance characteristics due to the Dark Silicon effect. In such systems, only the subset of processing elements that maximize energy efficiency for the current application is enabled. At least two research breakthroughs are necessary for this vision to become a reality. First, we need to develop efficient software for heterogeneous systems. Second, we need to identify and implement the most efficient processing cores and accelerators as well as integrating them efficiently into the complete system. These breakthroughs can be achieved by experiments on commercially available hardware or through simulations. Unfortunately, the level of heterogeneity of commercial hardware is limited, and the performance overhead of simulation is significant.

To meet these challenges, we propose the Single-ISA Heterogeneous MAny-core Computer (SHMAC). SHMAC is an infrastructure for realizing heterogeneous computing systems from a collection of diverse, generic processing elements based on a common high-level architecture. Using reconfigurable FPGAs, it is possible to rapidly evaluate software and hardware innovations in a collection of systems that are significantly more heterogeneous than what is commercially available. In this presentation, we will focus on the motivation and implementation of SHMAC. We will also cover our future plans and how SHMAC can be leveraged in research collaborations.

Ana Varbanescu

Performance Portability in the Multi-core Era: Myths and Facts

Abstract: The “write-once-run-everywhere” programming models are still seen as a marketing trick in computer science. OpenCL, the newest such model, is no exception: proposed in 2009 as an instrument to address the portability of parallel programming over multiple multi-/many-core architectures, it has been quickly criticized for its lack of “performance portability”.

This talk is intended as a thorough discussion on performance portability. Therefore, it addresses three essential questions: (1) what is performance portability? (2) can we quantify performance portability? (3) can a programming model achieve performance portability?

We provide our vision on answering these three questions, while using the OpenCL programming model and multi-/many-cores architectures as running examples.

Javed Absar

The EU project CARP: Correct and Efficient Accelerator Programming


Programming accelerators such as GPUs is accomplished today using low-level APIs such as OpenCL, which raises concerns from the programmer productivity and performance portability perspectives. Programmer productivity is affected because low-level APIs distract the programmer from the actual problem. Performance portability is affected because code optimized for a particular accelerator is unlikely to perform as well on another.

This talk will present a compilation flow that we have developed at ARM, along with other European Partners, which aims to address both concerns. The compilation flow includes VOBLA, a DSL that can compactly represent linear algebra operations, separating functional semantics from implementation details such as storage layouts. Any parallelism inherent in the function is not obscured by implementation details, easing parallel code generation.

VOBLA is compiled into PENCIL, a C99 based platform-neutral compute intermediate language, while retaining sufficient information for generating efficient accelerator code. PENCIL is then compiled into OpenCL code optimized for a specific accelerator using techniques based on the polyhedral model which make use of the retained information.

This is all exciting research with great benefits to GPU programming for programming-productivity and performance-portability.

Juan Manuel Cebrian

Are we Optimizing Hardware for non-optimized Applications?. PARSEC's Vectorization Effects on Energy Efficiency and Architectural Requirements

Abstract: Validation of new architectural proposals against real applications is a necessary step in academic research. However, providing benchmarks that keep up with new architectural changes has become a real challenge. If benchmarks don't cover the most common architectural features, architects may end up under/over estimating the impact of their contributions.

In this work, we extend the PARSEC benchmark suite with SIMD capabilities to provide an enhanced evaluation framework for new academic/industry proposals. We then perform a detailed energy and performance evaluation on different platforms (Intel® and ARM®) of this commonly used application set. Results show how SIMD code alters scalability, energy efficiency and hardware requirements. Performance and energy efficiency improvements depend greatly on the fraction of code that we can actually vectorize (up to 10x). We base our code in a custom built wrapper library compatible with SSE, AVX and NEON to facilitate rapid and general vectorization. We aim to distribute the source code to reinforce the evaluation process of new proposals for future computing systems.

Guillermo Miranda

OmpSs with Open CL and OmpSs/MPI

Abstract: This talk will introduce the OmpSs Programming Model, the interoperability with MPI and also with OpenCL. OmpSs enables mixing MPI code with OpenMP-like directives, improving IPC and allowing communication overlapping. OmpSs has also support for CUDA and OpenCL. Programmers can call GPU kernels without worrying about initialisation (troublesome in OpenCL), memory space management and data copying and device selection. OmpSs is able to schedule work to the available GPUs, and provides ways to run the application across all the available computing resources (CPU or accelerators).

Jan Chr. Meyer

Energy efficiency on the NTNU supercomputer Vilje

Abstract: This talk will describe the process of profiling application energy consumption on the Vilje supercomputer, using model-specific registers present in the Sandy Bridge architecture. These registers enable software to sample the energy consumption of the processor package and dynamic memory with fine granularity, but implementing access presents several challenges which are compounded in a production environment. A survey of our ongoing work to facilitate energy measurement will be presented, along with an overview of results that have been attained throughout the process.

Trond Kvamsdal

The NTNU/IME focus area research project Computational Science and Engineering (CSE)

Abstract: See PDF

Organizing committe

  • Bjørn Lindi (NTNU-IT HPC)
  • Asbjørn Djupdal (ARM Norway)
  • Per Gunnar Kjeldsberg (NTNU-IET and EMECS)
  • Birgit Sørgård (NTNU-IDI, adm)
  • Contact person: Lasse Natvig (NTNU-IDI, CARD), Mobile phone: +47 906 44 580

2013/11/24 16:37, Lasse Natvig