High performance computing (HPC) resources and workloads are undergoing tumultuous changes. HPC resources are growing more diverse with the adoption of accelerators; HPC workloads have increased in size by orders of magnitude. Despite these changes, when assigning workload jobs to resources, HPC schedulers still rely on users to accurately anticipate their applications' resource usage and remain stuck with the decades-old centralized scheduling model. In this talk we will discuss these ongoing changes and propose alternative models for HPC scheduling based on resource-awareness and fully hierarchical models. A key role in our models' evaluation is played by an emulator of a real open-source, next-generation resource management system. We will discuss the challenges of realistically mimicking the system's scheduling behavior. Our evaluation shows how our models improve scheduling scalability on a diverse set of synthetic and real-world workloads. This is joint work with Stephen Herbein and Michael Wyatt at the University of Delaware, and Tapasya Patki, Dong H. Ahn, Don Lipari, Thomas R.W. Scogland, Marc Stearman, Mark Grondona, Jim Garlick, Tamara Dahlgren, David Domyancic, and Becky Springmeyer at the Lawrence Livermore National Laboratory.
In the last years the traditional ways to keep the increase of hardware performance to the rate predicted by the Moore's Law vanished. When uni-cores were the norm, hardware design was decoupled from the software stack thanks to a well defined Instruction Set Architecture (ISA). This simple interface allowed developing applications without worrying too much about the underlying hardware, while computer architects proposed techniques to aggressively exploit Instruction-Level Parallelism (ILP) in superscalar processors. Current multi-cores are designed as simple symmetric multiprocessors on a chip. While these designs are able to compensate the clock frequency stagnation, they face multiple problems in terms of power consumption, programmability, resilience or memory. The solution is to give more responsibility to the runtime system and to let it tightly collaborate with the hardware. The runtime has to drive the design of future multi-cores architectures. In this talk, we introduce an approach towards a Runtime-Aware Architecture (RAA), a massively parallel architecture designed from the runtime's perspective. RAA aims at supporting the activity the parallel runtime system in three ways: First, to enable fine-grain tasking and support the opportunities it offers; second, to improve the performance of the memory subsystem by exposing hybrid hierarchies to the runtime system and, third, to improve performance by using vector units. During the talk, we will give a general overview of the problems RAA aims to solve and provide some examples of hardware components supporting the activity of the runtime system in the context of multi-core chips.
Markov models have a long tradition in modeling and simulation of dynamic systems. In this paper, we look at certain properties of a discrete time Markov chain including entropy, trace and 2nd largest eigenvalue to better understand their role for time series analysis. We simulate a number of possible input signals, fit a discrete time Markov chain and explore properties with the help of Sobol indices. This research is motivated by recent results in the analysis of cell development for Xenopus laevis in cell biology that relied on the considered entropy measure to distinguish development stages from time series data of Calcium levels in cells.
The capture of data about the events executed by a discrete event simulation can easily lead to very large trace data files. While disk space is relatively inexpensive and mostly capable of storing these large trace files, the manipulation and analysis of these large trace files can prove difficult. Furthermore, some types of analysis must be performed in-core and they cannot be performed with the trace data exceeds the size of the physical RAM where the analysis is performed. Because of these limits, it is often necessary to strictly limit the simulation run time to satisfy the analysis time memory limits. Experience with the DESMetrics tool suite (a collection of tools to analyze event trace files), demonstrates that our in-memory analysis tools are limited to trace files on the order of 10GB (on a machine with 24GB of RAM). Furthermore, even when it is possible to analyze large trace files, the run time costs of performing this analysis can take several days to complete. While high performance analysis of traces data is not strictly necessary, the results should be available within some reasonably bounded time frame. This paper explores techniques to overcome the limits of analyzing very large event trace files. While explorations for out-out-core analysis have been examined as part of this work, the run time costs for out-of-core processing can increase processing time 10-fold. As a result, the work reported here will focus on an approach to capture and analyze small samples from the event trace file. The work reported in this paper will examine how closely the analysis from sampling matches the analysis from a full trace file. Two techniques for comparison are presented. First a visual comparison of analysis results between the full trace and a trace sample are presented. Second, numerical quantification of the different analysis results (between the full trace and trace sample) will be reported using the Wasserstein, Directed Hausdorff, and Kolmogorov-Smirnov distance metrics. Finally, the ability to process trace samples from a very large trace file of 80GB is demonstrated.
Assessing the impact of changes in a production network (e.g., new routing protocols or topologies) requires simulation or emulation tools capable of providing results as close as possible to those from a real-world experiment. Large traffic loads and complex control-data plane interactions constitute significant challenges to these tools. To meet these challengeswe propose a model for the fast and convenient evaluation of SDN as well as legacy networks. Our approach emulates the network's control plane and simulates the data plane, to achieve high fidelity necessary for control plane behavior, while being capable of handling large traffic loads. We design and implement a proof of concept from the proposed model. The initial results of the prototype, compared to a state-of-the-art solution, shows it can increase the speed of network experiments by nearly 95% in the largest tested network scenario.
Recent trends towards data-driven methods may require a substantial rethinking regarding the practice of Modeling &Simulation (M&S). Machine Learning (ML) is now becoming an instrumental artefact for developing new insights, or improving already established knowledge. Reflecting this broad scope, the paper presents a conceptual framework to guide the integration of simulation models with ML. At its core, our approach is based on the premise that system knowledge can be (partially) captured and learned from data in an automated manner aided by ML. We conceive that the approach can help realise adaptive simulation models that learn to change their behaviour in response to behavioural changes in the actual system of interest. Broadly, the study is conceived to foster new ideas and speculative directions towards integrating the practice of M&S with data-driven knowledge learned by ML.
The growth in many-core CPUs has motivated development of shared-memory, multithreaded solutions to minimize communication and synchronization overheads in Parallel Discrete Event Simulations (PDES). Analogous capabilities, such as Cross Memory Attach (CMA) based approaches have been added to Message Passing Interface (MPI) libraries. CMA permits MPI-processes to directly read/write data from/to a different process's virtual memory space to exchange messages. This paper compares the performance of CMA capable, MPI-based version to our fine-tuned multithreaded version. The paper also discusses implementation and optimization of the multithreaded infrastructure to elucidate the design alternatives being compared and assessed. Our experiments conducted using 2-28 threads and a fine-grained (time per event 0.7 us) version of PHOLD benchmark shows that message-passing outperforms multithreading (by 10%-20%) in many scenarios but underperforms in others. The complex performance landscape inferred from our experiments suggest that more in-depth analysis of model characteristics is needed to decide between shared-memory multithreading versus message-passing approaches.
Parallel application performance models provide valuable insight about the performance in real systems. Capable tools providing fast, accurate, and comprehensive prediction and evaluation of high-performance computing (HPC) applications and system architectures have important value. This paper presents PyPassT, an analysis based modeling framework built on static program analysis and integrated simulation of the target HPC architectures. More specifically, the framework analyzes application source code written in C with OpenACC directives and transforms it into an application model describing its computation and communication behavior (including CPU and GPU workloads, memory accesses, and message-passing transactions). The application model is then executed on a simulated HPC architecture for performance analysis. Preliminary experiments demonstrate that the proposed framework can represent the runtime behavior of benchmark applications with good accuracy.
We describe an Abstract Model for Diffusion Processes to simu-late diffusion processes in multiplex dynamic networks using formal modeling and simulation (M&S) methodologies (in this case, the DEVS formalism). This approach helps the users to implement diffusion processes over a network by using the net-work specification and the diffusion rules. The result of combin-ing the network specifications and the diffusion rules is an Ab-stract Model for Diffusion Processes, which is formally defined in DEVS, and can be converted into a computerized model. Using the proposed Abstract Model for Diffusion Processes, we can study a diffusion process in multiplex networks with a formal simulation algorithm, improving the model's definition. We present a case study using the CDBoost simulation engine.
The share-everything PDES (Parallel Discrete Event Simulation) paradigm is based on fully sharing the possibility to process any individual event across concurrent threads, rather than binding Logical Processes (LPs) and their events to threads. It allows concentrating, at any time, the computing power---the CPU-cores on board of a shared-memory machine---towards the unprocessed events that stand closest to the current commit horizon of the simulation run. This fruitfully biases the delivery of the computing power towards the hot portion of the model execution trajectory. In this article we present an innovative share-everything PDES system that provides (1) fully non-blocking coordination of the threads when accessing shared data structures and (2) fully speculative processing capabilities---Time Warp style processing---of the events. As we show via an experimental study, our proposal can cope with hard workloads where both classical Time Warp systems---based on LPs to threads binding---and previous share-everything proposals---not able to exploit fully speculative processing of the events---tend to fail in delivering adequate performance.
The question of the energy consumed by synchronization algorithms for distributed simulation programs is addressed. The concept of zero energy synchronization is introduced wherein a distributed simulation program incurs no additional energy cost for synchronization. A theoretical approach to achieving zero energy synchronization using an oracle is described. An energy efficient implementation of the YAWNS algorithm, termed Low Energy YAWNS (LEY) is presented. It is shown that LEY can yield, in principle, zero energy synchronization for many classes of applications. Preliminary experimental results are presented showing that LEY achieves significantly less energy consumption compared to a conventional implementation of YAWNS that does not consider energy use as a design goal. Further, these experimental results indicate that LEY achieves energy consumption only modestly greater than that of an execution of the same application using an oracle for the test cases that were examined. These results suggest that it may be feasible to develop practical distributed simulation synchronization algorithms that approach zero energy synchronization.
Controlling power usage has become a core objective in modern computing platforms. In this article we present an innovative Time Warp architecture oriented to efficiently run parallel simulations under a power cap. Our architectural organization considers power usage as a foundational design principle, as opposed to classical power-unaware Time Warp design. We provide early experimental results showing the potential of our proposal.
The data structure that handles the pending event set of a discrete event simulator is a critical component in that its performances have a direct impact on those of the overall simulation engine. Many data structures have been proposed in the literature. Among them, the Ladder Queue (LadderQ) claims $O(1)$ amortized access time. However, empirical results show that the practical achievement of such performances is highly dependent on the distribution of event timestamps and that in many cases are similar or even worse than those of heap-based priority queues. This paper proposes an adaptive extension of the LadderQ which overcomes most of its weaknesses and allows to achieve $O(1)$ amortized access time in practice.
Dead reckoning is an important technique used in distributed virtual environments (DVEs) to mitigate the bandwidth consumption of frequent state updates and the negative effects of network latency. This paper proposes a novel dead reckoning approach for common DVE applications such as multiplayer online games. Unlike traditional dead reckoning approaches that estimate the movements of remote entities with pure kinematic models, the new approach performs extrapolations with the considerations of environmental factors and human behaviours. We have performed experiments, based on a distributed car simulator, to compare the the new approach with representative existing dead reckoning approaches. The results show that the new approach gives more accurate predictions with an acceptable overhead.
Traditionally, the model time in agent-based simulations is advanced in fixed time steps. However, a purely time-stepped execution is inefficient in situations where the states of individual agents are independent of other agents and thus easily predictable far into the simulated future. In this work, we propose a method to accelerate microscopic traffic simulations based on identifying independence among agent state updates. Instead of iteratively updating an agent's state throughout a sequence of time steps, a computationally inexpensive "fast-forward" function advances the agent's state to the time of its earliest possible interaction with other agents. To demonstrate the approach in practice, we present an algorithm to efficiently determine intervals of independence in microscopic traffic simulations and derive a fast-forward function for the popular Intelligent Driver Model (IDM). In contrast to existing acceleration approaches based on reducing the level of model detail, our approach retains the microscopic nature of the simulation. A performance evaluation is performed in a synthetic scenario and on the road network of the city of Singapore. At low traffic densities, we achieved a speedup of up to 2.8, whereas at the highest considered densities, only few opportunities for fast-forwarding could be identified. The algorithm parameters can be tuned to control the overhead of the approach.
In distributed simulation, communication based on publish/subscribe will generate large amount of irrelevant data transmissions, and thereby degrading the performance. To solve the problem, HLA standard defines data distribution management to filter unnecessary communication. Among several famous interest matching algorithms, the sort-based algorithm has been proven to be the most efficient method in most scenarios. However, the potential of existing sort-based algorithm has not been fully exploited, due to the overhead of sorting the bounds can be further reduced and a portion of unnecessary bit operations can be eliminated. In this paper, we propose a binary search enhanced sort-based interest matching algorithm (BSSIM). Based on a different sufficient and necessary condition to judge interval overlapping, the size of list to be sorted can be remarkably reduced. Moreover, unnecessary bit operations can be eliminated by binary searches. Experimental results show that BSSIM algorithm outperforms the sort-based algorithm, and approximately 64%-159% performance improvement can be achieved at different scenarios.
Graphics processing units (GPUs) have been shown to be well-suited to accelerate agent-based simulations. A fundamental challenge in agent-based simulations is the resolution of conflicts arising when agents compete for simulated resources, which may introduce substantial overhead. A variety of conflict resolution methods on the GPU have been proposed in the literature. In this paper, we systematize and compare these methods and propose two simple new variants. We present performance measurements on the example of the well-known segregation model. We show that the choice of conflict resolution method can substantially affect the simulation performance. Further, although methods in which agents actively indicate their interest in a resource require the use of costly atomic operations, these methods generally outperform the alternatives.
Methods combining deterministic and stochastic concepts present an efficient alternative to a purely stochastic treatment of biochemical models. Traditionally, those methods split biochemical reaction networks into one set of slow reactions that is computed stochastically and one set of fast reactions that is computed deterministically. Applying those methods to multi-level models with dynamic nestings requires coping with dynamic reaction networks changing over time. In addition, in case of large populations of nested entities, stochastic events can still decrease the runtime performance significantly, as reactions of dynamically nested entities are inherently stochastic. In this paper, we apply a hybrid simulation algorithm combining deterministic and stochastic concepts to multi-level models including an approximation control. Further, we present an extension of this simulation algorithm applying an additional approximation by executing multiple independent stochastic events simultaneously in one simulation step. The algorithm has been implemented in the rule-based multi-level modeling language ML-Rules. Its impact on speed and accuracy is evaluated based on simulations performed with a model of Dictyostelium discoideum amoebas.
The Functional Mock-up Interface (FMI) standard is becoming an essential solution for co-simulation. In this paper, we address a specific issue which arises in the context of Distributed Cyber-Physical System (DCPS) co-simulation where Functional Mock-up Units (FMU) need to interact with distributed application models. The core of the problem is that, in general, complex distributed application behaviors cannot be easily and accurately captured by a modeling formalism but are instead directly specified using a standard programming language. As a consequence, the model of a distributed application is often a concurrent program. The challenge is then to bridge the gap between this programmatic description and the equation-based framework of FMI in order to make FMUs interact with concurrent programs. In this article, we show how we use the unique model of execution of the SimGrid simulation platform to tackle this issue. The platform manages the co-evolution and the interaction between IT models and the different concurrent processes which compose a distributed application code. Thus, SimGrid offers a framework to mix models and concurrent programs. We show then how we specify an FMU as a SimGrid model to solve the DCPS co-simulation issues. Compared to other works of the literature, our solution is not limited to a specific use case and benefits from the versatility and scalability of SimGrid.
This paper explores the improvement of non-iterative co-simulation master. A simple hybrid system is depicted and analyzed. Based on this analysis guidelines for calculating the calling sequence are introduced. Guidelines allow the implementation of a new constraint programming algorithm. This algorithm allows better calling sequence selection based solely on information about connecting the co-simulation network. The algorithm is confirmed by the example of co-simulation of a hybrid electric vehicle. In this example, the constraint programming algorithm found a subjectively good calling sequence without any involvement of the model developer.
Reaction selection is a major and time consuming step of stochastic simulation algorithms. Current approaches focus on constant sets of reactions. However, in the case of multiple agents whose behaviors are governed by diverse reactions at multiple levels, where the number and structure of agents and the number of reactions varies during simulation. Therefore, we equip different variants of stochastic simulation algorithms with strategies to handle dynamic sets of reactions. We implement the next reaction method with a heap and the direct reaction method with two tree-based selection strategies, compare their performance, and discuss open questions for future research.
Many runs of a computer simulation are needed to model uncertainty and evaluate alternate design choices. Such an ensemble of runs often contains many commonalities among the different individual runs. Simulation cloning is a technique that capitalizes on this fact to reduce the amount of computation required by the ensemble. Granular cloning is proposed that allows the sharing of state and computations at the scale of simulation objects as small as individual variables, offering savings in computation and memory, increased parallelism and improved tractability of sample path patterns across multiple runs. The ensemble produces results that are identical to separately executed runs. Whenever simulation objects interact, granular cloning will resolve their association to subsets of runs though binary operations on tags. Algorithms and computational techniques required to efficiently implement granular cloning are presented. Results from an experimental study using a cellular automata-based transportation simulation model and a coupled transportation and land use model are presented providing evidence the approach can yield significant speed ups relative to brute force replicated runs.
Along the years, Parallel Discrete Event Simulation (PDES) has been enriched with programming facilities to bypass state disjointness across the concurrent Logical Processes (LPs). New supports have been proposed, offering the programmer approaches alternative to message passing to code complex LPs' relations. Along this path we find Event &Cross-State (ECS), which allows writing event handlers which can perform in-place accesses to the state of any LP, by simply relying on pointers. This programming model has been shipped with a runtime support enabling concurrent speculative execution of LPs limited to shared-memory machines. In this paper, we present the design of a middleware layer that allows ECS to be ported to distributed-memory clusters of machines. A core application of our middleware is to let ECS-coded models be hosted on top of (low-cost) resources from the Cloud. Overall, ECS-coded models no longer demand for powerful shared-memory machines to execute in reasonable time. Thanks to our solution, we retain indeed the possibility to rely on the enriched ECS programming model while still enabling deployments of PDES models on convenient (Cloud-based) infrastructures. An experimental assessment of our proposal is also provided.
Parallel Discrete Event Simulations (PDES) running at large scales involve the coordination of billions of very fine grain events distributed across a large number of processes. At such large scales optimistic synchronization protocols, such as TimeWarp, allow for a high degree of parallelism between processes, but with the additional complexity of managing event rollback and cancellation. This can become especially problematic in models that exhibit imbalance resulting in low event efficiency, which increases the total amount of work required to run a simulation to completion. Managing this complexity becomes key to achieving a high degree of performance across a wide range of models. In this paper, we address this issue by analyzing the relationship between synchronization cost and event efficiency. We first look at how these two characteristics are coupled via the computation of Global Virtual Time (GVT). We then introduce dynamic load balancing, and show how, when combined with low overhead GVT computation, we can achieve higher efficiency with less synchronization cost. In doing so, we achieve up to 2x better performance on a variety of benchmarks and models of practical importance.
We present a fine-grained load migration protocol intended for parallel discrete event simulation (PDES) of spatially extended models. Typical models have domains that are fine-grained discretizations of some volume, e.g., a cell, using an irregular three-dimensional mesh, where most events span several voxels. Phenomena of interest in, e.g., cellular biology, are often non-homogeneous and migrate over the simulated domain, making load balancing a crucial part of a successful PDES. Our load migration protocol is local in the sense that it involves only those processors that exchange workload, and does not affect the running parallel simulation. We present a detailed description of the protocol and a thorough proof for its correctness. We combine our protocol with a strategy for deciding when and what load to migrate, which optimizes both for load balancing and inter-processor communication using tunable parameters. Our evaluation shows that the overhead of the load migration protocol is negligible, and that it significantly reduces the number of rollbacks caused by load imbalance. On the other hand, the implementation mechanisms that we added to support fine-grained load balancing incur a significant cost.