REDVID Simulation Framework - Subatomic Particle Data Generation

Description

REDuced VIrtual Detector (REDVID) is a simulation framework and a synthetic data generator written in Python. As a reduced simulator, REDVID simulates the propagation of subatomic particles in a virtual detector model with a given geometry, inspired by the detectors installed at the Large Hadron Collider (LHC). The simulation model is complexity-reduced and is intended for generating source data to train Machine Learning (ML) algorithms and to perform ML-assisted solution exploration. The data is in the form of hit point coordinates in space and trajectory function parameters.

Sample events

A few events from simulations with varying recipes are shown for demonstration purposes. The plots below vary in track definitions. The track count is limited to five to improve legibility. From left to right, these plots depict the full event view, the hit points view and the tracks view, respectively. Note the incorporated detector model geometry as depicted in Figure 1.

Detector geometry — An example virtual detector and its layers.

The following plots consider this virtual detector with a 90-degree rotated orientation for plots. Note that the Z-axis has to go through the detector. We keep the scale down in these examples for legibility purposes, i.e., we are generating a low number of tracks per event.

3D space, noisy hit coordinates, linear tracks

A sample event with five linear tracks starting at the geometric origin and being randomly directed. The randomisation of these tracks follows the first track randomisation protocol, i.e., Protocol 1 - Last layer hit guarantee. Refer to our relevant publications for further details on track randomisation protocols. Different views for this event are depicted in Figure 2.

Full event — Views for full event, hit points and tracks.

Hit points — Views for full event, hit points and tracks.

3D space, noisy hit coordinates, helical uniform tracks

This example event goes a step further in complexity compared to events with linear tracks. Helical uniform tracks do not occur in realistic settings. However, the generated datasets are of independent value for research. Figure 3 depicts such an event with five helical uniform tracks.

3D space, noisy hit coordinates, helical expanding tracks

Helical expanding tracks are the closest type REDVID can generate to real-world tracks. Other complexity increasing features do not directly influence a track's formation principles. For instance, all these examples have hit point coordinate smearing enabled. Figure 4 showcases an event with five helical expanding tracks, following the same track randomisation protocol as earlier.

Feature set

REDVID is highly configurable and many features available in the main configuration file can be tweaked according to user requirements. We provide the available and planned features in Table 1, without exhaustive descriptions. Current availability is indicated using status markers.

Green: Available
Orange: Limited selection
Red: Under development

Configuration options and features supported by REDVID.
Category	Feature	Status
Execution config	Anchor path Multiple output modes Automated execution parallelism Automated large job division Automated batch processing Automated batch processing parallelism Performance monitoring Visualisations Import/load spawned detectors Dataset coordinate system
Experiment config	Custom/auto experiment ID Event count Fixed track count Variable track count with range Track direction, designated/random Shift over the Z-axis
Experiment config (2D tracks)	Slope limits y-intercept limits
Experiment config (3D tracks)	Track randomisation protocols Sub-detector track aggregation Track type: Linear Track type: Helical uniform Track type: Helical expanding Track type: Multiple types Track level: Primary tracks Track level: Secondary tracks Early terminating tracks Jet track type: Linear Jet track level: Primary jets Jet track level: Secondary jets
Experiment config (Hits)	Hit point calculation methods Hit point smearing Hit point recording probability Holes (unrecorded hits)
Geometry features	Custom/auto detector ID Dimension Detector space Cartesian axis boundaries Detector space Spherical boundaries
Geometry config (2D)	Origin coordinates: (x, y) Sub-det. presence: Pixel, SS, LS Sub-det. layer count, per type Sub-det. centre coordinates Sub-det. layer distance Sub-det. outer radius Sub-det. outer-inner radii delta
Geometry config (3D)	Origin smearing Origin smearing type Origin coordinates: (r, θ, z) Sub-det. presence: Pixel, SS, LS, Barrel Sub-det. layer count, per type Sub-det. centre coordinates Sub-det. layer distance Sub-det. outer radius Sub-det. outer-inner radii delta Sub-det. end z Sub-det. end-start z delta

Code repository

The code is open-source and publicly available. Refer to the included configuration file for a complete list of available parameters and their effect. To understand the overall functionality and usage of the tool, refer to the provided README, RELEASE NOTES, and documentation, as well as the related publication [1].

REDVID Simulation Framework:

Datasets

Collections of example, representative datasets are generated using the REDVID simulation framework which contain complexity-reduced subatomic particle collision event data for linear [2] and helical [3] tracks. Particle trajectory information and hit coordinates from interactions with reduced-order virtual detector models are included. The data are generated in 3D domain and follows the cylindrical coordinate system for hit point coordinates in space and trajectory function parameters.

The included five tarballs each belong to a different data generation recipe. While all recipes include 10000 collision events, the number of tracks included in events varies from 1 track per event to 10000 tracks per event. This is noticeable from the tarball names.

The dataset is intended to be used as synthesised input for research involving ML-assisted pipeline design exploration, as well as ML model design exploration, e.g., Neural Architecture Search (NAS). To understand the data and its generation in detail, refer to the provided README, as well as the related publication [1]. Further details regarding the ML research incorporating these datasets are available in our Connecting The Dots 2023 (CTD 2023) proceedings paper [5].

REDVID Collision Event Data – Linear Tracks and Hits:

REDVID Collision Event Data – Helical Tracks and Hits:

Authors

The REDVID simulation framework, the generated datasets, and the shared results are authored by:

dr. ir. Uraz Odyurt – University of Twente

Current collaborating team members

Previous team members

dr. Sascha Caron – Radboud University; Nikhef
Prof. dr. ir. Ana-Lucia Varbanescu – University of Twente; University of Amsterdam
dr. Roel Aaij – Nikhef
MSc Stephen Nicholas Swatman – University of Amsterdam; CERN

Publications

Publications and contributions about REDVID

[1] "Reduced Simulations for High-Energy Physics, a Middle Ground for Data-Driven Physics Research". Uraz Odyurt, Stephen Nicholas Swatman, Ana-Lucia Varbanescu, Sascha Caron. Computational Science – ICCS 2024. 2024.

Cite this »

Type: [paper]

Year: 2024

DOI: 10.1007/978-3-031-63751-3_6

DOI: 10.48550/arXiv.2309.03780

Abstract

Subatomic particle track reconstruction (tracking) is a vital task in High-Energy Physics experiments. Tracking is exceptionally computationally challenging and fielded solutions, relying on traditional algorithms, do not scale linearly. Machine Learning (ML) assisted solutions are a promising answer. We argue that a complexity-reduced problem description and the data representing it, will facilitate the solution exploration workflow. We provide the REDuced VIrtual Detector (REDVID) as a complexity-reduced detector model and particle collision event simulator combo. REDVID is intended as a simulation-in-the-loop, to both generate synthetic data efficiently and to simplify the challenge of ML model design. The fully parametric nature of our tool, with regards to system-level configuration, while in contrast to physics-accurate simulations, allows for the generation of simplified data for research and education, at different levels. Resulting from the reduced complexity, we showcase the computational efficiency of REDVID by providing the computational cost figures for a multitude of simulation benchmarks. As a simulation and a generative tool for ML-assisted solution design, REDVID is highly flexible, reusable and open-source. Reference data sets generated with REDVID are publicly available. Data generated using REDVID has enabled rapid development of multiple novel ML model designs, which is currently ongoing.

[2] "REDVID Collision Event Data – Linear Tracks and Hits". Uraz Odyurt, Stephen Nicholas Swatman. 2023.

Cite this »

Type: [dataset]

Year: 2023

DOI: 10.5281/zenodo.8183750

Abstract

An example, representative data set is generated using the REDuced VIrtual Detector (REDVID) simulation framework and contains complexity-reduced subatomic particle collision event data. Particle trajectory information and hit coordinates from interactions with reduced-order virtual detector models is included. The data is generated in 3D domain and follows the cylindrical coordinate system for hit point coordinates in space and trajectory function parameters.

The included five tarballs each belong to a different data generation recipe. While all recipes include 10000 collision events, the number of tracks included in events varies from 1 track per event to 10000 tracks per event. This is noticeable from the tarball names.

The data set is intended to be used as synthesised input for research involving ML-assisted pipeline design exploration, as well as ML model design exploration, e.g., Neural Architecture Search (NAS). To understand the data and its generation in detail, refer to the provided README file, as well as the related publication.

[3] "REDVID Collision Event Data – Helical Tracks and Hits". Uraz Odyurt. 2024.

Cite this »

Type: [dataset]

Year: 2024

DOI: 10.5281/zenodo.10514245

Abstract

[4] "Efficient Tracking Algorithm Evaluations through Multi-Level Reduced Simulations". Uraz Odyurt, Sascha Caron, Ana-Lucia Varbanescu. Conference on Computing in High Energy and Nuclear Physics (CHEP 2024). EPJ Web of Conferences. 2025.

Cite this »

Type: [paper]

Year: 2025

DOI: 10.1051/epjconf/202533701289

Abstract

Subatomic particle track reconstruction (tracking) is a vital task in High-Energy Physics experiments. Tracking, in its current form, is exceptionally computationally challenging. Fielded solutions, relying on traditional algorithms, do not scale linearly and pose a major limitation for the HL-LHC era. Machine Learning (ML) assisted solutions are a promising answer. Current ML model design practice is predominantly ad hoc. We aim for a methodology for automated search of ML model designs, consisting of complexity reduced descriptions of the main problem, forming a complexity spectrum. As the main pillar of such a method, we provide the REDuced VIrtual Detector (REDVID) as a complexity-aware detector model and particle collision event simulator. Through a multitude of configurable dimensions, REDVID is capable of simulations throughout the complexity spectrum. REDVID can also act as a simulation-in-the-loop, to both generate synthetic data efficiently and to simplify the challenge of ML model design evaluation. Starting from the simplistic end of the spectrum, lesser designs can be eliminated in a systematic fashion, early on. REDVID is not bound by real detector geometries and can simulate arbitrary detector designs. As a simulation and a generative tool for ML-assisted solution design, REDVID is open-source and reference data sets are publicly available. It has enabled rapid development of novel ML models.

Publications and contributions using REDVID

[5] "Novel Approaches for ML-Assisted Particle Track Reconstruction and Hit Clustering". Uraz Odyurt, Nadezhda Dobreva, Zef Wolffs, Yue Zhao, Antonio Ferrer Sánchez, Roberto Ruiz de Austri Bazan, José D. Martín-Guerrero, Ana-Lucia Varbanescu, Sascha Caron. In Proceedings of the Connecting The Dots Workshop 2023 (CTD 2023). 2024.

Cite this »

Type: [paper]

Year: 2024

DOI: 10.5281/zenodo.14439510

DOI: 10.48550/arXiv.2405.17325

Abstract

Track reconstruction is a vital aspect of High-Energy Physics (HEP) and plays a critical role in major experiments. In this study, we delve into unexplored avenues for particle track reconstruction and hit clustering. Firstly, we enhance the algorithmic design effort by utilising a simplified simulator (REDVID) to generate training data that is specifically composed for simplicity. We demonstrate the effectiveness of this data in guiding the development of optimal network architectures. Additionally, we investigate the application of image segmentation networks for this task, exploring their potential for accurate track reconstruction. Moreover, we approach the task from a different perspective by treating it as a hit sequence to track sequence translation problem. Specifically, we explore the utilisation of Transformer architectures for tracking purposes. Our preliminary findings are covered in detail. By considering this novel approach, we aim to uncover new insights and potential advancements in track reconstruction. This research sheds light on previously unexplored methods and provides valuable insights for the field of particle track reconstruction and hit clustering in HEP.

[6] "Towards Novel Charged Particle Tracking Approaches with Transformer and U-Net Models". Zef Wolffs, Antonio Ferrer Sánchez, José D. Martín-Guerrero, Jose Salt, Matous Vozák, Nadezhda Dobreva, Roberto Ruiz de Austri Bazan, Sascha Caron, Uraz Odyurt, Yue Zhao. ML4Jets workshop. 2023.

Cite this »

Type: [conference talk]

Year: 2023

Abstract

Inspired by the recent successes of language modelling and computer vision machine learning techniques, we study the feasibility of repurposing these developments for particle track reconstruction in the context of high energy physics. In particular, drawing from developments in the field of language modelling we showcase the performance of multiple implementations of the transformer model, including an autoregressive transformer with the original encoder-decoder architecture, and encoder-only architectures for the purpose of track parameter classification and clustering. Furthermore, in the context of computer vision we study a U-net style model with submanifold convolutions, treating the event as an image and highlighting those pixels where a hit was detected.

We benchmark these models on simplified training data utilising a recently developed simulation framework, REDuced VIrtual Detector (REDVID). These data include noisy linear and helical track definitions, similar to those observed in particle detectors from major LHC collaborations such as ATLAS and CMS. We find that the proposed models can be used to effectively reconstruct particle tracks on this simplified dataset, and we compare their performances both in terms of reconstruction efficiency and runtime. As such, this work lays the necessary groundwork for developments in the near future towards such novel machine learning strategies for particle tracking on more realistic data.

[7] "Transformers for Particle Track Reconstruction and Hit Clustering". Nadezhda Dobreva, Yue Zhao, Zef Wolffs, Uraz Odyurt, Sascha Caron. 6th Inter-experiment Machine Learning Workshop. 2024.

Cite this »

Type: [poster]

Year: 2024

Abstract

Track reconstruction is a crucial part of High Energy Physics (HEP) experiments. Traditional methods for the task scale poorly, making machine learning and deep learning appealing alternatives. Following the success of transformers in the field of language processing, we investigate the feasibility of training a Transformer to translate detector signals into track parameters. We study and compare different architectures. Firstly, an autoregressive Transformer model with the original encoder-decoder architecture which reconstructs a particle's trajectory given a few initial hits. Secondly, an encoder-only architecture used as a classifier, producing a class label for each hit in an event, given pre-defined bins within the track parameter space. Lastly, an encoder-only model with the purpose of regressing track parameter values for each hit in an event, followed by clustering.

The Transformer models are benchmarked on simplified datasets generated by the recently developed simulation framework REDuced VIrtual Detector (REDVID) as well as a subset of the TrackML data. The preliminary results of the proposed models show promise for the application of these deep learning techniques on more realistic data for particle reconstruction.

This work has been previously presented at the following conferences: Connecting The Dots 2023 (https://indico.cern.ch/event/1252748/contributions/5521505/), NNV 2023 (https://indico.nikhef.nl/event/4510/contributions/18909/), and ML4Jets2023 (https://indico.cern.ch/event/1253794/contributions/5588602/).

[8] "Transformer-Inspired Models for Particle Track Reconstruction". Yue Zhao, Nadezhda Dobreva, Zef Wolffs, Uraz Odyurt, Sascha Caron. European AI for Fundamental Physics Conference (EuCAIFCon 2024). 2024.

Cite this »

Type: [flash talk with poster]

Year: 2024

Abstract

Particle track reconstruction is a fundamental aspect of experimental analysis in high-energy particle physics. Conventional methodologies for track reconstruction are suboptimal in terms of efficiency in anticipation of the High Luminosity phase of the Large Hadron Collider. This has motivated researchers to explore the latest developments in deep learning for their scalability and potential enhanced inference efficiency.

We assess the feasibility of three Transformer-inspired model architectures for hit clustering and classification. The first model uses an encoder-decoder architecture to reconstruct a track auto-regressively, given the coordinates of the first few hits. The second model employs an encoder-only architecture as a classifier, using predefined labels for each track. The third model, also utilising an encoder-only configuration, regresses track parameters, and subsequently assigns clusters in the track parameter space to individual tracks.

We discuss preliminary studies on a simplified dataset, showing high success rates for all models under consideration, alongside our latest results using the TrackML dataset from the 2018 Kaggle challenge. Additionally, we present our journey in the adaptation of models and training strategies, addressing the trade-offs among training efficiency, accuracy, and the optimisation of sequence lengths within the memory constraints of the hardware at our disposal.

[9] "Efficient ML-Assisted Particle Track Reconstruction Designs". Sascha Caron, Nadezhda Dobreva, Antonio Ferrer Sánchez, Uraz Odyurt, Roberto Ruiz de Austri Bazan, Zef Wolffs, Yue Zhao. Conference on Computing in High Energy and Nuclear Physics (CHEP 2024). EPJ Web of Conferences. 2025.

Cite this »

Type: [paper]

Year: 2025

DOI: 10.1051/epjconf/202533701299

Abstract

Track reconstruction is a crucial part of High Energy Physics experiments. Traditional methods for the task, relying on Kalman Filters, scale poorly with detector occupancy. In the context of the upcoming High Luminosity-LHC, solutions based on Machine Learning (ML) and deep learning are very appealing. We investigate the feasibility of training multiple ML architectures to infer track-defining parameters from detector signals, for the application of offline reconstruction. We study and compare three Transformer model designs, as well as a U-Net architecture. We describe in detail the two most promising approaches and benchmark the pipelines for physics performance and inference speed on methodically simplified datasets, generated by the recently developed simulation framework, REDuced VIrtual Detector (REDVID). Our second batch of simplified datasets are derived from the TrackML dataset. Our preliminary results show promise for the application of such deep learning techniques on more realistic data for tracking, as well as efficient elimination of solutions.

[10] "TrackFormers: In Search of Transformer-Based Particle Tracking for the High-Luminosity LHC Era". Sascha Caron, Nadezhda Dobreva, Antonio Ferrer Sánchez, José D. Martín-Guerrero, Uraz Odyurt, Roberto Ruiz de Austri Bazan, Zef Wolffs, Yue Zhao. The European Physical Journal C (EPJ C). 2025.

Cite this »

Type: [paper]

Year: 2025

DOI: 10.1140/epjc/s10052-025-14156-3

DOI: 10.48550/arXiv.2407.07179

Abstract

High-Energy Physics experiments are facing a multi-fold data increase with every new iteration. This is certainly the case for the upcoming High-Luminosity LHC upgrade. Such increased data processing requirements forces revisions to almost every step of the data processing pipeline. One such step in need of an overhaul is the task of particle track reconstruction, a.k.a., tracking. A Machine Learning-assisted solution is expected to provide significant improvements, since the most time-consuming step in tracking is the assignment of hits to particles or track candidates. This is the topic of this paper.

We take inspiration from large language models. As such, we consider two approaches: the prediction of the next word in a sentence (next hit point in a track), as well as the one-shot prediction of all hits within an event. In an extensive design effort, we have experimented with three models based on the Transformer architecture and one model based on the U-Net architecture, performing track association predictions for collision event hit points. In our evaluation, we consider a spectrum of simple to complex representations of the problem, eliminating designs with lower metrics early on. We report extensive results, covering both prediction accuracy (score) and computational performance. We have made use of the REDVID simulation framework, as well as reductions applied to the TrackML data set, to compose five data sets from simple to complex, for our experiments. The results highlight distinct advantages among different designs in terms of prediction accuracy and computational performance, demonstrating the efficiency of our methodology. Most importantly, the results show the viability of a one-shot encoder-classifier based Transformer solution as a practical approach for the task of tracking.

[11] "TrackFormers - Collision Event Data Sets". Uraz Odyurt, Nadezhda Dobreva. 2024.

Cite this »

Type: [dataset]

Year: 2024

DOI: 10.5281/zenodo.14386133

Abstract

This artefact includes five individual data sets containing particle collision data in virtual detector setups. These data sets are utilised for Machine Learning (ML) model design and training within the publication “TrackFormers: In Search of Transformer-Based Particle Tracking for the High-Luminosity LHC Era”. Three of the data sets are generated using the REDuced VIrtual Detector (REDVID) simulation framework. The other two are reduced versions of the TrackML data set. The full TrackML data set is simulated using Pythia 8 event generator.

Refer to the provided README file for further details.

[12] "TrackFormers - Machine Learning Pipelines". Nadezhda Dobreva, Antonio Ferrer Sánchez, Zef Wolffs, Yue Zhao. 2024.

Cite this »

Type: [code]

Year: 2024

DOI: 10.5281/zenodo.14388534

Abstract

TrackFormers is a machine learning framework for track reconstruction in particle physics experiments. It leverages transformer- and U-Net-inspired deep learning architectures to predict particle tracks from hit data.

This repository contains 4 directories corresponding to the 4 models described in the paper TrackFormers: In Search of Transformer-Based Particle Tracking for the High-Luminosity LHC Era. EncDec, EncCla, and EncReg are transformer-based models, whereas U-Net is, as the name suggests, a U-Net model.

Refer to the provided README file for further details.

[13] "TrackFormers Part 2: Enhanced Transformer-Based Models for High-Energy Physics Track Reconstruction". Sascha Caron, Nadezhda Dobreva, Maarten Kimpel, Uraz Odyurt, Slav Pshenov, Roberto Ruiz de Austri Bazan, Eugene Shalugin, Zef Wolffs, Yue Zhao. European AI for Fundamental Physics Conference (EuCAIFCon 2025). 2025.

Cite this »

Type: [paper]

Year: 2025

DOI: TBA

DOI: 10.48550/arXiv.2509.26411

Abstract

High-Energy Physics experiments are rapidly escalating in generated data volume, a trend that will intensify with the upcoming High-Luminosity LHC upgrade. This surge in data necessitates critical revisions across the data processing pipeline, with particle track reconstruction being a prime candidate for improvement. In our previous work, we introduced "TrackFormers", a collection of Transformer-based one-shot encoder-only models that effectively associate hits with expected tracks. In this study, we extend our earlier efforts by incorporating loss functions that account for inter-hit correlations, conducting detailed investigations into (various) Transformer attention mechanisms, and a study on the reconstruction of higher-level objects. Furthermore we discuss new datasets that allow the training on hit level for a range of physics processes. These developments collectively aim to boost both the accuracy, and potentially the efficiency of our tracking models, offering a robust solution to meet the demands of next-generation high-energy physics experiments.

[14] "TrackFormers 2: Enhanced Transformer-Based Models for High-Energy Physics Track Reconstruction". Sascha Caron, Nadezhda Dobreva, José D. Martín-Guerrero, Uraz Odyurt, Slav Pshenov, Roberto Ruiz de Austri Bazan, Evgeniy Shalugin, Zef Wolffs, Yue Zhao. European Physical Society Conference on High Energy Physics (EPS-HEP 2025). 2025.

Cite this »

Type: [poster]

Year: 2025

Abstract

High-Energy Physics experiments are rapidly escalating in generated data volume, a trend that will intensify with the upcoming High-Luminosity LHC upgrade. This surge in data necessitates critical revisions across the data processing pipeline, with particle track reconstruction being a prime candidate for improvement. In our previous work, we introduced "TrackFormers", a collection of Transformer-based one-shot models that effectively associate hits with expected tracks. In this study, we extend our earlier efforts of model development by incorporating loss functions that account for inter-hit correlations, conducting detailed investigations into (various) Transformer attention mechanisms, and a study on the reconstruction of higher-level objects. Furthermore, we discuss new datasets that allow the training on hit level for a range of physics processes. These developments collectively aim to boost both the accuracy and the potential efficiency of our tracking models, offering a robust solution to meet the demands of next-generation high-energy physics experiments.

Artefacts shared as PDF are licenced under CC BY 4.0.