• Help
    Discussion forum
    Search tips
  • About
    CERN Open Data
    ALICE
    ATLAS
    CMS
    DELPHI
    JADE
    LHCb
    OPERA
    TOTEM
    Glossary

Important notice: opendata-qa.cern.ch is a quality-assurance server. Please use it for testing purposes only. The content may be erased from time to time. Please use opendata.cern.ch for production.

ATLAS $t\bar{t}$ simulation for ML-based jet flavour tagging (JetSet)

ATLAS collaboration

Cite as: ATLAS collaboration (2025). ATLAS $t\bar{t}$ simulation for ML-based jet flavour tagging (JetSet). CERN Open Data Portal. DOI:10.7483/OPENDATA.ATLAS.QG8W.TO8P

Dataset Derived Simulated Datascience ATLAS CERN-LHC


Description

Flavour-tagging — the task of identifying the flavour of jets — is essential for many physics analyses at the ATLAS experiment. This dataset, released for public use, can be used to train and evaluate machine learning models for jet flavour-tagging at ATLAS. It aims to facilitate broader interest and further development of innovative machine learning techniques to improve flavour-tagging performance.

The dataset consists of approximately 50 million events from simulated top quark pair production at a centre-of-mass energy of 13.6 TeV. It is stored in HDF5 format and contains structured event-level, jet-level, track-level and truth hadron information. This dataset is designed to be compatible with the flavour-tagging algorithm development pipeline used at ATLAS, and is supported by accompanying instructions and example configurations provided in open-source repositories.

To improve usability, the dataset is split into three mutually exclusive HDF5 files:

  • mc-flavtag-ttbar-small.h5 — ~1.36 million events (~5.6 million jets)
  • mc-flavtag-ttbar-medium.h5 — ~6.23 million events (~25.6 million jets)
  • mc-flavtag-ttbar-large.h5 — ~41.1 million events (~168 million jets)

Downloading all three files will provide access to the complete dataset. The smaller subsets are useful for quick exploration or prototyping workflows.

Dataset characteristics

48698675 events. 3 files. 100.4 GiB in total.

How can you use these data?

A detailed explanation of this dataset, and instructions for pre-processing, training, and evaluation workflows are provided in the accompanying GitLab repository. If this dataset is used in a publication, please cite this dataset record along with the accompanying ATLAS paper describing GN2, a ATLAS flavour-tagging algorithm with a transformer-like architecture.

Transforming Jet Flavour: Documentation and training pipeline

ATLAS GN2 paper ATLAS-FTAG-2023-05


      

Disclaimer

These open data are released under the Creative Commons Zero v1.0 Universal license.

Logo CC0-1.0

Neither the experiment(s) ( ATLAS ) nor CERN endorse any works, scientific or otherwise, produced using these data.

This release has a unique DOI that you are requested to cite in any applications or publications.

ALICE experiment
ATLAS experiment
CMS experiment
DELPHI experiment
JADE experiment
LHCb experiment
OPERA experiment
PHENIX experiment
TOTEM experiment
© CERN, 2014–2025 ·
Terms of Use ·
Privacy Policy ·
Help ·
GitHub ·
Twitter ·
Email
Powered by Invenio
Open Data Portal v0.3.0
CERN