Public datasets

This page collects public datsets that are used for machine learning studies at the LHC. The resources come from voluntary contributions of authors from papers as well as challenges. The text on the linked description pages is not in the responsibility of the IML. We want to emphasis that public datasets are typically done with simplified simulation of the real detectors and with much smaller samples than available to the collaborations. Best results in simplified simulation with limited number of samples do not automatically suggest an optimal strategy for application in real experiments, however currently there is no other option for comparisons as most of the collaborations' data is not public.

Simplified datasets for benchmarking:

Top tagging without heavy flavour & pileup: Data and details of arXiv:1707.08966
Jet substructure: Data from arXiv:16107.08633 at UC Irvines page MLPhysics
Flavour tagging without pileup: Data from arXiv:1603.09349 at UC Irvines page MLPhysics

Datasets for developing simulation:

Data for jet images from LAGAN
Data for 3D jet images from CaloGan
Electromagrentic jet images

Realistic datasets from the CMS experiment:

CMS open data (non trivial data format (CMS software knowledge of advantage), limited in size and older samples 2011)

LHC Kaggle challenges:

TrackML particle tracking challenge (second phase running Sep-Nov 2018)
Flavours Physics Challenge
Higgs Boson Challenge

IML workshop short challenges:

Public datasets

News

New ALICE coordinator

New CMS coordinator

New LHCb coordinator

CERN Accelerating science

Public datasets

News

New ALICE coordinator

New CMS coordinator

New LHCb coordinator