This page collects public datsets that are used for machine learning studies at the LHC. The resources come from voluntary contributions of authors from papers as well as challenges. The text on the linked description pages is not in the responsibility of the IML. We want to emphasis that public datasets are typically done with simplified simulation of the real detectors and with much smaller samples than available to the collaborations. Best results in simplified simulation with limited number of samples do not automatically suggest an optimal strategy for application in real experiments, however currently there is no other option for comparisons as most of the collaborations' data is not public.
Simplified datasets for benchmarking:
- Top tagging without heavy flavour & pileup: Data and details of arXiv:1707.08966
- Jet substructure: Data from arXiv:16107.08633 at UC Irvines page MLPhysics
- Flavour tagging without pileup: Data from arXiv:1603.09349 at UC Irvines page MLPhysics
Datasets for developing simulation:
- Data for jet images from LAGAN
- Data for 3D jet images from CaloGan
- Electromagrentic jet images
Realistic datasets from the CMS experiment:
- CMS open data (non trivial data format (CMS software knowledge of advantage), limited in size and older samples 2011)
LHC Kaggle challenges:
- TrackML particle tracking challenge (second phase running Sep-Nov 2018)
- Flavours Physics Challenge
- Higgs Boson Challenge
IML workshop short challenges: