Public datasets

This page collects public datsets that are used for machine learning studies at the LHC. The resources come from voluntary contributions of authors from papers as well as challenges. The text on the linked description pages is not in the responsibility of the IML. We want to emphasis that public datasets are typically done with simplified simulation of the real detectors and with much smaller samples than available to the collaborations. Best results in simplified simulation with limited number of samples do not automatically suggest an optimal strategy for application in real experiments, however currently there is no other option for comparisons as most of the collaborations' data is not public.

Simplified datasets for benchmarking:

  • Top tagging without heavy flavour & pileup: Data and details of arXiv:1707.08966
  • Jet substructure: Data from arXiv:16107.08633 at UC Irvines page MLPhysics
  • Flavour tagging without pileup: Data from arXiv:1603.09349 at UC Irvines page MLPhysics

Datasets for developing simulation:

Realistic datasets from the CMS experiment:

  • CMS open data (non trivial data format (CMS software knowledge of advantage), limited in size and older samples 2011)

LHC Kaggle challenges:

IML workshop short challenges: