Maximilian Böther

I am a final-year Ph.D. student in Computer Science at ETH Zurich’s Systems Group and the Efficient Architectures and Systems Lab (EASL), supervised by Ana Klimovic and Gustavo Alonso. I am also a Member of Technical Staff at DatologyAI.

I work on the intersection of systems and data-centric AI. I am excited about data loaders, the data and systems aspects of large-scale LLM/VLM training, data management for machine learning, and distribution shift in real-world machine learning pipelines.

I have published open-source projects like Mixtera (Github), a lightweight data plane for LLM/VLM training, and Modyn (Github), a platform for training models on datasets that grow over time. I also contributed to the training of Apertus v1, Switzerland’s national LLM.

I published at venues such as SIGMOD, VLDB, MLSys, ACL, and ICLR, and interned at Google and Apple. I obtained B.Sc. and M.Sc. degrees in IT-Systems Engineering from Hasso Plattner Institute, Potsdam, Germany in 2020 and 2022. Please find my CV here.

news

Jun 5, 2026	I attended SIGMOD’26 in Bengalore, India, presenting Mixtera both at the main conference as well as giving an invited talk on it at the DEEM workshop.
May 12, 2026	We just released a report on how data curation alone can increase VLM quality across 20 public benchmarks.
Apr 30, 2026	I attended EuroSys’26 in Edinburgh and presented the current status of the data loader that we are building at DatologyAI. I also gave an Invited Talk at GreenSys’26 on efficient data mixing and loading for foundation model training.
Apr 10, 2026	We just released a cost-performance trade-off study for test-time agent adaptation that has been accepted to the LLA workshop at ICLR‘26!
Apr 7, 2026	Our paper on Apertus v1, Switzerland’s national LLM, has been accepted to ACL‘26!
Mar 17, 2026	Ever wondered if you should include your domain-specific finetuning data in your pretraining mix? Check out The Finetuner’s Fallacy, where we dig into this question!
Feb 20, 2026	After finishing my internship at DatologyAI, I am happy to continue as Member of Technical Staff, working on our (soon-to-be) open source data loader!
Feb 16, 2026	Check out ÜberWeb, where we present our insights on curating multilingual data at the 20 trillion token scale.
Jan 6, 2026	I gave a talk at the DEEM Lab Research Seminar on Data Engineering for ML at TU Berlin, hosted by Sebastian Schelter.
Dec 7, 2025	I attended NeurIPS’25 in San Diego.
Nov 23, 2025	Mixtera has just been accepted to SIGMOD’26 in Bengaluru, India!
Oct 2, 2025	We organized the Building Efficient System Infrastructure for AI track at the AI+X Summit 2025 in Zurich and I gave a presentation about Mixtera.
Sep 15, 2025	I joined DatologyAI as a Research Intern! I will be working on data loading for LLM/VLM training.
Sep 1, 2025	We just released Apertus v1, Switzerland’s national LLM. Happy to have contributed to the pretraining data! The models can be found on huggingface and the technical report on Github.
Jun 27, 2025	I attended SIGMOD’25 in Berlin and presented Modyn at the conference!
May 19, 2025	I started as an ML Intern at Apple  in Seattle, working on reinforcement learning infrastructure for diffusion models!
May 15, 2025	I presented our distributed data selection paper at MLSys 2025, and gave a talk at DatologyAI on Mixtera.
Mar 19, 2025	I have received the ML and Systems Rising Star Award 2025. Thank you so much!
Mar 4, 2025	I presented Mixtera and Modyn at BTW’25. Thank you for the great discussions!
Mar 1, 2025	We just released a preprint on Mixtera, our data plane for foundation model training. If you are training LLMs or VLMs, and are looking for infrastructure for data loading and mixing, please feel free to reach out!
Feb 10, 2025	Our paper on distributed submodular subset selection–a result from my Google internship–has been accepted to MLSys’25. See you soon in Santa Clara!
Oct 31, 2024	Our paper on Modyn has been accepted to SIGMOD’25 in Berlin!
Oct 4, 2024	We organized the Systems for Cost-Efficient AI Track at the AI+X Summit in Zurich.
Sep 30, 2024	Our vision paper on Mixtera, our lightweight data lake for LLM training, has been accepted to HotInfra’24 at SOSP. See you in Austin, TX!
Aug 2, 2024	Happy to have attended the Dagstuhl Seminar 24311: Resource-Efficient Machine Learning.
Jun 17, 2024	I will talk about Modyn at the Data-centric Machine Learning (DML) workshop at ICLR’24. See you in Vienna!
Feb 26, 2024	We just released a preprint of our paper on scaling out practical subset selection using submodular functions. This paper is a result of my internship at Google.
Jun 17, 2023	Our paper on analyzing vectorized hash tables across CPU architectures just got accepted at VLDB’23 in Vancouver!
Jun 5, 2023	I joined Google for a summer research internship in Sunnyvale, California, USA! I am working on scaling out submodular data subset selection.
Apr 11, 2023	Our work-in-progress workshop paper on Modyn, our research platform for model training on dynamic datasets, has been accepted at EuroMLSys’23 in Rome!
Nov 1, 2022	I joined the ETH Zurich Systems Group and the Efficient Architectures and Systems Lab (EASL) to do a Ph.D. in Machine Learning Systems, supervised by Professor Ana Klimovic. Looking forward to the new adventures in Switzerland!
Oct 15, 2022	Our paper on efficiently computing directed minimum spanning trees (arboresence) has been accepted for publication at ALENEX 2023. Check out the final version here.
Jun 6, 2022	Our Law Smells paper, which applies concepts of software engineering to the law, has been published in AI&Law. Check out the final version here.
Jan 24, 2022	Our paper on deep learning for combinatorial optimization just got accepted at ICLR! Check out the final version here.
Dec 27, 2021	This website just went online!

selected publications

ACL

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

Hernández-Cano, Alejandro, Hägele, Alexander, Huang, Allen Hao and 99 more authors

In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) 2026

Bib HTML PDF

@inproceedings{Hernandez2026Apertus,
  author = {Hernández-Cano, Alejandro and Hägele, Alexander and Huang, Allen Hao and Romanou, Angelika and Solergibert, Antoni-Joan and Pasztor, Barna and Messmer, Bettina and Garbaya, Dhia and Ďurech, Eduard Frank and Hakimi, Ido and Giraldo, Juan García and Ismayilzada, Mete and Foroutan, Negar and Moalla, Skander and Chen, Tiancheng and Sabolčec, Vinko and Xu, Yixuan and Aerni, Michael and AlKhamissi, Badr and Mariñas, Inés Altemir and Amani, Mohammad Hossein and Ansaripour, Matin and Badanin, Ilia and Benoit, Harold and Boros, Emanuela and Browning, Nicholas and Bösch, Fabian and Böther, Maximilian and Canova, Niklas and Challier, Camille and Charmillot, Clement and Coles, Jonathan and Deriu, Jan and Devos, Arnout and Drescher, Lukas and Dzenhaliou, Daniil and Ehrmann, Maud and Fan, Dongyang and Fan, Simin and Gao, Silin and Gila, Miguel and Grandury, María and Hashemi, Diba and Hoyle, Alexander and Jiang, Jiaming and Klein, Mark and Kucharavy, Andrei and Kucherenko, Anastasiia and Lübeck, Frederike and Machacek, Roman and Manitaras, Theofilos and Marfurt, Andreas and Matoba, Kyle and Matrenok, Simon and Mendonça, Henrique and Mohamed, Fawzi Roberto and Montariol, Syrielle and Mouchel, Luca and Najem-Meyer, Sven and Ni, Jingwei and Oliva, Gennaro and Pagliardini, Matteo and Palme, Elia and Panferov, Andrei and Paoletti, Léo and Passerini, Marco and Pavlov, Ivan and Poiroux, Auguste and Ponkshe, Kaustubh and Ranchin, Nathan and Rando, Javi and Sauser, Mathieu and Saydaliev, Jakhongir and Sayfiddinov, Muhammad Ali and Schneider, Marian and Schuppli, Stefano and Scialanga, Marco and Semenov, Andrei and Shridhar, Kumar and Singhal, Raghav and Sotnikova, Anna and Sternfeld, Alexander and Tarun, Ayush Kumar and Teiletche, Paul and Vamvas, Jannis and Yao, Xiaozhe and Zhao, Hao and Ilic, Alexander and Klimovic, Ana and Krause, Andreas and Gulcehre, Caglar and Rosenthal, David and Ash, Elliott and Tramèr, Florian and VandeVondele, Joost and Veraldi, Livio and Rajman, Martin and Schulthess, Thomas and Hoefler, Torsten and Bosselut, Antoine and Jaggi, Martin and Schlag, Imanol},
  title = {Apertus: Democratizing Open and Compliant LLMs for Global Language Environments},
  booktitle = {Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2026},
}

SIGMOD

Mixtera: A Data Plane for Foundation Model Training

Böther, Maximilian, Yao, Xiaozhe, Kerimoglu, Tolga and 3 more authors

In Proceedings of the Conference on Management of Data (SIGMOD) 2026

Bib HTML PDF

@inproceedings{Bother2026Mixtera,
  author = {B{\"{o}}ther, Maximilian and Yao, Xiaozhe and Kerimoglu, Tolga and Graur, Dan and Gsteiger, Viktor and Klimovic, Ana},
  title = {Mixtera: A Data Plane for Foundation Model Training},
  booktitle = {Proceedings of the Conference on Management of Data (SIGMOD)},
  doi = {10.1145/3786668},
  url = {https://dl.acm.org/doi/10.1145/3786668},
  year = {2026},
}

MLSys

On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions

Böther, Maximilian, Sebastian, Abraham, Awasthi, Pranjal and 2 more authors

In Proceedings of the Conference on Machine Learning and Systems (MLSys) 2025

Bib HTML PDF

@inproceedings{Bother2025Submod,
  author = {B{\"{o}}ther, Maximilian and Sebastian, Abraham and Awasthi, Pranjal and Klimovic, Ana and Ramalingam, Srikumar},
  title = {On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions},
  booktitle = {Proceedings of the Conference on Machine Learning and Systems ({MLSys})},
  year = {2025},
}

SIGMOD

Modyn: Data-Centric Machine Learning Pipeline Orchestration

Böther, Maximilian, Robroek, Ties, Gsteiger, Viktor and 3 more authors

In Proceedings of the Conference on Management of Data (SIGMOD) 2025

Bib HTML PDF

@inproceedings{Bother2025Modyn,
  author = {B\"{o}ther, Maximilian and Robroek, Ties and Gsteiger, Viktor and Ma, Xianzhe and T\"{o}z\"{u}n, P{\i}nar and Klimovic, Ana},
  title = {Modyn: Data-Centric Machine Learning Pipeline Orchestration},
  booktitle = {Proceedings of the Conference on Management of Data (SIGMOD)},
  year = {2025},
  doi = {10.1145/3709705},
  url = {https://dl.acm.org/doi/10.1145/3709705},
}

VLDB

Analyzing Vectorized Hash Maps Across CPU Architectures

Böther, Maximilian, Benson, Lawrence, Klimovic, Ana and 1 more author

Proceedings of the VLDB Endowment 2023

Bib HTML PDF

@article{Bother2023Hashmaps,
  author = {B\"{o}ther, Maximilian and Benson, Lawrence and Klimovic, Ana and Rabl, Tilmann},
  title = {Analyzing Vectorized Hash Maps Across CPU Architectures},
  journal = {Proceedings of the {VLDB} Endowment},
  pages = {2755 - 2768},
  number = {11},
  volume = {16},
  year = {2023},
  doi = {10.14778/3611479.3611485},
  url = {https://dl.acm.org/doi/10.14778/3611479.3611485},
}