SpectruMS: Foundation Model for Mass Spectrometry

AI model for mass spectrometry data.

AI training is similar to file compression (think .zip). Both tools make large files smaller. The added benefit of AI is that it makes the compressed information more easily searchable. The downside is that training AI is nothing like right clicking and archive-ing a folder.

SpectruMS was a large language model at Pangea Bio that compressed GNPS, Massbank, PubChem into a single AI model. SpectruMS compressed petabytes of chemical data into a single 5GB BART model:

At Pangea Bio, SpectruMS was developed essentially from scratch. Every step, data curation, AI training, babysitting the TPU, deployment and serving on AWS was done in-house. What follows is my day-to-day experience of dealing with this project and my recommendations on approaching this.

Define metrics early

Really. Metrics should be defined on day one, and implemented on day two. For instance, the metric was Top-5 accuracy. An entire week was wasted before this was implemented.

When a metric appears, it should come with two things:

  • Evaluation dataset
  • Baseline model(s)

Baseline models give a sense of what the laziest solution could look like. If a random model can be made with a Top-5 accuracy of 5%, then actually training a model with 6% Top-5 accuracy is not that impressive. There’s a bug somewhere in training.

Second is the evaluation dataset. Only two things matter for evaluation dataset:

  • AI in training never sees the evaluation dataset.
  • Evaluation dataset is really different from training dataset.

Settle for the laziest training approach

What is the laziest training approach? It’s the one that trains an LLM over natural language. This was what worked surprisingly well in practice, and it was utterly lazy because there’s a ton of ready-to-use scripts that train an LLM on every conceivable hardware - GPUs, TPUs, XPUs, you name it.

But wasn’t this about training a foundation model for mass spectrometry? Indeed! But this mass spectrometry training goal was reframed into a language modelling problem. And it worked surprisingly well. Let me explain:

The problem of predicting mass spectra -> chemical was reframed into a question-answering problem. Basically, both the mass spectra and the chemical were turned into long texts, and for each mass spectra as input, the model was trained to output the chemical as deepSMILES as text. A simple text-to-text model.

Why did this make sense? Because writing your own training pipeline and making it efficient is a lot of effort (weeks, months even) and a lot of wasted cash. And it makes a lot of sense to lean on the existing open-source tools. If the AI problem can be reframed as a language modelling problem, weeks of development time can be saved. And this was exactly what was done at Pangea.

The training was reframed into basically the fine-tuning for question-answering tutorial. And the BART masked language modelling pretraining over a corpus of stringified MS/MS data and chemical deepSMILES was done.

A BART model architecture was chosen for this task instead of a GPT, even though GPT would’ve been (in my opinion) more efficient and easier. In the end, what mattered most was the quality and amount of data and not the model architecture. More on that later.

In essence, the inference looked like this function:

def predict(msms_string: str) -> str:
  msms_tokens = tokenizer(msms_string)
  prediction = tokenizer('<|begin_chemical|>')
  while True:
    next_token += model(msms_tokens, prediction)
    if next_token == '<|end_chemical|>':
      break
    prediction += next_token
  return prediction

To get the Top-5 accuracy, the model was simply sampled with nonzero temperature to create random outputs. If the exact correct string was within the first 5 tries, that was a +1 score.

TPU training sucks

Google is an amazing company that specializes in creating amazingly convoluted tools. TPU is Google’s chip for training AI, but using a TPU had to be one of the most painful experiences an AI engineer could be subjected to. The second biggest problem with TPUs was just how expensive they were to use.

TPUs were slow to start, the errors were (if logs could even be accessed) cryptic, and software written for TPU training was useless for training on anything but Google’s TPUs.

The TPU situation was so bad that even though the team was offered some $100k in Google cloud credits, the team lead still decided to move away from using TPUs. Even the fine-tuning part for question-answering (read: question-in-ms/ms answer in deepSMILES) was done away from TPUs and on an A100 instance on Google Cloud.

It is easy to burn a lot of cash

And not only on GPUs or TPUs, mind you. When working with a lot of data (there was a petabyte or so of it), it was easy to accidentally burn a lot of cash with a single press of a button. Taking your money out of the bank, covering it with gasoline and setting it on fire would be slower. Like, it would take more effort per dollar to destroy your money than that.

In one accident, $2.5k went aflame on AWS, because of S3. Essentially, a lot of data was written into an S3 Glacier Deep Archive storage. During data processing, one of the workflows accidentally read a large portion of that data. On S3 Glacier, the pricing was not for storage but per GB read. The workflow read $2.5k worth of data. The billing for that day looked like a Dirac function:

Ouch.

Conclusion

Training the foundation model for Pangea Bio was one of my best achievements in 2025. I trained a large language model for Pangea Bio from scratch. The model is trained to predict the identity of mass spectra (answering the question “what chemical made this spectrum?”). This top-n classification problem was reformulated into language modelling and question answering tasks and training logic was built around this. This model easily smashed the previous internal metric for model performance and in this process I got so much practical experience. That kind of experience is going to stay very valuable precisely because of how much resource is demands. If you are considering working at Pangea Bio, I can only fully recommend that – please reach out on LinkedIn and I can tell you more about my very good experience working there.