Join team
Datology AI Datology AI Jun 23 6 min read

🌞 Summer of Data Seminar: Diving Deep into Data Research

🌞 Summer of Data Seminar: Diving Deep into Data Research

Welcome to the Summer of Data Seminar series, hosted by Datology AI. Each week, we invite friends, collaborators, and curious minds doing interesting research in data and pretraining to share what they’re excited about—with a small group of us at the office (or virtually).

Whether it’s how to filter data at scale, reimagine pretraining from scratch, or just what makes a dataset “good”—this seminar is a space to think out loud, ask questions, and hang out with others who are data-obsessed.


🧠 About The Seminar

The Summer of Data Seminar is a casual internal series where we bring in folks doing great research around:

  • Dataset design, pretraining, or scaling laws
  • Synthetic data and data-centric alignment
  • Data contamination, memorization, unlearning
  • Anything else weird and interesting about data

Each session includes a short talk (30–40 mins) followed by open-ended discussion, questions, and some good old-fashioned geeking out. We record these sessions to share with the broader community on YouTube, while keeping the live discussions cozy with just our team.

The joy of research is in sharing it. And asking the hard questions together.


📅 Event Schedule

Stay tuned for upcoming talks. Here's who we've hosted so far:

Presenter Date Affiliation Topic Resources
Charlie Snell
May 5, 2025 UC Berkeley Scaling Test-Time Compute & Predicting Emergent Capabilities by Finetuning
Maximilian Böther
May 7, 2025 ETH Zurich Mixtera: A Data Plane for Foundation Model Training Paper
Shizhe Diao
May 19, 2025 NVIDIA Research CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training Paper
Alexander Gurung
June 2, 2025 University of Edinburgh Learning to Reason for Long-Form Story Generation Paper
Jacob Springer
June 9, 2025 CMU Echo embeddings & Overtrained Language Models Are Harder to Fine-Tune
Suhas Kotha
June 16, 2025 Stanford Standard fine-tuning inefficiently uses rare data Paper coming soon
Xindi Wu
June 23, 2025 Princeton COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning Paper
Jaehun Jung
June 30, 2025 UW & Nvidia Prismatic Synthesis & G-Vendi Score: How Data Diversification makes R1-32B a Better Teacher than R1-671B Paper
Etash Guha
July 7, 2025 Stanford & UW OpenThoughts3 Paper
(Links will be updated closer to the event dates)

🎤 Want to Present?

If you’re working on something fun and would enjoy chatting about it with a bunch of thoughtful nerds, we’d love to have you join us.

To suggest a talk, send us a message (email or DM @datologyai) with:

  • Your name + topic
  • A rough title
  • Any weeks that are particularly good or bad for your schedule

We’ll take it from there.


We’re excited for a summer of curiosity, great conversations, and rabbit holes we didn’t expect to fall into.

Stay data-obsessed. 🤓🚀