
🌞 Summer of Data Seminar: Diving Deep into Data Research

Welcome to the Summer of Data Seminar series, hosted by Datology AI. Each week, we invite friends, collaborators, and curious minds doing interesting research in data and pretraining to share what they’re excited about—with a small group of us at the office (or virtually).
Whether it’s how to filter data at scale, reimagine pretraining from scratch, or just what makes a dataset “good”—this seminar is a space to think out loud, ask questions, and hang out with others who are data-obsessed.
🧠 About The Seminar
The Summer of Data Seminar is a casual internal series where we bring in folks doing great research around:
- Dataset design, pretraining, or scaling laws
- Synthetic data and data-centric alignment
- Data contamination, memorization, unlearning
- Anything else weird and interesting about data
Each session includes a short talk (30–40 mins) followed by open-ended discussion, questions, and some good old-fashioned geeking out. We record these sessions to share with the broader community on YouTube, while keeping the live discussions cozy with just our team.
The joy of research is in sharing it. And asking the hard questions together.
📅 Event Schedule
Stay tuned for upcoming talks. Here's who we've hosted so far:
Presenter | Date | Affiliation | Topic | Resources |
---|---|---|---|---|
May 5, 2025 | UC Berkeley | Scaling Test-Time Compute & Predicting Emergent Capabilities by Finetuning | ||
May 7, 2025 | ETH Zurich | Mixtera: A Data Plane for Foundation Model Training | Paper | |
May 19, 2025 | NVIDIA Research | CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training | Paper | |
June 2, 2025 | University of Edinburgh | Learning to Reason for Long-Form Story Generation | Paper | |
June 9, 2025 | CMU | Echo embeddings & Overtrained Language Models Are Harder to Fine-Tune | ||
June 16, 2025 | Stanford | Standard fine-tuning inefficiently uses rare data | Paper coming soon | |
June 23, 2025 | Princeton | COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning | Paper | |
June 30, 2025 | UW & Nvidia | Prismatic Synthesis & G-Vendi Score: How Data Diversification makes R1-32B a Better Teacher than R1-671B | Paper | |
July 7, 2025 | Stanford & UW | OpenThoughts3 | Paper |
🎤 Want to Present?
If you’re working on something fun and would enjoy chatting about it with a bunch of thoughtful nerds, we’d love to have you join us.
To suggest a talk, send us a message (email or DM @datologyai) with:
- Your name + topic
- A rough title
- Any weeks that are particularly good or bad for your schedule
We’ll take it from there.
We’re excited for a summer of curiosity, great conversations, and rabbit holes we didn’t expect to fall into.
Stay data-obsessed. 🤓🚀