Designing and implementing data pipelines for scientific discovery

Location
London, Bristol, Manchester, Newcastle, Edinburgh.
Cohort Size
Maximum 15 participants per cohort.
Course Fees
The standard course fee is £400.

A reduced rate of £200 is available for current PhD and MSc students at select UK universities and research groups. Please contact us to find out if your group/institution is eligible.

To help you decide with confidence, we allow full payment to be made after the first session of the course.
Duration
12 hours spread over 2 days. Please see below for course schedule.
Dates
The course runs weekly in London and biweekly in other locations.
Level
Minimal programming and mathematics background required.
Introducing our unique approach to data pipelines, summarised in our philosophy: “Pipelines as Artefacts”.
This course offers a fresh perspective on one of the most overlooked yet foundational parts of modern research: the data pipeline.
Data pipelines are often undervalued in academic research. Even researchers with programming experience may see them as little more than tools for cleaning or preparing data. For those newer to data-intensive work, pipelines may go entirely unrecognised, built on-the-fly without clear structure or long-term value.

This stands in stark contrast to their role in industry and leading technology research labs, where data pipelines are treated as foundational infrastructure and viewed as strategic assets, central to delivering consistent, high-quality research at scale.

Well-crafted pipelines can spark new research ideas, enable collaboration across disciplines, and even serve as the foundation for entire ecosystems of research tools and platforms. Not least, they boost visibility, credibility, and open up opportunities for industrial careers.

However the skills to design, craft and implement them are rarely taught well or coherently in academic settings. As a result, many researchers are left to piece together best practices informally often through years of trial, error, and word-of-mouth.

Developed through years of collaboration with researchers at leading institutions—including the University of Cambridge and Imperial College London—as well as top industrial labs, this course helps scientists bridge the gap between academic research practices and industrial best practices, distilling knowledge and strategies on innovating high-quality data pipelines and using them to accelerate scientific discovery.

Finely tuned through extensive delivery and many iterations, this course is field-agnostic and designed to benefit researchers at all levels. Whether you’re just starting out with data workflows or refining your established practices, it provides structured, relevant and concise guidance delivered in a fun, anecdotal style by seasoned data pipeline experts.

The course also equips you with skills that make your work more attractive to non-traditional funders outside academia, like tech companies, innovation labs, and public/private digital initiatives, or helping you shape a niche academic consultancy profile.

One of the course highlights is “Your Data in Focus: Expert Consultation,” a supervisory group session where you bring your datasets and research questions to discuss with the instructors. This guided dialogue serves as a mini-supervision meeting, providing tailored advice to help you directly apply what you’ve learned to your research.

This is more than a technical course: it’s a rethinking of how you approach data, discovery, research impact, and even your career trajectory as a researcher.
Register interest
Course structure
Day 1
10:00-12:00
Lecture: Introduction to Data Pipelines
What do academics often get wrong about pipelines, and what are the common blind spots in how they approach, develop, and market them? How can you bridge the gap between best practices used in leading industrial labs and typical academic workflows? And how can you get more mileage out of your pipelines to boost visibility, collaboration, and momentum, wherever you are in your academic career?
12:00-13:00
Lab: Case studies of high-profile pipelines from a range of fields and industries, selected to align with the cohort’s research interests.
13:00-14:00
Break
14:00-16:00
Lecture: How to design and structure data pipelines to maximise scientific discovery: The "craft" of pipeline design.

How to think of pipelines as artefacts, exploring pipeline design as both a scientific and technical craft. We’ll cover the essential components, choices, and trade-offs involved. Crucially, the session focuses on coupling pipelines with the discovery process, moving beyond simple data preprocessing to using pipelines as a way to inspire new hypotheses and directions, allowing the pipeline to lead your research instead of merely supporting it.
16:00-17:00
Lab: In-depth exploration of how different pipeline design choices impact discovery, using real-world examples tailored to the cohort’s research interests.

Day 2

10:00-12:00
Lecture: How to Publish Data Pipelines

What does it mean to "publish" a data pipeline? What are the different "levels" of publishing pipelines, and how can you strategically align each level with various goals, including: academic recognition, funding impact, commercialisation, and long-term research value? We’ll also explore how to effectively market your pipeline to reach wider audiences and foster both vertical collaboration within your field and horizontal collaboration across disciplines.
12:00-13:00
Lab: Publishing workflows (cohort-specific examples) featuring demos and hands-on exercises covering the process end-to-end.
13:00-14:00
Break
14:00-17:00
Your data in focus: expert conultation

This session involves participants conducting an open discussion with the instructors on their own data and how they can  apply what they learnt in the course to their own research.
Meet the instructors
Ahmad is an experienced educator and AI practitioner. He has held senior research and teaching roles at institutions including University of Cambridge, Imperial College London and LSE.

His expertise combines deep technical skill in machine learning with a strong focus on research impact and effective communication across disciplines.
Steffen is an AI practitioner with a track record of applying machine learning across sectors - from FTSE 100 companies to startups and academic research.

He has held roles such as Senior AI Engineer at Rolls-Royce and has recently focused on applying his knowledge in the AI startup ecosystem. Alongside his applied work, he teaches and consults on AI across universities and industry.
Jan is an expert in big data, scientific computing, and AI, with senior roles across industry and academia. He has led large-scale projects at organisations such as Royal Mail and Citi Bank’s Innovation Lab, where he serves as Senior Vice President and Lead Software Engineer.

Known for bridging research and real-world application, Jan has lectured widely on AI, and has mentored over 100 students by integrating industry best practices into academic training.
Register your interest
FAQs
Who is this course for?
I’m not sure if this course is right for me, can I speak to someone?
Is the course in-person? Can I attend remotely?
How do I enrol to the course?
What do participants receive upon completion?
What is the minimum level of programming requirement to participate in the course?
Do I need to be familiar with machine learning to participate?