University group project analysing 1.9 million train records. I handled the statistical part — comparing delay patterns across train types using ANOVA and descriptive analysis.
Introduction
Academic data science project analyzing railway reliability using a real-world dataset of 1.9 million Deutsche Bahn observations from October 2025 (source: Huggingface).
Brief
As part of a 6-member interdisciplinary team, I was responsible for the statistical analysis of delay structure across train types. My contribution included descriptive analysis of average delays by train category, distribution comparisons via boxplots, a one-way ANOVA test (F = 4638.44, p < 0.001) confirming significant differences across train types, and effect size calculation (Eta² = 0.11) showing train type explains ~11% of total delay variance.
The broader project combined regression modeling, hypothesis testing, clustering (hierarchical + PCA), and logistic regression to assess systemic operational instability in the German rail network.