Customer Churn Prediction

Built a machine learning project on 72,000+ ISP customer records to predict churn. Compared 6 classification models — Random Forest performed best at 93% accuracy. Done as a self-learning project on Kaggle.

Introduction

A complete machine learning project built on a real-world Internet Service Provider dataset (72,274 customer records, 11 features) to predict customer churn — identifying which customers are likely to cancel their subscription.

The project covers the full data science pipeline from raw data to model comparison:

Exploratory Data Analysis: Investigated churn distribution, TV and movie package subscription patterns, service failure frequency, subscription age trends, and download/upload behavior. Used countplots, histograms, jointplots, boxplots, pairplots, and a correlation heatmap to surface key patterns. Key finding: remaining contract duration showed the strongest negative correlation with churn.

Data Cleaning & Feature Engineering: Handled missing values using median imputation for remaining contract (29% missing) and mean imputation for download/upload averages. Removed irrelevant ID column to streamline the feature set.

Feature Selection: Applied ExtraTreesClassifier feature importance scoring across all 9 predictors. Remaining contract duration dominated with 52.7% importance score. Dropped the two lowest-importance features (service failure count and download over limit) to retain the top 7 predictors. Applied StandardScaler for normalization before modeling.

Model Comparison: Trained and evaluated 6 classification algorithms — Logistic Regression, K-Nearest Neighbors, Decision Tree, Random Forest, Support Vector Machine, and Gaussian Naive Bayes — comparing Accuracy, Precision, and Recall for each using confusion matrices.

Results:

– Random Forest → 93.45% accuracy (best performer)
– Decision Tree → 91.02%
– KNN (k=10) → 89.90%
– Logistic Reg → 83.50%
– Gaussian NB → 78.86%
– SVM (sigmoid)→ 74.32%

 

Random Forest, Decision Tree, and KNN were identified as the top three models suitable for deployment on this classification problem.

Skills

  • Python
  • Pandas
  • Numpy
  • Matplotlib/Seaborn
  • Machine Learning
  • Jupyter Notebook