Table of Contents
Supervised learning is a widely used machine learning approach that involves training models on labeled data. However, practitioners often encounter common pitfalls that can affect the performance and reliability of their models. Understanding these issues and applying appropriate calculations can help mitigate their impact.
Overfitting and Underfitting
Overfitting occurs when a model learns the training data too well, including noise, leading to poor generalization on new data. Underfitting happens when the model is too simple to capture underlying patterns. Calculations such as the training and validation error rates can help identify these issues.
For example, comparing the training error (Etrain) and validation error (Eval) can indicate overfitting if Etrain is very low while Eval is high. Conversely, high errors on both suggest underfitting.
Class Imbalance
Class imbalance occurs when some classes are underrepresented in the dataset, leading to biased models. Calculating class distribution percentages helps identify imbalance.
Suppose the dataset has 1000 samples, with 900 belonging to class A and 100 to class B. The class distribution percentages are:
Class A: (900/1000) * 100 = 90%
Class B: (100/1000) * 100 = 10%
Evaluating Model Performance
Metrics such as accuracy, precision, recall, and F1-score are essential for assessing model performance. Calculations involve confusion matrix components:
- True Positives (TP)
- False Positives (FP)
- False Negatives (FN)
For example, precision is calculated as:
Precision = TP / (TP + FP)
Handling Noisy Data
Noisy data can distort model training. Calculations such as the noise-to-signal ratio help quantify data quality.
Suppose the dataset contains 100 noisy samples out of 1000 total samples. The noise ratio is:
Noise Ratio = (Number of noisy samples) / (Total samples) = 100 / 1000 = 0.1 or 10%