Part 1: Basic Concepts

Table of Contents


Bias-Variance Tradeoff

Decomposition of Expected Test Error:

$$ \begin{align*} \underbrace{\mathbb{E}_{x, y, D} \left[ \left( h_D(x) - y \right)^2 \right]}_{\text{Expected Test Error}} &= \underbrace{\mathbb{E}_{x, D} \left[ \left( h_D(x) - \bar{h}(x) \right)^2 \right]}_{\text{Variance}} + \underbrace{\mathbb{E}_{x, y} \left[ \left( \bar{y}(x) - y \right)^2 \right]}_{\text{Noise}} + \underbrace{\mathbb{E}_x \left[ \left( \bar{h}(x) - \bar{y}(x) \right)^2 \right]}_{\text{Bias}^2} \end{align*} $$

Underfitting vs. Overfitting

Characteristic Overfitting Underfitting
Training Error Low High
Validation/Test Error High High
Model Complexity Too complex Too simple
Generalization Poor Poor
Bias vs. Variance Low bias, high variance High bias, low variance

Cross-Validation

Fixing High Bias (Underfitting)

  1. Increase Model Complexity
  2. Decrease Regularization
  3. Add Features
  4. Increase Training Time
  5. Use Non-linear Models

Fixing High Variance (Overfitting)

  1. Decrease Model Complexity
  2. Increase Regularization
  3. Reduce Feature Space
  4. Increase Training Data
  5. Use Early Stopping
  6. Use Ensemble Methods

Balance: Bias-Variance Tradeoff

The solution often lies in striking a balance between high bias and high variance. You can experiment iteratively with:

Ensemble Methods

Pros & Cons

1. Bagging (Bootstrap Aggregating)

2. Boosting

3. Stacking (Stacked Generalization)

Ensemble Algorithms

  1. Random Forest: Bagging applied to decision trees
  2. AdaBoost: Boosting algorithm that combines decision trees or stumps
  3. Gradient Boosting Machines (GBM): Sequential training to minimize loss
  4. XGBoost, LightGBM, CatBoost: Optimized implementations of gradient boosting with faster training and improved accuracy

Loss functions

Loss functions provide a mathematical framework to quantify the error between a model's predictions and the true labels.

Loss Functions for Classification

  1. Least Square Loss: $(h_\theta(x) - y)^2$

  2. Zero-One Loss: $1{h_\theta(x) \cdot y \leq 0}$

  3. Logistic Loss: $\log(1 + \exp(-h_\theta(x) \cdot y))$

  4. Hinge Loss: $\max{1 - h_\theta(x) \cdot y, 0}$

  5. Exponential Loss: $\exp(-h_\theta(x) \cdot y)$

  6. Cross-Entropy Loss:

Loss Functions for Regression

  1. Root Mean Square Error (RMSE): $RMSE = \sqrt{\frac{1}{n} \sum_{j=1}^n (y_i - \hat{y}_i)^2}$

  2. Mean Absolute Error (MAE): $MAE = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|$

  3. RMSE vs. MAE:

MLE vs. Loss Functions

Batch Learning vs. Online Learning

  1. Batch Learning:

  2. Online Learning:

Confusion Matrix

Predicted Positive Predicted Negative
True Positive TP FN
True Negative FP TN
  1. Accuracy: $(TP + TN) / \text{all}$

  2. Recall (Sensitivity/True Positive Rate): $TP / (TP + FN)$

  3. Precision: $TP / (TP + FP)$

High Recall, Low Precision Many false positives but few false negatives Good for detection
Low Recall, High Precision Few false positives but many false negatives Good for trustworthiness
  1. F1 Score: $2 \cdot \frac{\text{recall } \cdot \text{ precision}}{\text{recall } + \text{ precision}}$

  2. Specificity (True Negative Rate): $TN / (TN + FP)$

ROC Curves, PR Curves

ROC (Receiver Operating Characteristic) Curves

Precision-Recall Curves