Part 2: Machine Learning Algorithms

Table of Contents


ML Problem Settings

Basic Assumptions and Definitions

  1. Probabilistic Model: The dataset is considered to be generated from an unknown probability distribution of input-output pairs.

  2. Definition of ML:

    Compute a function $f$ that has low expected error $\epsilon$ over sample $D$ (from unknown distribution) with respect to loss function $L$

  3. Structure of ML Algorithm:

Optimization (Gradient Descent)

Gradient Descent is an iterative optimization algorithm used to find the minimum of a function

Deterministic & Probabilistic Problems

[TODO]

Paramertric vs. Nonparametric Models

Parametric Models

Nonparametric Models

Diagnose ML model

[TODO]

Supervised Learning

Algorithm Hypothesis Loss Function Optimization Method
Linear Regression Linear Any regression loss Any
SVM Linear or kernel Hinge loss Any
Logistic Regression (Usually) linear Logistic loss Any
NN (Neural Networks) Composed non-linear function Any (Usually) gradient descent
DT (Decision Tree) Axis-aligned halfplanes Log probability under Bernoulli model Greedy search
Naive Bayes Linear Joint probability of data and output Analytic solution
Gradient Boosting Ensemble of other models Any Gradient descent

Decision Tree (DT)

Pros

Algorithm

  1. Search for the best attribute (splitting criterion) to split the data
  2. Use the chosen attribute to split data into subsets
  3. Repeat steps 1 and 2 for each subset until stopping criteria are met
  4. Optionally prune the tree to prevent overfitting

Splitting Criteria in Decision Trees

Splitting criteria determine the best way to split data at each node of the tree.

  1. Information Gain

  2. Gini Index

  3. Error Rate

Stopping Criteria

Pruning

  1. Prepare a validation set to evaluate the impact of pruning
  2. Greedily remove nodes that improve validation accuracy
  3. Stop pruning when further pruning is detrimental

Inductive Bias in Decision Trees


Random Forest (RF)

Random Forest is an ensemble method that combines multiple decision trees using bagging and feature randomness to create a forest of uncorrelated trees

Random Forest Process

Out-of-Bag (OOB) Score

Interpretability and Feature Importance

Advantages of Random Forest

  1. Handles missing values and outliers
  2. Less prone to overfitting; Low bias and moderate variance
  3. Generalizes to high-dimensional data and correlated features
  4. Robust to data variations and non-linear patterns
  5. Can be trained in parallel for efficiency
  6. Does not require cross-validation to estimate error

Boosting Regression Trees (Gradient Boosting)

Steps to Build Boosted Regression Trees

  1. Initialize:
  2. For each boosting iteration $b = 1, \dots, B$:
    1. Fit a regression tree $T_b$ to the training data $(X, r)$
      • The tree predicts $f_b(x)$ for an instance $x$
    2. Update the prediction model:
      • $f(x) \leftarrow f(x) + \lambda f_b(x)$, where $\lambda$ is the learning rate
    3. Update the residuals:
      • $r_i \leftarrow r_i - \lambda f_b(x_i)$
  3. Output the final boosted regression model:

Hyperparameters

  1. Number of Trees ($B$):

  2. Learning Rate ($\lambda$):

  3. Regularization Parameter ($h_\text{max}$ or $n_\text{min}$):

Practical Tips


Boosting Classification Trees (AdaBoost)

AdaBoost Process


KNN (K-Nearest Neighbors) Classification

Algorithm

  1. Training: Store the dataset $D$

  2. Prediction: Assign the most common label (majority vote) of the $K$-nearest points in $D$ to a given input based on the computed distances

Distance Functions

  1. Euclidean Distance:

    $$ g(u, v) = \sqrt{\sum_{m=1}^{M} (u_m - v_m)^2} $$
  2. Manhattan Distance:

    $$ g(u, v) = \sum_{m=1}^{M} |u_m - v_m| $$

Tuning

  1. Use different distance functions
  2. Increase $K$
  3. Remove the farthest point
  4. Apply weighted voting (give closer points higher weight)

Inductive Bias in KNN

  1. Assumes that nearby points are likely to have similar labels
  2. Assumes equal importance for all features (can degrade performance in high-dimensional datasets and lead to the curse of dimensionality)

Perceptron

Perceptron is a foundational machine learning model designed for binary classification tasks

Algorithm

  1. Initialize:
  2. While not converged:

Perceptron Mistake Bound

Inductive Bias in Perceptron

Extensions

  1. Voted Perceptron:
  2. Averaged Perceptron:
  3. Kernel Perceptron:
  4. Structured Perceptron:

Linear Regression

Assumptions

  1. Linear Relationship: Independent variables are assumed to have a linear relationship with the dependent variable
  2. Normal Residuals: Residuals should follow a normal distribution (check via Q-Q plot)
  3. No Multicollinearity: Independent variables should not be highly correlated to ensure stable coefficient estimates
  4. Homoscedasticity: Error variance should be constant across all predictor levels
  5. Independent Residuals: Observations should be randomly and independently sampled

Data Preparation

  1. Address non-linear relationships (e.g., log transformation for an exponential relationship)
  2. Clean data to remove noise
  3. Remove highly correlated variables
  4. Standardize variables for better predictions

Intercept Term

The intercept term in a regression model ensures that:

  1. Residuals have zero mean: The average difference between the observed values and the predicted values is zero, which prevents systematic overestimation or underestimation of the dependent variable

  2. Unbiased slope estimates: The inclusion of an intercept allows the regression line to fit the data accurately, ensuring that the estimated slope coefficients are not distorted by forcing the line to pass through the origin

Model Evaluation

Regularization

Regularization involves adding a penalty term to the loss function in regression models to prevent overfitting and enhance generalization

  1. Ridge Regression (L2 Regularization)

  2. Lasso Regression (L1 Regularization)

Practical Tips


Logistic Regression

Logistic Regression predicts the probability of a binary outcome using a linear combination of independent variables transformed by a logit function

Algorithm

Multi-class Logistic Regression


Neural Network


Naive Bayes

Naive Bayes is a probabilistic classifier based on Bayes' theorem, assuming conditional independence among features

Key Concepts

Algorithm

  1. Define Distributions

  2. Estimate Parameters

  3. Predict for Input $X$

  4. Select the Class

Example:

Laplace Smoothing

Laplace Smoothing addresses the issue of zero probabilities in Naive Bayes classification. If a feature value never occurs with a certain class in the training data, the estimated probability is zero, which can cause the entire probability product $P(X|Y)$ to become zero, making predictions impossible.

Example:


Support Vector Machine (SVM)

SVM aims to find the hyperplane that maximizes the margin between two classes

Algorithm

Hyperparameters in SVM

  1. Regularization Parameter ($C$): Balances the trade-off between maximizing the margin and minimizing the classification error

  2. Gamma: Defines how far the influence of a single training example reaches, with low values meaning 'far' and high values meaning 'close'

Comparison

Advantages


K-Means Clustering


Time Series Forecasting


Back to top