Part 2: Machine Learning Algorithms

ML Problem Settings
- Basic Assumptions and Definitions
- Optimization (Gradient Descent)
Deterministic & Probabilistic Problems
Paramertric vs. Nonparametric Models
- Parametric Models
- Nonparametric Models
Diagnose ML model
Supervised Learning
1. Decision Tree (DT)
2. KNN (K-Nearest Neighbors) Classification
3. Perceptron
4. Linear Regression
  - Regularization
    - Ridge Regression (L2 Regularization)
    - Lasso Regression (L1 Regularization)
5. Logistic Regression
6. Neural Network
7. Naive Bayes
8. Support Vector Machine (SVM)
9. K-Means Clustering
10. Time Series Forecasting

ML Problem Settings

Basic Assumptions and Definitions

Probabilistic Model: The dataset is considered to be generated from an unknown probability distribution of input-output pairs.
Definition of ML:

Compute a function $f$ that has low expected error $\epsilon$ over sample $D$ (from unknown distribution) with respect to loss function $L$
- Key Assumption: The target function is unknown
- Model: Defines the hypothesis space for the learning algorithm to search
- Model Parameters: Numerical values or structures used to define hypotheses
- Learning Algorithm: Searches the hypothesis space to minimize loss
- Hyperparameters: Tunable model aspects not selected by the learning algorithm
- Inductive Bias: Assumptions used by the learner to predict outputs of given inputs for unseen data
Structure of ML Algorithm:
- Inputs (features): $x^i \in R^n$, $ i = 1, \ldots, m$
- Outputs (targets): $y^i \in Y, $ $i = 1, \ldots, m$
- Parameters: $ \theta \in R^d$
- Hypothesis Function: $ h_\theta: R^n \to \hat{Y}$
- Loss Function: $ \ell: \hat{Y} \times Y \to R_+$
- Canonical Optimization Problem: $\min_{\theta \in \Theta} \frac{1}{m} \sum_{i=1}^m \ell(h_\theta(x^i), y^i)$

Optimization (Gradient Descent)

Gradient Descent is an iterative optimization algorithm used to find the minimum of a function

Algorithm: Repeatedly update weights by calculating the negative partial derivatives for each coordinate
- Starting point ($w_{\text{old}}$): random or zero
- Stopping criterion: Stop when the gradient ($\frac{\partial J}{\partial w}$) becomes close to zero or when the change in $J(w)$ is below a predefined threshold
- Learning rate ($\alpha$): Step size
Useful in cases where closed-form solutions are unavailable or infeasible
- Total complexity for OLS: $O(M^2N + M^{2.373})$, where $M$ is the number of features and $N$ is the number of data points
Pros:
- Simple to implement
- Widely applicable for a variety of optimization problems
Cons:
- Slow convergence on large datasets
- May find local minima instead of global minima
- May not be suitable for non-convex problems

GD Variants

Method	Data Processed At a Time	Gradient Quality	Speed	Stability
Batch GD	Entire dataset	High (accurate gradients)	Slowest	Very Stable
Mini-Batch GD	Batch of data points	Balanced	Balanced	Balanced
Stochastic GD (SGD)	Single data point	Noisy (approximate gradients)	Fastest	Less Stable

Appendix: Complexity of matrix operations:
- Inner product $x^T y$: $O(n)$
- Matrix-vector product $Ax$: $O(n^2)$
- Matrix-matrix product $AB$: $O(n^3)$
- Matrix inverse $A^{-1}$: $O(n^3)$

Deterministic & Probabilistic Problems

[TODO]

Paramertric vs. Nonparametric Models

Parametric Models

Characteristics:
- Assume a specific distribution for the data
- Have a fixed number of parameters
- Complexity of the model is bounded, regardless of the amount of data
Examples:
- Linear Regression
- Logistic Regression
- Linear SVM
- K-Means Clustering
- PCA

Nonparametric Models

Characteristics:
- Do not make strong assumptions about the underlying function
- The number of parameters grows with the size of the training data
- More flexible but require more training data
Examples:
- K-Nearest Neighbor (k-NN)
- Decision Trees
- RBF Kernel SVMs
- Gaussian Processes (GPs)

Diagnose ML model

[TODO]

Supervised Learning

Algorithm	Hypothesis	Loss Function	Optimization Method
Linear Regression	Linear	Any regression loss	Any
SVM	Linear or kernel	Hinge loss	Any
Logistic Regression	(Usually) linear	Logistic loss	Any
NN (Neural Networks)	Composed non-linear function	Any	(Usually) gradient descent
DT (Decision Tree)	Axis-aligned halfplanes	Log probability under Bernoulli model	Greedy search
Naive Bayes	Linear	Joint probability of data and output	Analytic solution
Gradient Boosting	Ensemble of other models	Any	Gradient descent

Decision Tree (DT)

Pros

Easy to interpret and visualize
Computationally efficient in both time and memory
Widely applicable to classification, regression, and density estimation

Algorithm

Search for the best attribute (splitting criterion) to split the data
Use the chosen attribute to split data into subsets
Repeat steps 1 and 2 for each subset until stopping criteria are met
Optionally prune the tree to prevent overfitting

Complexity:
- Training: $O(MN \log N)$
- Prediction: $O(2^M \log N)$

Splitting Criteria in Decision Trees

Splitting criteria determine the best way to split data at each node of the tree.

Information Gain
- Measures the reduction in entropy (uncertainty)
- Goal: Choose the attribute that maximizes the information gain after the split
  
  $$ IG(S, A) = Entropy(S) - \sum_{i} \frac{|S_i|}{|S|} \cdot Entropy(S_i) $$
  
  where:
  - $S$: Original dataset
  - $A$: Attribute being split
  - $S_i$: Subset created by splitting $S$ on $A$
- Entropy (for classification problems):
  
  $$ Entropy(S) = -\sum_{j} p_j \log_2(p_j) $$
  
  where $p_j$ is the proportion of class $j$ in the dataset
- (+): Works particularly well for multi-class problems
Gini Index
- Measures the impurity of a dataset
- Goal: Choose the attribute that minimizes the Gini Index after the split
- Formula:
  
  $$ Gini(S) = 1 - \sum_{j} p_j^2 $$
  
  where $p_j$ is the proportion of class $j$ in the dataset
- (+): Computationally efficient and widely used in practical implementations, especially in CART (Classification and Regression Trees)
Error Rate
- Measures the misclassification error of a split
- Goal: Choose the attribute that minimizes the error rate after the split
- Formula:
  
  $$ Error\ Rate(S) = 1 - \max(p_j) $$
  
  where $p_j$ is the proportion of the majority class in the dataset
- (-): Less sensitive to changes in data distribution compared to Information Gain or Gini Index and rarely used in practice

Stopping Criteria

Splitting criterion falls below a threshold
Maximum tree depth is achieved
Split is not statistically significant
Full tree is grown, then pruned

Pruning

Prepare a validation set to evaluate the impact of pruning
Greedily remove nodes that improve validation accuracy
Stop pruning when further pruning is detrimental

Inductive Bias in Decision Trees

ID3 Algorithm (Greedy Search):
- Smallest Tree: Assumes that smaller trees are better generalizations and less likely to overfit
- High Mutual Information: Favors splitting on attributes that provide the most information gain (reduce uncertainty the most) near the root of the tree, ensuring that the most informative decisions are made early

Random Forest (RF)

Random Forest is an ensemble method that combines multiple decision trees using bagging and feature randomness to create a forest of uncorrelated trees

Random Forest Process

Bootstrap Sampling (Bagging):
- Draw a bootstrap dataset $D_b$ of size $n$ from training data $D$
- Grow a decision tree $T_b$ using $D_b$ by:
  1. Selecting $p$ (typically $p \approx \sqrt{d}$) random features from $d$
    - Equivalent to vanilla bagging when $p = d$
  2. Choosing the best feature among $p$ to split
  3. Recursively splitting until the stopping criterion is met (e.g., minimum leaf size)
Output: An ensemble of trees ${T_1, T_2, \dots, T_B}$

Out-of-Bag (OOB) Score

Bootstrap Sampling: Each decision tree is trained on a randomly sampled subset of the training data (with replacement)
On average, 63% of the original data is included in each bootstrap sample for training, while 37% of the data is left out as out-of-bag (OOB) samples
OOB score is calculated as:
- the mean squared error (MSE) (for regression)
- the accuracy (for classification)
OOB score serves as an unbiased estimate of the Random Forest's generalization error without the need for additional validation datasets

Interpretability and Feature Importance

Trade-off: Bagging improves prediction accuracy but reduces interpretability
Feature Importance
- For regression trees: Calculated as the total reduction in RSS (Residual Sum of Squares) due to splits involving a feature, averaged across all $B$ trees
- For classification trees: Measured by the total decrease in impurity (e.g., Gini index) caused by splits involving a feature, averaged over all $B$ trees
Features are ranked by the average contribution to the measures, where higher values indicate greater importance

Advantages of Random Forest

Handles missing values and outliers
Less prone to overfitting; Low bias and moderate variance
Generalizes to high-dimensional data and correlated features
Robust to data variations and non-linear patterns
Can be trained in parallel for efficiency
Does not require cross-validation to estimate error

Boosting Regression Trees (Gradient Boosting)

Boosting mirrors human learning by iteratively improving performance on challenging examples
Examples: XGBoost, LightGBM, CatBoost

Steps to Build Boosted Regression Trees

Initialize:
- $f(x) = 0$ and residuals $r_i = y_i$ for all instances $i$ in the training set $(X, y)$
For each boosting iteration $b = 1, \dots, B$:
1. Fit a regression tree $T_b$ to the training data $(X, r)$
  - The tree predicts $f_b(x)$ for an instance $x$
2. Update the prediction model:
  - $f(x) \leftarrow f(x) + \lambda f_b(x)$, where $\lambda$ is the learning rate
3. Update the residuals:
  - $r_i \leftarrow r_i - \lambda f_b(x_i)$
Output the final boosted regression model:
- $f(x) = \sum_{b=1}^B \lambda f_b(x)$

Hyperparameters

Number of Trees ($B$):
- Controls the size of the ensemble
- Too large $B$ may lead to overfitting
Learning Rate ($\lambda$):
- Controls how much each tree contributes to the final model
- Small $\lambda$ requires more iterations but improves model performance
Regularization Parameter ($h_\text{max}$ or $n_\text{min}$):
- Limits tree complexity (e.g., maximum depth or minimum leaf size)
- Smaller trees reduce overfitting but may require a larger $B$

Practical Tips

Hyperparameter tuning often involves cross-validation over:
- A 3D grid of $B$, $\lambda$, and $h_\text{max}$
- Alternatively, fixing $B$ and searching over $\lambda$ and $h_\text{max}$ (2D grid)

Boosting Classification Trees (AdaBoost)

AdaBoost Process

Initialize the weights $ \beta_i^{(0)} = 1/n $ for training instances $ i = 1, \dots, n $ where $ y_i \in {-1, +1} $
For $ b = 1, \dots, B $:
1. Fit a decision tree $ T_b $ to training data using weights $ \beta_i^{(b)} $. Denote the prediction function of $ T_b $ as $ f_b : \mathbf{x} \to {-1, +1} $
2. Compute training error rate for $ f_b $ on the weighted data:
  $$ \text{error}_b = \frac{\sum_{i=1}^n \beta_i \cdot \mathbb{I}(y_i \neq f_b(x_i))}{\sum_{i=1}^n \beta_i} $$
3. Compute classifier weight $ \alpha_b $ for $ f_b $:
  $$ \alpha_b = \frac{1}{2} \log \left( \frac{1 - \text{error}_b}{\text{error}_b} \right) $$
4. Update weights of training instances:
  $$ \beta_i^{(b)} \gets \frac{\beta_i^{(b-1)} \cdot \exp \left[ -\alpha_b \cdot y_i \cdot f_b(x_i) \right]}{\sum_{i=1}^n \beta_i^{(b-1)} \cdot \exp \left[ -\alpha_b \cdot y_i \cdot f_b(x_i) \right]}, \quad i = 1, \dots, n $$
Output the boosted classification model:

$$ f(\mathbf{x}) = \text{sign} \left( \sum_{b=1}^B \alpha_b f_b(\mathbf{x}) \right) $$

KNN (K-Nearest Neighbors) Classification

Algorithm

Training: Store the dataset $D$
- $O(1)$
Prediction: Assign the most common label (majority vote) of the $K$-nearest points in $D$ to a given input based on the computed distances
- $O(MN)$, where $M$ is the number of features and $N$ is the size of the dataset
- Behavior with $K$:
  - Small $K$: Model may overfit
  - Large $K$: Majority vote in $D$ dominates

Distance Functions

Euclidean Distance:

$$ g(u, v) = \sqrt{\sum_{m=1}^{M} (u_m - v_m)^2} $$
Manhattan Distance:

$$ g(u, v) = \sum_{m=1}^{M} |u_m - v_m| $$

Tuning

Use different distance functions
Increase $K$
Remove the farthest point
Apply weighted voting (give closer points higher weight)

Inductive Bias in KNN

Assumes that nearby points are likely to have similar labels
Assumes equal importance for all features (can degrade performance in high-dimensional datasets and lead to the curse of dimensionality)

Perceptron

Perceptron is a foundational machine learning model designed for binary classification tasks

Uses weights for features to learn a linear decision boundary
Features with larger absolute weights have greater influence on prediction (requires feature scaling)

Algorithm

Initialize:
- Set all weights ($\theta$) to zero
While not converged:
- For each training example $(x^i, y^i)$:
  - Predict $\hat{y} = h_\theta(x) = \text{sign}(\theta^T x)$
  - If $\hat{y} \neq y$, update weights: $\theta \gets \theta + y^i x^i$

Perceptron Mistake Bound

If data has a margin $\gamma$ and is within a radius $R$, the Perceptron makes at most $(R / \gamma)^2$ mistakes
Scaling Invariance: Scaling all inputs does not affect the number of mistakes

Inductive Bias in Perceptron

Assumes linear decision boundaries
Prioritizes correcting the most recent mistakes (suitable for online learning)

Extensions

Voted Perceptron:
- Stores all weight vectors during training; Allows weighted voting
- Improves generalization but memory-intensive
Averaged Perceptron:
- Uses running averages of weights for memory efficiency
Kernel Perceptron:
- Uses the kernel trick to handle non-linear decision boundaries
Structured Perceptron:
- Designed for tasks with exponentially large output spaces

Linear Regression

Linear Regression models the relationship between variables with an affine combination (linear combination where the coefficients (weights) sum to 1)
Closed-form solution:

$$ \theta = (X^T X)^{-1} X^T y $$

Assumptions

Linear Relationship: Independent variables are assumed to have a linear relationship with the dependent variable
Normal Residuals: Residuals should follow a normal distribution (check via Q-Q plot)
No Multicollinearity: Independent variables should not be highly correlated to ensure stable coefficient estimates
Homoscedasticity: Error variance should be constant across all predictor levels
Independent Residuals: Observations should be randomly and independently sampled

Data Preparation

Address non-linear relationships (e.g., log transformation for an exponential relationship)
Clean data to remove noise
Remove highly correlated variables
Standardize variables for better predictions

Intercept Term

The intercept term in a regression model ensures that:

Residuals have zero mean: The average difference between the observed values and the predicted values is zero, which prevents systematic overestimation or underestimation of the dependent variable
Unbiased slope estimates: The inclusion of an intercept allows the regression line to fit the data accurately, ensuring that the estimated slope coefficients are not distorted by forcing the line to pass through the origin

Model Evaluation

R-squared/Adjusted R-squared
F-test
- Null Hypothese $H_0$: All regression coefficients are equal to zero
RMSE

Regularization

Regularization involves adding a penalty term to the loss function in regression models to prevent overfitting and enhance generalization

Ridge Regression (L2 Regularization)
- Penalizes the sum of squared coefficients ($|\mathbf{w}|_2^2 = \sum_j w_j^2$)
- Loss function:
  
  $$ \ell(\mathbf{w}) = \frac{1}{2} \|\mathbf{Xw} - \mathbf{y}\|^2 + \frac{\lambda}{2} \|\mathbf{w}\|_2^2 $$
- Closed-form solution:
  
  $$ \mathbf{w} = (\mathbf{X}^T \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y} $$
Lasso Regression (L1 Regularization)
- Penalizes the absolute sum of coefficients ($|\mathbf{w}|_1 = \sum_j |w_j|$)
- Encourages sparsity by driving some coefficients to zero, enabling feature selection
- No closed-form solution due to the non-differentiability of $|\mathbf{w}|_1$ at $w_j = 0$
- Requires subgradient optimization techniques for implementation

Practical Tips

Proper selection of $\lambda$ balances bias and variance, optimizing model performance
- $\lambda = 0$: No regularization, reduces to standard least squares regression, might lead to overfitting
- $\lambda \to \infty$: Coefficients approach zero, might lead to underfitting

Logistic Regression

Logistic Regression predicts the probability of a binary outcome using a linear combination of independent variables transformed by a logit function

Commonly known as the logit model, a probabilistic classifier optimized through MLE
Shares similarities with SVMs in separating hyperplanes, but differs in loss function

Algorithm

Hypothesis Function:

$$ p(y=1|x) = \frac{1}{1 + e^{-\theta^T x}} $$
- Models conditional probability using logits ($\theta^T x$)
Loss Function:
- Logistic Loss (log-loss): $\log(1 + e^{-h_\theta(x) \cdot y})$
- Optimized using gradient descent

Multi-class Logistic Regression

"One-vs-All" Method: Creates $k$ binary classifiers for $k$ classes
Loss Function:
- Softmax Loss: $p(y|x) = \frac{e^{\theta_k^T x}}{\sum_{k=1}^K e^{\theta_k^T x}}$
- Optimized using MLE

Neural Network

Naive Bayes

Naive Bayes is a probabilistic classifier based on Bayes' theorem, assuming conditional independence among features

Key Concepts

Naive Bayes Assumption: Features $X_i$ are conditionally independent given $Y$:
$$ P(X|Y) = \prod_{i=1}^n P(X_i|Y) $$
Goal: Compute $P(Y|X)$ using Bayes' theorem:
$$P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}$$

Algorithm

Define Distributions
- Specify the prior distribution $P(Y)$ for the class labels $Y$
- Specify the likelihood distributions $P(X_i | Y=y)$ for each feature $X_i$, conditioned on the class $Y=y$
- The distributions depend on the type of data:
  - Categorical Data: Use multinomial or Bernoulli distributions
  - Continuous Data: Use Gaussian (Normal) distributions
Estimate Parameters
- Use the MLE or Maximum A Posteriori (MAP) to estimate the parameters of the distributions:
  - For $P(Y)$:
    $$ P(Y=y) = \frac{\text{Number of instances with } Y=y}{\text{Total number of instances}} $$
  - For $P(X_i|Y=y)$:
    - If the feature is categorical:
      $$ P(X_i = x | Y = y) = \frac{\text{Count of } X_i = x \text{ and } Y = y + \alpha}{\text{Count of } Y = y + \alpha K} $$
      
      where
      - $\alpha$: Laplace smoothing factor
      - $K$: Number of unique categories for $X_i$
    - If the feature is continuous:
      $$ P(X_i | Y = y) = \frac{1}{\sqrt{2\pi\sigma_{y,i}^2}} \exp \left( -\frac{(X_i - \mu_{y,i})^2}{2\sigma_{y,i}^2} \right) $$
      
      where
      - $\mu_{y,i}$: Mean of $X_i$ in class $Y = y$
      - $\sigma_{y,i}^2$: Variance of $X_i$ in class $Y = y$
Predict for Input $X$
- Compute the posterior probability for each class $y$ given the input $X = {X_1, X_2, \dots, X_n}$:
  $$ P(Y=y|X) \propto P(Y=y) \prod_{i=1}^n P(X_i|Y=y) $$
- To avoid numerical underflow, compute the log probabilities:
  $$ \log P(Y=y|X) \propto \log P(Y=y) + \sum_{i=1}^n \log P(X_i|Y=y) $$
Select the Class
- Identify the class with the highest posterior probability:
  $$ y^* = \arg \max_{y \in \mathcal{Y}} P(Y=y|X) $$
- If using log probabilities:
  $$ y^* = \arg \max_{y \in \mathcal{Y}} \left( \log P(Y=y) + \sum_{i=1}^n \log P(X_i|Y=y) \right) $$

Example:

Assume you want to classify an email as Spam or Not Spam
$X$ represents the words in the email, and $Y$ represents the class (Spam or Not Spam)
Compute $P(\text{Spam})$, $P(\text{Not Spam})$, $P(\text{Word}|\text{Spam})$, and $P(\text{Word}|\text{Not Spam})$ based on training data
For a new email, calculate $P(\text{Spam|Email})$ and $P(\text{Not Spam|Email})$ and choose the class with the highest probability

Laplace Smoothing

Laplace Smoothing addresses the issue of zero probabilities in Naive Bayes classification. If a feature value never occurs with a certain class in the training data, the estimated probability is zero, which can cause the entire probability product $P(X|Y)$ to become zero, making predictions impossible.

For a categorical feature $j$ with $K$ unique values:
$$ \phi_{j,k|y=c} = P(x_j = v_k | y = c) = \frac{\sum_{i=1}^n \mathbb{I}(x_{ij} = v_k \text{ and } y_i = c) + 1}{\sum_{i=1}^n \mathbb{I}(y_i = c) + K} $$
- $\phi_{j,k|y=c}$: Smoothed conditional probability
- $\mathbb{I}$: Indicator function (1 if true, 0 otherwise)
- $K$: Number of unique values for feature $j$
- +1: Adds a constant (pseudo-count) for each feature value, ensuring $\phi_{j,k|y=c} \neq 0$

Example:

If a new word ("neurips") appears in an email and was not seen during training, $P(\text{"neurips"}|Y) = 0$ without smoothing
Laplace smoothing ensures $P(\text{"neurips"}|Y) \neq 0$, allowing predictions to continue without failure

Support Vector Machine (SVM)

SVM aims to find the hyperplane that maximizes the margin between two classes

Algorithm

Hypothesis: Linear
Loss: Hinge loss + Regularization term
$$ \text{minimize } \sum_{i=1}^m \max(1 - y^{(i)} \cdot \theta^T x^{(i)}, 0) + \frac{\lambda}{2} \|\theta\|^2 $$
Optimization: Quadratic programming

Hyperparameters in SVM

Regularization Parameter ($C$): Balances the trade-off between maximizing the margin and minimizing the classification error
- High $C$:
  - Penalizes misclassified points heavily
  - Results in a smaller margin, which fits the training data more closely (risk of overfitting)
- Low $C$:
  - Allows for more misclassified points
  - Results in a larger margin, which generalizes better to unseen data (less risk of overfitting)
Gamma: Defines how far the influence of a single training example reaches, with low values meaning 'far' and high values meaning 'close'
- Low Gamma:
  - Each training example has a further influence
  - The model is more flexible and captures global patterns (risk of underfitting)
- High Gamma:
  - Each training example has a closer influence
  - The model fits closely to the training data and captures fine details (risk of overfitting)

Comparison

SVM vs. Perceptron

SVM Perceptron

Focus Maximizes the margin Reduces error

Online Training Not supported Supported

	SVM	Perceptron
Focus	Maximizes the margin	Reduces error
Online Training	Not supported	Supported

SVM vs. Logistic Regression

	SVM	Logistic Regression
Margin Maximization	Yes	No
Sensitivity to Outliers	Less sensitive	More sensitive
Output Type	Discrete (0 or 1)	Probabilities
Non-linear Decision Boundaries	Supported using kernel functions	Not supported

Advantages

Effective in high-dimensional spaces
Can handle non-linear decision boundaries using kernel functions

Part 2: Machine Learning Algorithms

Table of Contents

ML Problem Settings

Basic Assumptions and Definitions

Optimization (Gradient Descent)

Deterministic & Probabilistic Problems

Paramertric vs. Nonparametric Models

Parametric Models

Nonparametric Models

Diagnose ML model

Supervised Learning

Decision Tree (DT)

Pros

Algorithm

Splitting Criteria in Decision Trees

Stopping Criteria

Pruning

Inductive Bias in Decision Trees

Random Forest (RF)

Random Forest Process

Out-of-Bag (OOB) Score

Interpretability and Feature Importance

Advantages of Random Forest

Boosting Regression Trees (Gradient Boosting)

Steps to Build Boosted Regression Trees

Hyperparameters

Boosting Classification Trees (AdaBoost)

AdaBoost Process

KNN (K-Nearest Neighbors) Classification

Algorithm

Distance Functions

Tuning

Inductive Bias in KNN

Perceptron

Algorithm

Perceptron Mistake Bound

Inductive Bias in Perceptron

Extensions

Linear Regression

Assumptions

Data Preparation

Intercept Term

Model Evaluation

Regularization

Logistic Regression

Algorithm

Multi-class Logistic Regression

Neural Network

Naive Bayes

Key Concepts

Algorithm

Laplace Smoothing

Support Vector Machine (SVM)

Algorithm

Hyperparameters in SVM

Comparison

Advantages

K-Means Clustering

Time Series Forecasting