Data Preprocessing Algorithms

Problems with Data

  • Outlier: Extreme values that deviate significantly frome the rest of the data

    • Causes: data entry errors, measurement errors, natural variability

    • Solutions: Detect and remove outliers using statistical methods (z-scores, interquartile). Use algo like tree based models that are less senstitev to outliers

  • Imbalanced Data: In classfication, imbalanced datasets (very rare) can cause models to be biased toward majority class

    • Causes: Natural rarity of certain events, sampling error

    • Solutions: Resampling, use algo designed to handle imbalance (SMOTE, cost-senesitve learning)

      • Downsample: if imabalanced data is 0.5% of the data, we can bring down the balanced data to also 0.5% of the data and if the accuracy is still the same then it is accruate.

  • Multicolinearity/correlation between features: occurs when two or more predictor variables in regression model are highly correlated making it diffcult to distinguish

    • Redundunt features or variables

  • Data leakage: when information from outside the trainning dataset is used to create the model, leading to overly optimistic performance during training

    • Causes: mistakenly use future data or including variables that would not be avaliable in real world prediction

    • Solutions: ensure that the training set only includes data availble at the time of prediction. Conduct thorough review of features

  • High dimension: having too many features can lead to overfitting, models learn the noise in the data

    • Causes: datasts with a large number of variables relative to the number of observations

    • Solutions: PCA, feature selection

  • Times serires data: autocorrelation or seasonality can impact a model accruary

    • Causes: time related pattern needs to be accounted for

    • Solutions: use time series models (arima sarima lstms)

      • Add time related features

    • Seasonality: some time event is counted as outlier can effect the model

Basic Descriptive Statistics

  • Descriptive Stats:

    • Frequency distributiosn

    • Proportions

      • Chi-square test

    • Mean and variance

  • Graphical Rep:

    • Box plot

  • Covariance: When mean and variance dont work, use covariance. A positive cov means two varaibles tend to move the same direction. A negative means two variables tend to move in opposite directions

    • Sample formula: covx,y=(xixˉ)(yiyˉ)N1cov_{x,y} = \frac{\sum{(x_i-\bar{x})(y_i-\bar{y})}}{N-1}

    • On a covariance matrix, the diagonal is always variance. cov(x,x) = E(x^2) * (E(x))^2

  • Correlation: measure linear association, drived from covaraince and devide by variance of x and y

  • Covariance of itself or a single variable is variance

Feature Engineering

Transforming raw data into meaningful features

  • Curse of dimensionality: various challenges due to high dimensional spaces

  • Data sparsity: in high dimensional spaces, data points become sparse, data may not cover the space sufficiently, making it difficult to detect reliable partterns

  • Distance measures lose its meaning: in high dimensions, the impact of distance becomes less meaningful, affecting algorithms that rely on distance (e.g knn)

PCA

  • Trying to find a new axis that can maximize the spread or variance of the data

  1. Find the vector that maximizes the variance of your data projected on theat vector

    1. Cov matrix by normalizing data

    2. Compute eigenvalues and eigenvectors based on cov matrix by using SVD

    3. The eigenvector with the largest eigenvalue captures the most variance

    4. Sort eigenvectors by eigenvalues, select top k eigenvalues and it will reduced to dimention k

  2. Project data on the new feature space

We can use scree plot to see how much variance is explained by number of principle components

SVD

  • Decomposes any matrix X (doesnt have to be square) into three matrices

  • Use to extract eigenvalues and eigenvectors for PCA

Combining Datasets

  • Enriched analysis: can introuce additional features

  • Holistic understanding

  • Enhanced BI

Comparing Data

Similarity: Higher values indicate greater similarity.

Distance: Higher values indicate greater dissimilarity.

Last updated