Importance of Data Preparation & Common Methods

Data preparation is the various methods and steps taken to transform the data such that it is suitable for ML algorithms. Here quality precedes over quantity. Hence it is essential to transform the raw data into to a more informative format for a better modelling task. Below are the most commonly followed data preparation tasks in data science industry

Data Cleaning
Feature Engineering
Data Transformation
Feature Extraction

Data Cleaning:

Data cleaning is the process of detecting and also treating inaccurate data points from a data set. This is the most time consuming and critical part of ML. It simply goes "garbage in = garbage out". Below are some of the methods to clean data

Find the number / percentage (both) of missing rows or null values (NAN) & treat it
Identify the duplicate rows within the dataset & treat it
Detect the outliers using statistics & treat it using domain knowledge
Identify data types and columns with zero variance and delete it

Data Cleaning article (medium.com) - Link

Feature Engineering:

Feature engineering is a process of creating new input variables from the available data. Domain knowledge is key here. Some common feature engineering techniques are:

Binning - create buckets of data like age group instead of age
Log Transformation - is the method in which we replace each variable x with a log(x). It reduces or removes the skewness of our original data
Feature split - is extracting the utilizable parts of a column into new features. Like extracting only house number or location from an address column
Combining sparse classes - Combine classes which has very less number of data points. There is no hard and fast rule here and it depends on size of the dataset. For example there are 4 classes with value counts C1:100 C2:100 C3:10 C4:2, we combine C3&4 as other. To make it easy if a class has < 50 instances then it is better to combine

Feature engineering article (elitedatascience.com & medium.com) - Link1 | Link2

Data Transformation
Data transformation is the process of changing the format, structure, or values of data

Dummy variables or one-hot encoding - ML models need data to be in numerical format and hence we need to transform a categorical data to a numerical one. Example: "Yes" as 1 and "No" as 0
Scaling - Dataset might be varying in their degrees of magnitude, range, and units. For example one of the column may have a range from 0-100 and other -100 to 1000000. Hence it is essential to scale the data and bring all numerical columns to a same level

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and . [Min-Max Scalar in Python]
Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation [standard scalar in python]
If there are outliers go for Robust Scaler in python

Data transformation article (towardsdatascience.com) - Link
Scaling methods (stack overflow) - Link

Feature Extraction:

Feature Extraction is the process of selecting the subset from the existing feature list or reducing the dimensionality of the dataset by applying various dimensionality reduction algorithms. It can accomplished by either feature selection or feature extraction

Feature Selection - this aims to rank the importance of the existing features in the dataset and discard less important ones
Feature Extraction - it creates a projection of the data into a lower-dimensional space that still preserves the most important properties of the original data.

Data transformation article (towardsdatascience.com) - Link

Search This Blog

Data Science

Bite Sized Learning: Data Preparation

Importance of Data Preparation & Common Methods

Data Cleaning
Feature Engineering
Data Transformation
Feature Extraction

Comments

Post a Comment

Popular posts from this blog

Data Science Interview Questions

Unlocking the True Cost of Generative AI

LLM Evaluation Guide

Bite Sized Learning: Data Preparation

Importance of Data Preparation & Common Methods

Data CleaningFeature EngineeringData TransformationFeature Extraction

Comments

Post a Comment

Popular posts from this blog

Data Science Interview Questions

Unlocking the True Cost of Generative AI

LLM Evaluation Guide

Data Cleaning
Feature Engineering
Data Transformation
Feature Extraction