Bite Sized Learning: Data Preparation
Importance of Data Preparation & Common Methods
Data preparation is the various methods and steps taken to transform the data such that it is suitable for ML algorithms. Here quality precedes over quantity. Hence it is essential to transform the raw data into to a more informative format for a better modelling task. Below are the most commonly followed data preparation tasks in data science industry
- Data Cleaning
- Feature Engineering
- Data Transformation
- Feature Extraction
Data Cleaning:
Data cleaning is the process of detecting and also treating inaccurate data points from a data set. This is the most time consuming and critical part of ML. It simply goes "garbage in = garbage out". Below are some of the methods to clean data
- Find the number / percentage (both) of missing rows or null values (NAN) & treat it
- Identify the duplicate rows within the dataset & treat it
- Detect the outliers using statistics & treat it using domain knowledge
- Identify data types and columns with zero variance and delete it
Feature Engineering:
Feature engineering is a process of creating new input variables from the available data. Domain knowledge is key here. Some common feature engineering techniques are:
- Binning - create buckets of data like age group instead of age
- Log Transformation - is the method in which we replace each variable x with a log(x). It reduces or removes the skewness of our original data
- Feature split - is extracting the utilizable parts of a column into new features. Like extracting only house number or location from an address column
- Combining sparse classes - Combine classes which has very less number of data points. There is no hard and fast rule here and it depends on size of the dataset. For example there are 4 classes with value counts C1:100 C2:100 C3:10 C4:2, we combine C3&4 as other. To make it easy if a class has < 50 instances then it is better to combine
Data Transformation
Data transformation is the process of changing the format, structure, or values of data
- Dummy variables or one-hot encoding - ML models need data to be in numerical format and hence we need to transform a categorical data to a numerical one. Example: "Yes" as 1 and "No" as 0
- Scaling - Dataset might be varying in their degrees of magnitude, range, and units. For example one of the column may have a range from 0-100 and other -100 to 1000000. Hence it is essential to scale the data and bring all numerical columns to a same level
- Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and . [Min-Max Scalar in Python]
- Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation [standard scalar in python]
- If there are outliers go for Robust Scaler in python
Scaling methods (stack overflow) - Link
Feature Extraction:
Feature Extraction is the process of selecting the subset from the existing feature list or reducing the dimensionality of the dataset by applying various dimensionality reduction algorithms. It can accomplished by either feature selection or feature extraction
- Feature Selection - this aims to rank the importance of the existing features in the dataset and discard less important ones
- Feature Extraction - it creates a projection of the data into a lower-dimensional space that still preserves the most important properties of the original data.
Data transformation article (towardsdatascience.com) - Link
Nice Article. Simple and Clean.
ReplyDelete--Terence