Data Science

Posts

Showing posts from August, 2020

Auto ML - PDF Table Extraction

August 14, 2020

Auto ML - PDF Table Extraction Image: excalibur-py.readthedocs.io Extracting tables from PDFs is not easy. Simple copy and paste from a PDF don't preserve table structure. Hence automatically detecting the structure and preserving the format is critical. Machine Learning come to rescue here as well. Let us see some of the Python libraries available for this task Excalibur: It is a web interface to extract tabular data from PDFs. It is powered by Camelot. It only works with text-based PDFs and not scanned documents. Installation Guide Tutorial PDF Table Extraction: It is a parser to extract the table in PDF document with RetinaNet Github/Installation Guide Tutorial Camelot: It is a Python library and a command-line tool that makes it easy for anyone to extract data tables trapped inside PDF file Installation Guide Tutorial Github Tabula: It is a free tool for extracting data from PDF files into CSV and Excel files. Tabula only works on text-based PDFs, not scanned docu...

Bite Sized Learning - Statistics P1

August 09, 2020

Essential Statistics for Data Science The goal of inferential statistics is to use the sample to learn about the population The steps involved here are Sampling Hypothesis Testing Inference Sample typically is selected in a manner that allows it to be an unbiased representation of the entire population Sampling the entire population is nearly impossible and Central Limit Theorem helps solve this problem Some of the business cases are Estimate the effectiveness of a medicine or vaccine Estimate the expected consumers for a new product/service

Bite Sized Learning: Data Preparation

August 07, 2020

Importance of Data Preparation & Common Methods Data preparation is the various methods and steps taken to transform the data such that it is suitable for ML algorithms. Here quality precedes over quantity. Hence it is essential to transform the raw data into to a more informative format for a better modelling task. Below are the most commonly followed data preparation tasks in data science industry Data Cleaning Feature Engineering Data Transformation Feature Extraction Data Cleaning: Data cleaning is the process of detecting and also treating inaccurate data points from a data set. This is the most time consuming and critical part of ML. It simply goes "garbage in = garbage out". Below are some of the methods to clean data Find the number / percentage (both) of missing rows or null values (NAN) & treat it Identify the duplicate rows within the dataset & treat it Detect the outliers using statistics & treat it using doma...

Machine Learning Part VII - KNN Classifier

August 02, 2020

Machine Learning Part VII - KNN Classifier Video Link Sample Code