This blog will demonstrate how do I perform data cleaning and preparation using Python libraries.
JupyterNotebook and datasets
The process of dealing with missing, incorrect, incomplete, insignificant, improperly formatted, or duplicated data in a dataset is called data cleaning or scrubbing. Machine learning (ML) algorithms provide a better result when the dataset used by the algorithm is well-formatted and error-free. So, before deciding or applying an ML algorithm, Data Scientist or Data Analyst perform the following two steps:
1. Data Exploration - Understand the source data in detail and logically associate the attributes
2. Data Scrubbing – Correct errors found in the source data and format the data for the best performance of the selected ML algorithm
This project is divided into two parts:
1. Part 1, the students use Assignment#1_Part1_Motor_Insurance_Fraud.csv dataset
2. Part 2, the students use Assignment#1_Part1_Online_Activity.csv dataset
Libraries used: pandas, NumPy, seaborn, matplotlib
Part 2 - Data Scrubbing
Data Exploration
After loading the data and storing it in the data frame, we can see the dataset "Assignment#1_Part2_Online_Activity.csv" contains 11 rows and 15 columns.
Type of columns and columns name:
Data Missing Values
I also wanted to identify missing values in the column.
There a few missing values occurred in "Read_News" (1 missing value), "Online_Shopping" (2 missing values), "Online_Gaming" (3 missing values), and "Other_Social_Network" (7 missing values).
With "Online_Gaming", there are two types of input values that occurred within this column that included: "Y" and "N". Let's assume that people who don't answer this question aren't into online gaming. Therefore, I will replace these missing values with "N".
There is only 1 missing value in the "Read_news" and 2 missing values in the column "Online_Shopping", therefore, I will drop these rows that contain missing values in these columns.
Last, I want to reduce the redundancy in the dataset. I want to drop those columns that is useless for the analysis. For example, "Other_social_network" doesn't contain much more useful information.
This post is one of the examples of how I prepare the dataset.
The datasets in Part 1 and Part 2 were provided by Dr. Samir Chatterjee, Professor & Fletcher Jones Chair of Technology & Management.
Thank you for reading and enjoy analyzing!
SEE PART 1:
Comments