top of page

Data Scraping and Preparation Part 2

Updated: Feb 13, 2022


This blog will demonstrate how do I perform data cleaning and preparation using Python libraries.

JupyterNotebook and datasets

 

The process of dealing with missing, incorrect, incomplete, insignificant, improperly formatted, or duplicated data in a dataset is called data cleaning or scrubbing. Machine learning (ML) algorithms provide a better result when the dataset used by the algorithm is well-formatted and error-free. So, before deciding or applying an ML algorithm, Data Scientist or Data Analyst perform the following two steps:

1. Data Exploration - Understand the source data in detail and logically associate the attributes

2. Data Scrubbing – Correct errors found in the source data and format the data for the best performance of the selected ML algorithm


This project is divided into two parts:

1. Part 1, the students use Assignment#1_Part1_Motor_Insurance_Fraud.csv dataset

2. Part 2, the students use Assignment#1_Part1_Online_Activity.csv dataset


Libraries used: pandas, NumPy, seaborn, matplotlib


Part 2 - Data Scrubbing


Data Exploration

After loading the data and storing it in the data frame, we can see the dataset "Assignment#1_Part2_Online_Activity.csv" contains 11 rows and 15 columns.

Type of columns and columns name:



Data Missing Values

I also wanted to identify missing values in the column.


There a few missing values occurred in "Read_News" (1 missing value), "Online_Shopping" (2 missing values), "Online_Gaming" (3 missing values), and "Other_Social_Network" (7 missing values).


With "Online_Gaming", there are two types of input values that occurred within this column that included: "Y" and "N". Let's assume that people who don't answer this question aren't into online gaming. Therefore, I will replace these missing values with "N".


There is only 1 missing value in the "Read_news" and 2 missing values in the column "Online_Shopping", therefore, I will drop these rows that contain missing values in these columns.


Last, I want to reduce the redundancy in the dataset. I want to drop those columns that is useless for the analysis. For example, "Other_social_network" doesn't contain much more useful information.


This post is one of the examples of how I prepare the dataset.

The datasets in Part 1 and Part 2 were provided by Dr. Samir Chatterjee, Professor & Fletcher Jones Chair of Technology & Management.


Thank you for reading and enjoy analyzing!


SEE PART 1:











36 views0 comments

Recent Posts

See All

Comments


Post: Blog2_Post
bottom of page