Data Scraping and Preparation Part 1

Updated: Feb 13, 2022

This blog will demonstrate how do I perform data cleaning and preparation using Python libraries.

JupyterNotebook and datasets

The process of dealing with missing, incorrect, incomplete, insignificant, improperly formatted, or duplicated data in a dataset is called data cleaning or scrubbing. Machine learning (ML) algorithms provide a better result when the dataset used by the algorithm is well-formatted and error-free. So, before deciding or applying an ML algorithm, Data Scientist or Data Analyst perform the following two steps:

1. Data Exploration - Understand the source data in detail and logically associate the attributes

2. Data Scrubbing – Correct errors found in the source data and format the data for the best performance of the selected ML algorithm

This project is divided into two parts:

1. Part 1, the students use Assignment#1_Part1_Motor_Insurance_Fraud.csv dataset

2. Part 2, the students use Assignment#1_Part1_Online_Activity.csv dataset

Libraries used: pandas, NumPy, seaborn, matplotlib

Part 1 – Data Exploration

Data Exploration

After loading the data and storing it in the data frame, we can see the dataset "Assignment#1_Part1_Motor_Insurance_Fraud.csv" contains 500 rows and 14 columns.

Type of columns and columns name:

One of the examples of how column values look like is "Injury Type."

The column "Injury Type" shows 177 cases of Broken Limb, 172 cases of Soft Tissue, 116 cases of Back, and 35 cases of Serious.

I also created a new column called "Revenue," which is the sum of "Num claims" and "Claim Amount Received."

The result shows that my dataset now has the new column "Revenue" that is calculated accordingly using "NumClaims" and "Claim Amount Received."

Data Missing Values

I also wanted to identify missing values in the column.

There are quite a few missing values that occurred in columns "Marital Status" (330 missing values) and "Num Soft Tissue" (10 missing values).

For the "Marital Status," I will replace the missing values with "Other" as a new category for Marital Status besides the existing ones "Married," "Divorce," "Single."

For the "Num Soft Tissue," I will remove the rows containing missing values since only 10 rows are impacted.

Odd data values

When examining the column "Insurance Type," I found that it has only one data value, "CI".

This column would be important if CI is the key to linking this dataset with another dataset that contains more than one Insurance Type. However, there is only one dataset and one Insurance type in this assignment, so we don't need this column. We will remove this column from the dataset.

Correlation Matrix

I want to see how the attributes vary or relate to each other. In this step, I will calculate and visualize correlations using the correlation matrix. I will use sklearn in this section, including ColumnTransformer, OneHotEncoder, LabelEncoder, and preprocessing.

First I will encode categorical attributes such as "Marital Status", "Injure Type", and "Overnight hospital stays".

After that, I can finally run the correlation matrix and plot the result into the heatmap using seaborn.

The red marks represent the negative relationships, blue marks represent the positive relationships. The stronger the color, the higher the correlation magnitude.

From the result, columns "Id", "Num Claimants", "Overnight Hospital Stay", and "Injury Type" do not have any relationships with other variables.

For indexing and identifying purposes, I would recommend keeping the "Id" column.

For identifying the purpose of how many dependents are on the insurance policy, I recommend keeping the Num Calimants even though they have no correlations to the other variables. Everything else that doesn't have a correlation with the others will be dropped.

Thank you for following,

Part 2: SEE NEXT BLOG!

Part 2

Data Scraping and Preparation Part 1

Part 1 – Data Exploration

Part 2: SEE NEXT BLOG!

Recent Posts

Comments