You are given some data by a collaborator and asked to build
You are given some data by a collaborator, and asked to build a two-class classifier with n = 1000 observations and p = 500 features, to predict the risk of a customer defaulting on a loan. Unfortunately about 25% of the features are missing at random (and not the same 25% each time). The result is that nearly every observation has some missing features. How would you deal with this? If you later learn that some of the features like monthly income are not missing at random, but are more likely to be missing because the mortgage company has lost track of the customer. How would you deal with this issue?
Solution
in this case very large amount of data was missing(25%) first try to complete the data if possible.
we can not delete full row of missing value data beacause we might lose data
we can also replace data with mean or median (not recomended) but this is not clever way to handle this situation in data analysis
their is package in R called MICE
please install mice package and try to run following code
install.packages(\"mice\")
library(\"mice\")
x1=mice(DataName,m=5,seed=100)
x2=complete(x1)
View(x2)
your missing values are filled
if you have any doubt regarding this please comment.
