Creating A Special Category
A categorical feature, like gender, for instance, will have a set number of possibilities. We can designate another class for the missing values because they have a predetermined number of classes. These characteristics The missing values for Cabin and Embarked can be replaced with a new category, such as U for “unknown.” This tactic will increase the amount of information in the dataset, changing the variance. They are categorical, so in order for the algorithm to understand them, we must find one hot encoding that will transform them into a numeric form.
Pros:
- Since it is categorical, there are fewer possibilities with one additional category, resulting in low variance after one hot encoding.
- prevents data loss by including a special category
Cons:
- Reduces the variety.
- adds a new feature to the model while it is being encoded, which could have negative performance effects.
Upskilling and staying current in the workplace require Data Analyst Training.
Speculating About The Missing Values
With the aid of a machine learning algorithm, we can forecast the nulls using the features that do not have missing values. If a missing value isn’t anticipated to have a very high variance, then this approach might produce results that are more accurate. We will use linear regression to use other available features to replace the nulls in the feature “age.” Instead of sticking with one algorithm, one can experiment with several and see which provides the highest accuracy.
Pros:
- As long as the bias from the same is smaller than the omitted variable, imputing the missing variable is an improvement bias
- Enables the model parameters to be accurately estimated.
Cons:
- When using an incomplete conditioning set for a categorical variable, bias also develops.
- Only used as a stand-in for the real values
Utilizing Methods That Support Missing Values
KNN is a machine learning algorithm that utilizes the distance measure concept. When the dataset contains nulls, this algorithm can be applied. KNN uses the majority of the K nearest values to take into account the missing values while running the algorithm. In this particular dataset, we will assume that people with the same data for the aforementioned features will have the same kind of fare regardless of the person’s age, sex, class, etc.
Unfortunately, the Python K-Nearest Neighbor algorithm does not support the presence of missing values in the SciKit Learn library.
RandomForest is another algorithm that may be applied here. This model’s robust output is a result of its ability to handle categorical and non-linear data. On large datasets, it produces better results by adjusting to the data structure while taking into account bias or high variance.
Pros:
- Does not call for building a predictive model for each attribute in the dataset that lacks data.
- The data’s correlation is disregarded
Cons:
- It takes a long time and can be crucial when extracting data from large databases in data mining.
- Euclidean, Manhattan, and other distance functions are options, but they don’t produce reliable results.