Issues of Dealing with Missing Values
A lot of data analysis programs do not have the ability to distinguish between many values, namely:
· Missing Values
· Blanks
· Zero
This weakness of data analysis programs also extends to the failure of many data analysts to distinguish between these values, therefore, these values are not being distinguished or dealt with, and data are not being analyzed based on these differences.
Some may think that these differences are not very important, and they ignore them and leave dealing with them to data analysis programs, but in most cases, this gives catastrophic results that many people do not realize.
I will attempt to illustrate these differences through some examples:
1. If we want to analyze the average income of households in a country suffering from a crisis, it was noticed that a high percentage of respondents said that they have no income of any kind, and the percentage of these respondents is over 40% of the surveyed families. Data analysts dealt with these cases as missing values, the thing that gave results that are utterly different from the situation of society, as the socio-economic indicators in this case will show, for example, that only 10% of HHs are below the extreme poverty line, but the truth is that the percentage is more than 50%, because whoever does not have any income must be considered as his income is zero rather than a missing value, because the missing value is not included in the calculations, while the value zero is, and thus affects the percentages and the general average of income. In the opposite case, in the event of asking about the monthly salary, the salary of a person who does not have a job will be considered as a missing value rather than a zero, as he is unemployed and the salary is not calculated as a zero.
2. Many programs do not consider the blanks in the text questions as a missing value. For example, we find that the SPSS program does not consider the empty cell in the text questions as a missing value, but rather considers it a valid value, as in the Gender column, if it is a text question, the program will calculate the empty values, the thing that significantly affects results such as percentages, knowing that those who did not indicate their gender (male or female) should be considered a missing value.
3. In the SPSS, when trying to calculate a new data column from other columns, we find that some of the codes (formulas) can deal with the missing values effectively and some formulas cannot, for example when trying to calculate the total number of the family members out of the family members of each group, and we used the (sum) formula. We notice that SPSS gives the sum result even if there is a missing value in one of the categories, while calculating as a manual sum will give the sum result as a missing value when any of the cases with a missing value is encountered.
The cases in which there are issues in defining the missing values are unlimited, and I do not advise in any case to give the data analysis program nor the data analyst alone the freedom to guess and deal with those values, as the appropriate treatment and definition of the empty value must be determined, as we explained in the income case, the missing value must be considered as zero, while in the salary case, it must be considered a missing value, and in our third example, the empty cells of any category of family members must be considered zero, knowing that from the beginning, data collectors must be told that if a family does not have any member of a certain category, they must not leave a missing value, rather, they should fill it with a zero.
By:
Ghaith Albahr: CEO of INDICATORS
Issues of Dealing with Missing Values
A lot of data analysis programs do not have the ability to distinguish between many values, namely:
· Missing Values
· Blanks
· Zero
This weakness of data analysis programs also extends to the failure of many data analysts to distinguish between these values, therefore, these values are not being distinguished or dealt with, and data are not being analyzed based on these differences.
Some may think that these differences are not very important, and they ignore them and leave dealing with them to data analysis programs, but in most cases, this gives catastrophic results that many people do not realize.
I will attempt to illustrate these differences through some examples:
Issues of Dealing with Missing Values
1. If we want to analyze the average income of households in a country suffering from a crisis, it was noticed that a high percentage of respondents said that they have no income of any kind, and the percentage of these respondents is over 40% of the surveyed families. Data analysts dealt with these cases as missing values, the thing that gave results that are utterly different from the situation of society, as the socio-economic indicators in this case will show, for example, that only 10% of HHs are below the extreme poverty line, but the truth is that the percentage is more than 50%, because whoever does not have any income must be considered as his income is zero rather than a missing value, because the missing value is not included in the calculations, while the value zero is, and thus affects the percentages and the general average of income. In the opposite case, in the event of asking about the monthly salary, the salary of a person who does not have a job will be considered as a missing value rather than a zero, as he is unemployed and the salary is not calculated as a zero.
2. Many programs do not consider the blanks in the text questions as a missing value. For example, we find that the SPSS program does not consider the empty cell in the text questions as a missing value, but rather considers it a valid value, as in the Gender column, if it is a text question, the program will calculate the empty values, the thing that significantly affects results such as percentages, knowing that those who did not indicate their gender (male or female) should be considered a missing value.
3. In the SPSS, when trying to calculate a new data column from other columns, we find that some of the codes (formulas) can deal with the missing values effectively and some formulas cannot, for example when trying to calculate the total number of the family members out of the family members of each group, and we used the (sum) formula. We notice that SPSS gives the sum result even if there is a missing value in one of the categories, while calculating as a manual sum will give the sum result as a missing value when any of the cases with a missing value is encountered.
The cases in which there are issues in defining the missing values are unlimited, and I do not advise in any case to give the data analysis program nor the data analyst alone the freedom to guess and deal with those values, as the appropriate treatment and definition of the empty value must be determined, as we explained in the income case, the missing value must be considered as zero, while in the salary case, it must be considered a missing value, and in our third example, the empty cells of any category of family members must be considered zero, knowing that from the beginning, data collectors must be told that if a family does not have any member of a certain category, they must not leave a missing value, rather, they should fill it with a zero.
By:
Ghaith Albahr: CEO of INDICATORS
Reviews
There are no reviews yet.