Jaykumar Patel: Monthly Provisional Counts Of Deaths

Jaykumar patel

Monthly Provisional Counts Of Deaths

In this blog I have used Python packages in the Rstudio and done the EDA as well as Summary Statistics on the cause of the death in the united state of america in the year 2019 and 2020.

Author

Affiliation

Jaykumar patel

Published

May 6, 2021

Citation

patel, 2021

Analysis of the Monthly Provisional Death counts

In this analysis we going to look into the cause of death in the United State in the year of 2019 and 2020.

Importing Required Libraries

import pandas as pd
import numpy as np

Reading Data

death = pd.read_csv('Data/causes_of_death.csv', index_col= 0)

death.head()

   Date.Of.Death.Year  ...  Jurisdiction.of.Occurrence
1                2019  ...               United States
2                2019  ...               United States
3                2019  ...               United States
4                2019  ...               United States
5                2019  ...               United States

[5 rows x 23 columns]

Summary Statistics for Numerical Columns

The describe() function computes a summary of statistics pertaining to the DataFrame columns.
Describe function gives the mean, std and IQR values.

AllCause

death['AllCause'].describe()

count     2849.000000
mean      2280.159354
std       6028.864306
min         10.000000
25%         86.000000
50%        271.000000
75%       1498.000000
max      53242.000000
Name: AllCause, dtype: float64

This column tells us about total number of people died in the month of particular year in the United State of America.

From the count row we can find that we have 2849 non null values in our data and remaining rows contain null values.
In this column we can see that the mean value is 2280.159354 and median (50%) is 271, that means our data is right skewed distributed.
- So the frequency of people died at very large number is not high we can say that there are some outliears which leads to this difference.
Standard deviation of this column is 6028.864306. It indicates data are more spread out. A high standard deviation means data is not closely bound to the mean value.
Data is not containing close Continuous number for the people died in the United State of America. It has very large range.
As we can see the range is 0-53242 and we have only 2849 data, This is why we have high standard deviation and large difference between mean and median.

NaturalCause

death['NaturalCause'].describe()

count     2717.000000
mean      2194.944056
std       5945.745418
min          0.000000
25%         70.000000
50%        216.000000
75%       1294.000000
max      52054.000000
Name: NaturalCause, dtype: float64

This column tells us about total number of people died in the month of particular year in the United State of America due to natural cause.

From the count row we can find that we have 2717 non null values in our data and remaining rows contain null values.
In this column we can see that the mean value is 2194.944056 and median (50%) is 216, that means our data is right skewed distributed.
- We have the closest mean and median to the all cause’s mean and median. It means the both are having the some what similar numbers for observation.
Standard deviation of this column is 5945.745418. It indicates data are more spread out. A high standard deviation means data is not closely bound to the mean value.
- It has very close stadard deviation as allcause column which also add support to the statement that they have some identical numbers or nearest numbers

Septicemia (A40-A41)

death['Septicemia..A40.A41.'].describe()

count    1736.000000
mean       44.580069
std        88.269978
min         0.000000
25%         0.000000
50%        10.000000
75%        36.000000
max       484.000000
Name: Septicemia..A40.A41., dtype: float64

This column tells us about number of people died in the month of particular year in the United State of America due to Septicemia (A40-A41) disease.

From the count row we can find that we have 1736 non null values in our data and remaining rows contain null values.
In this column we can see that the mean value is 44.580069 and median (50%) is 10, that means our data is right skewed distributed.
- Around 44 people die every month from this disease.
Standard deviation of this column is 88.269978. It has the lowest standard deviation of all.

Malignant neoplasms (C00-C97)

death['Malignant.neoplasms..C00.C97.'].describe()

count    2249.000000
mean      549.167185
std      1277.215975
min         0.000000
25%        20.000000
50%        69.000000
75%       338.000000
max      6498.000000
Name: Malignant.neoplasms..C00.C97., dtype: float64

This column tells us about total number of people died in the month of particular year in the United State of America due to Malignant neoplasms.

From the count row we can find that we have 2249 non null values in our data and remaining rows contain null values.
In this column we can see that the mean value is 549.167185 and median (50%) is 69, that means our data is right skewed distributed.
- We can say that around 550 people die every month. As compared to the Septicemia more than 12 times more people died due to this disease.
Standard deviation of this column is 1277.215975. It indicates data are more spread out. A high standard deviation means data is not closely bound to the mean value.
- It has not as large standard deviation as of allcause column but still it has a high standard deviation which means data is not close to it mean so we can not say that our statement for death for everymonth is accurate.

For the remaining columns

We can see that the every column have the low median value and the high mean value. Which means all the column are right skewed.
We can say that the data frame is not closely bound to its mean values. Our data have very distinct values and its difference from minmum to maximum values are very high.
- This difference means there are very few deaths have high frequency and number of deaths for months are comparatively smallar than it mean values.
- Here, we can not state that the number of people died in the month is equal to the mean value and reason is that, as we can see we have very small median and at the same time we have high mean.
- This means that most of the data is less than the mean value there are some outliers. Those outliers are the reason we are having a large mean value.

Summary Statistics for Categorical Data

Race/Ethnicity

This column tells us about total number of people died as per their race in the month of particular year in the United State of America.

print(death.groupby(['Race.Ethnicity']).size())

Race.Ethnicity
Hispanic                                         500
Non-Hispanic American Indian or Alaska Native    500
Non-Hispanic Asian                               500
Non-Hispanic Black                               500
Non-Hispanic White                               500
Other                                            500
dtype: int64

Our data have uniform distribution when we consider the races.
It is good to have uniform distribution it help use to train a model which is not bias.

Sex

This column tells us about total number of people died as per their sex in the month of particular year in the United State of America.

print(death.groupby(['Sex']).size())

Sex
F         720
Female    780
M         720
Male      780
dtype: int64

Here also,
* Our data have uniform distribution when we look for the sex. * Having uniform distribution in the data is help us to see how each category affects

AgeGroup

This column tells us about total number of people died as per their agegroup in the month of particular year in the United State of America.

print (death.groupby(['AgeGroup']).size())

AgeGroup
0-4 years            300
15-24 years          300
25-34 years          300
35-44 years          300
45-54 years          300
5-14 years           300
55-64 years          300
65-74 years          300
75-84 years          300
85 years and over    300
dtype: int64

Same as above data column it also have uniform distribution so if we want to train our model to predict the cause of the death.

We can use following columns:

Race/Ethnicity
Sex
AgeGroup

Exploratory Data Analysis Numeric Data

Plotting histogram for AllCause Column

py$death %>% ggplot(aes(AllCause))+    
  geom_histogram(aes(y = stat(density)), color = "#13B4FA",fill = "#FF6F91", bins = 40) +
    geom_density(fill = "#845EC2", alpha = 0.5, color = NaN)+
  labs(title = "Distribution of Total Death",
         x = "Total Death",
         y = "density") +
    theme_minimal()+
  themes()

We can see here that our data have a right skewness so if we are going to use this data column for the modeling than we need to first remove the skewnees then we can do modeling.

Removing Skewness

p1 <- py$death %>%
    ggplot(aes(AllCause)) +
    geom_histogram(aes(y = stat(density)),bins = 28, color = "#13B4FA",fill = "#FF6F91") +
    geom_density(fill = "#845EC2", alpha = 0.5, color = NaN)+
    labs(title = "Total Death",
         x = "Total Death",
         y = "density") +
    theme_minimal() + themes()

p2 <- py$death %>%
    ggplot(aes(log10(AllCause))) +
    geom_histogram(aes(y = stat(density)),bins = 28, color = "#13B4FA",fill = "#FFC75F") +
    geom_density(fill = "#845EC2", alpha = 0.5, color = NaN)+
    labs(title = "Total Death (log10 based)",
         x = "Total Death",
         y = "density") +
    theme_minimal()+ themes()

p3 <- py$death %>%
    ggplot(aes(sqrt(AllCause))) +
    geom_histogram(aes(y = stat(density)),bins = 28, color = "#13B4FA",fill = "#845EC2") +
    geom_density(fill = "#845EC2", alpha = 0.5, color = NaN)+
    labs(title = "Total Death(Sqrt. based)",
         x = "Total Death",
         y = "density") +
    theme_minimal() + themes()

grid.arrange(p1,p2,p3, ncol = 2)

Now if we look at the distribution of the AllCause we get the normal distribution. By applying log10, we can convert skewness to normal distribution.

Checking for the outliers for All Cause column

py$death %>% 
  ggplot(aes(y = AllCause))+
  geom_boxplot(outlier.colour = "#651a34")+
  theme_minimal()+
  themes()

We can clear see that there are a lot of outliers present in this column. There is a way to remove those that is by applying the Log.

Outliears after applying Log

py$death %>% 
  ggplot(aes(y = log10(AllCause)))+
  geom_boxplot(outlier.colour = "#651a34")+
  theme_minimal()+
  themes()

As we can see now, there is no outliers present in the column so now, we can use this to train out model.

Conclusion of numerical data exploration

We have seen the method to remove skewnees and outliers from the data. We always keep this in mind that before moving to the modeling part we always need to do the exploration of the data, so we can know our data better.
- Its behavior, is there any pattern present in the data.

Exploratory Data Analysis chategorical Data

In this part we are going to plot some graph to check skewness of the chategorical data and outliers present in that data.
Unlike numerical data we can not use histogram to find out skewness, we need to plot bar graph.

Plotting Bar Graph for AgeGroup column

Age <- py$death %>% 
          select(AgeGroup) %>% 
          group_by(AgeGroup) %>% 
          summarize(frequency = n())

Age %>% 
  ggplot(aes(AgeGroup,frequency))+
  geom_col(stat = "identity", fill = "#651a34" )+
  coord_flip()+
  theme_minimal()+
  labs( title = "Distribution of Age Group"
  )+
  themes()

We can see here that our data is evenly distributed in the age group. Every age group have same number of death. So there is no need to do any transformation.

Citation

For attribution, please cite this work as

patel (2021, May 6). Jaykumar Patel: Monthly Provisional Counts Of Deaths. Retrieved from https://jaykumar-patel.netlify.app/python/2021-05-06-monthly-provisional-counts-of-deaths/

BibTeX citation

@misc{patel2021monthly,
  author = {patel, Jaykumar},
  title = {Jaykumar Patel: Monthly Provisional Counts Of Deaths},
  url = {https://jaykumar-patel.netlify.app/python/2021-05-06-monthly-provisional-counts-of-deaths/},
  year = {2021}
}

Monthly Provisional Counts Of Deaths

Author

Affiliation

Published

Citation

Analysis of the Monthly Provisional Death counts

Importing Required Libraries

Reading Data

Summary Statistics for Numerical Columns

AllCause

NaturalCause

Septicemia (A40-A41)

Malignant neoplasms (C00-C97)

For the remaining columns

Summary Statistics for Categorical Data

Race/Ethnicity

Sex

AgeGroup

Exploratory Data Analysis Numeric Data

Plotting histogram for AllCause Column

Removing Skewness

Checking for the outliers for All Cause column

Outliears after applying Log

Conclusion of numerical data exploration

Exploratory Data Analysis chategorical Data

Plotting Bar Graph for AgeGroup column

Footnotes

Citation