In this blog I have used Python packages in the Rstudio and done the EDA as well as Summary Statistics on the cause of the death in the united state of america in the year 2019 and 2020.
import pandas as pd
import numpy as np
= pd.read_csv('Data/causes_of_death.csv', index_col= 0) death
death.head()
Date.Of.Death.Year ... Jurisdiction.of.Occurrence
1 2019 ... United States
2 2019 ... United States
3 2019 ... United States
4 2019 ... United States
5 2019 ... United States
[5 rows x 23 columns]
'AllCause'].describe() death[
count 2849.000000
mean 2280.159354
std 6028.864306
min 10.000000
25% 86.000000
50% 271.000000
75% 1498.000000
max 53242.000000
Name: AllCause, dtype: float64
This column tells us about total number of people died in the month of particular year in the United State of America.
From the count row we can find that we have 2849 non null values in our data and remaining rows contain null values.
In this column we can see that the mean value is 2280.159354 and median (50%) is 271, that means our data is right skewed distributed.
Standard deviation of this column is 6028.864306. It indicates data are more spread out. A high standard deviation means data is not closely bound to the mean value.
Data is not containing close Continuous number for the people died in the United State of America. It has very large range.
As we can see the range is 0-53242 and we have only 2849 data, This is why we have high standard deviation and large difference between mean and median.
'NaturalCause'].describe() death[
count 2717.000000
mean 2194.944056
std 5945.745418
min 0.000000
25% 70.000000
50% 216.000000
75% 1294.000000
max 52054.000000
Name: NaturalCause, dtype: float64
This column tells us about total number of people died in the month of particular year in the United State of America due to natural cause.
From the count row we can find that we have 2717 non null values in our data and remaining rows contain null values.
In this column we can see that the mean value is 2194.944056 and median (50%) is 216, that means our data is right skewed distributed.
Standard deviation of this column is 5945.745418. It indicates data are more spread out. A high standard deviation means data is not closely bound to the mean value.
'Septicemia..A40.A41.'].describe() death[
count 1736.000000
mean 44.580069
std 88.269978
min 0.000000
25% 0.000000
50% 10.000000
75% 36.000000
max 484.000000
Name: Septicemia..A40.A41., dtype: float64
This column tells us about number of people died in the month of particular year in the United State of America due to Septicemia (A40-A41) disease.
From the count row we can find that we have 1736 non null values in our data and remaining rows contain null values.
In this column we can see that the mean value is 44.580069 and median (50%) is 10, that means our data is right skewed distributed.
Standard deviation of this column is 88.269978. It has the lowest standard deviation of all.
'Malignant.neoplasms..C00.C97.'].describe() death[
count 2249.000000
mean 549.167185
std 1277.215975
min 0.000000
25% 20.000000
50% 69.000000
75% 338.000000
max 6498.000000
Name: Malignant.neoplasms..C00.C97., dtype: float64
This column tells us about total number of people died in the month of particular year in the United State of America due to Malignant neoplasms.
From the count row we can find that we have 2249 non null values in our data and remaining rows contain null values.
In this column we can see that the mean value is 549.167185 and median (50%) is 69, that means our data is right skewed distributed.
Standard deviation of this column is 1277.215975. It indicates data are more spread out. A high standard deviation means data is not closely bound to the mean value.
We can see that the every column have the low median value and the high mean value. Which means all the column are right skewed.
We can say that the data frame is not closely bound to its mean values. Our data have very distinct values and its difference from minmum to maximum values are very high.
This difference means there are very few deaths have high frequency and number of deaths for months are comparatively smallar than it mean values.
Here, we can not state that the number of people died in the month is equal to the mean value and reason is that, as we can see we have very small median and at the same time we have high mean.
This means that most of the data is less than the mean value there are some outliers. Those outliers are the reason we are having a large mean value.
print(death.groupby(['Race.Ethnicity']).size())
Race.Ethnicity
Hispanic 500
Non-Hispanic American Indian or Alaska Native 500
Non-Hispanic Asian 500
Non-Hispanic Black 500
Non-Hispanic White 500
Other 500
dtype: int64
print(death.groupby(['Sex']).size())
Sex
F 720
Female 780
M 720
Male 780
dtype: int64
Here also,
* Our data have uniform distribution when we look for the sex. * Having uniform distribution in the data is help us to see how each category affects
print (death.groupby(['AgeGroup']).size())
AgeGroup
0-4 years 300
15-24 years 300
25-34 years 300
35-44 years 300
45-54 years 300
5-14 years 300
55-64 years 300
65-74 years 300
75-84 years 300
85 years and over 300
dtype: int64
Same as above data column it also have uniform distribution so if we want to train our model to predict the cause of the death.
We can use following columns:
Race/Ethnicity
Sex
AgeGroup
py$death %>% ggplot(aes(AllCause))+
geom_histogram(aes(y = stat(density)), color = "#13B4FA",fill = "#FF6F91", bins = 40) +
geom_density(fill = "#845EC2", alpha = 0.5, color = NaN)+
labs(title = "Distribution of Total Death",
x = "Total Death",
y = "density") +
theme_minimal()+
themes()
p1 <- py$death %>%
ggplot(aes(AllCause)) +
geom_histogram(aes(y = stat(density)),bins = 28, color = "#13B4FA",fill = "#FF6F91") +
geom_density(fill = "#845EC2", alpha = 0.5, color = NaN)+
labs(title = "Total Death",
x = "Total Death",
y = "density") +
theme_minimal() + themes()
p2 <- py$death %>%
ggplot(aes(log10(AllCause))) +
geom_histogram(aes(y = stat(density)),bins = 28, color = "#13B4FA",fill = "#FFC75F") +
geom_density(fill = "#845EC2", alpha = 0.5, color = NaN)+
labs(title = "Total Death (log10 based)",
x = "Total Death",
y = "density") +
theme_minimal()+ themes()
p3 <- py$death %>%
ggplot(aes(sqrt(AllCause))) +
geom_histogram(aes(y = stat(density)),bins = 28, color = "#13B4FA",fill = "#845EC2") +
geom_density(fill = "#845EC2", alpha = 0.5, color = NaN)+
labs(title = "Total Death(Sqrt. based)",
x = "Total Death",
y = "density") +
theme_minimal() + themes()
grid.arrange(p1,p2,p3, ncol = 2)
py$death %>%
ggplot(aes(y = AllCause))+
geom_boxplot(outlier.colour = "#651a34")+
theme_minimal()+
themes()
py$death %>%
ggplot(aes(y = log10(AllCause)))+
geom_boxplot(outlier.colour = "#651a34")+
theme_minimal()+
themes()
Age <- py$death %>%
select(AgeGroup) %>%
group_by(AgeGroup) %>%
summarize(frequency = n())
Age %>%
ggplot(aes(AgeGroup,frequency))+
geom_col(stat = "identity", fill = "#651a34" )+
coord_flip()+
theme_minimal()+
labs( title = "Distribution of Age Group"
)+
themes()
For attribution, please cite this work as
patel (2021, May 6). Jaykumar Patel: Monthly Provisional Counts Of Deaths. Retrieved from https://jaykumar-patel.netlify.app/python/2021-05-06-monthly-provisional-counts-of-deaths/
BibTeX citation
@misc{patel2021monthly, author = {patel, Jaykumar}, title = {Jaykumar Patel: Monthly Provisional Counts Of Deaths}, url = {https://jaykumar-patel.netlify.app/python/2021-05-06-monthly-provisional-counts-of-deaths/}, year = {2021} }