Using Machine Learning to Determine the
Causal Relationship of Age on COVID-19
Symptom Progression Period, Infection Rate
Percentage, and Recurring Symptoms
Aditya K. Mittal
1
, Henry Zhao
1
, Aadhi T. Kumaraswamy
1
, Megan S. Jacob
1
, and Rohan S. Ayyagari
1
Advisor: Keshav Rao
1
1
Aspiring Scholars Directed Research Program, 43505 Mission Blvd, Fremont, CA 94539
Due to rising COVID-19 rates and a shortage of re-
searchers, more research must be done to combat future
COVID-19 resurgences and to reduce the number of infected.
Using datasets from California and Kaggle, machine learning
models were used to predict future outcomes for COVID-19,
specifically in the field of age groups. Using k-means cluster-
ing, the average symptom progression period per age group be-
fore hospitalization was found, and polynomial regression was
then used to predict the percentage each age group took in the
amount of total positive cases. Afterward, specific datasets were
analyzed to look at the symptoms reported for each patient, and
graphs were created to show how many patients had a specific
symptom. The project showed evidence of three factors that
show a relationship between age and symptoms: older patients
have a longer onset period, the young adult age range has a
higher likelihood of infection, and between all the age ranges,
the symptoms are similar. This project proved a causal relation-
ship for age on symptom progression period and infection per-
centage by age group, which will provide invaluable information
to future researchers developing a cure.
age groups | COVID-19 | k-means clustering | polynomial regression |
symptoms
Correspondence: keshav.rao@fremontstem.com
Introduction
The purpose of the project is to determine the direct causal re-
lationship between the age of COVID-19 patients and symp-
tom type, the likelihood of infection, and the symptom pro-
gression period. In California itself as of August 10, 2020,
there have been 563,000 total confirmed cases and 10,365 to-
tal deaths. Science has shown that COVID-19 is the deadliest
to older people with pre-existing health conditions. How-
ever, research must be done to see how that information will
provide a plausible cure to the growing epidemic. Due to it
being airborne, the chance of infection increases drastically,
making the disease more dangerous than before. If we could
predict how much the virus will spread in the future, neces-
sary steps can be taken to combat COVID-19.
Through this experiment, a three-pronged approach
was implemented to determine age’s general effect on
COVID-19. First, using unsupervised machine learning al-
gorithms to create age range clusters, the average symptom
progression period from symptom onset to hospitalization
was calculated and analyzed. Second, specifically looking
at California, the infected for each age group as compared to
the sum infected people for all ages was determined for each
day and examined using graphical analysis; future percent-
ages were determined using polynomial regression. Lastly,
the most recurring symptoms for each group were determined
from a patient dataset containing the patients’ age and current
symptoms. Using this three-pronged approach, it is now pos-
sible to predict certain factors of a patient depending on age,
a finding that will be discussed in greater detail in the study.
These predictions will undoubtedly help many that are
misinformed, and will help the general public make better-
informed decisions regarding COVID-19. These predictions
will form the basis for whether there is a causal relationship
of age on symptom progression period, death rate, and spe-
cific symptoms of COVID-19. By learning about specific
symptoms for each age group, more individuals will be able
to determine whether they have the disease, and stop the dis-
ease from spreading further. COVID-19 has already taken
millions of lives and infected many others, which is why this
research is vital to contain the spread of the virus.
Methods
K-Means Clustering
An unsupervised machine learning algorithm, k-means clus-
tering aims to find clusters within an n-dimensional space
containing points using distance as a heuristic
1
. The algo-
rithm attempts to minimize the squared mean distance of the
points from each centroid. The number of centroids are cho-
sen beforehand as a hyperparameter, and is largely depen-
dent on the type of the data; centroids form the center of each
cluster, and were used to find the average symptom progres-
sion period for the study. The study’s data was plotted on a
two-dimensional space against age and number of days from
symptom onset to hospitalization. Outliers were removed be-
forehand to remove inaccuracy with our clusters, and an el-
bow curve was plotted to determine the number of clusters
and centroids necessary
2
.
K-means clustering helped show the optimal age
ranges using the data provided. Advantages of k-means clus-
338 | ASDRP Summer 2020 Effect of Age on COVID-19 Symptoms
tering are its computational speed in determining clusters and
its easy implementation. However, it was difficult to deter-
mine optimal clusters as the clusters changed depending on
where the initial centroids were placed
3
. To fix this, many
clustering algorithms were run with different initial centroids,
and the age ranges for each cluster were determined from the
aggregate clusters. Additionally, it was difficult to determine
the number of clusters, even with an elbow curve, as the op-
timal score-cluster ratio changes for every dataset.
For the study, the sklearn and matplotlib libraries from
Python were used for running the algorithm and plotting the
results. The sklearn libraries helped greatly with the sim-
plification and overall effectiveness of the algorithm, which
allowed the study to produce accurate results. Additionally,
the matplotlib library allowed for the graphing of the elbow
curve and the final clusters, which helped with the visualiza-
tion of the final result of the algorithm.
Polynomial Regression
A supervised machine learning algorithm, polynomial regres-
sion is largely used to predict trends in the data from data that
can be easily visualized on scatter plots. To run the model,
a n
th
degree polynomial is fitted to the scatter plot that at-
tempts to reduce the error between points and their predicted
points on the polynomial line. To refine it, this error is cal-
culated as the sum of the squared distances between the real
point and the predicted point and minimized over a series of
iterations. Although a common and simple-to-use algorithm,
polynomial regression does have the risk of overfitting, which
is where the model generalizes well to training data, but not
to testing data. Nevertheless, polynomial regression provided
an efficient way to predict from time series data, and multiple
models were created to reduce the risk of overfitting.
The process of implementing polynomial regression
was done using the matplotlib and sklearn modules. Using
these modules, we were able to use the infection percent-
age dataset that was derived from the California datasets and
graph the existing and predicted lines for each age group pro-
vided in the dataset.
First, it was necessary to change the unit in which the
dates would be measured. Individual dates would not work
due to their format (MM-DD-YYYY), since it is not possible
to just take a date and perform mathematical operations on
them. This was a significant drawback encountered in the
experiment and it was changed such that each row is ‘Day
X’ after the date 04-02-2020, the very first date, so that in
reality, each number is also the same as its index and go up
one by one, removing any conflicts in which our algorithm
would not be able to provide a proper value.
Afterward, we then imported the sklearn module,
specifically the linear regression and polynomial regression
segments, along with the matplotlib module for us to make
scatter plots. A function was then created which would take
the age ranges, graph them, and then plot the polynomial re-
gression line on top of it. In this case, the regression model
used a degree of four. This function allowed us to graph all
of our data along with their respective polynomial regression
lines and also predicted the percentages for the next “X” days
as inputted. After running the function with all of the age
groups, a new data frame was created in a similar format to
the infection percentage dataset except the predicted values
created by the polynomial regression model were appended
to the existing data frame.
Using a for loop, we then graphed the original dataset
along with the prediction dataset together with different col-
ors so that we can see the original values and the predictions
that were given by polynomial regression.
Data Mining
With the datasets used, the concept of data mining was im-
plemented into our project. Both datasets were displayed in
columnar format, which allowed for an easier manipulation
of the data into a Pandas data frame. The original datasets
had many columns that were not of use for this project, for
example, the location of the case and the web link to the case
report. On top of this there were many NaNs and missing
data. In order to analyze the data at the highest possible level,
to compare just age and symptoms, the concept of data min-
ing was used to create a new dataset based off of the origi-
nal one with just the data that was needed. Data mining is
the discovery of structures and patterns in large and complex
data sets. There are two aspects to data mining: model build-
ing and pattern detection
4
. For the purpose of this project,
both aspects of data mining were incorporated. Model build-
ing was used for the k-means clustering method, and pattern
detection was used for the polynomial regression.
There were a variety of data mining Python functions
used to create the new dataset. First, all the missing data and
values were dropped from the dataset using row-wise impu-
tation through Pandas
5
. Afterward, all the columns besides
the age of the patients and the list of symptoms were dropped
as well. Lastly, a function sorted the data frame by age and
stored the results in a new dataset.
The data mining that was applied to the original dataset
helped greatly with the machine learning algorithms’ accu-
racies. Due to this process, the data was converted into an
easier format for model building and pattern detection.
Datasets Extracted
Due to the time and money constraints of this paper,
datasets were compiled by an external party to ease the
computational power required to make such datasets.
There were two main datasets that were extracted: one
to analyze the death rate percentage for each age group
and one to analyze the symptom progression period
and recurrent symptoms in each age group. The first
dataset can be found here: https://data.ca.
gov/dataset/covid-19-cases/resource/
339d1c4d-77ab-44a2-9b40-745e64e335f2.
Additionally, the second dataset can be found here:
https://www.kaggle.com/sudalairajkumar/
novel-corona-virus-2019-dataset?select=
COVID19_line_list_data.csv. Both were displayed
in tabular format, which allowed for easy manipulation of
Effect of Age on COVID-19 Symptoms ASDRP Summer 2020 | 339
the data into a Pandas data frame and the running of k-means
clustering on certain columns.
Results
Symptom Progression Period
Using k-means clustering, the symptom progression period
was determined from a patient dataset with ages ranging from
18-80 years old. Using an elbow curve as shown in Figure 1,
the optimal number of clusters was determined to be six clus-
ters, against the dataset provided. The symptom progression
period was calculated by taking the patient’s symptom onset
date and finding the number of days to hospitalization. Al-
though not an exact measurement of symptom progression,
generally, worse symptoms of COVID-19 necessitate hospi-
talization, while milder symptoms can be handled at home.
Fig. 1. Elbow curve used to determine the optimal number of clusters needed to
cluster age based on symptom progression period.
The symptom onset period varied from 1-35 days
throughout all of the age ranges. Using k-means clustering,
four significant clusters were extracted from the data, as the
other two clustered only a few data points together. The first
cluster had a mean symptom progression period of 9.5 days
for the age range of 18-33 years old. The second cluster had
a mean symptom progression period of 10.5 days for the age
range of 34-49 years old. The third cluster had a mean symp-
tom progression of 11 days for the age range of 50-64 years
old. Lastly, the fourth cluster had a mean symptom progres-
sion period of 14.4 days for the age range of 65-80 years old.
From plotting a trendline through this data and analyzing the
clusters, it was determined that there was a positive correla-
tion between age and symptom progression period. A graph
of the clusters can be seen in Figure 2.
Most Recurring Symptoms
Using the Kaggle dataset, symptoms were extracted for each
of the patients and compiled to create a comprehensive list
of symptoms for each age group. Given the availability of
data, it was found optimal to bucket the age ranges into three
groups: 18-49 years old, 50-65 years old, and 65+ years old.
For the 18-49 year old age group, the symptoms were gen-
erally commonplace, including fever, sore throat, malaise,
and runny nose. Fever was the highest for this age group,
Fig. 2. K-means clustering to determine the clusters based on symptom progres-
sion period and age range.
Fig. 3. The top most recurring symptoms for the 65+ age group from the Kaggle
dataset.
which has already been proven as the most common symp-
tom
6
. Therefore, the data provided for these symptoms was
largely insignificant and did not provide any breakthroughs.
Additionally, sore throat, malaise, and runny nose have all
also been proven as common symptoms of COVID-19
7
. For
the 50-65 year old age group, the predicament is similar, as
many of the most common symptoms, such as fever, cough,
malaise, and headache, have already been proven to be com-
mon symptoms of COVID-19.
However, for the 65+ years old age group, joint pain,
shortness of breath, and vomiting become common symp-
toms for the age group, which contrasts to that of the first
two age groups. Joint pain is common in older people; there-
fore, that can be the cause of the reporting of that symptom
while in the hospital
8
. Additionally, shortness of breath is
commonplace for older people, while vomiting is common
for older patients suffering from COVID-19
9
. A figure of the
most recurring symptoms for the 65+ years old age group can
be seen in Figure 3.
Percent Infected by Age Range
Using polynomial regression, the percent infected per age
range was determined from a California dataset. The dataset
was split into two axes, four different age ranges, and the
number of days after April 2, 2020. Using polynomial re-
gression, we were able to predict the percent change that each
age group will be infected. The age ranges were from 0-17,
18-49, 50-64, and 65+. The ranges were color-coded and
graphed on top of each other in Figure 4.
340 | ASDRP Summer 2020 Effect of Age on COVID-19 Symptoms
Fig. 4. This graph implements polynomial regression that predicts the percent
infected by age range for the next fifty days.
April 2nd to present day, it seemed that the age ranges’s
percentages were staying the same. For the age range 0-17,
the number of cases in California were under one percent.
For the age ranges 18-49, the number of cases were between
65 percent to 67 percent and has been slowly decreasing. For
the 50-64 range, the number of cases has stayed at a steady
20 percent of total cases in California. Finally, for the 65+
age range, the percent of cases increased from 11 percent to
13 percent. The polynomial regression also modeled what
the future could hold for COVID-19 in California. The model
predicts 175 days after April 2nd, which would be September
24th. The 0-17 age range seems to increase up to 4 percent of
total cases. 18-45 age range seems to decrease to 51 percent
of total cases. The 50-64 age range seems to make a slight
increase to the 21 percent mark. Finally the 65+ age range
increases up to 23 percent of total cases in California.
Discussion
Symptom Progression Period
Using k-means clustering, it was determined that a positive
correlation existed between age and symptom progression
period. Although symptom progression period increasing
with age seems to be against the common consensus, es-
pecially with the majority of COVID-19 patients being el-
derly, viewing this issue from a biological perspective proves
a causal relationship between the two factors. As the human
body ages, it encounters more and more diseases, which al-
lows the immune system to create antibodies against a wealth
of diseases. Therefore, many common symptoms that oc-
cur, such as dry cough and fever, would be abated due to the
strength of the immune system
10
. Due to this, hospitalization
would occur at a later date, as the symptoms are weaker and
progress slower.
However, although this may seem to be the case, stud-
ies show that although immune systems vary heavily between
individuals, they stay relatively the same for a single individ-
ual. In this case, a single variation in age would not be the
cause of a weaker or stronger immune system
11
. This would
suggest that the study may have had an inherent flaw, pos-
sibly caused by a smaller sample size than the one needed
to provide a non-biased conclusion. Due to the heavy bio-
logical backing associated with the latter study, it is safe to
assume that the former would have less backing than the lat-
ter. However, a biological study must be done to further test
age’s effect on symptom progression period.
Most Recurring Symptoms
Across all age groups, the top five symptoms consisted of
fever and joint pain. As previously discussed, these have
been proven to be common symptoms of COVID-19, and
can likely be disregarded as a significant outcome. After an-
alyzing the entire list of symptoms provided by the dataset,
joint pain seems to be the only symptom that does not recur
for other age groups. Due to the likelihood of joint pain in-
creasing with age, this also does not appear to be a significant
finding. Reasons for these findings vary, but it could be due to
the small sample size of the dataset, similar to the symptom
progression period dataset. Obscure and infrequent symp-
toms would not be picked up in this dataset, as only a few
thousands of patients were analyzed. As a result, this data
does not provide a comprehensive list of symptoms, and can-
not be accurately relied upon for providing symptom data.
Lastly, due to region-specific symptoms, it is impossible to
accurately determine the most recurring symptoms globally.
This dataset primarily consisted of data from Southeast Asia
and Europe, as the data was taken in January and February.
Therefore, it does not account for the symptoms elsewhere,
which provides a skewed finding.
Throughout each age group, study participants faced
several varying symptoms on a case by case basis. With each
age range, the number of people surveyed also differed, im-
pacting a few of the statistics consequently. In the 0-17 age
range 12 individuals were surveyed, in the 18-49 age range
165 individuals were surveyed, in the 50-65 age range 130
individuals were surveyed, and in the 65+ age range 355 in-
dividuals were surveyed. Consequently, it is visible that the
symptoms analysis for the 65+ age range used data from the
most study participants. This may have been due to the higher
risk faced by elderly in regards to COVID-19. Due to the
larger amount of data, it can also be reasonably inferred that
the conclusions drawn from the 65+ age range yield a higher
accuracy than those from the other age ranges. On the other
hand the 0-17 age range only contained data from 12 partic-
ipants so is likely to have the least accurate results out of all
the present age ranges analyzed.
Along with impacting the degree of accuracy of the re-
sults, the amount of data available for each age range also
impacts the comparison between the various age ranges. In
the 0-17 age range 5 out of 12 individuals experienced fever
symptoms, in the 18-49 age range 5 out of 165 individuals
experienced fever symptoms, in the 50-65 age range 56 out
of 130 individuals experienced fever symptoms, and in the
65+ age range 142 out of 355 individuals experienced fever
symptoms. In this scenario it is difficult to compare the num-
bers alone. Age range 0-17 had 5 cases of the fever reported
and age range 65+ had 142 cases reported. In comparison to
the findings in the 0-17 age range, the 65+ age range seems to
have a higher likelihood of facing fever as a symptom. How-
Effect of Age on COVID-19 Symptoms ASDRP Summer 2020 | 341
ever, once converted to percentages it is found that only 40%
of the occurrences reported fever for the 65+ age range while
42% reported fever for the 0-17 age range. Consequently, al-
though the numbers are lower, the results show that both age
ranges have a similar likelihood of reporting fever as a symp-
tom. This principle applies to all of the age ranges and all of
the symptoms as they are compared against each other.
Percent Infected by Age Range
Using polynomial regression, we were able to draw nu-
merous conclusions about the percentage of infected people
throughout different age groups. Since April 2nd, we can see
that most of the cases were in the age range of 18-49. While
at first, this may seem like an extensive range, our machine
learning model for how the percent of infected by age range
shows that this number will decrease from 65% to only 51%
by September 24th (175 days after April 2nd). Furthermore,
the infected rates among people in the age range from 0-17
will shoot up from less than 1% to 4% of total cases. The age
range of 65+ almost doubles from around 12% to 23%. Any
other age ranges did not experience any significant changes
in their infected percentages.
We can see that closer to the start of the outbreak, it
was generally older people getting affected by the virus, but
as time progresses, we can see that infected rates for younger
age groups are slowly increasing and accounting for a more
substantial majority of total cases. While the age group of 18-
49 is an extensive range as opposed to other ranges such as
50-64, spanning for only 14 years, it has overall decreased
from the start of the pandemic, whereas the youngest age
range has been increasing exponentially. This rate of increase
in the percentage of infected is more evidence that young
adults are getting infected more often and have a higher plau-
sibility of actually getting infected. These changes in per-
centages between different age groups are notable because it
allows for new possibilities and potential findings in subse-
quent research, which will overall help when developing a
cure for the disease.
Conclusion
The aim of this study was to determine age’s overall relation-
ship with COVID-19 symptoms and infection rate. After con-
ducting the research, it is determined that a causal relation-
ship could be found between age and COVID-19 symptoms.
Three major findings were extracted from the data: older pa-
tients are likely to experience slower symptom progression
than younger patients, the young adult age range has the high-
est likelihood of getting infected with COVID-19, and symp-
toms generally remain constant across all age groups. Us-
ing these results, it is now possible to determine when life-
threatening symptoms will occur for certain age groups, so
that hospitals can determine the most efficient way to treat the
patients. Additionally, knowing what age ranges are affected
the most will provide future researchers more data to work
with when trying to develop a cure for the disease. More
research must be done to further prove these causal relation-
ships, but from preliminary analysis, a relationship can be
determined.
AUTHOR INFORMATION
Aditya Mittal contributed towards the use of k-means clustering in the symp-
tom progression period section and the use of data science libraries in the most
recurring symptoms section. Henry Zhao contributed towards the use of polyno-
mial regression in the percent infected by age range section. Aadhi Kumaraswamy
contributed data mining, extraction, and manipulation of the California and Kaggle
datasets. Megan Jacob contributed towards the finding of datasets and the usage
of datasets in machine learning algorithms. Rohan Ayyagari contributed towards
the finding of datasets and the creation of the structure of the research paper.
ACKNOWLEDGEMENTS
Thanks to Keshav Rao for guiding the team through the research paper
and teaching the various components of how to write it. Thanks Aspiring Scholars
Directed Research Program for providing the resources for the team to complete
the study.
Bibliography
1. T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. An
efficient k-means clustering algorithm: analysis and implementation. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 24(7):881–892, 2002.
2. Purnima Bholowalia and Arvind Kumar. Ebk-means: A clustering technique based on elbow
method and k-means in wsn. International Journal of Computer Applications, 105(9), 2014.
3. Marina Santini. Advantages & disadvantages of k-means and hierarchical clustering (un-
supervised learning). URL: http://santini. se/teaching/ml/2016/Lect_10/10c_Unsupervise
dMethods. pdf (Accesed 17.04. 2019), 2016.
4. David J Hand and Niall M Adams. Data mining. Wiley StatsRef: Statistics Reference Online,
pages 1–7, 2014.
5. Wes McKinney et al. pandas: a foundational python library for data analysis and statistics.
Python for High Performance and Scientific Computing, 14(9), 2011.
6. Min Cao, Dandan Zhang, Youhua Wang, Yunfei Lu, Xiangdong Zhu, Ying Li, Honghao Xue,
Yunxiao Lin, Min Zhang, Yiguo Sun, Zongguo Yang, Jia Shi, Yi Wang, Chang Zhou, Yidan
Dong, Ping Liu, Steven M Dudek, Zhen Xiao, Hongzhou Lu, and Longping Peng. Clinical
features of patients infected with the 2019 novel coronavirus (COVID-19) in shanghai, china.
March 2020. doi: 10.1101/2020.03.04.20030395.
7. Suxin Wan, Yi Xiang, Wei Fang, Yu Zheng, Boqun Li, Yanjun Hu, Chunhui Lang, Daoqiu
Huang, Qiuyan Sun, Yan Xiong, Xia Huang, Jinglong Lv, Yaling Luo, Li Shen, Haoran Yang,
Gu Huang, and Ruishan Yang. Clinical features and treatment of COVID-19 patients in
northeast chongqing. Journal of Medical Virology, 92(7):797–806, April 2020. doi: 10.1002/
jmv.25783.
8. I.P. Donald. A longitudinal study of joint pain in older people. Rheumatology, 43(10):1256–
1260, July 2004. doi: 10.1093/rheumatology/keh298.
9. Yuan Tian, Long Rong, Weidong Nian, and Yan He. Review article: gastrointestinal fea-
tures in COVID-19 and the possibility of faecal transmission. Alimentary Pharmacology &
Therapeutics, 51(9):843–851, March 2020. doi: 10.1111/apt.15731.
10. Federico Licastro, Giuseppina Candore, Domenico Lio, Elisa Porcellini, Giuseppina
Colonna-Romano, Claudio Franceschi, and Calogero Caruso. Immunity & Ageing, 2(1):
8, 2005. doi: 10.1186/1742-4933-2-8.
11. Petter Brodin and Mark M. Davis. Human immune system variation. Nature Reviews Im-
munology, 17(1):21–29, December 2016. doi: 10.1038/nri.2016.125.
342 | ASDRP Summer 2020 Effect of Age on COVID-19 Symptoms