Myopia (near-sightedness or short-sightedness) is the most common eye problem and is estimated to affect 1.5 billion people (22% of the population). It is a condition of the eye where light focuses in front, instead of on the retina. This causes distant objects to be blurry while close objects appear normal. Other symptoms may include headaches and eye strain. Severe near-sightedness increases the risk of retinal detachment, cataracts, and glaucoma.
The underlying cause is believed to be a combination of genetic and environmental variables. A family history of the condition is likely to play a role. Environmental risk factors include doing work that involves focusing on close objects and greater time spent indoors. There is tentative evidence that near-sightedness can be prevented by having young children spend more time outside. This may be related to natural light exposure.
Regarding environmental variables, there are two hypotheses:
“Near work” hypothesis
Also referred to as the “use-abuse theory”, it states that spending time involved in near work strains the eyes and increases the risk of myopia. Some studies support the hypothesis while other studies do not.
“Visual stimuli” hypothesis
The visual stimuli hypothesis adds another layer of complexity. There is evidence that lack of normal visual stimuli causes improper development of the eyeball, namely, elongation. In this case, “normal” refers to the environmental stimuli that the eyeball evolved for over hundreds of millions of years, like oceans, jungles, forests. Modern humans who spend most of their time indoors, in dimly or fluorescently lit buildings are not giving their eyes the appropriate stimuli.
Experiments where animals such as kittens and monkeys had their eyes sewn shut for long periods of time also show eyeball elongation, demonstrating that complete lack of stimuli also causes improper growth trajectories of the eyeball. Further research shows that people, and children especially, who spend more time doing physical exercise and outdoor activities have lower rates of myopia. There is preliminary evidence that the protective effect of outdoor activities on the development of myopia is due, at least in part, to the effect of daylight on the production and the release of retinal dopamine, which deters eyeball elongation.
Here we study myopia’s various likely influencing variables using logistic regression. We examine physiological variables (age, gender, eyeball parameters), environmental variables (time spent on near-work and outdoor activities) as well as hereditary variables (myopic mother and father). By doing the analysis, we examine the validity of the aforementioned hypotheses.
The dataset is from Hosmer, D.W., Lemeshow, S. and Sturdivant, R.X. (2013) Applied Logistic Regression: Third Edition, and is copyrighted by John Wiley | Sons Inc.
myopia = read.csv('myopia.csv', row.names = 'ID')
myopia$MYOPIC = factor(myopia$MYOPIC)
myopia$GENDER = factor(myopia$GENDER)
myopia$MOMMY = factor(myopia$MOMMY)
myopia$DADMY = factor(myopia$DADMY)
head(myopia)
## STUDYYEAR MYOPIC AGE GENDER SPHEQ AL ACD LT VCD SPORTHR
## 1 1992 1 6 1 -0.052 21.89 3.690 3.498 14.70 45
## 2 1995 0 6 1 0.608 22.38 3.702 3.392 15.29 4
## 3 1991 0 6 1 1.179 22.49 3.462 3.514 15.52 14
## 4 1990 1 6 1 0.525 22.20 3.862 3.612 14.73 18
## 5 1995 0 5 0 0.697 23.29 3.676 3.454 16.16 14
## 6 1995 0 6 0 1.744 22.14 3.224 3.556 15.36 10
## READHR COMPHR STUDYHR TVHR DIOPTERHR MOMMY DADMY
## 1 8 0 0 10 34 1 1
## 2 0 1 1 7 12 1 1
## 3 0 2 0 10 14 0 0
## 4 11 0 0 4 37 0 1
## 5 0 0 0 4 4 1 0
## 6 6 2 1 19 44 0 1
The columns are:
Column | Description | Value/Unit | Name |
---|---|---|---|
1 | Year subject entered the study | year | STUDYYEAR |
2 | Myopia within the first five years of follow up | 0 = No; 1 = Yes | MYOPIC |
3 | Age at first visit | years | AGE |
4 | Gender | 0 = Male; 1 = Female | GENDER |
5 | Spherical Equivalent Refraction | diopter | SPHEQ |
6 | Axial Length | mm | AL |
7 | Anterior Chamber Depth | mm | ACD |
8 | Lens Thickness | mm | LT |
9 | Vitreous Chamber Depth | mm | VCD |
10 | Time spent engaging in sports/outdoor activities | hours per week | SPORTHR |
11 | Time spent reading for pleasure | hours per week | READHR |
12 | Time spent playing video/computer games or working on the computer | hours per week | COMPHR |
13 | Time spent reading or studying for school assignments | hours per week | STUDYHR |
14 | Time spent watching television | hours per week | TVHR |
15 | Composite of near-work activities | hours per week | DIOPTERHR |
16 | Was the subject’s mother myopic? | 0 = No; 1 = Yes | MOMMY |
17 | Was the subject’s father myopic? | 0 = No; 1 = Yes | DADMY |
Column 2: MYOPIC
is defined as SPHEQ
<= -0.75 D.
Column 5: A measure of the eye’s effective focusing power. Eyes that are “normal” (don’t require glasses or contact lenses) have spherical equivalents between -0.25 diopters (D) and +1.00 D. The more negative the spherical equivalent, the more myopic the subject.
Column 6: The length of eye from front to back.
Column 7: The length from front to back of the aqueous-containing space of the eye between the cornea and the iris.
Column 8: The length from front to back of the crystalline lens.
Column 9: The length from front to back of the aqueous-containing space of the eye in front of the retina.
Column 15: the composite is defined as DIOPTERHR
= 3 × (READHR
+ STUDYHR
) + 2 × COMPHR
+ TVHR
.
Besides the response MYOPIC
, there are three categorical variables GENDER
, MOMMY
and DADMY
. Let’s visualize them:
par(mfrow=c(1,3))
plot(myopia$MYOPIC, myopia$GENDER, xlab = "MYOPIC", ylab = "GENDER")
plot(myopia$MYOPIC, myopia$MOMMY, xlab = "MYOPIC", ylab = "MOMMY")
plot(myopia$MYOPIC, myopia$DADMY, xlab = "MYOPIC", ylab = "DADMY")
Continuing last section, we look at a fit with all categorical variables GENDER
, MOMMY
and DADMY
.
summary(glm(MYOPIC ~ GENDER + MOMMY + DADMY, data = myopia,
family = binomial(logit)))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.1509152 0.3065982 -10.277019 8.945123e-25
## GENDER1 0.3949800 0.2457213 1.607431 1.079599e-01
## MOMMY1 0.8575581 0.2565950 3.342069 8.315652e-04
## DADMY1 0.9520906 0.2584177 3.684309 2.293242e-04
It appears that GENDER
is not statistically prominent. Look at the contingency table with MOMMY
and DADMY
:
myopia.table <- xtabs(~ MOMMY + DADMY + MYOPIC, data = myopia)
ftable(myopia.table)
## MYOPIC 0 1
## MOMMY DADMY
## 0 0 147 5
## 1 132 21
## 1 0 138 20
## 1 120 35
summary(myopia.table)
## Call: xtabs(formula = ~MOMMY + DADMY + MYOPIC, data = myopia)
## Number of cases in table: 618
## Number of factors: 3
## Test for independence of all factors:
## Chisq = 25.022, df = 4, p-value = 4.979e-05
Here we do a fit that includes every column other than MYOPIC
.
summary(glm(MYOPIC ~ ., data = myopia, family = binomial(logit)))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.509693e+02 219.95543974 -1.1410004 2.538697e-01
## STUDYYEAR 1.279753e-01 0.11011207 1.1622275 2.451431e-01
## AGE 8.781767e-02 0.25295253 0.3471706 7.284632e-01
## GENDER1 6.250700e-01 0.34451742 1.8143351 6.962615e-02
## SPHEQ -4.145231e+00 0.47041751 -8.8118133 1.231375e-18
## AL -3.009826e+01 39.02156386 -0.7713238 4.405150e-01
## ACD 3.124416e+01 39.08467746 0.7993966 4.240605e-01
## LT 2.915362e+01 39.11136522 0.7454003 4.560297e-01
## VCD 2.971146e+01 39.04372518 0.7609793 4.466695e-01
## SPORTHR -4.641564e-02 0.02154816 -2.1540418 3.123690e-02
## READHR 8.966185e-02 0.05029143 1.7828457 7.461142e-02
## COMPHR 4.649699e-02 0.04620587 1.0063006 3.142710e-01
## STUDYHR -1.737402e-01 0.09849706 -1.7639123 7.774675e-02
## TVHR -9.562692e-03 0.02928495 -0.3265394 7.440163e-01
## MOMMY1 7.239397e-01 0.32175693 2.2499585 2.445158e-02
## DADMY1 7.786300e-01 0.32212940 2.4171344 1.564324e-02
There are a few observations:
We immediately find out that the p-value of SPHEQ
is very small. Looking at the column notes above, MYOPIC
is defined according to SPHEQ
. Therefore, obviously, the correlation is prominent.
DIOPTERHR
’s row disappears, because it is a linearly combination of READHR
, STUDYHR
, COMPHR
and TVHR
.
STUDYYEAR
is related to the chronological trend and sampling process, and should not be a causal variable. The p-value of STUDYYEAR
is indeed large, confirming our belief.
AL
, ACD
, LT
and VCD
have similar p-values. A closer examination shows that similar to DIOPTERHR
, there is a linear functional relationship. AL
= ACD
+ LT
+ VCD
. The maximum absolute difference between two sides of the equation is 0.008.
MYOPIC
~ (AL
, ACD
, LT
, VCD
) model selectionOne term of the four must be gone. We use step
to evaluate the 2-term interaction models with one variable gone.
m.eyeball.no.AL <- step(glm(MYOPIC ~ .^2 , data =
myopia[, c('MYOPIC', 'ACD', 'LT', 'VCD')], family = binomial(logit)))
m.eyeball.no.ACD <- step(glm(MYOPIC ~ .^2 , data =
myopia[, c('MYOPIC', 'AL', 'LT', 'VCD')], family = binomial(logit)))
m.eyeball.no.LT <- step(glm(MYOPIC ~ .^2 , data =
myopia[, c('MYOPIC', 'AL', 'ACD', 'VCD')], family = binomial(logit)))
m.eyeball.no.VCD <- step(glm(MYOPIC ~ .^2 , data =
myopia[, c('MYOPIC', 'AL', 'ACD', 'LT')], family = binomial(logit)))
summary(m.eyeball.no.AL)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.008358 1.9355795 -3.620806 0.0002936864
## ACD 1.418993 0.5318269 2.668148 0.0076270647
summary(m.eyeball.no.ACD)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.696102 5.9892258 -0.7840917 0.43298632
## AL 1.383147 0.5663913 2.4420341 0.01460477
## LT -1.692605 0.9125328 -1.8548430 0.06361865
## VCD -1.453938 0.6078946 -2.3917599 0.01676781
summary(m.eyeball.no.LT)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.008358 1.9355795 -3.620806 0.0002936864
## ACD 1.418993 0.5318269 2.668148 0.0076270647
summary(m.eyeball.no.VCD)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.008358 1.9355795 -3.620806 0.0002936864
## ACD 1.418993 0.5318269 2.668148 0.0076270647
Interestingly, removing ACD
results in all other three variables statistically prominent (m.eyeball.no.ACD
), but keeping ACD
, eventually only ACD
alone is prominent. So instead of removing just one term, we remove AL
, LT
and VCD
, while keeping ACD
.
In the next iteration, we rule out these terms.
summary(m <- glm(MYOPIC ~ . - STUDYYEAR - SPHEQ - AL - LT - VCD - DIOPTERHR,
data = myopia, family = binomial(logit)))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.83456507 2.38380127 -4.1255809 3.698001e-05
## AGE 0.21000681 0.19649483 1.0687651 2.851755e-01
## GENDER1 0.57936005 0.27389999 2.1152248 3.441080e-02
## ACD 1.51780587 0.58595138 2.5903273 9.588472e-03
## SPORTHR -0.04562130 0.01853642 -2.4611719 1.384840e-02
## READHR 0.09406765 0.03867674 2.4321507 1.500946e-02
## COMPHR 0.03197313 0.03852656 0.8298984 4.065962e-01
## STUDYHR -0.10743429 0.07524283 -1.4278343 1.533395e-01
## TVHR 0.01150788 0.02341332 0.4915098 6.230659e-01
## MOMMY1 0.77858551 0.26536308 2.9340385 3.345827e-03
## DADMY1 0.99341915 0.26803610 3.7062886 2.103186e-04
It’s curious to see that STUDYHR
has an opposite effect as READHR
.
Now we need to categorize the column variables. Our goal is to find out the causal variables for myopia.
AGE
and GENDER
are physiological variables. Our life experience suggests they should not directly play a role in causing myopia, but may compound with other variables. AL
, ACD
, LT
and VCD
are physiological variables as well. They represent the state of the eyeballs.
SPORTHR
, READHR
, COMPHR
, STUDYHR
and TVHR
are environmental variables, documenting the time spent on near-work and outdoor activities. They depend on the life style of the subject. We should give more attention to these.
MOMMY
and DADMY
are hereditary variables. They should play a heavy role.
Then we use step
to automatically select the best fitted model:
m.s <- step(m)
summary(m.s)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.40390128 2.11593176 -3.971726 7.135375e-05
## GENDER1 0.48110897 0.25986819 1.851358 6.411811e-02
## ACD 1.52298094 0.57271705 2.659221 7.832168e-03
## SPORTHR -0.04589204 0.01817867 -2.524499 1.158633e-02
## READHR 0.08480289 0.03749779 2.261544 2.372562e-02
## MOMMY1 0.79097301 0.26396109 2.996552 2.730520e-03
## DADMY1 0.95349257 0.26338355 3.620168 2.944123e-04
Looks like some automatic magic has already removed some of the less prominent variables. AGE
is not playing a prominent role, while physiological variables GENDER
and ACD
remain. Among the environmental variables, only SPORTHR
and READHR
remain. Hereditary variables MOMMY
and DADMY
remain as well.
It is also beneficial to test the 2-term interaction models with step
using the not-yet-automatically-reduced model m
.
m.2.s <- step(glm(MYOPIC ~ .^2, data = myopia[, c('MYOPIC', 'AGE', 'GENDER', 'ACD',
'SPORTHR', 'READHR', 'COMPHR', 'STUDYHR', 'TVHR', 'MOMMY', 'DADMY')],
family = binomial(logit)))
summary(m.2.s)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.94113853 3.570229198 -3.3446420 0.0008238884
## GENDER1 -0.12705162 0.459476771 -0.2765137 0.7821535499
## ACD 2.66480887 0.966186231 2.7580696 0.0058143817
## SPORTHR -0.09972152 0.033018635 -3.0201587 0.0025264230
## READHR 0.09859844 0.038897135 2.5348511 0.0112495130
## STUDYHR -0.25443169 0.133301082 -1.9086994 0.0563008856
## MOMMY1 0.79888783 0.268852651 2.9714709 0.0029637695
## DADMY1 7.48000321 4.277633351 1.7486312 0.0803547867
## GENDER1:SPORTHR 0.05919397 0.037492925 1.5788038 0.1143810609
## ACD:DADMY1 -1.79355896 1.170278646 -1.5325914 0.1253765675
## SPORTHR:STUDYHR 0.01376141 0.008046762 1.7101799 0.0872326077
Using the automatically-reduced model m.s
, we have:
m.s.2.s <- step(glm(MYOPIC ~ .^2, data = myopia[, c('MYOPIC', 'GENDER', 'ACD',
'SPORTHR', 'READHR', 'MOMMY', 'DADMY')], family = binomial(logit)))
summary(m.s.2.s)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.05623675 3.52238211 -3.4227510 0.0006199084
## GENDER1 -0.12254835 0.45992349 -0.2664538 0.7898897627
## ACD 2.62554816 0.95311095 2.7547141 0.0058743445
## SPORTHR -0.08092241 0.03057828 -2.6464013 0.0081353244
## READHR 0.08678812 0.03785472 2.2926628 0.0218674267
## MOMMY1 0.78878105 0.26529472 2.9732255 0.0029468783
## DADMY1 7.31529476 4.24697814 1.7224705 0.0849843137
## GENDER1:SPORTHR 0.05745745 0.03783491 1.5186360 0.1288541322
## ACD:DADMY1 -1.75149758 1.16160592 -1.5078243 0.1315995117
There isn’t much of a difference except by including STUDYHR
, we also introduce 2-term interaction SPORTHR:STUDYHR
. The observation about the coefficient of STUDYHR
in Section 2 is also seen here, only more negative. Considering the fact that the subjects are between 5 to 9 years old, their homework assignment that counts as STUDYHR
could very much be non-near-work activities. They do not necessarily equate activities counting as READHR
. For this reason, we remove the confounding variable STUDYHR
and stick with the simpler model.
In addition, the p-value for GENDER
is very large, we remove it while keeping the 2-term interactions GENDER
:SPORTHR
. The p-value for ACD
:DADMY
is large too so we remove it as well. Refit the data:
m.s.2.s.no.gender <- glm(formula = MYOPIC ~ ACD + SPORTHR + READHR + MOMMY + DADMY +
GENDER:SPORTHR, data = myopia[, c("MYOPIC", "GENDER", "ACD", "SPORTHR", "READHR",
"MOMMY", "DADMY")], family = binomial(logit))
summary(m.s.2.s.no.gender)
##
## Call:
## glm(formula = MYOPIC ~ ACD + SPORTHR + READHR + MOMMY + DADMY +
## GENDER:SPORTHR, family = binomial(logit), data = myopia[,
## c("MYOPIC", "GENDER", "ACD", "SPORTHR", "READHR", "MOMMY",
## "DADMY")])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1793 -0.5794 -0.4115 -0.2690 2.6477
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.06676 2.03873 -3.957 7.6e-05 ***
## ACD 1.51107 0.56344 2.682 0.007321 **
## SPORTHR -0.07525 0.02325 -3.237 0.001209 **
## READHR 0.08596 0.03761 2.285 0.022296 *
## MOMMY1 0.77605 0.26445 2.935 0.003340 **
## DADMY1 0.94621 0.26359 3.590 0.000331 ***
## SPORTHR:GENDER1 0.04993 0.02122 2.352 0.018648 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 480.08 on 617 degrees of freedom
## Residual deviance: 431.84 on 611 degrees of freedom
## AIC: 445.84
##
## Number of Fisher Scoring iterations: 5
Some goodness-of-fit tests:
# Model validity
with(pchisq(deviance, df.residual), data = m.s.2.s.no.gender)
## [1] 5.487704e-09
# H_0 independent hypothesis
with(pchisq(null.deviance - deviance, df.null - df.residual,
lower.tail = F), data = m.s.2.s.no.gender)
## [1] 1.057106e-08
par(mfrow=c(1,2))
plot(predict(m.s.2.s.no.gender), resid(m.s.2.s.no.gender), ylab = 'Deviance residual')
plot(predict(m.s.2.s.no.gender, type = 'response'), resid(m.s.2.s.no.gender), ylab = 'Deviance residual')
The gap in the residual plot look interesting. We tried to use different links, square and square root transformations but the situation did not improve.
ROC curve:
result <- roc(myopia$MYOPIC,predict(m.s.2.s.no.gender,type = "response"))
plot(result, print.auc = TRUE)
With the current model, the influencing variables are ACD
, SPORTHR
, READHR
, MOMMY
, DADMY
and SPORTHR
:GENDER
. All the p-values are \(<\) 0.05. Here we examine every variable.
ACD
contributes to myopia. From the introduction we know that the elongation of the eyeball (total length AL) is the physiological cause of myopia, but more fitting suggets ACD
is the most prominent variable. One reason could be that ACD
measures the first layer of the front of the eye, whose shape is critical in refracting light. This is also why LASIK surgury, done on this layer, is effective.
SPORTHR
lowers the chance while READHR
increases it. This proves what we already know, both by the “near work” hypothesis (READHR
), and the “visual stimuli” hypothesis (SPORTHR
). The 2-term correlation variabel SPORTHR
:GENDER
suggests for females (GENDER
= 1), the coefficient for SPROTHR
is -0.08 + 0.05 = -0.03, less negative than for males (GENDER
= 0), -0.08. In other words, it benefits men/boys more to have more outdoors time in term of not developing myopia. This appears to be consistent with evolution, because for 200 thousand years men have been hunters, and have been under a stronger selection with outdoors activities. If this is true, the “visual stimuli” hypothesis must have more descriptive/predictive power for males than the “near work” hypothesis, while less for females.
MOMMY
and DADMY
are also strong positive influencing variables. Intuitively, with myopic parents, children are more likely to get myopia. They differ by a small value but the difference is not statistically prominent.
se = sqrt(vcov(m.s.2.s.no.gender)['MOMMY1', 'MOMMY1'] +
vcov(m.s.2.s.no.gender)['DADMY1', 'DADMY1'] -
2 * vcov(m.s.2.s.no.gender)['DADMY1', 'MOMMY1'])
pnorm(m.s.2.s.no.gender$coefficients[['MOMMY1']] -
m.s.2.s.no.gender$coefficients[['DADMY1']], se)
## [1] 0.2968098
Assuming ACD
and READHR
taking their average values, we plot a few scenarios.
SPORTHR.seq = seq(min(myopia$SPORTHR), max(myopia$SPORTHR))
newdata = data.frame(SPORTHR = SPORTHR.seq,
male.parents_good = predict(m.s.2.s.no.gender, newdata = with(myopia,
data.frame(GENDER = factor(0), ACD = mean(ACD), SPORTHR = SPORTHR.seq,
READHR = mean(READHR), MOMMY = factor(0), DADMY = factor(0))), type = 'response'),
female.parents_good = predict(m.s.2.s.no.gender, newdata = with(myopia,
data.frame(GENDER = factor(1), ACD = mean(ACD), SPORTHR = SPORTHR.seq,
READHR = mean(READHR), MOMMY = factor(0), DADMY = factor(0))), type = 'response'),
male.parents_myopic = predict(m.s.2.s.no.gender, newdata = with(myopia,
data.frame(GENDER = factor(0), ACD = mean(ACD), SPORTHR = SPORTHR.seq,
READHR = mean(READHR), MOMMY = factor(1), DADMY = factor(1))), type = 'response'),
female.parents_myopic = predict(m.s.2.s.no.gender, newdata = with(myopia,
data.frame(GENDER = factor(1), ACD = mean(ACD), SPORTHR = SPORTHR.seq,
READHR = mean(READHR), MOMMY = factor(1), DADMY = factor(1))), type = 'response'))
ggplot(data = newdata, aes(x = SPORTHR)) +
geom_line(aes(y = male.parents_good,col = 'male, parents good')) +
geom_line(aes(y = female.parents_good, col = 'female, parents good')) +
geom_line(aes(y = male.parents_myopic, col = 'male, parents myopic')) +
geom_line(aes(y = female.parents_myopic, col = 'female, parents myopic')) +
ylab("Probability")
This suggests an effective way to raise children, in order to prevent them from developing myopia.
We studied various likely influencing variables of myopia using logistic regression. Physiological variables, environmental variables and hereditary variables were examined. We examined the validity of the aforementioned hypotheses and concluded that the “visual stimuli” hypothesis have more descriptive/predictive power for males than the “near work” hypothesis, while less for females. This suggests an effective way to raise children, in order to prevent them from developing myopia.