1. Introduction 1

Myopia (near-sightedness or short-sightedness) is the most common eye problem and is estimated to affect 1.5 billion people (22% of the population). It is a condition of the eye where light focuses in front, instead of on the retina. This causes distant objects to be blurry while close objects appear normal. Other symptoms may include headaches and eye strain. Severe near-sightedness increases the risk of retinal detachment, cataracts, and glaucoma.

Eyeball anatomy

The underlying cause is believed to be a combination of genetic and environmental variables. A family history of the condition is likely to play a role. Environmental risk factors include doing work that involves focusing on close objects and greater time spent indoors. There is tentative evidence that near-sightedness can be prevented by having young children spend more time outside. This may be related to natural light exposure.

Regarding environmental variables, there are two hypotheses:

Here we study myopia’s various likely influencing variables using logistic regression. We examine physiological variables (age, gender, eyeball parameters), environmental variables (time spent on near-work and outdoor activities) as well as hereditary variables (myopic mother and father). By doing the analysis, we examine the validity of the aforementioned hypotheses.

2. Dataset

The dataset is from Hosmer, D.W., Lemeshow, S. and Sturdivant, R.X. (2013) Applied Logistic Regression: Third Edition, and is copyrighted by John Wiley | Sons Inc.

myopia = read.csv('myopia.csv', row.names = 'ID')
myopia$MYOPIC = factor(myopia$MYOPIC)
myopia$GENDER = factor(myopia$GENDER)
myopia$MOMMY = factor(myopia$MOMMY)
myopia$DADMY = factor(myopia$DADMY)
head(myopia)
##   STUDYYEAR MYOPIC AGE GENDER  SPHEQ    AL   ACD    LT   VCD SPORTHR
## 1      1992      1   6      1 -0.052 21.89 3.690 3.498 14.70      45
## 2      1995      0   6      1  0.608 22.38 3.702 3.392 15.29       4
## 3      1991      0   6      1  1.179 22.49 3.462 3.514 15.52      14
## 4      1990      1   6      1  0.525 22.20 3.862 3.612 14.73      18
## 5      1995      0   5      0  0.697 23.29 3.676 3.454 16.16      14
## 6      1995      0   6      0  1.744 22.14 3.224 3.556 15.36      10
##   READHR COMPHR STUDYHR TVHR DIOPTERHR MOMMY DADMY
## 1      8      0       0   10        34     1     1
## 2      0      1       1    7        12     1     1
## 3      0      2       0   10        14     0     0
## 4     11      0       0    4        37     0     1
## 5      0      0       0    4         4     1     0
## 6      6      2       1   19        44     0     1

The columns are:

Column Description Value/Unit Name
1 Year subject entered the study year STUDYYEAR
2 Myopia within the first five years of follow up 0 = No; 1 = Yes MYOPIC
3 Age at first visit years AGE
4 Gender 0 = Male; 1 = Female GENDER
5 Spherical Equivalent Refraction diopter SPHEQ
6 Axial Length mm AL
7 Anterior Chamber Depth mm ACD
8 Lens Thickness mm LT
9 Vitreous Chamber Depth mm VCD
10 Time spent engaging in sports/outdoor activities hours per week SPORTHR
11 Time spent reading for pleasure hours per week READHR
12 Time spent playing video/computer games or working on the computer hours per week COMPHR
13 Time spent reading or studying for school assignments hours per week STUDYHR
14 Time spent watching television hours per week TVHR
15 Composite of near-work activities hours per week DIOPTERHR
16 Was the subject’s mother myopic? 0 = No; 1 = Yes MOMMY
17 Was the subject’s father myopic? 0 = No; 1 = Yes DADMY

Column 2: MYOPIC is defined as SPHEQ <= -0.75 D.
Column 5: A measure of the eye’s effective focusing power. Eyes that are “normal” (don’t require glasses or contact lenses) have spherical equivalents between -0.25 diopters (D) and +1.00 D. The more negative the spherical equivalent, the more myopic the subject.
Column 6: The length of eye from front to back.
Column 7: The length from front to back of the aqueous-containing space of the eye between the cornea and the iris.
Column 8: The length from front to back of the crystalline lens.
Column 9: The length from front to back of the aqueous-containing space of the eye in front of the retina.
Column 15: the composite is defined as DIOPTERHR = 3 × (READHR + STUDYHR) + 2 × COMPHR + TVHR.

Besides the response MYOPIC, there are three categorical variables GENDER, MOMMY and DADMY. Let’s visualize them:

par(mfrow=c(1,3))
plot(myopia$MYOPIC, myopia$GENDER, xlab = "MYOPIC", ylab = "GENDER")
plot(myopia$MYOPIC, myopia$MOMMY, xlab = "MYOPIC", ylab = "MOMMY")
plot(myopia$MYOPIC, myopia$DADMY, xlab = "MYOPIC", ylab = "DADMY")

3. Analysis

1. Initial exploratory fit

Continuing last section, we look at a fit with all categorical variables GENDER, MOMMY and DADMY.

summary(glm(MYOPIC ~ GENDER + MOMMY + DADMY, data = myopia, 
            family = binomial(logit)))$coefficients
##               Estimate Std. Error    z value     Pr(>|z|)
## (Intercept) -3.1509152  0.3065982 -10.277019 8.945123e-25
## GENDER1      0.3949800  0.2457213   1.607431 1.079599e-01
## MOMMY1       0.8575581  0.2565950   3.342069 8.315652e-04
## DADMY1       0.9520906  0.2584177   3.684309 2.293242e-04

It appears that GENDER is not statistically prominent. Look at the contingency table with MOMMY and DADMY:

myopia.table <- xtabs(~ MOMMY + DADMY + MYOPIC, data = myopia)
ftable(myopia.table)
##             MYOPIC   0   1
## MOMMY DADMY               
## 0     0            147   5
##       1            132  21
## 1     0            138  20
##       1            120  35
summary(myopia.table)
## Call: xtabs(formula = ~MOMMY + DADMY + MYOPIC, data = myopia)
## Number of cases in table: 618 
## Number of factors: 3 
## Test for independence of all factors:
##  Chisq = 25.022, df = 4, p-value = 4.979e-05

Here we do a fit that includes every column other than MYOPIC.

summary(glm(MYOPIC ~ ., data = myopia, family = binomial(logit)))$coefficients
##                  Estimate   Std. Error    z value     Pr(>|z|)
## (Intercept) -2.509693e+02 219.95543974 -1.1410004 2.538697e-01
## STUDYYEAR    1.279753e-01   0.11011207  1.1622275 2.451431e-01
## AGE          8.781767e-02   0.25295253  0.3471706 7.284632e-01
## GENDER1      6.250700e-01   0.34451742  1.8143351 6.962615e-02
## SPHEQ       -4.145231e+00   0.47041751 -8.8118133 1.231375e-18
## AL          -3.009826e+01  39.02156386 -0.7713238 4.405150e-01
## ACD          3.124416e+01  39.08467746  0.7993966 4.240605e-01
## LT           2.915362e+01  39.11136522  0.7454003 4.560297e-01
## VCD          2.971146e+01  39.04372518  0.7609793 4.466695e-01
## SPORTHR     -4.641564e-02   0.02154816 -2.1540418 3.123690e-02
## READHR       8.966185e-02   0.05029143  1.7828457 7.461142e-02
## COMPHR       4.649699e-02   0.04620587  1.0063006 3.142710e-01
## STUDYHR     -1.737402e-01   0.09849706 -1.7639123 7.774675e-02
## TVHR        -9.562692e-03   0.02928495 -0.3265394 7.440163e-01
## MOMMY1       7.239397e-01   0.32175693  2.2499585 2.445158e-02
## DADMY1       7.786300e-01   0.32212940  2.4171344 1.564324e-02

There are a few observations:

  • We immediately find out that the p-value of SPHEQ is very small. Looking at the column notes above, MYOPIC is defined according to SPHEQ. Therefore, obviously, the correlation is prominent.

  • DIOPTERHR’s row disappears, because it is a linearly combination of READHR, STUDYHR, COMPHR and TVHR.

  • STUDYYEAR is related to the chronological trend and sampling process, and should not be a causal variable. The p-value of STUDYYEAR is indeed large, confirming our belief.

  • AL, ACD, LT and VCD have similar p-values. A closer examination shows that similar to DIOPTERHR, there is a linear functional relationship. AL = ACD + LT + VCD. The maximum absolute difference between two sides of the equation is 0.008.

2. MYOPIC ~ (AL, ACD, LT, VCD) model selection

One term of the four must be gone. We use step to evaluate the 2-term interaction models with one variable gone.

m.eyeball.no.AL <- step(glm(MYOPIC ~ .^2 , data = 
            myopia[, c('MYOPIC', 'ACD', 'LT', 'VCD')], family = binomial(logit)))
m.eyeball.no.ACD <- step(glm(MYOPIC ~ .^2 , data = 
            myopia[, c('MYOPIC', 'AL', 'LT', 'VCD')], family = binomial(logit)))
m.eyeball.no.LT <- step(glm(MYOPIC ~ .^2 , data = 
            myopia[, c('MYOPIC', 'AL', 'ACD', 'VCD')], family = binomial(logit)))
m.eyeball.no.VCD <- step(glm(MYOPIC ~ .^2 , data = 
            myopia[, c('MYOPIC', 'AL', 'ACD', 'LT')], family = binomial(logit)))
summary(m.eyeball.no.AL)$coefficients
##              Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -7.008358  1.9355795 -3.620806 0.0002936864
## ACD          1.418993  0.5318269  2.668148 0.0076270647
summary(m.eyeball.no.ACD)$coefficients
##              Estimate Std. Error    z value   Pr(>|z|)
## (Intercept) -4.696102  5.9892258 -0.7840917 0.43298632
## AL           1.383147  0.5663913  2.4420341 0.01460477
## LT          -1.692605  0.9125328 -1.8548430 0.06361865
## VCD         -1.453938  0.6078946 -2.3917599 0.01676781
summary(m.eyeball.no.LT)$coefficients
##              Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -7.008358  1.9355795 -3.620806 0.0002936864
## ACD          1.418993  0.5318269  2.668148 0.0076270647
summary(m.eyeball.no.VCD)$coefficients
##              Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -7.008358  1.9355795 -3.620806 0.0002936864
## ACD          1.418993  0.5318269  2.668148 0.0076270647

Interestingly, removing ACD results in all other three variables statistically prominent (m.eyeball.no.ACD), but keeping ACD, eventually only ACD alone is prominent. So instead of removing just one term, we remove AL, LT and VCD, while keeping ACD.

In the next iteration, we rule out these terms.

summary(m <- glm(MYOPIC ~ . - STUDYYEAR - SPHEQ - AL - LT - VCD - DIOPTERHR, 
                  data = myopia, family = binomial(logit)))$coefficients
##                Estimate Std. Error    z value     Pr(>|z|)
## (Intercept) -9.83456507 2.38380127 -4.1255809 3.698001e-05
## AGE          0.21000681 0.19649483  1.0687651 2.851755e-01
## GENDER1      0.57936005 0.27389999  2.1152248 3.441080e-02
## ACD          1.51780587 0.58595138  2.5903273 9.588472e-03
## SPORTHR     -0.04562130 0.01853642 -2.4611719 1.384840e-02
## READHR       0.09406765 0.03867674  2.4321507 1.500946e-02
## COMPHR       0.03197313 0.03852656  0.8298984 4.065962e-01
## STUDYHR     -0.10743429 0.07524283 -1.4278343 1.533395e-01
## TVHR         0.01150788 0.02341332  0.4915098 6.230659e-01
## MOMMY1       0.77858551 0.26536308  2.9340385 3.345827e-03
## DADMY1       0.99341915 0.26803610  3.7062886 2.103186e-04

It’s curious to see that STUDYHR has an opposite effect as READHR.

3. Categorization & automated fit

Now we need to categorize the column variables. Our goal is to find out the causal variables for myopia.

  • AGE and GENDER are physiological variables. Our life experience suggests they should not directly play a role in causing myopia, but may compound with other variables. AL, ACD, LT and VCD are physiological variables as well. They represent the state of the eyeballs.

  • SPORTHR, READHR, COMPHR, STUDYHR and TVHR are environmental variables, documenting the time spent on near-work and outdoor activities. They depend on the life style of the subject. We should give more attention to these.

  • MOMMY and DADMY are hereditary variables. They should play a heavy role.

Then we use step to automatically select the best fitted model:

m.s <- step(m)
summary(m.s)$coefficients
##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -8.40390128 2.11593176 -3.971726 7.135375e-05
## GENDER1      0.48110897 0.25986819  1.851358 6.411811e-02
## ACD          1.52298094 0.57271705  2.659221 7.832168e-03
## SPORTHR     -0.04589204 0.01817867 -2.524499 1.158633e-02
## READHR       0.08480289 0.03749779  2.261544 2.372562e-02
## MOMMY1       0.79097301 0.26396109  2.996552 2.730520e-03
## DADMY1       0.95349257 0.26338355  3.620168 2.944123e-04

Looks like some automatic magic has already removed some of the less prominent variables. AGE is not playing a prominent role, while physiological variables GENDER and ACD remain. Among the environmental variables, only SPORTHR and READHR remain. Hereditary variables MOMMY and DADMY remain as well.

It is also beneficial to test the 2-term interaction models with step using the not-yet-automatically-reduced model m.

m.2.s <- step(glm(MYOPIC ~ .^2, data = myopia[, c('MYOPIC', 'AGE', 'GENDER', 'ACD', 
              'SPORTHR', 'READHR', 'COMPHR', 'STUDYHR', 'TVHR', 'MOMMY', 'DADMY')], 
              family = binomial(logit)))
summary(m.2.s)$coefficients
##                     Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept)     -11.94113853 3.570229198 -3.3446420 0.0008238884
## GENDER1          -0.12705162 0.459476771 -0.2765137 0.7821535499
## ACD               2.66480887 0.966186231  2.7580696 0.0058143817
## SPORTHR          -0.09972152 0.033018635 -3.0201587 0.0025264230
## READHR            0.09859844 0.038897135  2.5348511 0.0112495130
## STUDYHR          -0.25443169 0.133301082 -1.9086994 0.0563008856
## MOMMY1            0.79888783 0.268852651  2.9714709 0.0029637695
## DADMY1            7.48000321 4.277633351  1.7486312 0.0803547867
## GENDER1:SPORTHR   0.05919397 0.037492925  1.5788038 0.1143810609
## ACD:DADMY1       -1.79355896 1.170278646 -1.5325914 0.1253765675
## SPORTHR:STUDYHR   0.01376141 0.008046762  1.7101799 0.0872326077

Using the automatically-reduced model m.s, we have:

m.s.2.s <- step(glm(MYOPIC ~ .^2, data = myopia[, c('MYOPIC', 'GENDER', 'ACD', 
    'SPORTHR', 'READHR', 'MOMMY', 'DADMY')], family = binomial(logit)))
summary(m.s.2.s)$coefficients
##                     Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)     -12.05623675 3.52238211 -3.4227510 0.0006199084
## GENDER1          -0.12254835 0.45992349 -0.2664538 0.7898897627
## ACD               2.62554816 0.95311095  2.7547141 0.0058743445
## SPORTHR          -0.08092241 0.03057828 -2.6464013 0.0081353244
## READHR            0.08678812 0.03785472  2.2926628 0.0218674267
## MOMMY1            0.78878105 0.26529472  2.9732255 0.0029468783
## DADMY1            7.31529476 4.24697814  1.7224705 0.0849843137
## GENDER1:SPORTHR   0.05745745 0.03783491  1.5186360 0.1288541322
## ACD:DADMY1       -1.75149758 1.16160592 -1.5078243 0.1315995117

There isn’t much of a difference except by including STUDYHR, we also introduce 2-term interaction SPORTHR:STUDYHR. The observation about the coefficient of STUDYHR in Section 2 is also seen here, only more negative. Considering the fact that the subjects are between 5 to 9 years old, their homework assignment that counts as STUDYHR could very much be non-near-work activities. They do not necessarily equate activities counting as READHR. For this reason, we remove the confounding variable STUDYHR and stick with the simpler model.

In addition, the p-value for GENDER is very large, we remove it while keeping the 2-term interactions GENDER:SPORTHR. The p-value for ACD:DADMY is large too so we remove it as well. Refit the data:

m.s.2.s.no.gender <- glm(formula = MYOPIC ~ ACD + SPORTHR + READHR + MOMMY + DADMY + 
     GENDER:SPORTHR, data = myopia[, c("MYOPIC", "GENDER", "ACD", "SPORTHR", "READHR", 
                                       "MOMMY", "DADMY")], family = binomial(logit))
summary(m.s.2.s.no.gender)
## 
## Call:
## glm(formula = MYOPIC ~ ACD + SPORTHR + READHR + MOMMY + DADMY + 
##     GENDER:SPORTHR, family = binomial(logit), data = myopia[, 
##     c("MYOPIC", "GENDER", "ACD", "SPORTHR", "READHR", "MOMMY", 
##         "DADMY")])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1793  -0.5794  -0.4115  -0.2690   2.6477  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -8.06676    2.03873  -3.957  7.6e-05 ***
## ACD              1.51107    0.56344   2.682 0.007321 ** 
## SPORTHR         -0.07525    0.02325  -3.237 0.001209 ** 
## READHR           0.08596    0.03761   2.285 0.022296 *  
## MOMMY1           0.77605    0.26445   2.935 0.003340 ** 
## DADMY1           0.94621    0.26359   3.590 0.000331 ***
## SPORTHR:GENDER1  0.04993    0.02122   2.352 0.018648 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 480.08  on 617  degrees of freedom
## Residual deviance: 431.84  on 611  degrees of freedom
## AIC: 445.84
## 
## Number of Fisher Scoring iterations: 5

Some goodness-of-fit tests:

# Model validity
with(pchisq(deviance, df.residual), data = m.s.2.s.no.gender)
## [1] 5.487704e-09
# H_0 independent hypothesis
with(pchisq(null.deviance - deviance, df.null - df.residual, 
            lower.tail = F), data = m.s.2.s.no.gender)
## [1] 1.057106e-08
par(mfrow=c(1,2))
plot(predict(m.s.2.s.no.gender), resid(m.s.2.s.no.gender), ylab = 'Deviance residual')
plot(predict(m.s.2.s.no.gender, type = 'response'), resid(m.s.2.s.no.gender), ylab = 'Deviance residual')

The gap in the residual plot look interesting. We tried to use different links, square and square root transformations but the situation did not improve.

ROC curve:

result <- roc(myopia$MYOPIC,predict(m.s.2.s.no.gender,type = "response"))
plot(result, print.auc = TRUE)

4. Model interpretation

With the current model, the influencing variables are ACD, SPORTHR, READHR, MOMMY, DADMY and SPORTHR:GENDER. All the p-values are \(<\) 0.05. Here we examine every variable.

  • ACD contributes to myopia. From the introduction we know that the elongation of the eyeball (total length AL) is the physiological cause of myopia, but more fitting suggets ACD is the most prominent variable. One reason could be that ACD measures the first layer of the front of the eye, whose shape is critical in refracting light. This is also why LASIK surgury, done on this layer, is effective.

  • SPORTHR lowers the chance while READHR increases it. This proves what we already know, both by the “near work” hypothesis (READHR), and the “visual stimuli” hypothesis (SPORTHR). The 2-term correlation variabel SPORTHR:GENDER suggests for females (GENDER = 1), the coefficient for SPROTHR is -0.08 + 0.05 = -0.03, less negative than for males (GENDER = 0), -0.08. In other words, it benefits men/boys more to have more outdoors time in term of not developing myopia. This appears to be consistent with evolution, because for 200 thousand years men have been hunters, and have been under a stronger selection with outdoors activities. If this is true, the “visual stimuli” hypothesis must have more descriptive/predictive power for males than the “near work” hypothesis, while less for females.

  • MOMMY and DADMY are also strong positive influencing variables. Intuitively, with myopic parents, children are more likely to get myopia. They differ by a small value but the difference is not statistically prominent.

se = sqrt(vcov(m.s.2.s.no.gender)['MOMMY1', 'MOMMY1'] + 
          vcov(m.s.2.s.no.gender)['DADMY1', 'DADMY1'] - 
      2 * vcov(m.s.2.s.no.gender)['DADMY1', 'MOMMY1'])
pnorm(m.s.2.s.no.gender$coefficients[['MOMMY1']] - 
      m.s.2.s.no.gender$coefficients[['DADMY1']], se)
## [1] 0.2968098

Assuming ACD and READHR taking their average values, we plot a few scenarios.

SPORTHR.seq = seq(min(myopia$SPORTHR), max(myopia$SPORTHR))
newdata = data.frame(SPORTHR = SPORTHR.seq,
  male.parents_good = predict(m.s.2.s.no.gender, newdata = with(myopia, 
    data.frame(GENDER = factor(0), ACD = mean(ACD), SPORTHR = SPORTHR.seq,
    READHR = mean(READHR), MOMMY = factor(0), DADMY = factor(0))), type = 'response'),
  female.parents_good = predict(m.s.2.s.no.gender, newdata = with(myopia, 
    data.frame(GENDER = factor(1), ACD = mean(ACD), SPORTHR = SPORTHR.seq,
    READHR = mean(READHR), MOMMY = factor(0), DADMY = factor(0))), type = 'response'),
  male.parents_myopic = predict(m.s.2.s.no.gender, newdata = with(myopia, 
    data.frame(GENDER = factor(0), ACD = mean(ACD), SPORTHR = SPORTHR.seq,
    READHR = mean(READHR), MOMMY = factor(1), DADMY = factor(1))), type = 'response'),
  female.parents_myopic = predict(m.s.2.s.no.gender, newdata = with(myopia, 
    data.frame(GENDER = factor(1), ACD = mean(ACD), SPORTHR = SPORTHR.seq,
    READHR = mean(READHR), MOMMY = factor(1), DADMY = factor(1))), type = 'response'))

ggplot(data = newdata, aes(x = SPORTHR)) + 
  geom_line(aes(y = male.parents_good,col = 'male, parents good')) + 
  geom_line(aes(y = female.parents_good, col = 'female, parents good')) + 
  geom_line(aes(y = male.parents_myopic, col = 'male, parents myopic')) + 
  geom_line(aes(y = female.parents_myopic, col = 'female, parents myopic')) + 
  ylab("Probability")

This suggests an effective way to raise children, in order to prevent them from developing myopia.

5. Conclusion

We studied various likely influencing variables of myopia using logistic regression. Physiological variables, environmental variables and hereditary variables were examined. We examined the validity of the aforementioned hypotheses and concluded that the “visual stimuli” hypothesis have more descriptive/predictive power for males than the “near work” hypothesis, while less for females. This suggests an effective way to raise children, in order to prevent them from developing myopia.


  1. https://en.wikipedia.org/wiki/Near-sightedness