Taking Least Squares From the Ordinary to the General(ized)


Basic Princples of Ordinary Least Squares


  • Y is determined by X: p(Y \(|\) X=x) 

  • The relationship between X and Y is Linear 

  • The residuals of \(\widehat{Y_i} = \beta_0 + \beta_1 X_i + \epsilon\) are normally distributed 
    (i.e., \(\epsilon_i \sim\) N(0,\(\sigma\)))

Basic Principles of Ordinary Least Squares

\(\widehat{Y} = \beta_0 + \beta_1 X + \epsilon\) where \(\beta_0\) = intercept, \(\beta_1\) = slope

Minimize Residuals defined as \(SS_{residuals} = \sum(Y_{i} - \widehat{Y})^2\)

But…what about this?

Variance-Mean Relationship

Or…what about this?

Different precision for different data points

Weighted Least Squares




Minimize Weighted \(SS_{residuals} = W_i\sum(Y_{i} - \widehat{Y})^2\)


Why weight?

  • Minimized influence of heteroskedasticity

  • Increases precision of estimates

  • Decreases influence of imprecision in measurements
     
  • Minimize sampling bias

  • Other reasons to count some measurements more than others

Weighted Least Squares

\[Y_i = \beta X_i + \epsilon_i\] \[\epsilon_i \sim \mathcal{N}(0, \sigma_i)\]
OLS: \(\sigma_1 = \sigma_2 = ...\) is constant
WLS: \(\sigma_i\) is variable

Generalized Least Squares

\[Y_i = \beta X_i + \epsilon_i\] \[\epsilon_i \sim \mathcal{N}(0, \sigma_i)\]
In OLS, this implies \(\sigma_i = \sigma I_n\)
(\(I_n\) is an identity matrix with n x n dimensions)

In WLS, the diagonal of the \(\sigma\) matrix can be anything
In Generalized Least Squares, the off diagonals can be non-zero, too

How can we use weighting?

  1. Weighting by variance

  2. Mean-Variance relationship

  3. Unequal variance between groups

SQUID!

Somatic v. Reproductive Tissues: Variance

Somatic v. Reproductive Tissues: N

Weighting by Data Quality

  • We typically weight by 1/variance (precision)
    - Or N, or other estimate of sample precision

  • This is different than variance weighted meta-analysis
    • We are still estimating a residual error for fits

Implementation



sq_sum <- lm(MEAN_Testisweight ~ MEAN_DML, data=squid_summary,
             weights=1/VAR_DML)

Did it do anything?

No weights

term estimate std.error statistic p.value
(Intercept) -4.5177727 2.1081505 -2.143003 0.0445948
MEAN_DML 0.0376229 0.0077701 4.842035 0.0000989

Precision Weighted

term estimate std.error statistic p.value
(Intercept) -4.518775 2.2041869 -2.050087 0.0544130
MEAN_DML 0.037140 0.0074578 4.980010 0.0000831

Sample Size Weighted

term estimate std.error statistic p.value
(Intercept) -5.2808150 2.8213670 -1.871722 0.0759446
MEAN_DML 0.0417104 0.0110373 3.779050 0.0011787

How can we use weighting?

  1. Weighting by variance

  2. Mean-Variance relationship

  3. Unequal variance between groups

Looking at Individual Squid

Oh, Yeah, That’s Heteroskedasticity

Oh, Yeah, That’s Heteroskedasticity

OK, But, We Need a test

  • Not every case is clear-cut

  • Breusch-Pagan/Godfrey/Koenker test
    • Variant of White’s test (you’ll see this)
    • Get squared residuals
    • Regress on one or more predictors
    • Results fail \(\chi^2\) test of heteroskedastic

BP Test

library(lmtest)

squid_lm <- lm(Testisweight ~ DML, data=squid)

bptest(squid_lm)

    studentized Breusch-Pagan test

data:  squid_lm
BP = 155.31, df = 1, p-value < 2.2e-16

Can use BP to Look at Multiple Predictors

squid_lm2 <- lm(Testisweight ~ DML + MONTH, data=squid)

bptest(squid_lm2)

    studentized Breusch-Pagan test

data:  squid_lm2
BP = 144.22, df = 2, p-value < 2.2e-16

Can Look at Contribution of Individual Predictors

bptest(squid_lm2, varformula = ~ DML, data=squid)

    studentized Breusch-Pagan test

data:  squid_lm2
BP = 140.74, df = 1, p-value < 2.2e-16
bptest(squid_lm2, varformula = ~ MONTH, data=squid)

    studentized Breusch-Pagan test

data:  squid_lm2
BP = 0.086946, df = 1, p-value = 0.7681

So How do we Weight This?

Weighting by a Predictor

  • Need to determine the direction variance increases
  • Weight by X or 1/X

  • Is it a linear or nonlinear relationship between variance and X?

  • Is more than one predictor influencing variance (from BP Test/Graphs)

WLS with LM


squid_wls <- lm(Testisweight ~ DML, data=squid,
                weights = 1/squid$DML)


Variance increases with DML, so weight decreases

LM is LiMiting

  • Need to hand-code the weights

  • Diagnostic plots will still look odd
    • DHARMa can help

  • Cannot easily combine weights from multiple predictors

Enter NLME

  • NonLinear Mixed Effects Model package

  • We’ll get to what that means

  • BUT - also has likelihood-based methods to fit WLS and GLS

  • Flexible weighting specification

WLS via GLS in NLME


squid_gls <- gls(Testisweight ~ DML, 
                 
                 data=squid,
                 
                 weights = varFixed(~ DML))

Note: higher value = higher variance - opposite of lm

Compare WLS and GLS

WLS with lm


t test of coefficients:

              Estimate Std. Error t value  Pr(>|t|)    
(Intercept) -5.6239370  0.3382932 -16.624 < 2.2e-16 ***
DML          0.0430654  0.0014061  30.627 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

GLS with gls


z test of coefficients:

              Estimate Std. Error z value  Pr(>|z|)    
(Intercept) -5.6239370  0.3382932 -16.624 < 2.2e-16 ***
DML          0.0430654  0.0014061  30.627 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Different Variance Structures

  • VarFixed - Linear continuous variance
  • VarPower - Variance increases by a power
  • VarExp - Variance exponentiated
  • VarConstPower - Variance is constant + power
  • VarIdent - Variance differs by groups
  • VarComb - Combines different variance functions

Multiple Sources of Heteroskedasticity

Let’s add month! If it was another driver…

squid_gls_month <- gls(Testisweight ~ DML, data=squid,

                       weights = varComb(varFixed(~DML),
                                         
                                         varFixed(~MONTH)))

Weighted v. Naieve Fit

Weighted v. Naieve Fit

How can we use weighting?

  1. Weighting by variance

  2. Mean-Variance relationship

  3. Unequal variance between groups

Testis Weight by Month

squid <- mutate(squid, fMONTH = factor(MONTH))

A Classic ANOVA problem

Wowsers - Unequal Variances

MONTH var_weight
1 8.216093
2 9.491376
3 9.456260
4 6.663934
5 5.513464
6 10.884092
7 6.635753
8 2.274724
9 41.502657
10 46.016877
11 17.488115
12 17.734229

Is this a probem?

bptest(squid_month)

    studentized Breusch-Pagan test

data:  squid_month
BP = 68.375, df = 11, p-value = 2.485e-10

Solution: Weight by Month

vMonth <- varIdent(form = ~ 1 | fMONTH)
  • Note 1 | x form, different variance for different strata

Fit and Estimate SD

squid_month_gls <- gls(Testisweight ~ fMONTH, 
                       
                       data=squid, 
                       
                       weights=vMonth)

Summary of results

Generalized least squares fit by REML
  Model: Testisweight ~ fMONTH 
  Data: squid 
       AIC      BIC   logLik
  4307.961 4419.034 -2129.98

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | fMONTH 
 Parameter estimates:
        2         9        12        11         8        10         5 
1.0000000 2.0910010 1.3669006 1.3573650 0.4895507 2.2018372 0.7621603 
        7         6         4         1         3 
0.8361282 1.0708259 0.8379144 0.9303731 0.9981328 

Coefficients:
                Value Std.Error    t-value p-value
(Intercept) 10.200311 0.4272917  23.872011       0
fMONTH2     -5.391282 0.6795202  -7.933953       0
fMONTH3     -4.094684 0.5555744  -7.370183       0
fMONTH4     -5.609615 0.5722338  -9.803013       0
fMONTH5     -6.514285 0.5724285 -11.380085       0
fMONTH6     -7.577601 0.6848334 -11.064882       0
fMONTH7     -8.847744 0.6016011 -14.706993       0
fMONTH8     -9.093561 0.4757355 -19.114743       0
fMONTH9     -3.975147 0.7016302  -5.665587       0
fMONTH10    -4.109998 0.7252503  -5.667006       0
fMONTH11    -4.596277 0.6174994  -7.443371       0
fMONTH12    -4.373843 0.7482711  -5.845265       0

 Correlation: 
         (Intr) fMONTH2 fMONTH3 fMONTH4 fMONTH5 fMONTH6 fMONTH7 fMONTH8
fMONTH2  -0.629                                                        
fMONTH3  -0.769  0.484                                                 
fMONTH4  -0.747  0.470   0.574                                         
fMONTH5  -0.746  0.469   0.574   0.557                                 
fMONTH6  -0.624  0.392   0.480   0.466   0.466                         
fMONTH7  -0.710  0.447   0.546   0.530   0.530   0.443                 
fMONTH8  -0.898  0.565   0.691   0.671   0.670   0.560   0.638         
fMONTH9  -0.609  0.383   0.468   0.455   0.455   0.380   0.433   0.547 
fMONTH10 -0.589  0.370   0.453   0.440   0.440   0.368   0.418   0.529 
fMONTH11 -0.692  0.435   0.532   0.517   0.517   0.432   0.491   0.622 
fMONTH12 -0.571  0.359   0.439   0.426   0.426   0.356   0.406   0.513 
         fMONTH9 fMONTH10 fMONTH11
fMONTH2                           
fMONTH3                           
fMONTH4                           
fMONTH5                           
fMONTH6                           
fMONTH7                           
fMONTH8                           
fMONTH9                           
fMONTH10  0.359                   
fMONTH11  0.421   0.408           
fMONTH12  0.348   0.336    0.395  

Standardized residuals:
       Min         Q1        Med         Q3        Max 
-3.5056004 -0.6728254 -0.3252821  0.4762647  4.9030307 

Residual standard error: 3.080871 
Degrees of freedom: 768 total; 756 residual

Compare with Unweighted Fit

squid_lm <- gls(Testisweight ~ fMONTH, data=squid)

anova(squid_lm, squid_month_gls)
                Model df      AIC      BIC    logLik   Test L.Ratio
squid_lm            1 13 4555.853 4616.017 -2264.926               
squid_month_gls     2 24 4307.961 4419.034 -2129.980 1 vs 2 269.892
                p-value
squid_lm               
squid_month_gls  <.0001

A Clammy Example

  1. Clams.txt has data on the length-AFDM relationship of Wedge clames in Argentina each month.
    - Use read.delim to read it in

  2. Evaluate the length-biomass relationship
    - Test for Heteroskedasticity
    - Correct for it

  3. Do the same for just month
    - You’ll have to make it a factor

  4. Combine them into one model!
    - use varComb