Jellybean mistakes 3: Model simmetries

One of the most often mistaken and forgotten problem in basic statistics, when they assume that the fitted regression model is symmetric, aka. does not matter if you fit x~y or y~x. Well, based purely on the method and definition of linear model fitting, this is a very wrong assumption and I think one of the main cause for this is that linear models are always interpreted together with correlation values where this causational direction is not present.

What is a linear (regression) model?

Linear regression model is a very simple thing. It all starts with the model function: $ y = + x $, which is a line with \(\beta\) slope and an intersection on the y axis at \(\alpha\). Since most of the real world observations are not laying on a perfect line due to some error (unobserved variables, stochastic effects etc.), we extend this equation for each observation pair \({x_i, y_i}\) as \(y_i = \alpha + \beta x_i + \epsilon_i\).

And here we are, we have the regression model. The goal is to find estimates for \(\alpha\) and \(\beta\) as \(\hat{\alpha}\) and \(\hat{\beta}\) based on our measured data pairs.

The solution for this problem is to use the ordinary least squares method. You do not need to know how to calculate this, just understand what is going on. We would like to find \(\hat{\alpha}\) and \(\hat{\beta}\) where the sum of the squared error terms (\(\sum\hat{\epsilon_i}\)) is minimal. In simpler world, we would like to find a model, where the error is minimal. Sounds reasonable and easy, but let’s see a visual representation.

tibble(s = seq(-5, 5, 0.2)) %>%
  mutate(order = seq_along(s),
         x = list(anscombe$x1),
         y = list(anscombe$y1),
         i = mean(anscombe$y1) - s*mean(anscombe$x1)) %>%
  unnest() %>%
  mutate(pred_y = s*x+i) %>%
  group_by(order) %>%
  mutate(se = sum((y-pred_y)^2)) %>%
  ggplot() +
  geom_smooth(aes(x, y), method = "lm", se = FALSE, color = "lightblue", alpha = 0.6) + 
  geom_abline(aes(slope = s, intercept = i)) +
  geom_segment(aes(x = x, y = y, xend = x, yend = pred_y, color = se)) +
  geom_point(aes(x, y)) +
  geom_point(aes(mean(x), mean(y)), color = "red") +
  coord_cartesian(xlim = c(-10, 30), ylim = c(-10, 30)) +
  labs("Ordinary least square regression as x ~ y") +
  transition_states(
    order,
    transition_length = 2,
    state_length = 1
  ) 

You can see the actual fitted line fixed and the “searching” line with all the errors connected to the datapoints. You can also see, that those lines (residuals) are moving only in the y axis, but fixed in the x. It means we minimize the error only in the y axis direction, which makes sense, since we are predicting y based on x. The residual is the informationbit x cannot explain about y.

But what if we turn the table?

What if we want to predict x based on y? Then we just need to change the equation as \(x = \alpha + \beta y\) which means we are working on minimizing the squeared residuals on the horizontal direction.

Here I simply changed the x and y data and finally used a coord_flip() to bring it back to the original state.

tibble(s = seq(-5, 5, 0.2)) %>%
  mutate(order = seq_along(s),
         x = list(anscombe$y1),
         y = list(anscombe$x1),
         i = mean(anscombe$x1) - s*mean(anscombe$y1)) %>%
  unnest() %>%
  mutate(pred_y = s*x+i) %>%
  group_by(order) %>%
  mutate(se = sum((y-pred_y)^2)) %>%
  ggplot() +
  geom_smooth(aes(x, y), method = "lm", se = FALSE, color = "lightblue", alpha = 0.6) + 
  geom_abline(aes(slope = s, intercept = i)) +
  geom_segment(aes(x = x, y = y, xend = x, yend = pred_y, color = se)) +
  geom_point(aes(x, y)) +
  geom_point(aes(mean(x), mean(y)), color = "red") +
  coord_flip(xlim = c(-10, 30), ylim = c(-10, 30)) +
  labs("Ordinary least square regression as y ~ x") +
  labs(x = "y", y = "x") +
  transition_states(
    order,
    transition_length = 2,
    state_length = 1
  ) 

You can easily see how many things changed:

1, y became the independent variable. 2, x became the dependent one. 3, Regression line is different (actually mirrored). 4, Residuals are represented as vertical lines.

Which one is correct?

The correct choice for independent and dependent variable is based on your experimental setup. When you fix a condition, like temperature, soil, light, chemical concentration, that will be your predictor. Then you will measure something (~phenotype, reaction), which will be dependent on the predictor.

A symmetric solution - orthogonal regression

There are cases when one would like to have a simmetryc solution for the problem, where non of the variables are strictly dependent on the other. Just think about that when you would like to plot two genes in several conditions and predict their linear (or logarithmic) relationship. Without further information about the gene regulatory hierarchy, there is no way to place one above the other, and also we cannot fix any of the genes’ expression to measure the other one. Which one you would choose as a predictor? I would choose neither and try to fit an orthogonal regression.

Orthogonal sounds very “mathematical”, but actually it is the same fitted line as the first eigenvector of the system, which would be the first principal component axis in a PCA. A lot of things in a very different names, but basically they mean the same. In fact, to calculate the line, there is no easier way in R, then calculate the PCA and from the rotation and center informations calculate the slope and the intercept terms.

In this case we could reach with the total least squares method, where we minimize the summed squared distances in perpendicular between the data points and the fitted line (minimizing total variance). Maybe it is a bit difficult to understand, but here is the visual representation to help you:

pca <- prcomp(~x1+y1, anscombe)
slp <- with(pca, rotation[2,1] / rotation[1,1])
int <- with(pca, center[2] - slp*center[1])

tibble(s = seq(-5, 5, 0.2)) %>%
  mutate(order = seq_along(s),
         x = list(anscombe$x1),
         y = list(anscombe$y1),
         i = mean(anscombe$y1) - s*mean(anscombe$x1)) %>%
  unnest() %>%
  mutate(s2 = -1/s,
         i2 = y-s2*x,
         pred_x = if_else(s == 0, x, (i2-i)/(s-s2) ),
         pred_y = s*pred_x + i) %>%
  group_by(order) %>%
  mutate(se = sum((pred_x-x)^2 + (pred_y-y)^2)) %>%
  ggplot() +
  geom_abline(slope = slp, intercept = int, color = "lightblue", alpha = 0.6) +
  geom_abline(aes(slope = s, intercept = i, color = se)) +
  geom_segment(aes(x = x, y = y, xend = pred_x, yend = pred_y, color = se)) +
  geom_point(aes(x, y)) +
  geom_point(aes(mean(x), mean(y)), color = "red") +
  labs("Orthogonal regression for (x,y)") +
  coord_cartesian(xlim = c(-10, 30), ylim = c(-10, 30)) +
  transition_states(
    order,
    transition_length = 1,
    state_length = 1
  ) 

Let’s see all of them at once!

To make a stronger convincing evidence, let’s see all the 3 regression lines in one graph.

pca <- prcomp(~x1+y1, anscombe)
slp <- with(pca, rotation[2,1] / rotation[1,1])
int <- with(pca, center[2] - slp*center[1])
lmxy <- lm(y1 ~ x1, anscombe)
lmyx <- lm(x1 ~ y1, anscombe)

ggplot(anscombe) +
  geom_point(aes(x1,y1)) + 
  geom_abline(intercept = int, slope = slp, color = "limegreen") +
  geom_abline(intercept = lmxy$coefficients[1], slope = lmxy$coefficients[2], color = "hotpink") +
  geom_abline(intercept = - lmyx$coefficients[1]/lmyx$coefficients[2], slope = 1/lmyx$coefficients[2], color = "dodgerblue") +
  xlim(c(0, 20)) + ylim(c(0, 20)) +
  labs(title = "Linear regression types", 
       x = "x", 
       y = "y", 
       caption = "Blue: y~x\nPink: x~y\nGreen: orthogonal")

So where is correlation?

Actually, there is no direct relation between the correlation and the regression. Indeed, correlation is insensitive of the directional relation of the data, it is a measure of the strengh of the relationship. But more about this in an other post.

Session info

To produce the graphs, I used the tidyverse and the gganimate packages and the anscombe dataset (only x1 nad y1). My plotting theme is set as ggthemr::ggthemr("fresh")

devtools::session_info()
## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       macOS Mojave 10.14.1        
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Asia/Tokyo                  
##  date     2019-02-15                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version date       lib source                          
##  assertthat    0.2.0   2017-04-11 [1] CRAN (R 3.5.0)                  
##  backports     1.1.3   2018-12-14 [1] CRAN (R 3.5.0)                  
##  bindr         0.1.1   2018-03-13 [1] CRAN (R 3.5.0)                  
##  bindrcpp    * 0.2.2   2018-03-29 [1] CRAN (R 3.5.0)                  
##  blogdown      0.10    2019-01-09 [1] CRAN (R 3.5.2)                  
##  bookdown      0.9     2018-12-21 [1] CRAN (R 3.5.0)                  
##  broom         0.5.1   2018-12-05 [1] CRAN (R 3.5.0)                  
##  callr         3.1.1   2018-12-21 [1] CRAN (R 3.5.0)                  
##  cellranger    1.1.0   2016-07-27 [1] CRAN (R 3.5.0)                  
##  class         7.3-14  2015-08-30 [1] CRAN (R 3.5.2)                  
##  classInt      0.3-1   2018-12-18 [1] CRAN (R 3.5.0)                  
##  cli           1.0.1   2018-09-25 [1] CRAN (R 3.5.0)                  
##  colorspace    1.4-0   2019-01-13 [1] CRAN (R 3.5.2)                  
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.0)                  
##  DBI           1.0.0   2018-05-02 [1] CRAN (R 3.5.0)                  
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.0)                  
##  devtools      2.0.1   2018-10-26 [1] CRAN (R 3.5.2)                  
##  digest        0.6.18  2018-10-10 [1] CRAN (R 3.5.0)                  
##  dplyr       * 0.7.8   2018-11-10 [1] CRAN (R 3.5.0)                  
##  e1071         1.7-0.1 2019-01-21 [1] CRAN (R 3.5.2)                  
##  evaluate      0.13    2019-02-12 [1] CRAN (R 3.5.2)                  
##  farver        1.1.0   2018-11-20 [1] CRAN (R 3.5.0)                  
##  forcats     * 0.3.0   2018-02-19 [1] CRAN (R 3.5.0)                  
##  fs            1.2.6   2018-08-23 [1] CRAN (R 3.5.0)                  
##  generics      0.0.2   2018-11-29 [1] CRAN (R 3.5.0)                  
##  gganimate   * 1.0.0   2019-01-02 [1] CRAN (R 3.5.2)                  
##  ggplot2     * 3.1.0   2018-10-25 [1] CRAN (R 3.5.0)                  
##  ggthemr     * 1.1.0   2019-02-14 [1] Github (cttobin/ggthemr@0a31bb5)
##  gifski        0.8.6   2018-09-28 [1] CRAN (R 3.5.0)                  
##  glue          1.3.0   2018-07-17 [1] CRAN (R 3.5.0)                  
##  gtable        0.2.0   2016-02-26 [1] CRAN (R 3.5.0)                  
##  haven         2.0.0   2018-11-22 [1] CRAN (R 3.5.0)                  
##  hms           0.4.2   2018-03-10 [1] CRAN (R 3.5.0)                  
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.0)                  
##  httr          1.4.0   2018-12-11 [1] CRAN (R 3.5.0)                  
##  jsonlite      1.6     2018-12-07 [1] CRAN (R 3.5.0)                  
##  knitr         1.21    2018-12-10 [1] CRAN (R 3.5.2)                  
##  labeling      0.3     2014-08-23 [1] CRAN (R 3.5.0)                  
##  lattice       0.20-38 2018-11-04 [1] CRAN (R 3.5.2)                  
##  lazyeval      0.2.1   2017-10-29 [1] CRAN (R 3.5.0)                  
##  lpSolve       5.6.13  2015-09-19 [1] CRAN (R 3.5.0)                  
##  lubridate     1.7.4   2018-04-11 [1] CRAN (R 3.5.0)                  
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.0)                  
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.0)                  
##  modelr        0.1.3   2019-02-05 [1] CRAN (R 3.5.2)                  
##  munsell       0.5.0   2018-06-12 [1] CRAN (R 3.5.0)                  
##  nlme          3.1-137 2018-04-07 [1] CRAN (R 3.5.2)                  
##  pillar        1.3.1   2018-12-15 [1] CRAN (R 3.5.0)                  
##  pkgbuild      1.0.2   2018-10-16 [1] CRAN (R 3.5.0)                  
##  pkgconfig     2.0.2   2018-08-16 [1] CRAN (R 3.5.0)                  
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.0)                  
##  plyr          1.8.4   2016-06-08 [1] CRAN (R 3.5.0)                  
##  png           0.1-7   2013-12-03 [1] CRAN (R 3.5.0)                  
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.0)                  
##  processx      3.2.1   2018-12-05 [1] CRAN (R 3.5.0)                  
##  progress      1.2.0   2018-06-14 [1] CRAN (R 3.5.0)                  
##  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.0)                  
##  purrr       * 0.3.0   2019-01-27 [1] CRAN (R 3.5.2)                  
##  R6            2.3.0   2018-10-04 [1] CRAN (R 3.5.0)                  
##  Rcpp          1.0.0   2018-11-07 [1] CRAN (R 3.5.0)                  
##  readr       * 1.3.1   2018-12-21 [1] CRAN (R 3.5.0)                  
##  readxl        1.2.0   2018-12-19 [1] CRAN (R 3.5.0)                  
##  remotes       2.0.2   2018-10-30 [1] CRAN (R 3.5.0)                  
##  rlang         0.3.1   2019-01-08 [1] CRAN (R 3.5.2)                  
##  rmarkdown     1.11    2018-12-08 [1] CRAN (R 3.5.0)                  
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.0)                  
##  rstudioapi    0.9.0   2019-01-09 [1] CRAN (R 3.5.2)                  
##  rvest         0.3.2   2016-06-17 [1] CRAN (R 3.5.0)                  
##  scales        1.0.0   2018-08-09 [1] CRAN (R 3.5.0)                  
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.0)                  
##  sf            0.7-2   2018-12-20 [1] CRAN (R 3.5.0)                  
##  stringi       1.3.1   2019-02-13 [1] CRAN (R 3.5.2)                  
##  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 3.5.2)                  
##  tibble      * 2.0.1   2019-01-12 [1] CRAN (R 3.5.2)                  
##  tidyr       * 0.8.2   2018-10-28 [1] CRAN (R 3.5.0)                  
##  tidyselect    0.2.5   2018-10-11 [1] CRAN (R 3.5.0)                  
##  tidyverse   * 1.2.1   2017-11-14 [1] CRAN (R 3.5.0)                  
##  transformr    0.1.1   2018-12-09 [1] CRAN (R 3.5.0)                  
##  tweenr        1.0.1   2018-12-14 [1] CRAN (R 3.5.0)                  
##  units         0.6-2   2018-12-05 [1] CRAN (R 3.5.0)                  
##  usethis       1.4.0   2018-08-14 [1] CRAN (R 3.5.0)                  
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.0)                  
##  xfun          0.4     2018-10-23 [1] CRAN (R 3.5.0)                  
##  xml2          1.2.0   2018-01-24 [1] CRAN (R 3.5.0)                  
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.0)                  
## 
## [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library
Postdoctoral Researcher

I am interested in plant evolution and genomics, with focus on the evolution of novel traits and cell types in the shoot apex and leaf.

Related