Influential Points in Regression

Initializing live version
Download to Desktop

Requires a Wolfram Notebook System

Interact on desktop, mobile and cloud with the free Wolfram Player or other Wolfram Language products.

A random sample of size from a bivariate normal distribution with mean , unit variances, and correlation coefficient is generated. The sample correlation is shown as well as the Cook's distance corresponding to the locator point. Several methods of fitting the regression line are available.

[more]

The LS (least-squares) method uses Mathematica's built-in function LinearModelFit. See the Details section for more information about L1 (least absolute deviation) and RLINE (resistant line). Cook's distances provide an indication of points that have a large influence on the slope of the LS regression. As a rough rule, points that exceed , where is the sample size, may be influential. The recommended practice is to look at a plot of all Cook's distances. The Cook's distances are determined using LinearModelFit to fit the LS regression. Two plots are available for the Cook's distances. See Details for more information.

The slider zoom can be used to zoom out and move the locator some distance away to explore its influence on the regression, correlation, and Cook's distance. The effect of sample size and correlation may also be explored. By varying the random seed, you can explore the stochastic variation for a fixed initial data configuration.

[less]

Contributed by: Ian McLeod (March 2011)
(University of Western Ontario)
Open content licensed under CC BY-NC-SA


Snapshots


Details

For the definition of Cook's distance, see [1]. For discussion of its use in detecting influential points in regression, see [2, 3].

Pages 67–68 of [2] suggest that observations with Cook's distances with values exceeding may be influential but that it is better to look at a plot of the Cook's distances versus with a benchmark line at .

Page 70 of [3] suggests looking at the half-normal plot of the Cook's distances to see those that are relatively large compared with the rest.

L1 Regression: minimizes the absolute sum of errors. This is computed using linear programming; see eqn. (3) in [4]. L1 regression is more robust than LS when moderate outliers are present, but it is still sensitive to extreme outliers.

RLINE: resistant regression line, discussed in §5 of [5], is based on medians.

[1] Cook's distance, Wikipedia.

[2] S. J. Sheather, A Modern Approach to Regression with R, New York: Springer, 2009.

[3] J. J. Faraway, Linear Models with R, Boca Raton: Chapman & Hall/CRC, 2005.

[4] S. C. Narula and J. F. Wellington, "The Minimum Sum of Absolute Errors Regression: A State of the Art Survey," International Statistical Review, 50(2), 1982 pp. 317–326.

[5] P. F. Velleman and D. C. Hoaglin, Applications, Basics and Computing of Exploratory Data Analysis, Boston: Duxbury Press, 1981.



Feedback (field required)
Email (field required) Name
Occupation Organization
Note: Your message & contact information may be shared with the author of any specific Demonstration for which you give feedback.
Send