Python Rocks! and other rants
Weblog of Kent S Johnson

2006-10-04 18:16:36

Using R in Biostatistics

I am taking a course in biostatistics based on the book Biostatistics:A Foundation for Analysis in the Health Sciences by Wayne W. Daniel. At the same time I am learning to use the R statistics package. This page collects my observations.

Contents

Getting started

The main R website is http://www.r-project.org/.

A number of books are listed here: http://cran.r-project.org/doc/FAQ/R-FAQ.html#What-documentation-exists-for-R_003f

The book I bought is Using R for Introductory Statistics by John Verzani. This book has helped me learn my way around R. Verzani's book grew out of a fairly extensive document called Simple R which is available on-line. In part it was this document that convinced me to buy the book.

Another good online introduction is Using R for Data Analysis and Graphics by J H Maindonald. I think this is also available in a book-length version.

The documentation and help that comes with R is extensive but I didn't find it very accessible as a beginner. The other tutorials provide a better route in my opinion. Once you start to know your way around a little the help is very useful.

It can be hard to find what you need. The table of contents for the base package is over seven pages, the stats package is another six pages.You can search the help with help.search('search term'). You can search the R website and help mailing list with the menu command Help / search.r-project.org.... These are helpful when you just need to know the name of the function to use.

If you know the name of a function, type ?func for full help, or args(func) for a list of arguments. To search the help for func, type help.search('func'). For a list of the contents of a library, type library(help=libname). Note: In what follows, I cite R functions in the form library:function(), for example BSDA:zsum.test().

Chapter 2 - Descriptive Statistics

The library UsingR has a very nice function simple.hist.and.boxplot() which draws a histogram with rug and accompanying box-and-whisker plot. It will also draw a frequency polygon with simple.freqpoly but the endpoints are not right, they are half and interval short. For histogram or box plot alone use hist() and boxplot().

Rcmdr:stem.leaf() draws a stem-and-leaf plot like the ones in the book. The default stem() function draws a simpler version.

mean() and var() compute mean and variance. summary() shows quartiles.

Chapter 4 - Probability Distributions

R has built-in functions to calculate many standard distributions. For each distribution xx, R has four functions dxx, pxx, qxx and rxx where xx is the name of the distribution. A few of the distributions available are

  • binom - binomial distribution
  • norm - normal distribution
  • pois - Poisson distribution
  • t - Student's t distribution
  • chisq - chi-square distribution
  • f - F distribution

The 'd' function computes the density of the distribution at a given quantile. This is the height of the curve for a given x value. For example, the height of the normal distribution at z=0 is 0.3989423:

> dnorm(0)
[1] 0.3989423

The 'p' function computes the probability at a qiven quantile. This is the area under the curve on one side of the quantile. The default is to give the area to the left of the quantile, but if you give the argument lower.tail=FALSE it will give the right side. For example, z=0 divides the normal distribution in half and z=1.64 leaves about 95% of the area to the left:

> pnorm(0)
[1] 0.5
> pnorm(1.64)
[1] 0.9494974

The 'q' function is the inverse of the 'p' function. Given a probability, it returns the corresponding quantile:

> qnorm(.95)
[1] 1.644854

The 'r' function picks random values:

> rnorm(3)
[1]  1.6623277 -1.0355470 -0.9379454

You can pass arrays to these functions. This gives a simple way to find the area between two z values. For example P(.84 <= z <= 2.45) can be computed as:

> diff(pnorm(c(.84, 2.45)))
[1] 0.1933114

The normal distribution functions are not constrained to the standard normal, you can give them the mean and standard deviation of the distribution. For example the probability of 3 or less in a distribution with mean 5.4 and standard deviation 1.3 (Example 4.7.1) is easily obtained:

> pnorm(3, 5.4, 1.3)
[1] 0.03243494

Chapter 5 - Sampling Distributions

Distribution of the sample mean (Example 5.3.2)

What is the probability that a random sample of size 10 from a population with mean 185.6 and standard deviation 12.7 will have a mean greater than 190? R doesn't have a function to compute standard error, but it does have a built-in square root function:

> pnorm(190, 185.6, 12.7/sqrt(10), lower=F)
[1] 0.1366286

Alternately you can use BSDA:zsum.test(). In this case the null hypothesis is that mu=185.6 and the p-value is the desired answer:

> library(BSDA)
> zsum.test(190, 12.7, 10, mu=185.6, alt='g')

      One-sample z-Test

data:  Summarized x
z = 1.0956, p-value = 0.1366
alternative hypothesis: true mean is greater than 185.6
95 percent confidence interval:
 183.3941       NA
sample estimates:
mean of x
      190

Distribution of the difference between two sample means (Example 5.4.2)

R doesn't give much support for this, you have to calculate the difference of means and standard error yourself, then use pnorm() to compute the desired probabilty:

> pnorm(20, 45-30, sqrt(15^2/35 + 20^2/40), lower=F)
[1] 0.1086783

There may be a way to do this with BSDA:zsum.test() but I haven't figured it out.

Distribution of the sample proportion and difference between two sample proportions

These are similar to the two examples above. Compute the standard error yourself, then use pnorm() for the final answer. Questions about a single sample proportion can also be computed with prop.test(), for example here is Example 5.5.2:

> prop.test(.45*200, 200, .51, alt='l',corr=F)

      1-sample proportions test without continuity correction

data:  0.45 * 200 out of 200, null probability 0.51
X-squared = 2.8812, df = 1, p-value = 0.04481
alternative hypothesis: true p is less than 0.51
95 percent confidence interval:
 0.0000000 0.5081466
sample estimates:
   p
0.45

To match the book result I turned of the Yates continuity correction.

Chapter 6 - Estimation

Confidence interval for a population mean

Several functions are available to create confidence intervals for the value of a mean. The choice of which to use depends on whether you have raw or summary data, and whether you want to use a z-statistic or t-statistic. The functions are summarized here:

Type of data z-statistic t-statistic
Raw BSDA:z.test() t.test()
Summarized BSDA:zsum.test() BSDA:tsum.test()

Example 6.2.1 uses a z-statistic with summarized data. Note that the problem supplies the variance and zsum.test() needs the standard deviation:

> zsum.test(22,sqrt(45),10)
95 percent confidence interval:
 17.84229 26.15771

Example 6.2.2 is similar but uses a different confidence level:

> zsum.test(84.3, sqrt(144), 15, conf.level=.99)
99 percent confidence interval:
 76.31908 92.28092

In Example 6.3.1 the sample standard deviation is given so a t-statistic is used:

> tsum.test(250.8, 130.9, 19)
95 percent confidence interval:
 187.7082 313.8918

To find reliability factors directly use qnorm() and qt().

Confidence interval for the difference between population means

The same four functions are used to compute confidence intervals for the difference between two means. z.test(), t.test(), zsum.test() and tsum.test() can all take two sets of values as parameters.

Example 6.4.1:

> zsum.test(4.5, sqrt(1), 12, 3.4, sqrt(1.5), 15)
95 percent confidence interval:
 0.2607936 1.9392064

Here the confidence interval is for the difference between the two means.

Example 6.4.3 tsum.test() has a parameter var.equal which specifies whether the variances are equal or not:

> tsum.test(4.7, 9.3, 18, 8.8, 11.5, 10, var.equal=T)
95 percent confidence interval:
 -12.301022   4.101022

Example 6.4.4 When the variances are not equal, R uses the Welch method to compute the confidence intervals. The Welch method is briefly described in this Wikipedia entry. The Welch method gives a different result than the Cochran method used in the text:

> tsum.test(4.7, 9.3, 18, 8.8, 11.5, 10, var.equal=F)
95 percent confidence interval:
 -13.118585   4.918585

Confidence interval for a population proportion

The confidence interval for a population proportion or the difference between two population proportions is computed with prop.test(). There are some differences between prop.test() and the method shown in the text.

First, prop.test() uses a different formula for standard error. A paper by Newcombe[#]_ compares seven methods. He calls the method used in the text the simple asymptotic method and the method used by R the Wilson 'score' method. In both cases he allows a variation using continuity correction. In his discussion he strongly recommends that the simple asymptotic method be limited to simple uses such as teaching and sample size estimation and finds that the Wilson 'score' method performs well.

Second, R applies the continuity correction (p. 147 in the textbook) automatically in some cases. You can control this explicitly with the correct parameter to prop.test().

With those caveats, the confidence interval is easy to calculate. Note that the arguments to prop.test() are raw counts, not the sample proportion. Example 6.5.1 looks like this:

> prop.test(.18*1220, 1220)
95 percent confidence interval:
 0.1590672 0.2029797
[1]Newcombe, R.G. 'Two-sided confidence intervals for the single proportion: comparison of seven methods', Statistics in Medicine, 17, 857-872 (1998).

Confidence interval for the difference between population proportions

A confidence interval for the difference between population proportions is also calculated using prop.test(). The above notes about prop.test() apply here as well. The sample numbers are given as two arrays; the c() function can be used to build the arrays. Example 6.6.1 looks like this:

> prop.test(c(31, 53), c(68, 255), conf.level=.99)
99 percent confidence interval:
 0.06994455 0.42613388

Determining sample size

The BSDA:nsize() function can be used to determine required sample size for estimating means or proportions. Example 6.7.1 becomes:

> nsize(5, 20)

The required sample size (n) to estimate the population
mean with a 0.95 confidence interval so that the margin
of error is no more than 5 is 62 .

and Example 6.8.1 is:

> nsize(0.05, p=.35, type='pi')

The required sample size (n) to estimate the population
proportion of successes with a 0.95 confidence interval
so that the margin of error is no more than 0.05 is 350 .

Confidence interval for a variance or ratio of variances

As far as I know R doesn't have a built-in function to compute a confidence interval for a single variance. It does have qchisq() which can be used to find the correct reliability coefficients. For example, using values from Example 6.9.1:

> alpha=.05
> qchisq(1-alpha/2, df=6)
[1] 14.44938
> qchisq(alpha/2, df=6)
[1] 1.237344

R does have var.test() to compute an F-test for ratio of two variances but it works only from raw data so it cannot be used to solve Example 6.10.1. The qf() function gives reliability coefficients from the F-distribution:

> qf(.025, 15, 3)
[1] 0.2408012
> qf(.975, 15, 3)
[1] 14.25271

Chapter 7 - Hypothesis Testing

Hypothesis testing a single population mean

The functions used to create confidence intervals for the mean are also used for hypothesis testing the value of a mean. This is not too surprising as one way to test a hypothesis is to create a confidence interval for the sampled mean. The table above also applies here. An additional parameter gives the value of the mean in the null hypothesis.

Example 7.2.1 uses a z-statistic with summarized data. This is computed with BSDA:zsum.test(). The parameters are the mean, standard deviation and count of the sample, and the mean being tested against in the null hypothesis:

> zsum.test(27, sqrt(20), 10, mu=30)

        One-sample z-Test

data:  Summarized x
z = -2.1213, p-value = 0.03389
alternative hypothesis: true mean is not equal to 30
95 percent confidence interval:
 24.22819 29.77181
sample estimates:
mean of x
       27

The output shows the computed z-statistic -2.1213, the p-value and the confidence interval. R does not show the rejection region; the null hypothesis is accepted or rejected by seeing whether it lies within the computed confidence interval.

Example 7.2.2 is a one-sided hypothesis test with the same data. This is easily computed by adding alt='less' to the parameters to zsum.test():

> zsum.test(27, sqrt(20), 10, mu=30, alt='less')

        One-sample z-Test

data:  Summarized x
z = -2.1213, p-value = 0.01695
alternative hypothesis: true mean is less than 30
95 percent confidence interval:
       NA 29.32617
sample estimates:
mean of x
       27

Example 7.2.3 uses a t-statistic with raw data. This is computed with t.test(). The scan() function is a convenient way to enter small data sets by hand. (Larger data sets are better read from a file using read.table() or read.csv().)

> days = scan()
1: 14 9 18 26 12
6: 0 10 4 8 21
11: 28 24 24 2 3 14 9
18:
Read 17 items

> t.test(days, mu=15)

        One Sample t-test

data:  days
t = -0.7915, df = 16, p-value = 0.4402
alternative hypothesis: true mean is not equal to 15
95 percent confidence interval:
  8.72508 17.86315
sample estimates:
mean of x
 13.29412

In this case the null hypothesis cannot be rejected because the mean of 15 falls in the 95% confidence interval for the true mean of the data set.

Hypothesis testing the difference between two means

Once again z.test(), t.test(), zsum.test() and tsum.test() are used. This is similar to calculating confidence intervals for the difference between two means.

Example 7.3.2 uses the same data as Example 6.4.1, and the same command is used for the hypothesis test. Since the null hypothesis is that the difference is 0, it does not need to be specified. The full output is:

> zsum.test(4.5, sqrt(1), 12, 3.4, sqrt(1.5), 15)

        Two-sample z-Test

data:  Summarized x and y
z = 2.569, p-value = 0.01020
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.2607936 1.9392064
sample estimates:
mean of x mean of y
      4.5       3.4

Paired comparison

Example 7.4

The data for this example can be downloaded from the Student Companion website and read using read.csv():

> data = read.csv(file='EXA_C07_S04_01.csv')

Then this becomes a single-sample t-test on the difference between the two data sets:

> with(data, t.test(POSTOP-PREOP, alt='greater'))

        One Sample t-test

data:  POSTOP - PREOP
t = 1.9159, df = 11, p-value = 0.04086
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
 1.131919      Inf
sample estimates:
mean of x
   18.075

Alternately use the paired=TRUE parameter to t.test() which yields the same result:

with(data, t.test(POSTOP, PREOP, paired=T,  alt='greater'))

Hypothesis testing a population proportion or difference in proportions

prop.test() may be used for hypothesis testing of a single population proportion or the difference between two proportions. Here is Example 7.5.1:

> prop.test(24, 301, p=.063, alt='g', corr=F)

        1-sample proportions test without continuity correction

data:  24 out of 301, null probability 0.063
X-squared = 1.4279, df = 1, p-value = 0.1161
alternative hypothesis: true p is greater than 0.063
95 percent confidence interval:
 0.05763846 1.00000000
sample estimates:
         p
0.07973422

Example 7.6.1:

> prop.test(c(11, 24), c(29, 44), alt='l', corr=F)

        2-sample test for equality of proportions without continuity
        correction

data:  c(11, 24) out of c(29, 44)
X-squared = 1.9333, df = 1, p-value = 0.0822
alternative hypothesis: less
95 percent confidence interval:
 -1.00000000  0.02675495
sample estimates:
   prop 1    prop 2
0.3793103 0.5454545

Hypothesis testing a variance or ratio of variances

Here again R doesn't give much help. qchisq() and qf() can be used to obtain critical values, and var.test() will test the ratio of two variances given the raw data.

Chapter 8 - Analysis of Variance

One-way analysis of variance

The functions oneway.test() and aov() both perform an F test as described in Section 8.2. The output of aov() is very similar to Table 8.2.4.

Both functions need the data as a single array and a factor. The stack() function will convert a data frame containing multiple datasets into the required form. Below is Example 8.2.1. (Note: The data set available online differs from the one in the book so the answer here differs from the book.)

# Read the data set
> sel=read.csv('EXA_C08_S02_01.csv')

# Convert to stacked form
> d=stack(sel)
> names(d)
[1] "values" "ind"

# ANOVA using oneway.test()

> oneway.test(values~ind, d,var.equal=T)

        One-way analysis of means

data:  values and ind
F = 22.6142, num df = 3, denom df = 140, p-value = 5.345e-12

# ANOVA using aov()
> res = aov(values~ind, d)
> summary(res)
             Df Sum Sq Mean Sq F value    Pr(>F)
ind           3  18935    6312  22.614 5.345e-12 ***
Residuals   140  39074     279

oneway.test() can also use the Welch approximation to perform an F test without the assumption of equal variances.

Tukey's HSD test

The Tukey HSD test can be computed from the result of aov() using TukeyHSD():

> TukeyHSD(aov(values~ind, d), ordered=T)
  Tukey multiple comparisons of means
    95% family-wise confidence level
    factor levels have been ordered

Fit: aov(formula = values ~ ind, data = d)

$ind
             diff       lwr      upr     p adj
RRB-VEN  2.646857 -7.737053 13.03077 0.9108599
SQU-VEN 11.541505  2.567647 20.51536 0.0057777
NRB-VEN 36.170840 24.160828 48.18085 0.0000000
SQU-RRB  8.894648 -1.030122 18.81942 0.0961032
NRB-RRB 33.523982 20.787762 46.26020 0.0000000
NRB-SQU 24.629335 13.014007 36.24466 0.0000010

The first column is the difference between means, the second and third columns are the lower and upper bounds of the confidence interval for the difference, and the last column is an adjusted p-value. If the confidence interval includes 0 the means may be equal. The confidence levels may be plotted using plot(TukeyHSD(...)).

Two-way ANOVA

Two-way analysis of variance is very similar to the one-way analysis. An extra term is added to the model formula to represent the second factor. Here is Example 8.3.1, with data entered by hand:

# Data for example 8.3.1
> method =rep(c('A', 'B','C'), 5)
> age=rep(c('U20', '20', '30', '40', '50'), rep(3,5))
> days=scan()
1: 7 9 10 8 9 10 9 9 12 10 9 12 11 12 14
16:
Read 15 items
> data = data.frame(list(days=days, age=age, method=method))

# xtabs() shows data as a table
> xtabs(days ~ age + method, data)
     method
age    A  B  C
  20   8  9 10
  30   9  9 12
  40  10  9 12
  50  11 12 14
  U20  7  9 10

# Two-way ANOVA
> res = aov(days ~ method + age, data)
> summary(res)
            Df  Sum Sq Mean Sq F value    Pr(>F)
method       2 18.5333  9.2667  21.385 0.0006165 ***
age          4 24.9333  6.2333  14.385 0.0010017 **
Residuals    8  3.4667  0.4333

> TukeyHSD(res, "method")
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = days ~ method + age, data = data)

$method
    diff        lwr      upr     p adj
B-A  0.6 -0.5896489 1.789649 0.3666717
C-A  2.6  1.4103511 3.789649 0.0006358
C-B  2.0  0.8103511 3.189649 0.0034083

Example 8.4.1 is similar. Note that the raw category data from read.csv() must be converted to factors:

> health =read.csv('EXA_C08_S04_01.csv')
> summary(aov(FUNC ~ factor(SUBJ) + factor(TIME), data=health))
            Df  Sum Sq Mean Sq F value    Pr(>F)
SUBJ        17 20237.5  1190.4  8.1998 2.179e-09 ***
TIME         3  2395.8   798.6  5.5008  0.002371 **
Residuals   51  7404.2   145.2

For a factorial experiment where the factors may interact, a different model formula is used, with A*B instead of A+B. Here is Example 8.5.2:

> health =read.csv('EXA_C08_S04_01.csv')
> summary(aov(HOME ~ factor(A) * factor(B), minutes))
                    Df  Sum Sq Mean Sq F value    Pr(>F)
factor(A)            3 2992.45  997.48 67.9427 < 2.2e-16 ***
factor(B)            3 1201.05  400.35 27.2695 1.763e-11 ***
factor(A):factor(B)  9  608.45   67.61  4.6049 0.0001047 ***
Residuals           64  939.60   14.68

Chapter 9 - Simple Linear Regression and Correlation

R has great support for linear regression and related operations such as plotting. The lm() function creates a linear model. Its argument is a formula. For example here is the code to reproduce Example 9.3.1:

# Read the data set
> at=read.csv('EXA_C09_S03_01.csv')

# Scatter plot
> plot(Y ~ X, data=at)

# Compute the linear model
> res=lm(Y~X, data=at)
> res

Call:
lm(formula = Y ~ X, data = at)

Coefficients:
(Intercept)            X
   -215.981        3.459

# More information is available in the summary:
> summary(res)

Call:
lm(formula = Y ~ X, data = at)

Residuals:
     Min       1Q   Median       3Q      Max
-107.288  -19.143   -2.939   16.376   90.342

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -215.9815    21.7963  -9.909   <2e-16 ***
X              3.4589     0.2347  14.740   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 33.06 on 107 degrees of freedom
Multiple R-Squared:  0.67,      Adjusted R-squared: 0.667
F-statistic: 217.3 on 1 and 107 DF,  p-value: < 2.2e-16

# Add the regression line to the plot
> abline(res)

Evaluating the regression equation

The coefficient of determination (Multiple R-Squared), adjusted coefficient of determination (Adjusted R-squared) and F-statistic (Example 9.4.1) are all shown at the end the summary above. The full analysis of variance table can be shown using aov():

> summary(aov(Y~X,at))
             Df Sum Sq Mean Sq F value    Pr(>F)
X             1 237549  237549  217.28 < 2.2e-16 ***
Residuals   107 116982    1093

The t-statistic and p-value for beta (Example 9.4.2) are shown in the second line of the coefficients table.

The standard error of the slope, used to create a confidence interval for the slope, (0.2347) is shown in the above output under the heading Std. Error.

Confidence interval for Y and mu

Predicted values of y and mu with associated confidence intervals are computed with predict(). For a confidence interval on Y, use interval='prediction'. For a confidence interval on the mean of Y, use interval='confidence':

> new <- data.frame(X = 100)
> predict(res, new, se.fit=T, interval='prediction')
$fit
          fit      lwr      upr
[1,] 129.9045 63.94943 195.8595

$se.fit
[1] 3.693391

> predict(res, new, interval='confidence')
          fit      lwr      upr
[1,] 129.9045 122.5827 137.2262

Resistant line

The line() function computes a resistant line using a method credited to Tukey, but the results differ considerably from Figure 9.5.2. The functions MASS:lqs() and MASS:rlm() have several methods to compute a resistant regression. The results vary widely when used with the dataset of Example 9.3.1. The closest match to the results given in Figure 9.5.2 are from MASS:lqs() with method='S':

> lqs(Y~X, data=at, method='S')
Call:
lqs.formula(formula = Y ~ X, data = at, method = "S")

Coefficients:
(Intercept)            X
   -208.833        3.293

Scale estimates 25.89

Correlation coefficient

The correlation coefficient, t-statistic and confidence interval are computed by cor.test(). Here are Example 9.7.1 and Example 9.7.2:

> sep=read.csv('EXA_C09_S07_01.csv')
> cor.test(~ HEIGHT + CV, sep)

        Pearson's product-moment correlation

data:  HEIGHT and CV
t = 19.7813, df = 153, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7967316 0.8869721
sample estimates:
     cor
0.847883

The correlation coefficient can also be computed as the square root of the MultipleR-Squared value from the summary of the linear model.

Chapter 10 - Multiple Regression and Correlation

Computing a multiple linear regression is similar to a simple linear regression, the only change is to add another model variable to the model formula. Here is Example 10.3.1:

> cda=read.csv('EXA_C10_S03_01.csv')
> res = lm(CDA ~ AGE + EDLEVEL, cda)
> summary(res)

Call:
lm(formula = CDA ~ AGE + EDLEVEL, data = cda)

Residuals:
     Min       1Q   Median       3Q      Max
-5.98039 -2.21247 -0.07607  2.28236  9.12296

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  5.49407    4.44297   1.237 0.220498
AGE         -0.18412    0.04851  -3.795 0.000316 ***
EDLEVEL      0.61078    0.13565   4.503 2.70e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.134 on 68 degrees of freedom
Multiple R-Squared: 0.3706,     Adjusted R-squared: 0.3521
F-statistic: 20.02 on 2 and 68 DF,  p-value: 1.454e-07

Evaluating the regression equation

The values for r-squared (Example 10.4.1), the F-statistic for r-squared (Example 10.4.2), the t-statistic and p-value for the coefficients (Example 10.4.3) and the standard error of the coefficients (for constructing confidence intervals) may all be read from the above summary.

R does not compute the ANOVA table in the same form as it is shown in the text (or at least I don't know how to do it). Here is the table for the cda data; compare with Figure 10.3.1:

> summary(aov(CDA ~ AGE + EDLEVEL, cda))
            Df Sum Sq Mean Sq F value    Pr(>F)
AGE          1 194.24  194.24  19.774 3.306e-05 ***
EDLEVEL      1 199.15  199.15  20.273 2.703e-05 ***
Residuals   68 667.97    9.82

To obtain the ANOVA table in Figure 10.3.1, the Sum Sq terms for AGE and EDLEVEL must be combined, then divided by the summed degrees of freedom.

Confidence interval for Y and mu

As for the single regression, predicted values of y and mu with associated confidence intervals are computed with predict(). Here is Example 10.5.1:

> new = data.frame(AGE=68, EDLEVEL=12)
> predict(res, new, se.fit=T, interval='prediction')
$fit
           fit       lwr      upr
[1,] 0.3029173 -6.093481 6.699316

$se.fit
[1] 0.672222

> predict(res, new, se.fit=T, interval='confidence')
$fit
           fit       lwr      upr
[1,] 0.3029173 -1.038481 1.644315

$se.fit
[1] 0.672222

Multiple correlation

The correlation coefficient may be computed from the Multiple R-Squared value in the model summary. The F statistic is reported directly. Here is Example 10.6.1:

> bone=read.csv('EXA_C10_S06_01.csv')
> res=lm(W ~ P + S, bone)
> summary(res)

Call:
lm(formula = W ~ P + S, data = bone)

Residuals:
     Min       1Q   Median       3Q      Max
-33.9074 -19.5941  -0.5171  10.1592  76.8129

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  35.6138    29.1296   1.223  0.23245
P             1.4509     2.7632   0.525  0.60397
S             2.3960     0.7301   3.282  0.00294 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 27.42 on 26 degrees of freedom
Multiple R-Squared: 0.2942,     Adjusted R-squared: 0.2399
F-statistic: 5.419 on 2 and 26 DF,  p-value: 0.01078

> sqrt(0.2942)
[1] 0.5424021

Partial correlation coefficients may be computed from the residuals of the pairwise regressions using the method of Figure 10.6.3. This method yields p-values as well as the partial correlation coefficients. The function Rcmdr:partial.cor() computes the coefficients directly but does not include the p-values. Here is Example 10.6.2:

> cor.test(residuals(lm(W~P,bone)), residuals(lm(S~P, bone)))

        Pearson's product-moment correlation

data:  residuals(lm(W ~ P, bone)) and residuals(lm(S ~ P, bone))
t = 3.3442, df = 27, p-value = 0.002433
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.2179076 0.7574559
sample estimates:
      cor
0.5411914

> cor.test(residuals(lm(W~S,bone)), residuals(lm(P~S, bone)))

        Pearson's product-moment correlation

data:  residuals(lm(W ~ S, bone)) and residuals(lm(P ~ S, bone))
t = 0.5351, df = 27, p-value = 0.597
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.2743707  0.4519729
sample estimates:
      cor
0.1024358

> cor.test(residuals(lm(S~W,bone)), residuals(lm(P~W, bone)))

        Pearson's product-moment correlation

data:  residuals(lm(S ~ W, bone)) and residuals(lm(P ~ W, bone))
t = -0.6413, df = 27, p-value = 0.5267
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.4679843  0.2554876
sample estimates:
       cor
-0.1224876

> library(Rcmdr)
> partial.cor(bone)
          W          P          S
W 0.0000000  0.1024358  0.5411914
P 0.1024358  0.0000000 -0.1224876
S 0.5411914 -0.1224876  0.0000000

Chapter 11 - Regression Analysis - Additional Techniques

Qualitative Independent Variables

A single, qualitative variable coded with a number is treated the same as any other numeric variable. Here is Example 11.2.1:

> d11=read.csv('EXA_C11_S02_01.csv')
> res.d11 = lm(GRAMS ~ WEEKS + SMOKE, data=d11)
> summary(res.d11)

Call:
lm(formula = GRAMS ~ WEEKS + SMOKE, data = d11)

Residuals:
      Min        1Q    Median        3Q       Max
-1260.450  -265.799     9.262   311.526  1016.586

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -1724.42     558.84  -3.086  0.00265 **
WEEKS         130.05      14.52   8.957 2.39e-14 ***
SMOKE        -294.40     135.78  -2.168  0.03260 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 484.6 on 97 degrees of freedom
Multiple R-squared: 0.4636,   Adjusted R-squared: 0.4525
F-statistic: 41.92 on 2 and 97 DF,  p-value: 7.594e-14

By the way, Figure 11.2.3 can be reproduced with the commands

> with(d11, plot(WEEKS, GRAMS, pch=SMOKE+16))
> coeff=res.d11$coefficients
> abline(coeff[1], coeff[2])
> abline(coeff[1]+coeff[3], coeff[2])

A variable with multiple response levels must be split into multiple variables and the cross terms included in the model formula. Here is Example 11.2.3:

> d13=read.csv('EXA_C11_S02_03.csv')
> d13$X2 = ifelse(d13$METHOD=='A', 1, 0)
> d13$X3 = ifelse(d13$METHOD=='B', 1, 0)
> summary(lm(EFFECT ~ AGE + X2 + X3 + AGE*X2 + AGE*X3, data=d13))

Call:
lm(formula = EFFECT ~ AGE + X2 + X3 + AGE * X2 + AGE * X3, data = d13)

Residuals:
    Min      1Q  Median      3Q     Max
-6.4366 -2.7637  0.1887  2.9075  6.5634

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  6.21138    3.34964   1.854 0.073545 .
AGE          1.03339    0.07233  14.288 6.34e-15 ***
X2          41.30421    5.08453   8.124 4.56e-09 ***
X3          22.70682    5.09097   4.460 0.000106 ***
AGE:X2      -0.70288    0.10896  -6.451 3.98e-07 ***
AGE:X3      -0.50971    0.11039  -4.617 6.85e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.925 on 30 degrees of freedom
Multiple R-squared: 0.9143,   Adjusted R-squared: 0.9001
F-statistic: 64.04 on 5 and 30 DF,  p-value: 4.264e-15

Stepwise regression

The stats:step() and MASS:stepAIC() perform step-wise refinement of a model in a manner similar to that described in section 11.3. The difference is is the selection procedure; the book uses the F-statistic; the R functions use the Akaike information criterion (AIC). The input to step() is a model such as created by lm(). Here is Example 11.3.1 run with the stats:step() function:

> d14=read.csv('EXA_C11_S03_01.csv')
> step(lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6, data=d14))
Start:  AIC=106.22
Y ~ X1 + X2 + X3 + X4 + X5 + X6

       Df Sum of Sq     RSS    AIC
- X5    1      1.34  650.26 104.28
- X4    1      6.79  655.70 104.53
<none>               648.92 106.22
- X1    1     93.28  742.19 108.25
- X2    1     98.13  747.04 108.45
- X6    1    126.76  775.68 109.58
- X3    1   1172.30 1821.22 135.18

Step:  AIC=104.29
Y ~ X1 + X2 + X3 + X4 + X6

       Df Sum of Sq     RSS    AIC
- X4    1      5.58  655.84 102.54
<none>               650.26 104.28
- X2    1     97.40  747.66 106.47
- X1    1     97.75  748.01 106.49
- X6    1    126.85  777.11 107.63
- X3    1   1192.66 1842.92 133.54

Step:  AIC=102.54
Y ~ X1 + X2 + X3 + X6

       Df Sum of Sq     RSS    AIC
<none>               655.84 102.54
- X1    1     95.23  751.07 104.61
- X2    1     99.01  754.84 104.76
- X6    1    135.14  790.98 106.16
- X3    1   1466.29 2122.13 135.77

Call:
lm(formula = Y ~ X1 + X2 + X3 + X6, data = d14)

Coefficients:
(Intercept)           X1           X2           X3           X6
    23.7359       0.2142      -0.1901       0.8254      -0.4282

Logistic regression

Logistic regression uses general linear models and the stats:glm() function. (See also rms:lrm() and rms:residuals() which seems to include many more options.) Here is Example 11.4.1 worked using glm():

To use summary data, the dependent variable of the glm formula is coded as a two-column matrix with the columns giving the numbers of successes and failures. The first column is successes and the second column failures. To duplicate the book results, "OCAD not present" is coded as a success.

> present = c(92, 15)
> not.present = c(21, 20)
> ocad = cbind(not.present=not.present, present=present)
> sex = c(0, 1)
> cbind(ocad, sex)
     not.present present sex
[1,]          21      92   0
[2,]          20      15   1

> res.glm = glm(ocad ~ sex, family = binomial)
> summary(res.glm)

Call:
glm(formula = ocad ~ sex, family = binomial)

Deviance Residuals:
[1]  0  0

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -1.4773     0.2418  -6.108 1.01e-09 ***
sex           1.7649     0.4185   4.217 2.47e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance:  1.8362e+01  on 1  degrees of freedom
Residual deviance: -6.8834e-15  on 0  degrees of freedom
AIC: 12.686

Number of Fisher Scoring iterations: 3

The B value (estimate) for sex is the log odds ratio. To compute the odds ratio use exp():

> exp(1.765)
[1] 5.8416

The B value for sex means that the odds of a woman not having OCAD are 5.84 times the odds of a man not having OCAD. The book states this as the odds of a man having OCAD are 5.84 times the odds of a woman having it; this is an equivalent statement.

To confirm the odds ratio from the raw data:

> pMNotPresent = 21/(21+92)
> oMNotPresent = pMNotPresent / (1-pMNotPresent)
> oMNotPresent
[1] 0.22826

> pFNotPresent = 20/(20+15)
> oFNotPresent = pFNotPresent / (1-pFNotPresent)
> oFNotPresent
[1] 1.3333

> oFNotPresent/oMNotPresent
[1] 5.8413

Chapter 12 - The Chi-Square Distribution

Goodness-of-fit

To perform a goodness-of-fit test in R, the expected relative frequency table must be created. Then the chisq.test() and pchisq() functions compute the desired statistics. Note that the p= parameter to chisq.test() is the expected relative frequency, not the expected frequency. Here is Example 12.3.1:

# Enter the data
> actual=scan()
1: 1 3 8 18 6 4 4 3
9:
Read 8 items

# Create the expected data
> endpts = seq(125, 275, by=25)
> endpts
[1] 125 150 175 200 225 250 275
> expected=c(pnorm(125, 198.67, 41.31),
+ diff(pnorm(endpts, 198.67, 41.31)),
+ 1-pnorm(275, 198.67, 41.31))
> expected
[1] 0.03726504 0.08210035 0.16396211 0.22951446 0.22521804
[6] 0.15492430 0.07469547 0.03232023

> chisq.test(actual, p=expected)

        Chi-squared test for given probabilities

data:  actual
X-squared = 10.3246, df = 7, p-value = 0.1709

# The p-value is incorrect because the degrees of freedom is wrong.
# Here is the correct p-value
> 1-pchisq(10.3246, 5)
[1] 0.06654326

Example 12.3.2 is similar:

> actual = scan()
1: 11 8 10 10 15 17 10 10 9 0
11:
Read 10 items

> expected = c(pbinom(1, 25, .2),
+ diff(pbinom(seq(1, 9), 25, .2)), 1-pbinom(9, 25, .2))
> expected
 [1] 0.02738973 0.07083550 0.13576804 0.18668105 0.19601510
 [6] 0.16334592 0.11084187 0.06234855 0.02944237 0.01733187

> chisq.test(actual, p=expected)

        Chi-squared test for given probabilities

data:  actual
X-squared = 47.6777, df = 9, p-value = 2.934e-07

> 1-pchisq(47.6777, 8)
[1] 1.138480e-07

Tests of independence and homogeneity

The chisq.test() function can be used to test for independence of entries in a contingency table. The data must be in a matrix. Here is Example 12.4.1:

> data = matrix(c(260, 15, 7, 299, 41, 14), ncol=2)
> data
     [,1] [,2]
[1,]  260  299
[2,]   15   41
[3,]    7   14
> chisq.test(data)

        Pearson's Chi-squared test

data:  data
X-squared = 9.0913, df = 2, p-value = 0.01061

The function TeachingDemos:chisq.detail() function shows the details of the calculation including marginal totals, the expected frequency table and the contribution of each table entry to the chi-squared statistic:

> library(TeachingDemos)
> chisq.detail(data)

observed
expected

                    Total
      260    299      559
      247.86 311.14

       15     41       56
       24.83  31.17

        7     14       21
        9.31  11.69

Total 282    354      636

Cell Contributions

 0.59 + 0.47 +
 3.89 + 3.10 +
 0.57 + 0.46 = 9.09

df = 2  P-value = 0.011

Calculation of a the chi-squared test for a 2x2 contingency table is similar. By default R applies the Yates correction. Here is Example 12.4.2:

> data = matrix(c(131, 14, 52, 36), ncol=2)
> data
     [,1] [,2]
[1,]  131   52
[2,]   14   36
> chisq.test(data, correct=F)

Pearson's Chi-squared test

data:  data
X-squared = 31.7391, df = 1, p-value = 1.763e-08

> chisq.test(data)

Pearson's Chi-squared test with Yates' continuity correction

data:  data
X-squared = 29.9118, df = 1, p-value = 4.522e-08

Not surprisingly, a chi-squared test for homogeneity is done the same way. Here is Example 12.5.1:

> data = matrix(c(21, 19, 75, 77), ncol=2)
> chisq.test(data, correct=F)

        Pearson's Chi-squared test

data:  data
X-squared = 0.1263, df = 1, p-value = 0.7223

The Fisher exact test

R uses the function fisher.test() to compute the Fisher exact test. The presentation of the results of this test is quite different from the presentation in the text; I'm not sure how to compare the two.

Relative risk and odds ratio

R has no functions to compute relative risk and odds ratios as shown in the text. The chi-squared statistic needed to compute confidence intervals for relative risk and odds ratios can be computed using chisq.test() in the same manner as shown above.

Note there is an error in the text of Example 12.7.1; the sum of the first column of Table 12.7.2 is 40, not 44, so the computed value of chi-squared is incorrect.

The epitools library contains functions riskratio() and oddsratio() but they seem to use more sophisticated methods than those in the text.

Mantel-Haenszel statistic

The Mantel-Haenszel statistic and common odds ratio may be computed with the mantelhaen.test() function. The data must be in a three-dimensional array with the last dimension corresponding to the strata. Here is Example 12.7.3; the dimension names are not required but aid in verifying the data entry:

> data = array(c(21, 16, 11, 6, 50, 18, 14, 6),
+   dim=c(2, 2, 2),
+   dimnames=list(Hypertension=c('Present', 'Absent'),
+                 OCAD=c('Case', 'Noncase'),
+                 Stratum=c('55 and under', 'over 55')))
> data
, , Stratum = 55 and under

            OCAD
Hypertension Case Noncase
     Present   21      11
     Absent    16       6

, , Stratum = over 55

            OCAD
Hypertension Case Noncase
     Present   50      14
     Absent    18       6

> mantelhaen.test(data, corr=F)

Mantel-Haenszel chi-squared test without continuity correction

data:  data
Mantel-Haenszel X-squared = 0.0243, df = 1, p-value = 0.8762
alternative hypothesis: true common odds ratio is not equal to 1
95 percent confidence interval:
 0.4184851 2.1018264
sample estimates:
common odds ratio
        0.9378609

Tests for equal variance and normality

Levene's test for equal variance

The text shows how to test the ratio of two variances using the F distribution. Another way to test the null hypothesis that the variances of two samples are equal is with Levene's test (car:levene.test()). levene.test() requires a single data array and a factor, so the stack() function can be helpful. For example, using the data from Example 8.2.1:

> sel=read.csv('EXA_C08_S02_01.csv')
> levene.test(values ~ ind, stack(sel))
Levene's Test for Homogeneity of Variance
       Df F value    Pr(>F)
group   3  10.623 2.452e-06 ***
      140

This test rejects the null hypothesis that the variances are equal.

Tests for normality

Two common tests for normality of a data set are the Shapiro-Wilk test and the Kolmogorov-Smirnov test. The Shapiro-Wilk test is preferred for small data sets (<30 items) and the Kolmogorov-Smirnov is preferred for larger data sets. These tests are both available in the stats package as stats:shapiro.test() and stats:ks.test().

The Kolmogorov-Smirnov test compares the distribution of two samples or compares a sample to a specified distribution. You must specify the distribution to compare against. For example:

> ks.test(data, 'pnorm', mean, sd)

Note that if the mean and sd of the sample data are used, the p-value computed by ks.test() is not valid. Instead, use nortest:lillie.test() which computes the same test statistic with the correct p-value.

 
© Kent S Johnson Creative Commons License

Comments about life, the universe and Python, from the imagination of Kent S Johnson.

kentsjohnson.com

Weblog home

All By Date

All By Category

Essays

XML-Image

BlogRoll