2006-10-04 18:16:36
Using R in Biostatistics
I am taking a course in biostatistics based on the book Biostatistics:A Foundation for Analysis in the Health Sciences by Wayne W. Daniel. At the same time I am learning to use the R statistics package. This page collects my observations.
The main R website is http://www.r-project.org/.
A number of books are listed here:
http://cran.r-project.org/doc/FAQ/R-FAQ.html#What-documentation-exists-for-R_003f
The book I bought is Using R for Introductory Statistics by John Verzani. This book has helped me learn my way around R. Verzani's book grew out of a fairly extensive document called Simple R which is available on-line. In part it was this document that convinced me to buy the book.
Another good online introduction is Using R for Data Analysis and Graphics by J H Maindonald. I think this is also available in a book-length version.
The documentation and help that comes with R is extensive but I didn't find it very accessible as a beginner. The other tutorials provide a better route in my opinion. Once you start to know your way around a little the help is very useful.
It can be hard to find what you need. The table of contents for the base package is over seven pages, the stats package is another six pages.You can search the help with help.search('search term'). You can search the R website and help mailing list with the menu command Help / search.r-project.org.... These are helpful when you just need to know the name of the function to use.
If you know the name of a function, type ?func for full help, or args(func) for a list of arguments. To search the help for func, type help.search('func'). For a list of the contents of a library, type library(help=libname). Note: In what follows, I cite R functions in the form library:function(), for example BSDA:zsum.test().
The library UsingR has a very nice function simple.hist.and.boxplot() which draws a histogram with rug and accompanying box-and-whisker plot. It will also draw a frequency polygon with simple.freqpoly but the endpoints are not right, they are half and interval short. For histogram or box plot alone use hist() and boxplot().
Rcmdr:stem.leaf() draws a stem-and-leaf plot like the ones in the book. The default stem() function draws a simpler version.
mean() and var() compute mean and variance. summary() shows quartiles.
R has built-in functions to calculate many standard distributions. For each distribution xx, R has four functions dxx, pxx, qxx and rxx where xx is the name of the distribution. A few of the distributions available are
- binom - binomial distribution
- norm - normal distribution
- pois - Poisson distribution
- t - Student's t distribution
- chisq - chi-square distribution
- f - F distribution
The 'd' function computes the density of the distribution at a given quantile. This is the height of the curve for a given x value. For example, the height of the normal distribution at z=0 is 0.3989423:
> dnorm(0)
[1] 0.3989423
The 'p' function computes the probability at a qiven quantile. This is the area under the curve on one side of the quantile. The default is to give the area to the left of the quantile, but if you give the argument lower.tail=FALSE it will give the right side. For example, z=0 divides the normal distribution in half and z=1.64 leaves about 95% of the area to the left:
> pnorm(0)
[1] 0.5
> pnorm(1.64)
[1] 0.9494974
The 'q' function is the inverse of the 'p' function. Given a probability, it returns the corresponding quantile:
> qnorm(.95)
[1] 1.644854
The 'r' function picks random values:
> rnorm(3)
[1] 1.6623277 -1.0355470 -0.9379454
You can pass arrays to these functions. This gives a simple way to find the area between two z values. For example P(.84 <= z <= 2.45) can be computed as:
> diff(pnorm(c(.84, 2.45)))
[1] 0.1933114
The normal distribution functions are not constrained to the standard normal, you can give them the mean and standard deviation of the distribution. For example the probability of 3 or less in a distribution with mean 5.4 and standard deviation 1.3 (Example 4.7.1) is easily obtained:
> pnorm(3, 5.4, 1.3)
[1] 0.03243494
What is the probability that a random sample of size 10 from a population with mean 185.6 and standard deviation 12.7 will have a mean greater than 190? R doesn't have a function to compute standard error, but it does have a built-in square root function:
> pnorm(190, 185.6, 12.7/sqrt(10), lower=F)
[1] 0.1366286
Alternately you can use BSDA:zsum.test(). In this case the null hypothesis is that mu=185.6 and the p-value is the desired answer:
> library(BSDA)
> zsum.test(190, 12.7, 10, mu=185.6, alt='g')
One-sample z-Test
data: Summarized x
z = 1.0956, p-value = 0.1366
alternative hypothesis: true mean is greater than 185.6
95 percent confidence interval:
183.3941 NA
sample estimates:
mean of x
190
R doesn't give much support for this, you have to calculate the difference of means and standard error yourself, then use pnorm() to compute the desired probabilty:
> pnorm(20, 45-30, sqrt(15^2/35 + 20^2/40), lower=F)
[1] 0.1086783
There may be a way to do this with BSDA:zsum.test() but I haven't figured it out.
These are similar to the two examples above. Compute the standard error yourself, then use pnorm() for the final answer. Questions about a single sample proportion can also be computed with prop.test(), for example here is Example 5.5.2:
> prop.test(.45*200, 200, .51, alt='l',corr=F)
1-sample proportions test without continuity correction
data: 0.45 * 200 out of 200, null probability 0.51
X-squared = 2.8812, df = 1, p-value = 0.04481
alternative hypothesis: true p is less than 0.51
95 percent confidence interval:
0.0000000 0.5081466
sample estimates:
p
0.45
To match the book result I turned of the Yates continuity correction.
Several functions are available to create confidence intervals for the value of a mean. The choice of which to use depends on whether you have raw or summary data, and whether you want to use a z-statistic or t-statistic. The functions are summarized here:
Type of data |
z-statistic |
t-statistic |
Raw |
BSDA:z.test() |
t.test() |
Summarized |
BSDA:zsum.test() |
BSDA:tsum.test() |
Example 6.2.1 uses a z-statistic with summarized data. Note that the problem supplies the variance and zsum.test() needs the standard deviation:
> zsum.test(22,sqrt(45),10)
95 percent confidence interval:
17.84229 26.15771
Example 6.2.2 is similar but uses a different confidence level:
> zsum.test(84.3, sqrt(144), 15, conf.level=.99)
99 percent confidence interval:
76.31908 92.28092
In Example 6.3.1 the sample standard deviation is given so a t-statistic is used:
> tsum.test(250.8, 130.9, 19)
95 percent confidence interval:
187.7082 313.8918
To find reliability factors directly use qnorm() and qt().
The same four functions are used to compute confidence intervals for the difference between two means. z.test(), t.test(), zsum.test() and tsum.test() can all take two sets of values as parameters.
Example 6.4.1:
> zsum.test(4.5, sqrt(1), 12, 3.4, sqrt(1.5), 15)
95 percent confidence interval:
0.2607936 1.9392064
Here the confidence interval is for the difference between the two means.
Example 6.4.3 tsum.test() has a parameter var.equal which specifies whether the variances are equal or not:
> tsum.test(4.7, 9.3, 18, 8.8, 11.5, 10, var.equal=T)
95 percent confidence interval:
-12.301022 4.101022
Example 6.4.4 When the variances are not equal, R uses the Welch method to compute the confidence intervals. The Welch method is briefly described in this Wikipedia entry. The Welch method gives a different result than the Cochran method used in the text:
> tsum.test(4.7, 9.3, 18, 8.8, 11.5, 10, var.equal=F)
95 percent confidence interval:
-13.118585 4.918585
The confidence interval for a population proportion or the difference between two population proportions is computed with prop.test(). There are some differences between prop.test() and the method shown in the text.
First, prop.test() uses a different formula for standard error. A paper by Newcombe[#]_ compares seven methods. He calls the method used in the text the simple asymptotic method and the method used by R the Wilson 'score' method. In both cases he allows a variation using continuity correction. In his discussion he strongly recommends that the simple asymptotic method be limited to simple uses such as teaching and sample size estimation and finds that the Wilson 'score' method performs well.
Second, R applies the continuity correction (p. 147 in the textbook) automatically in some cases. You can control this explicitly with the correct parameter to prop.test().
With those caveats, the confidence interval is easy to calculate. Note that the arguments to prop.test() are raw counts, not the sample proportion. Example 6.5.1 looks like this:
> prop.test(.18*1220, 1220)
95 percent confidence interval:
0.1590672 0.2029797
A confidence interval for the difference between population proportions is also calculated using prop.test(). The above notes about prop.test() apply here as well. The sample numbers are given as two arrays; the c() function can be used to build the arrays. Example 6.6.1 looks like this:
> prop.test(c(31, 53), c(68, 255), conf.level=.99)
99 percent confidence interval:
0.06994455 0.42613388
The BSDA:nsize() function can be used to determine required sample size for estimating means or proportions. Example 6.7.1 becomes:
> nsize(5, 20)
The required sample size (n) to estimate the population
mean with a 0.95 confidence interval so that the margin
of error is no more than 5 is 62 .
and Example 6.8.1 is:
> nsize(0.05, p=.35, type='pi')
The required sample size (n) to estimate the population
proportion of successes with a 0.95 confidence interval
so that the margin of error is no more than 0.05 is 350 .
As far as I know R doesn't have a built-in function to compute a confidence interval for a single variance. It does have qchisq() which can be used to find the correct reliability coefficients. For example, using values from Example 6.9.1:
> alpha=.05
> qchisq(1-alpha/2, df=6)
[1] 14.44938
> qchisq(alpha/2, df=6)
[1] 1.237344
R does have var.test() to compute an F-test for ratio of two variances but it works only from raw data so it cannot be used to solve Example 6.10.1. The qf() function gives reliability coefficients from the F-distribution:
> qf(.025, 15, 3)
[1] 0.2408012
> qf(.975, 15, 3)
[1] 14.25271
The functions used to create confidence intervals for the mean are also used for hypothesis testing the value of a mean. This is not too surprising as one way to test a hypothesis is to create a confidence interval for the sampled mean. The table above also applies here. An additional parameter gives the value of the mean in the null hypothesis.
Example 7.2.1 uses a z-statistic with summarized data. This is computed with BSDA:zsum.test(). The parameters are the mean, standard deviation and count of the sample, and the mean being tested against in the null hypothesis:
> zsum.test(27, sqrt(20), 10, mu=30)
One-sample z-Test
data: Summarized x
z = -2.1213, p-value = 0.03389
alternative hypothesis: true mean is not equal to 30
95 percent confidence interval:
24.22819 29.77181
sample estimates:
mean of x
27
The output shows the computed z-statistic -2.1213, the p-value and the confidence interval. R does not show the rejection region; the null hypothesis is accepted or rejected by seeing whether it lies within the computed confidence interval.
Example 7.2.2 is a one-sided hypothesis test with the same data. This is easily computed by adding alt='less' to the parameters to zsum.test():
> zsum.test(27, sqrt(20), 10, mu=30, alt='less')
One-sample z-Test
data: Summarized x
z = -2.1213, p-value = 0.01695
alternative hypothesis: true mean is less than 30
95 percent confidence interval:
NA 29.32617
sample estimates:
mean of x
27
Example 7.2.3 uses a t-statistic with raw data. This is computed with t.test(). The scan() function is a convenient way to enter small data sets by hand. (Larger data sets are better read from a file using read.table() or read.csv().)
> days = scan()
1: 14 9 18 26 12
6: 0 10 4 8 21
11: 28 24 24 2 3 14 9
18:
Read 17 items
> t.test(days, mu=15)
One Sample t-test
data: days
t = -0.7915, df = 16, p-value = 0.4402
alternative hypothesis: true mean is not equal to 15
95 percent confidence interval:
8.72508 17.86315
sample estimates:
mean of x
13.29412
In this case the null hypothesis cannot be rejected because the mean of 15 falls in the 95% confidence interval for the true mean of the data set.
Once again z.test(), t.test(), zsum.test() and tsum.test() are used. This is similar to calculating confidence intervals for the difference between two means.
Example 7.3.2 uses the same data as Example 6.4.1, and the same command is used for the hypothesis test. Since the null hypothesis is that the difference is 0, it does not need to be specified. The full output is:
> zsum.test(4.5, sqrt(1), 12, 3.4, sqrt(1.5), 15)
Two-sample z-Test
data: Summarized x and y
z = 2.569, p-value = 0.01020
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.2607936 1.9392064
sample estimates:
mean of x mean of y
4.5 3.4
Example 7.4
The data for this example can be downloaded from the Student Companion website and read using read.csv():
> data = read.csv(file='EXA_C07_S04_01.csv')
Then this becomes a single-sample t-test on the difference between the two data sets:
> with(data, t.test(POSTOP-PREOP, alt='greater'))
One Sample t-test
data: POSTOP - PREOP
t = 1.9159, df = 11, p-value = 0.04086
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
1.131919 Inf
sample estimates:
mean of x
18.075
Alternately use the paired=TRUE parameter to t.test() which yields the same result:
with(data, t.test(POSTOP, PREOP, paired=T, alt='greater'))
prop.test() may be used for hypothesis testing of a single population proportion or the difference between two proportions. Here is Example 7.5.1:
> prop.test(24, 301, p=.063, alt='g', corr=F)
1-sample proportions test without continuity correction
data: 24 out of 301, null probability 0.063
X-squared = 1.4279, df = 1, p-value = 0.1161
alternative hypothesis: true p is greater than 0.063
95 percent confidence interval:
0.05763846 1.00000000
sample estimates:
p
0.07973422
Example 7.6.1:
> prop.test(c(11, 24), c(29, 44), alt='l', corr=F)
2-sample test for equality of proportions without continuity
correction
data: c(11, 24) out of c(29, 44)
X-squared = 1.9333, df = 1, p-value = 0.0822
alternative hypothesis: less
95 percent confidence interval:
-1.00000000 0.02675495
sample estimates:
prop 1 prop 2
0.3793103 0.5454545
Here again R doesn't give much help. qchisq() and qf() can be used to obtain critical values, and var.test() will test the ratio of two variances given the raw data.
The functions oneway.test() and aov() both perform an F test as described in Section 8.2. The output of aov() is very similar to Table 8.2.4.
Both functions need the data as a single array and a factor. The stack() function will convert a data frame containing multiple datasets into the required form. Below is Example 8.2.1. (Note: The data set available online differs from the one in the book so the answer here differs from the book.)
# Read the data set
> sel=read.csv('EXA_C08_S02_01.csv')
# Convert to stacked form
> d=stack(sel)
> names(d)
[1] "values" "ind"
# ANOVA using oneway.test()
> oneway.test(values~ind, d,var.equal=T)
One-way analysis of means
data: values and ind
F = 22.6142, num df = 3, denom df = 140, p-value = 5.345e-12
# ANOVA using aov()
> res = aov(values~ind, d)
> summary(res)
Df Sum Sq Mean Sq F value Pr(>F)
ind 3 18935 6312 22.614 5.345e-12 ***
Residuals 140 39074 279
oneway.test() can also use the Welch approximation to perform an F test without the assumption of equal variances.
The Tukey HSD test can be computed from the result of aov() using TukeyHSD():
> TukeyHSD(aov(values~ind, d), ordered=T)
Tukey multiple comparisons of means
95% family-wise confidence level
factor levels have been ordered
Fit: aov(formula = values ~ ind, data = d)
$ind
diff lwr upr p adj
RRB-VEN 2.646857 -7.737053 13.03077 0.9108599
SQU-VEN 11.541505 2.567647 20.51536 0.0057777
NRB-VEN 36.170840 24.160828 48.18085 0.0000000
SQU-RRB 8.894648 -1.030122 18.81942 0.0961032
NRB-RRB 33.523982 20.787762 46.26020 0.0000000
NRB-SQU 24.629335 13.014007 36.24466 0.0000010
The first column is the difference between means, the second and third columns are the lower and upper bounds of the confidence interval for the difference, and the last column is an adjusted p-value. If the confidence interval includes 0 the means may be equal. The confidence levels may be plotted using plot(TukeyHSD(...)).
Two-way analysis of variance is very similar to the one-way analysis. An extra term is added to the model formula to represent the second factor. Here is Example 8.3.1, with data entered by hand:
# Data for example 8.3.1
> method =rep(c('A', 'B','C'), 5)
> age=rep(c('U20', '20', '30', '40', '50'), rep(3,5))
> days=scan()
1: 7 9 10 8 9 10 9 9 12 10 9 12 11 12 14
16:
Read 15 items
> data = data.frame(list(days=days, age=age, method=method))
# xtabs() shows data as a table
> xtabs(days ~ age + method, data)
method
age A B C
20 8 9 10
30 9 9 12
40 10 9 12
50 11 12 14
U20 7 9 10
# Two-way ANOVA
> res = aov(days ~ method + age, data)
> summary(res)
Df Sum Sq Mean Sq F value Pr(>F)
method 2 18.5333 9.2667 21.385 0.0006165 ***
age 4 24.9333 6.2333 14.385 0.0010017 **
Residuals 8 3.4667 0.4333
> TukeyHSD(res, "method")
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = days ~ method + age, data = data)
$method
diff lwr upr p adj
B-A 0.6 -0.5896489 1.789649 0.3666717
C-A 2.6 1.4103511 3.789649 0.0006358
C-B 2.0 0.8103511 3.189649 0.0034083
Example 8.4.1 is similar. Note that the raw category data from read.csv() must be converted to factors:
> health =read.csv('EXA_C08_S04_01.csv')
> summary(aov(FUNC ~ factor(SUBJ) + factor(TIME), data=health))
Df Sum Sq Mean Sq F value Pr(>F)
SUBJ 17 20237.5 1190.4 8.1998 2.179e-09 ***
TIME 3 2395.8 798.6 5.5008 0.002371 **
Residuals 51 7404.2 145.2
For a factorial experiment where the factors may interact, a different model formula is used, with A*B instead of A+B. Here is Example 8.5.2:
> health =read.csv('EXA_C08_S04_01.csv')
> summary(aov(HOME ~ factor(A) * factor(B), minutes))
Df Sum Sq Mean Sq F value Pr(>F)
factor(A) 3 2992.45 997.48 67.9427 < 2.2e-16 ***
factor(B) 3 1201.05 400.35 27.2695 1.763e-11 ***
factor(A):factor(B) 9 608.45 67.61 4.6049 0.0001047 ***
Residuals 64 939.60 14.68
R has great support for linear regression and related operations such as plotting. The lm() function creates a linear model. Its argument is a formula. For example here is the code to reproduce Example 9.3.1:
# Read the data set
> at=read.csv('EXA_C09_S03_01.csv')
# Scatter plot
> plot(Y ~ X, data=at)
# Compute the linear model
> res=lm(Y~X, data=at)
> res
Call:
lm(formula = Y ~ X, data = at)
Coefficients:
(Intercept) X
-215.981 3.459
# More information is available in the summary:
> summary(res)
Call:
lm(formula = Y ~ X, data = at)
Residuals:
Min 1Q Median 3Q Max
-107.288 -19.143 -2.939 16.376 90.342
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -215.9815 21.7963 -9.909 <2e-16 ***
X 3.4589 0.2347 14.740 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 33.06 on 107 degrees of freedom
Multiple R-Squared: 0.67, Adjusted R-squared: 0.667
F-statistic: 217.3 on 1 and 107 DF, p-value: < 2.2e-16
# Add the regression line to the plot
> abline(res)
The coefficient of determination (Multiple R-Squared), adjusted coefficient of determination (Adjusted R-squared) and F-statistic (Example 9.4.1) are all shown at the end the summary above. The full analysis of variance table can be shown using aov():
> summary(aov(Y~X,at))
Df Sum Sq Mean Sq F value Pr(>F)
X 1 237549 237549 217.28 < 2.2e-16 ***
Residuals 107 116982 1093
The t-statistic and p-value for beta (Example 9.4.2) are shown in the second line of the coefficients table.
The standard error of the slope, used to create a confidence interval for the slope, (0.2347) is shown in the above output under the heading Std. Error.
Predicted values of y and mu with associated confidence intervals are computed with predict(). For a confidence interval on Y, use interval='prediction'. For a confidence interval on the mean of Y, use interval='confidence':
> new <- data.frame(X = 100)
> predict(res, new, se.fit=T, interval='prediction')
$fit
fit lwr upr
[1,] 129.9045 63.94943 195.8595
$se.fit
[1] 3.693391
> predict(res, new, interval='confidence')
fit lwr upr
[1,] 129.9045 122.5827 137.2262
The line() function computes a resistant line using a method credited to Tukey, but the results differ considerably from Figure 9.5.2. The functions MASS:lqs() and MASS:rlm() have several methods to compute a resistant regression. The results vary widely when used with the dataset of Example 9.3.1. The closest match to the results given in Figure 9.5.2 are from MASS:lqs() with method='S':
> lqs(Y~X, data=at, method='S')
Call:
lqs.formula(formula = Y ~ X, data = at, method = "S")
Coefficients:
(Intercept) X
-208.833 3.293
Scale estimates 25.89
The correlation coefficient, t-statistic and confidence interval are computed by cor.test(). Here are Example 9.7.1 and Example 9.7.2:
> sep=read.csv('EXA_C09_S07_01.csv')
> cor.test(~ HEIGHT + CV, sep)
Pearson's product-moment correlation
data: HEIGHT and CV
t = 19.7813, df = 153, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7967316 0.8869721
sample estimates:
cor
0.847883
The correlation coefficient can also be computed as the square root of the MultipleR-Squared value from the summary of the linear model.
Computing a multiple linear regression is similar to a simple linear regression, the only change is to add another model variable to the model formula. Here is Example 10.3.1:
> cda=read.csv('EXA_C10_S03_01.csv')
> res = lm(CDA ~ AGE + EDLEVEL, cda)
> summary(res)
Call:
lm(formula = CDA ~ AGE + EDLEVEL, data = cda)
Residuals:
Min 1Q Median 3Q Max
-5.98039 -2.21247 -0.07607 2.28236 9.12296
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.49407 4.44297 1.237 0.220498
AGE -0.18412 0.04851 -3.795 0.000316 ***
EDLEVEL 0.61078 0.13565 4.503 2.70e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.134 on 68 degrees of freedom
Multiple R-Squared: 0.3706, Adjusted R-squared: 0.3521
F-statistic: 20.02 on 2 and 68 DF, p-value: 1.454e-07
The values for r-squared (Example 10.4.1), the F-statistic for r-squared (Example 10.4.2), the t-statistic and p-value for the coefficients (Example 10.4.3) and the standard error of the coefficients (for constructing confidence intervals) may all be read from the above summary.
R does not compute the ANOVA table in the same form as it is shown in the text (or at least I don't know how to do it). Here is the table for the cda data; compare with Figure 10.3.1:
> summary(aov(CDA ~ AGE + EDLEVEL, cda))
Df Sum Sq Mean Sq F value Pr(>F)
AGE 1 194.24 194.24 19.774 3.306e-05 ***
EDLEVEL 1 199.15 199.15 20.273 2.703e-05 ***
Residuals 68 667.97 9.82
To obtain the ANOVA table in Figure 10.3.1, the Sum Sq terms for AGE and EDLEVEL must be combined, then divided by the summed degrees of freedom.
As for the single regression, predicted values of y and mu with associated confidence intervals are computed with predict(). Here is Example 10.5.1:
> new = data.frame(AGE=68, EDLEVEL=12)
> predict(res, new, se.fit=T, interval='prediction')
$fit
fit lwr upr
[1,] 0.3029173 -6.093481 6.699316
$se.fit
[1] 0.672222
> predict(res, new, se.fit=T, interval='confidence')
$fit
fit lwr upr
[1,] 0.3029173 -1.038481 1.644315
$se.fit
[1] 0.672222
The correlation coefficient may be computed from the Multiple R-Squared value in the model summary. The F statistic is reported directly. Here is Example 10.6.1:
> bone=read.csv('EXA_C10_S06_01.csv')
> res=lm(W ~ P + S, bone)
> summary(res)
Call:
lm(formula = W ~ P + S, data = bone)
Residuals:
Min 1Q Median 3Q Max
-33.9074 -19.5941 -0.5171 10.1592 76.8129
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.6138 29.1296 1.223 0.23245
P 1.4509 2.7632 0.525 0.60397
S 2.3960 0.7301 3.282 0.00294 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 27.42 on 26 degrees of freedom
Multiple R-Squared: 0.2942, Adjusted R-squared: 0.2399
F-statistic: 5.419 on 2 and 26 DF, p-value: 0.01078
> sqrt(0.2942)
[1] 0.5424021
Partial correlation coefficients may be computed from the residuals of the pairwise regressions using the method of Figure 10.6.3. This method yields p-values as well as the partial correlation coefficients. The function Rcmdr:partial.cor() computes the coefficients directly but does not include the p-values. Here is Example 10.6.2:
> cor.test(residuals(lm(W~P,bone)), residuals(lm(S~P, bone)))
Pearson's product-moment correlation
data: residuals(lm(W ~ P, bone)) and residuals(lm(S ~ P, bone))
t = 3.3442, df = 27, p-value = 0.002433
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2179076 0.7574559
sample estimates:
cor
0.5411914
> cor.test(residuals(lm(W~S,bone)), residuals(lm(P~S, bone)))
Pearson's product-moment correlation
data: residuals(lm(W ~ S, bone)) and residuals(lm(P ~ S, bone))
t = 0.5351, df = 27, p-value = 0.597
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.2743707 0.4519729
sample estimates:
cor
0.1024358
> cor.test(residuals(lm(S~W,bone)), residuals(lm(P~W, bone)))
Pearson's product-moment correlation
data: residuals(lm(S ~ W, bone)) and residuals(lm(P ~ W, bone))
t = -0.6413, df = 27, p-value = 0.5267
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.4679843 0.2554876
sample estimates:
cor
-0.1224876
> library(Rcmdr)
> partial.cor(bone)
W P S
W 0.0000000 0.1024358 0.5411914
P 0.1024358 0.0000000 -0.1224876
S 0.5411914 -0.1224876 0.0000000
A single, qualitative variable coded with a number is treated the same as any other numeric variable. Here is Example 11.2.1:
> d11=read.csv('EXA_C11_S02_01.csv')
> res.d11 = lm(GRAMS ~ WEEKS + SMOKE, data=d11)
> summary(res.d11)
Call:
lm(formula = GRAMS ~ WEEKS + SMOKE, data = d11)
Residuals:
Min 1Q Median 3Q Max
-1260.450 -265.799 9.262 311.526 1016.586
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1724.42 558.84 -3.086 0.00265 **
WEEKS 130.05 14.52 8.957 2.39e-14 ***
SMOKE -294.40 135.78 -2.168 0.03260 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 484.6 on 97 degrees of freedom
Multiple R-squared: 0.4636, Adjusted R-squared: 0.4525
F-statistic: 41.92 on 2 and 97 DF, p-value: 7.594e-14
By the way, Figure 11.2.3 can be reproduced with the commands
> with(d11, plot(WEEKS, GRAMS, pch=SMOKE+16))
> coeff=res.d11$coefficients
> abline(coeff[1], coeff[2])
> abline(coeff[1]+coeff[3], coeff[2])
A variable with multiple response levels must be split into multiple variables and the cross terms included in the model formula. Here is Example 11.2.3:
> d13=read.csv('EXA_C11_S02_03.csv')
> d13$X2 = ifelse(d13$METHOD=='A', 1, 0)
> d13$X3 = ifelse(d13$METHOD=='B', 1, 0)
> summary(lm(EFFECT ~ AGE + X2 + X3 + AGE*X2 + AGE*X3, data=d13))
Call:
lm(formula = EFFECT ~ AGE + X2 + X3 + AGE * X2 + AGE * X3, data = d13)
Residuals:
Min 1Q Median 3Q Max
-6.4366 -2.7637 0.1887 2.9075 6.5634
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.21138 3.34964 1.854 0.073545 .
AGE 1.03339 0.07233 14.288 6.34e-15 ***
X2 41.30421 5.08453 8.124 4.56e-09 ***
X3 22.70682 5.09097 4.460 0.000106 ***
AGE:X2 -0.70288 0.10896 -6.451 3.98e-07 ***
AGE:X3 -0.50971 0.11039 -4.617 6.85e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.925 on 30 degrees of freedom
Multiple R-squared: 0.9143, Adjusted R-squared: 0.9001
F-statistic: 64.04 on 5 and 30 DF, p-value: 4.264e-15
The stats:step() and MASS:stepAIC() perform step-wise refinement of a model in a manner similar to that described in section 11.3. The difference is is the selection procedure; the book uses the F-statistic; the R functions use the Akaike information criterion (AIC). The input to step() is a model such as created by lm(). Here is Example 11.3.1 run with the stats:step() function:
> d14=read.csv('EXA_C11_S03_01.csv')
> step(lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6, data=d14))
Start: AIC=106.22
Y ~ X1 + X2 + X3 + X4 + X5 + X6
Df Sum of Sq RSS AIC
- X5 1 1.34 650.26 104.28
- X4 1 6.79 655.70 104.53
<none> 648.92 106.22
- X1 1 93.28 742.19 108.25
- X2 1 98.13 747.04 108.45
- X6 1 126.76 775.68 109.58
- X3 1 1172.30 1821.22 135.18
Step: AIC=104.29
Y ~ X1 + X2 + X3 + X4 + X6
Df Sum of Sq RSS AIC
- X4 1 5.58 655.84 102.54
<none> 650.26 104.28
- X2 1 97.40 747.66 106.47
- X1 1 97.75 748.01 106.49
- X6 1 126.85 777.11 107.63
- X3 1 1192.66 1842.92 133.54
Step: AIC=102.54
Y ~ X1 + X2 + X3 + X6
Df Sum of Sq RSS AIC
<none> 655.84 102.54
- X1 1 95.23 751.07 104.61
- X2 1 99.01 754.84 104.76
- X6 1 135.14 790.98 106.16
- X3 1 1466.29 2122.13 135.77
Call:
lm(formula = Y ~ X1 + X2 + X3 + X6, data = d14)
Coefficients:
(Intercept) X1 X2 X3 X6
23.7359 0.2142 -0.1901 0.8254 -0.4282
Logistic regression uses general linear models and the stats:glm() function. (See also rms:lrm() and rms:residuals() which seems to include many more options.) Here is Example 11.4.1 worked using glm():
To use summary data, the dependent variable of the glm formula is coded as a two-column matrix with the columns giving the numbers of successes and failures. The first column is successes and the second column failures. To duplicate the book results, "OCAD not present" is coded as a success.
> present = c(92, 15)
> not.present = c(21, 20)
> ocad = cbind(not.present=not.present, present=present)
> sex = c(0, 1)
> cbind(ocad, sex)
not.present present sex
[1,] 21 92 0
[2,] 20 15 1
> res.glm = glm(ocad ~ sex, family = binomial)
> summary(res.glm)
Call:
glm(formula = ocad ~ sex, family = binomial)
Deviance Residuals:
[1] 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.4773 0.2418 -6.108 1.01e-09 ***
sex 1.7649 0.4185 4.217 2.47e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.8362e+01 on 1 degrees of freedom
Residual deviance: -6.8834e-15 on 0 degrees of freedom
AIC: 12.686
Number of Fisher Scoring iterations: 3
The B value (estimate) for sex is the log odds ratio. To compute the odds ratio use exp():
> exp(1.765)
[1] 5.8416
The B value for sex means that the odds of a woman not having OCAD are 5.84 times the odds of a man not having OCAD. The book states this as the odds of a man having OCAD are 5.84 times the odds of a woman having it; this is an equivalent statement.
To confirm the odds ratio from the raw data:
> pMNotPresent = 21/(21+92)
> oMNotPresent = pMNotPresent / (1-pMNotPresent)
> oMNotPresent
[1] 0.22826
> pFNotPresent = 20/(20+15)
> oFNotPresent = pFNotPresent / (1-pFNotPresent)
> oFNotPresent
[1] 1.3333
> oFNotPresent/oMNotPresent
[1] 5.8413
To perform a goodness-of-fit test in R, the expected relative frequency table must be created. Then the chisq.test() and pchisq() functions compute the desired statistics. Note that the p= parameter to chisq.test() is the expected relative frequency, not the expected frequency. Here is Example 12.3.1:
# Enter the data
> actual=scan()
1: 1 3 8 18 6 4 4 3
9:
Read 8 items
# Create the expected data
> endpts = seq(125, 275, by=25)
> endpts
[1] 125 150 175 200 225 250 275
> expected=c(pnorm(125, 198.67, 41.31),
+ diff(pnorm(endpts, 198.67, 41.31)),
+ 1-pnorm(275, 198.67, 41.31))
> expected
[1] 0.03726504 0.08210035 0.16396211 0.22951446 0.22521804
[6] 0.15492430 0.07469547 0.03232023
> chisq.test(actual, p=expected)
Chi-squared test for given probabilities
data: actual
X-squared = 10.3246, df = 7, p-value = 0.1709
# The p-value is incorrect because the degrees of freedom is wrong.
# Here is the correct p-value
> 1-pchisq(10.3246, 5)
[1] 0.06654326
Example 12.3.2 is similar:
> actual = scan()
1: 11 8 10 10 15 17 10 10 9 0
11:
Read 10 items
> expected = c(pbinom(1, 25, .2),
+ diff(pbinom(seq(1, 9), 25, .2)), 1-pbinom(9, 25, .2))
> expected
[1] 0.02738973 0.07083550 0.13576804 0.18668105 0.19601510
[6] 0.16334592 0.11084187 0.06234855 0.02944237 0.01733187
> chisq.test(actual, p=expected)
Chi-squared test for given probabilities
data: actual
X-squared = 47.6777, df = 9, p-value = 2.934e-07
> 1-pchisq(47.6777, 8)
[1] 1.138480e-07
The chisq.test() function can be used to test for independence of entries in a contingency table. The data must be in a matrix. Here is Example 12.4.1:
> data = matrix(c(260, 15, 7, 299, 41, 14), ncol=2)
> data
[,1] [,2]
[1,] 260 299
[2,] 15 41
[3,] 7 14
> chisq.test(data)
Pearson's Chi-squared test
data: data
X-squared = 9.0913, df = 2, p-value = 0.01061
The function TeachingDemos:chisq.detail() function shows the details of the calculation including marginal totals, the expected frequency table and the contribution of each table entry to the chi-squared statistic:
> library(TeachingDemos)
> chisq.detail(data)
observed
expected
Total
260 299 559
247.86 311.14
15 41 56
24.83 31.17
7 14 21
9.31 11.69
Total 282 354 636
Cell Contributions
0.59 + 0.47 +
3.89 + 3.10 +
0.57 + 0.46 = 9.09
df = 2 P-value = 0.011
Calculation of a the chi-squared test for a 2x2 contingency table is similar. By default R applies the Yates correction. Here is Example 12.4.2:
> data = matrix(c(131, 14, 52, 36), ncol=2)
> data
[,1] [,2]
[1,] 131 52
[2,] 14 36
> chisq.test(data, correct=F)
Pearson's Chi-squared test
data: data
X-squared = 31.7391, df = 1, p-value = 1.763e-08
> chisq.test(data)
Pearson's Chi-squared test with Yates' continuity correction
data: data
X-squared = 29.9118, df = 1, p-value = 4.522e-08
Not surprisingly, a chi-squared test for homogeneity is done the same way. Here is Example 12.5.1:
> data = matrix(c(21, 19, 75, 77), ncol=2)
> chisq.test(data, correct=F)
Pearson's Chi-squared test
data: data
X-squared = 0.1263, df = 1, p-value = 0.7223
R uses the function fisher.test() to compute the Fisher exact test. The presentation of the results of this test is quite different from the presentation in the text; I'm not sure how to compare the two.
R has no functions to compute relative risk and odds ratios as shown in the text. The chi-squared statistic needed to compute confidence intervals for relative risk and odds ratios can be computed using chisq.test() in the same manner as shown above.
Note there is an error in the text of Example 12.7.1; the sum of the first column of Table 12.7.2 is 40, not 44, so the computed value of chi-squared is incorrect.
The epitools library contains functions riskratio() and oddsratio() but they seem to use more sophisticated methods than those in the text.
The Mantel-Haenszel statistic and common odds ratio may be computed with the mantelhaen.test() function. The data must be in a three-dimensional array with the last dimension corresponding to the strata. Here is Example 12.7.3; the dimension names are not required but aid in verifying the data entry:
> data = array(c(21, 16, 11, 6, 50, 18, 14, 6),
+ dim=c(2, 2, 2),
+ dimnames=list(Hypertension=c('Present', 'Absent'),
+ OCAD=c('Case', 'Noncase'),
+ Stratum=c('55 and under', 'over 55')))
> data
, , Stratum = 55 and under
OCAD
Hypertension Case Noncase
Present 21 11
Absent 16 6
, , Stratum = over 55
OCAD
Hypertension Case Noncase
Present 50 14
Absent 18 6
> mantelhaen.test(data, corr=F)
Mantel-Haenszel chi-squared test without continuity correction
data: data
Mantel-Haenszel X-squared = 0.0243, df = 1, p-value = 0.8762
alternative hypothesis: true common odds ratio is not equal to 1
95 percent confidence interval:
0.4184851 2.1018264
sample estimates:
common odds ratio
0.9378609
The text shows how to test the ratio of two variances using the F distribution. Another way to test the null hypothesis that the variances of two samples are equal is with Levene's test (car:levene.test()). levene.test() requires a single data array and a factor, so the stack() function can be helpful. For example, using the data from Example 8.2.1:
> sel=read.csv('EXA_C08_S02_01.csv')
> levene.test(values ~ ind, stack(sel))
Levene's Test for Homogeneity of Variance
Df F value Pr(>F)
group 3 10.623 2.452e-06 ***
140
This test rejects the null hypothesis that the variances are equal.
Two common tests for normality of a data set are the Shapiro-Wilk test and the Kolmogorov-Smirnov test. The Shapiro-Wilk test is preferred for small data sets (<30 items) and the Kolmogorov-Smirnov is preferred for larger data sets. These tests are both available in the stats package as stats:shapiro.test() and stats:ks.test().
The Kolmogorov-Smirnov test compares the distribution of two samples or compares a sample to a specified distribution. You must specify the distribution to compare against. For example:
> ks.test(data, 'pnorm', mean, sd)
Note that if the mean and sd of the sample data are used, the p-value computed by ks.test() is not valid. Instead, use nortest:lillie.test() which computes the same test statistic with the correct p-value.
|
|
© Kent S Johnson
|
Comments about life, the universe and Python, from the imagination of Kent S Johnson.
kentsjohnson.com
Weblog home
All By Date
All By Category
Essays
BlogRoll
|