• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    迪恩网络公众号

R语言学习笔记之十

原作者: [db:作者] 来自: [db:来源] 收藏 邀请

摘要: 仅用于记录R语言学习过程:

内容提要:

描述性统计;t检验;数据转换;方差分析;卡方检验;回归分析与模型诊断;生存分析;COX回归

写在正文前的话,关于基础知识,此篇为终结篇,笔记来自医学方的课程,仅用于学习R的过程。

正文:

  描述性统计

n  如何去生成table1 用table()函数,快速汇总频数

u  生成四格表:table(行名,列名)

> table(tips$sex,tips$smoker)

      No Yes

Female 54  33

      Male   97  60

 

addmargins()函数:对生成的table表格进行计算

> table(esoph$agegp,esoph$ncases)

      

         0  1  2  3  4  5  6  8  9 17

  25-34 14  1  0  0  0  0  0  0  0  0

  35-44 10  2  2  1  0  0  0  0  0  0

  45-54  3  2  2  2  3  2  2  0  0  0

  55-64  0  0  2  4  3  2  2  1  2  0

  65-74  1  4  2  2  2  2  1  0  0  1

  75+    1  7  3  0  0  0  0  0  0  0

> tt <- table(esoph$agegp,esoph$ncases)

> addmargins(tt,margin = c(1,2))  # margin 1表示行,2表示列

      

         0  1  2  3  4  5  6  8  9 17 Sum

  25-34 14  1  0  0  0  0  0  0  0  0  15

  35-44 10  2  2  1  0  0  0  0  0  0  15

  45-54  3  2  2  2  3  2  2  0  0  0  16

  55-64  0  0  2  4  3  2  2  1  2  0  16

  65-74  1  4  2  2  2  2  1  0  0  1  15

  75+    1  7  3  0  0  0  0  0  0  0  11

      Sum   29 16 11  9  8  6  5  1  2  1  88

n  xtabs()函数:在数据集中取子集,生成表格

> hightip <- tips[,\'tip\'] > mean(tips[,\'tip\'] )

> as.data.frame(xtabs(~tips$sex + hightip,subset = tips$smoker==\'No\'))

#as.data.frame转换成数据框;~后面的公式类似table括号中的内容,为分类变量;~左边需添加的是连续型变量;有一个子集subset可进行提取

  tips.sex hightip Freq

1   Female   FALSE   31

2     Male   FALSE   49

3   Female    TRUE   23

4     Male    TRUE   48

 

> as.data.frame(xtabs(~tips$sex + hightip,subset = tips$day %in% c(\'Sun\',\'Sat\')))

  tips.sex hightip Freq

1   Female   FALSE   21

2     Male   FALSE   53

3   Female    TRUE   25

4     Male    TRUE   64

 

示例2:

> xtabs(ncontrols~agegp + alcgp,data = esoph)

       alcgp

agegp   0-39g/day 40-79 80-119 120+

  25-34        61    45      5    5

  35-44        89    80     20   10

  45-54        78    81     39   15

  55-64        89    84     43   26

  65-74        71    53     29    8

  75+          27    12      2    3

> addmargins(xtabs(ncontrols~agegp + alcgp,data = esoph),margin = c(1,2))

       alcgp

agegp   0-39g/day 40-79 80-119 120+ Sum

  25-34        61    45      5    5 116

  35-44        89    80     20   10 199

  45-54        78    81     39   15 213

  55-64        89    84     43   26 242

  65-74        71    53     29    8 161

  75+          27    12      2    3  44

  Sum         415   355    138   67 975

 

#cbind(数值型,数值型)

> xtabs(cbind(ncases,ncontrols)~agegp + alcgp,data = esoph)

, ,  = ncases

 

       alcgp

agegp   0-39g/day 40-79 80-119 120+

  25-34         0     0      0    1

  35-44         1     4      0    4

  45-54         1    20     12   13

  55-64        12    22     24   18

  65-74        11    25     13    6

  75+           4     4      2    3

 

, ,  = ncontrols

 

       alcgp

agegp   0-39g/day 40-79 80-119 120+

  25-34        61    45      5    5

  35-44        89    80     20   10

  45-54        78    81     39   15

  55-64        89    84     43   26

  65-74        71    53     29    8

  75+          27    12      2    3

n  ftable()函数:扁平表格,接受xtabs对象,进行调整

> ftable(xtabs(cbind(ncases,ncontrols)~agegp +alcgp,data = esoph))

                 ncases ncontrols

agegp alcgp                     

25-34 0-39g/day       0        61

      40-79           0        45

      80-119          0         5

      120+            1         5

35-44 0-39g/day       1        89

      40-79           4        80

      80-119          0        20

      120+            4        10

45-54 0-39g/day       1        78

      40-79          20        81

      80-119         12        39

      120+           13        15

55-64 0-39g/day      12        89

      40-79          22        84

      80-119         24        43

      120+           18        26

65-74 0-39g/day      11        71

      40-79          25        53

      80-119         13        29

      120+            6         8

75+   0-39g/day       4        27

      40-79           4        12

      80-119          2         2

      120+            3         3

n  数据汇总可结合前面学习的函数,如:

u  summary(数据集)函数

u  describe(数据集)函数(psych包里的)

u  describe.data.frame(数据集)函数 (Hmisc包里的)

u  apply()家族等

  t检验

n  假设检验:服从正态分布方差齐性(如果不齐,可以用t’检验),

n  示例:

u  生成随机数据,并检验是否符合正态分布:(shapiro.test()函数)

> data1 <- sample(1:100,50)

>

> #检验正态性  shapiro.test()函数

> shapiro.test(data1)

 

       Shapiro-Wilk normality test

 

data:  data1

W = 0.96424, p-value = 0.1338  #不能拒绝原假设,服从正态分布

除此之外,还有以下5个函数可用于检验是否符合正态分布:

library(nortest)

lillie.test(data1)

ad.test(data1)

cvm.test(data1)

pearson.test(data1)

sf.test(data1)

如:

> lillie.test(data1)

 

       Lilliefors (Kolmogorov-Smirnov) normality test

 

data:  data1

D = 0.064492, p-value = 0.8693

 

> ad.test(data1)

 

       Anderson-Darling normality test

 

data:  data1

A = 0.40918, p-value = 0.333

 

> cvm.test(data1)

 

       Cramer-von Mises normality test

 

data:  data1

W = 0.054212, p-value = 0.4437

 

> pearson.test(data1)

 

       Pearson chi-square normality test

 

data:  data1

P = 4, p-value = 0.7798

 

> sf.test(data1)

 

       Shapiro-Francia normality test

 

data:  data1

W = 0.97496, p-value = 0.3075

生成服从正态分布的数据:rnorm()函数

> #生成服从正态分布的数据

> data3 <- rnorm(100,3,5)

> data4 <- rnorm(200,3.4,8)

>

> shapiro.test(data3)

 

       Shapiro-Wilk normality test

 

data:  data3

W = 0.99369, p-value = 0.9258

 

>

> shapiro.test(data4)

 

       Shapiro-Wilk normality test

 

data:  data4

W = 0.98946, p-value = 0.149

检验方差齐性:var.test()函数:仅适用于两组

> var.test(data3,data4)

 

       F test to compare two variances

 

data:  data3 and data4

F = 0.40348, num df = 99, denom df = 199, p-value = 1.088e-06

alternative hypothesis: true ratio of variances is not equal to 1

95 percent confidence interval:

 0.2893355 0.5739700

sample estimates:

ratio of variances

         0.4034794   #p值小于0.05,说明方差不齐性

u  进行t检验函数:t.test()函数

两组均数t检验

> t.test(data3,data4,var.equal = F)   #方差不齐则var.equal设置为F,默认也为FALSE,若其设置为TRUE,如为FALSE,则会使用t’检验

 

       Welch Two Sample t-test

 

data:  data3 and data4

t = -1.3681, df = 281.41, p-value = 0.1724   #说明两组均数无显著统计学差异

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

 -2.587004  0.465494

sample estimates:

mean of x mean of y

      2.468549  3.529304

样本均数与总体均数t检验

> #样本均数与总体均数的比较

> t.test(data3,mu =3.2)

 

       One Sample t-test

 

data:  data3

t = -1.4117, df = 99, p-value = 0.1612  #样本与总体无统计学差异

alternative hypothesis: true mean is not equal to 3.2

95 percent confidence interval:

 1.440424 3.496675

sample estimates:

mean of x

 2.468549

配对数据的t检验

data3 <- rnorm(200,3,5)

> data4 <- rnorm(200,3.4,5)

>

> #配对数据

> #进行配对t检验

> t.test(data3,data4,paired = TRUE)

 

       Paired t-test

 

data:  data3 and data4

t = 0.59079, df = 199, p-value = 0.5553  #p大于0.05,说明无显著差异

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

 -0.6946177  1.2888581

sample estimates:

mean of the differences

              0.2971202

 

  数据变换

n  数据不满足正态分布时该如何处理:有时可采用取log()、sqrt()、1/x等等方式,

n  runif()平均分布函数

> mydata <- runif(100,min = 1,max = 2)

> shapiro.test(mydata)

 

       Shapiro-Wilk normality test

 

data:  mydata

W = 0.93149, p-value = 6.05e-05

 

> shapiro.test(log(mydata))

 

       Shapiro-Wilk normality test

 

data:  log(mydata)

W = 0.93082, p-value = 5.54e-05

 

> shapiro.test(sqrt(mydata))

 

       Shapiro-Wilk normality test

 

data:  sqrt(mydata)

W = 0.93236, p-value = 6.794e-05

 

> shapiro.test(1/mydata)

 

       Shapiro-Wilk normality test

 

data:  1/mydata

W = 0.9195, p-value = 1.325e-05

#几种方式均不可行。

boxcox转换---对公式的,加载MASS包,运用里面的boxcox()函数,#boxcox()转换,核心为找到lambda的值进行相应的转换

bc <- boxcox(Volume ~ log(Height) + log(Girth),data = trees,

             lambda = seq(-0.25,0.25,length =10))

#Volume为拟操作的变量,Height和Girth为用此两个数据进行估计,但要取log,trees为数据集,lambda后面的公式为取最佳值

u  用公式:(x^lambda-1)/lambda  进行数据转换(lambda 不等于1),新的Volume_bc就服从正态分布了。

> Volume_bc <- (trees$Volume^lambda-1)/lambda

> shapiro.test(Volume_bc)

 

       Shapiro-Wilk normality test

 

data:  Volume_bc

W = 0.96431, p-value = 0.3776

u  可用qqnorm(Volume_bc)和qqline(Volume_bc)c查看图,是否符合正态分布

boxcox转换---对数值的,加载forecast包,利用包中的BoxCox.lambda()函数

> BoxCox.lambda(trees$Volume)

[1] -0.4954451   #lambda的值  ,采用1/sqrt(x)

分以下几种情况:

ü  lambda接近-1时    1/x

ü             -0.5     1/sqrt(x)

ü              0       ln(x) 或log(x)

ü              0.5      sqrt()

ü               1       不用变换了

完整示例:

> BoxCox.lambda(trees$Volume)

[1] -0.4954451

> new_volum <- 1/sqrt(trees$Volume)

> shapiro.test(new_volum)

 

       Shapiro-Wilk normality test

 

data:  new_volum

W = 0.94523, p-value = 0.1152

利用car包中powerTransform得到lambda值,再进行正态分布分析

> library(car)

> powerTransform(trees$Volume)

Estimated transformation parameter

trees$Volume

 -0.07476608

> new_volum <- trees$Volume^-0.07476608

> shapiro.test(new_volum)

 

       Shapiro-Wilk normality test

 

data:  new_volum

W = 0.96428, p-value = 0.3768

  方差分析

n  用于多组均值的比较。两组均值的t检验与方差分析等价,但是对于>=3组的均数比较,t检验不适用,需用方差分析。

n  假设检验,需满足的条件:

u  正态性

u  方差齐性

u  独立性

n  方差齐性的检验

u  安装multcomp包,加载cholesterol数据集(里面包含response组和trt组)

u  方差齐性的检验:

l  R语言的内置包:bartlett.test()

> bartlett.test(response~trt,data = cholesterol)

 

       Bartlett test of homogeneity of variances

 

data:  response by trt

Bartlett\'s K-squared = 0.57975, df = 4, p-value =

0.9653

 

> #p = 0.9653   方差是齐性的

>

> #正态性检验

> shapiro.test(cholesterol$response)

 

       Shapiro-Wilk normality test

 

data:  cholesterol$response

W = 0.97722, p-value = 0.4417

l  R语言的内置包:fligner.test() 与bartlett.test()检验原理不同,但公式一样

> #方差齐性检验

> fligner.test(response~trt,data = cholesterol)

 

       Fligner-Killeen test of homogeneity of variances

 

data:  response by trt

Fligner-Killeen:med chi-squared = 0.74277, df = 4,

p-value = 0.946

car包中的ncvTest()函数:

> ncvTest(lm(response~trt,data = cholesterol))

Non-constant Variance Score Test

Variance formula: ~ fitted.values

Chisquare = 0.1338923    Df = 1     p = 0.71443

car包中的leveneTest()函数:

> leveneTest(response~trt,data = cholesterol)

Levene\'s Test for Homogeneity of Variance (center = median)

      Df F value Pr(>F)

group  4  0.0755 0.9893

      45

 

n  方差分析

单因素方差分析aov()函数

> aov(response~trt,data = cholesterol)

Call:

   aov(formula = response ~ trt, data = cholesterol)

 

Terms:

                      trt Residuals

Sum of Squares  1351.3690  468.7504

Deg. of Freedom         4        45

 

Residual standard error: 3.227488

Estimated effects may be unbalanced

> summary(fit)

            Df Sum Sq Mean Sq F value   Pr(>F)   

trt          4 1351.4   337.8   32.43 9.82e-13 ***

Residuals   45  468.8    10.4                    

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#用gplots包中的plotmeans可视化

plotmeans(response~trt,data = cholesterol)

单因素方差分析:oneway.test()函数

> oneway.test(response~trt,data = cholesterol)

 

       One-way analysis of means (not assuming equal

       variances)

 

data:  response and trt

F = 30.709, num df = 4.00, denom df = 22.46, p-value =

8.227e-09

 

> oneway.test(response~trt,data = cholesterol,var.equal = T)

 

       One-way analysis of means

 

data:  response and trt

F = 32.433, num df = 4, denom df = 45, p-value =

9.819e-13

两两比较:TukeyHSD()函数

> TukeyHSD(fit)

  Tukey multiple comparisons of means

    95% family-wise confidence level

 

Fit: aov(formula = response ~ trt, data = cholesterol)

 

$trt

                  diff        lwr       upr     p adj

2times-1time   3.44300 -0.6582817  7.544282 0.1380949

4times-1time   6.59281  2.4915283 10.694092 0.0003542

drugD-1time    9.57920  5.4779183 13.680482 0.0000003

drugE-1time   15.16555 11.0642683 19.266832 0.0000000

4times-2times  3.14981 -0.9514717  7.251092 0.2050382

drugD-2times   6.13620  2.0349183 10.237482 0.0009611

drugE-2times  11.72255  7.6212683 15.823832 0.0000000

drugD-4times   2.98639 -1.1148917  7.087672 0.2512446

drugE-4times   8.57274  4.4714583 12.674022 0.0000037

drugE-drugD    5.58635  1.4850683  9.687632 0.0030633

#可视化

plot(TukeyHSD(fit))

多因素方差分析:aov()函数

> data(\'ToothGrowth\')

> head(\'ToothGrowth\')

len supp dose

1  4.2   VC  0.5

2 11.5   VC  0.5

3  7.3   VC  0.5

4  5.8   VC  0.5

5  6.4   VC  0.5

6 10.0   VC  0.5

> levels(ToothGrowth$supp)

[1] "OJ" "VC"

> levels(ToothGrowth$dose)

NULL

> levels(as.factor(ToothGrowth$dose))

[1] "0.5" "1"   "2" 

> table(ToothGrowth$supp,ToothGrowth$dose)

   

     0.5  1  2

  OJ  10 10 10

  VC  10 10 10

 

#公式的写法   关注交互作用

 

> fangcha <- aov (len~supp+dose,data = ToothGrowth)

> summary(fangcha)

            Df Sum Sq Mean Sq F value   Pr(>F)   

supp         1  205.4   205.4   11.45   0.0013 **

dose         1 2224.3  2224.3  123.99 6.31e-16 ***

Residuals   57 1022.6    17.9                    

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#交互作用

> fangcha <- aov (len~supp*dose,data = ToothGrowth)

> summary(fangcha)

            Df Sum Sq Mean Sq F value   Pr(>F)   

supp         1  205.4   205.4  12.317 0.000894 ***

dose         1 2224.3  2224.3 133.415  < 2e-16 ***

supp:dose    1   88.9    88.9   5.333 0.024631 *    #交互作用不能忽视

Residuals   56  933.6    16.7                    

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#可视化  交互作用图

>interaction.plot(ToothGrowth$dose,ToothGrowth$supp,ToothGrowth$len,type = \'b\',

+                  pch = c(1,10),col = c(\'red\',\'green\'))

单因素协方差分析:aov()函数

> data(\'litter\')

> head(litter)

  dose weight gesttime number

1    0  28.05     22.5     15

2    0  33.33     22.5     14

3    0  36.37     22.0     14

4    0  35.52     22.0     13

5    0  36.77     21.5     15

6    0  29.60     23.0      5

 

> facn <- aov(weight~gesttime + dose + gesttime:dose,data = litter)

> summary(facn)

              Df Sum Sq Mean Sq F value  Pr(>F)  

gesttime       1  134.3  134.30   8.289 0.00537 **  #协变量确实有影响,该如何去除?

dose           3  137.1   45.71   2.821 0.04556 *

gesttime:dose  3   81.9   27.29   1.684 0.17889  

Residuals     66 1069.4   16.20                  

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

去除协变量的影响:加载effects包,用其中的effect()函数

> effect(\'dose\',facn)

NOTE: dose is not a high-order term in the model

 

 dose effect

dose

       0        5       50      500

32.30722 28.50625 30.61078 29.29005

 

#常见研究设计的表达式

 

 

  卡方检验

n  对分类变量的检验方法

n  分类:

u  拟合优度检验:用chisq.test()函数

(针对样本数据的分布与已知总体分布是否一致)

> men <- c(11,120,60,45)

> women <- c(20,102,39,30)

> df <- as.data.frame(rbind(men,women))

> colnames(df) <- c(\'AB\',\'O\',\'A\',\'B\')

> chisq.test(men)

 

       Chi-squared test for given probabilities

 

data:  men

X-squared = 105.46, df = 3, p-value < 2.2e-16

 

> chisq.test(men,p = c(0.1,0.5,0.2,0.2))  #p 中为总体人群中各血型的比例

 

       Chi-squared test for given probabilities

 

data:  men

X-squared = 10.335, df = 3, p-value = 0.01592

u  卡方齐性检验:用于比较不同分类水平下,各个类型的比例是否一致。

> chisq.test(df)  #行变量与列变量的相关性检验

 

       Pearson\'s Chi-squared test

 

data:  df

X-squared = 6.8607, df = 3, p-value = 0.07647 #男女不同血型的分布一致

u  卡方独立性检验:

> chisq.test(df)

 

       Pearson\'s Chi-squared test

 

data:  df

X-squared = 6.8607, df = 3, p-value = 0.07647 # 行变量和列变量无关联,血型分布和性别无关

u  CMH检验:分层检验 三维,行变量为无序,列变量为有序数据;判断是否有混杂因素

#采用mantelhaen.test()函数

> Rabbits <-

+   array(c(0,0,6,5,

+           3,0,3,6,

+           6,2,0,4,

+           5,6,1,0,

+           2,5,0,0),

+         dim = c(2,2,5),

+         dimnames = list(

+           Delay = c(\'None\',\'1.5h\'),

+           Response = c(\'Cured\',\'Died\'),

+           Penicillin.Level = c(\'1/8\',\'1/4\',\'1/2\',\'1\',\'4\')))

> Rabbits

, , Penicillin.Level = 1/8

 

      Response

Delay  Cured Died

  None     0    6

  1.5h     0    5

 

, , Penicillin.Level = 1/4

 

      Response

Delay  Cured Died

  None     3    3

  1.5h     0    6

 

, , Penicillin.Level = 1/2

 

      Response

Delay  Cured Died

  None     6    0

  1.5h     2    4

 

, , Penicillin.Level = 1

 

      Response

Delay  Cured Died

  None     5    1

  1.5h     6    0

 

, , Penicillin.Level = 4

 

      Response

Delay  Cured Died

  None     2    0

  1.5h     5    0

 

> mantelhaen.test(Rabbits)

 

       Mantel-Haenszel chi-squared test with continuity

       correction

 

data:  Rabbits

Mantel-Haenszel X-squared = 3.9286, df = 1, p-value =

0.04747

alternative hypothesis: true common odds ratio is not equal to 1

95 percent confidence interval:

  1.026713 47.725133

sample estimates:

common odds ratio

                7    #可以判定盘尼西林是否为混杂因素

u  CMH检验:有序分类的卡方检验:连续型变量

#采用mantelhaen.test()函数

> Satisfaction <-

+   as.table(array(c(1,2,0,0,3,3,1,2,

+                    11,17,8,4,2,3,5,2,

+                    1,0,0,0,1,3,0,1,

+                    2,5,7,9,1,1,3,6),

+                  dim = c(4,4,2),

+                  dimnames =

+                    list(Income =

+                           c(\'<5000\',\'5000-15000\',

+                             \'15000-25000\',\'>25000\'),

+                         \'Job Satisfaction\' =

+                           c(\'V_D\',\'L_S\',\'M_S\',\'V_S\'),

+                         Gender = c(\'Female\',\'Male\'))))

> Satisfaction

, , Gender = Female

 

             Job Satisfaction

Income        V_D L_S M_S V_S

  <5000         1   3  11   2

  5000-15000    2   3  17   3

  15000-25000   0   1   8   5

  >25000        0   2   4   2

 

, , Gender = Male

 

             Job Satisfaction

Income        V_D L_S M_S V_S

  <5000         1   1   2   1

  5000-15000    0   3   5   1

  15000-25000   0   0   7   3

  >25000        0   1   9   6

 

> mantelhaen.test(Satisfaction)

 

       Cochran-Mantel-Haenszel test

 

data:  Satisfaction

Cochran-Mantel-Haenszel M^2 = 10.2, df = 9, p-value =

0.3345   #工资水平与满意度无关

u  配对四格表的卡方检验:采用mcnemar.test()函数

> paired <- as.table(matrix(c(157,24,69,18),nrow = 2,dimnames = list(case =c(\'A\',\'B\'),control = c(\'A\',\'B\'))))

> paired

    control

case   A   B

   A 157  69

   B  24  18

> mcnemar.test(paired)

 

       McNemar\'s Chi-squared test with continuity correction

 

data:  paired

McNemar\'s chi-squared = 20.817, df = 1, p-value =

5.053e-06   

> chisq.test(paired)

 

       Pearson\'s Chi-squared test with Yates\' continuity

       correction

 

data:  paired

X-squared = 1.9244, df = 1, p-value = 0.1654

u  的

  回归分析与模型诊断

前提:是否适合做线性回归,是否满足正态分布

只要因变量是连续型变量即可做线性回归,因变量是分类变量需要做Logistic回归;自变量是连续型或者分类变量无影响

n  一元回归分析:lm(因变量~自变量)

x为连续型变量

> x<- seq(1,5,len =100)

> noise <- rnorm(n=100,mean = 0,sd = 1)

> beta0 = 1

> beta1 <-2

> y <- beta0 + beta1*x + noise

> plot(y~x)   #看一下是否适合做线性回归

> model <- lm(y~x)

> summary(model)

 

Call:

lm(formula = y ~ x)

 

Residuals:

     Min       1Q   Median       3Q      Max

-2.49732 -0.64794  0.00215  0.75355  3.06579

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)  0.77938    0.28733   2.712  0.00789 **

x            2.05561    0.08927  23.027  < 2e-16 *** 

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 1.041 on 98 degrees of freedom

Multiple R-squared:  0.844,    Adjusted R-squared:  0.8424 #模型可解释度

F-statistic: 530.3 on 1 and 98 DF,  p-value: < 2.2e-16  #模型可靠性

x为分类变量

> x <- factor(rep(c(0,1,2),each = 20))

> y <- c(rnorm(20,0,1),rnorm(20,1,1),rnorm(20,2,1))

> model <- lm(y~x)

> summary(model)

 

Call:

lm(formula = y ~ x)

 

Residuals:

     Min       1Q   Median       3Q      Max

-2.50566 -0.82826  0.01137  0.87966  2.41338

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)    

(Intercept)  0.05338    0.24331   0.219    0.827   

x1           0.81708    0.34409   2.375    0.021 * 

x2           1.99903    0.34409   5.810 2.94e-07 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 1.088 on 57 degrees of freedom

Multiple R-squared:  0.3745,      Adjusted R-squared:  0.3525

F-statistic: 17.06 on 2 and 57 DF,  p-value: 1.558e-06

 

> exp( 1.99903)  #转化成OR值

[1] 7.381892

n  模型诊断

非正态性:shapiro.test()判断

> #检验残差是否服从正态分布

> data(LMdata,package = \'rinds\')

> model <- lm(y~x,data = LMdata$NonL)

> #找残差

>

> res1 <- residuals(model)

> shapiro.test(res1)   #残差不符合正态分布,需要重新做

 

       Shapiro-Wilk normality test

 

data:  res1

W = 0.93524, p-value = 1e-04   #残差不符合正态分布,需要重新做

u  非线性

> #2.非线性

> model2 <- lm(y~x,data = LMdata$NonL)

> plot(model2)

#y 与x2的关系是否成线性

model2 <- lm(y~x+ I(x^2),data = LMdata$NonL)

summary(model2)

#踢掉x

> model3 <- update(model2,y~.-x)

> summary(model3)

 

Call:

lm(formula = y ~ 1, data = LMdata$NonL)

 

Residuals:

    Min      1Q  Median      3Q     Max

-19.456 -12.907  -2.464  10.725  28.697

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)   21.829      1.429   15.28   <2e-16 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 14.29 on 99 degrees of freedom

> model2 <- lm(y~x+ I(x^2),data = LMdata$NonL)

> summary(model2)

Call:

lm(formula = y ~ x + I(x^2), data = LMdata$NonL)

 

Residuals:

     Min       1Q   Median       3Q      Max

-2.32348 -0.60702  0.00701  0.56018  2.27346

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)  0.98702    0.62216   1.586    0.116   

x            0.11085    0.45405   0.244    0.808   

I(x^2)       1.97966    0.07456  26.552   <2e-16 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 0.907 on 97 degrees of freedom

Multiple R-squared:  0.9961,      Adjusted R-squared:  0.996

F-statistic: 1.224e+04 on 2 and 97 DF,  p-value: < 2.2e-16

 

> model3 <- update(model2,y~.-x)

> summary(model3)

 

Call:

lm(formula = y ~ I(x^2), data = LMdata$NonL)

 

Residuals:

     Min       1Q   Median       3Q      Max

-2.34289 -0.60692  0.01722  0.58186  2.29601

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)  1.13378    0.15963   7.103 1.97e-10 ***

I(x^2)       1.99760    0.01271 157.195  < 2e-16 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 0.9026 on 98 degrees of freedom

Multiple R-squared:  0.996,       Adjusted R-squared:  0.996

F-statistic: 2.471e+04 on 1 and 98 DF,  p-value: < 2.2e-16

 

> AIC(model,model2,model3)

       df      AIC

model   3 478.4558

model2  4 269.2121

model3  3 267.2736  #拟合效果越来越好

 

plot(model3$residuals~LMdata$NonL)  #在0左右分布

u  异方差:考虑的是噪声对模型的影响

model4 <- lm(y~x,data = LMdata$Hetero)

plot(model4$residuals~LMdata$Hetero$x)

#处理方法:使用加权最小二乘法:对于不同的样本点给予不同的权重

> summary(model5)

 

Call:

lm(formula = y ~ x, data = LMdata$Hetero, weights = 1/x^2)

 

Weighted Residuals:

     Min       1Q   Median       3Q      Max

-2.25132 -0.68409 -0.03924  0.63997  2.61098

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)   1.2611     0.5139   2.454   0.0159 * 

x             1.8973     0.2317   8.190 9.98e-13 ***

---

Signif. codes: 

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 1.025 on 98 degrees of freedom

Multiple R-squared:  0.4063,      Adjusted R-squared:  0.4003

F-statistic: 67.07 on 1 and 98 DF,  p-value: 9.977e-13

 

> summary(model4)

 

Call:

lm(formula = y ~ x, data = LMdata$Hetero)

 

Residuals:

    Min      1Q  Median      3Q     Max

-8.3371 -1.6383 -0.1533  1.4075 12.3423

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)   1.6175     1.0019   1.615     0.11   

x             1.7671     0.3113   5.677  1.4e-07 ***

---

Signif. codes: 

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 3.63 on 98 degrees of freedom

Multiple R-squared:  0.2475,      Adjusted R-squared:  0.2398

F-statistic: 32.23 on 1 and 98 DF,  p-value: 1.396e-07

 

#处理方法:使用迭代重复加权最小二乘法  采用nlme包中的gls()函数,找到最合适的weights公式,得到最小AIC

> glsmodel <- gls(y~x,weights = varFixed(~x),data = LMdata$Hetero)

> summary(glsmodel)

Generalized least squares fit by REML

  Model: y ~ x

  Data: LMdata$Hetero

       AIC      BIC    logLik

  517.5286 525.2835 -255.7643

 

Variance function:

 Structure: fixed weights

 Formula: ~x

 

Coefficients:

               Value Std.Error  t-value p-value

(Intercept) 1.421324 0.7091780 2.004184  0.0478

x           1.832543 0.2603653 7.038351  0.0000

 

 Correlation:

  (Intr)

x -0.908

 

Standardized residuals:

        Min          Q1         Med          Q3

-2.17451000 -0.55322766 -0.03454023  0.51060224

        Max

 2.93969900

 

Residual standard error: 1.890127

Degrees of freedom: 100 total; 98 residual

u  自相关:利用lmtest包中的DW检验函数:dwtest()函数

> model6 <- lm(y~x,data = LMdata$AC)

> dwtest(model6)

 

       Durbin-Watson test

 

data:  model6

DW = 0.65556, p-value = 2.683e-12  #存在自相关,不能使用最小二乘法,可使用gls()函数,实现广义最小二乘法,注意gls中的correlation 参数,设置为corAR1()

alternative hypothesis: true autocorrelation is greater than 0

 

> glsmodel2 <- gls(y~x,correlation = corAR1(),data = LMdata$AC)

> summary(glsmodel2)

Generalized least squares fit by REML

  Model: y ~ x

  Data: LMdata$AC

       AIC      BIC    logLik

  268.7617 279.1016 -130.3809

 

Correlation Structure: AR(1)

 Formula: ~1

 Parameter estimate(s):

      Phi

0.7041665

 

Coefficients:

               Value Std.Error  t-value p-value

(Intercept) 1.335059 0.7792019 1.713367  0.0898

x           2.072936 0.2405513 8.617438  0.0000

 

 Correlation:

  (Intr)

x -0.926

 

Standardized residuals:

         Min           Q1          Med           Q3

-1.667086282 -0.571601900  0.001724114  0.568360354

         Max

 2.306177988

 

Residual standard error: 1.25329

Degrees of freedom: 100 total; 98 residual

> summary(model6)

 

Call:

lm(formula = y ~ x, data = LMdata$AC)

 

Residuals:

     Min       1Q   Median       3Q      Max

-2.10131 -0.72894 -0.03329  0.66138  2.77461

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)   1.2626     0.3303   3.822 0.000232 ***

x             2.1118     0.1026  20.578  < 2e-16 ***

---

Signif. codes: 

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 1.197 on 98 degrees of freedom

Multiple R-squared:  0.8121,      Adjusted R-squared:  0.8101

F-statistic: 423.4 on 1 and 98 DF,  p-value: < 2.2e-16

u  异常值:离群点,杠杆点、高影响点

model7 <- lm(y~x,data = LMdata$Outlier)

plot(y~x,data = LMdata$Outlier)

abline(model7)  #画出回归直线

#利用car包中的infuencePlot()函数进行判断,圆圈越大,对模型影响越大,做线性回归模型时需踢掉该点,用update函数

model8 <- update(model7,y~x,subset = -32,data = LMdata$Outlier)

plot(y~x,data = LMdata$Outlier)   #比较去除前后的差异

abline(model7,col = \'red\')   #比较去除前后的差异

abline(model8,col = \'green\')  #比较去除前后的差异

u  多重共线性的判断:自变量之间存在线性关系

#x1,x2,x3与y之间p值无显著差异,但是总体上的p值和R值非常显著,说明x1,x2,x3之间可能是存在相关性的,但与y没有相关性,不能进行回归分析,需要用函数进行检验确认:vif()函数计算方差膨胀因子

> model9 <- lm(y~x1+x2+x3,data = LMdata$Mult)

> summary(model9)

 

Call:

lm(formula = y ~ x1 + x2 + x3, data = LMdata$Mult)

 

Residuals:

     Min       1Q   Median       3Q      Max

-1.29100 -0.26369  0.00141  0.32529  0.91182

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|) 

(Intercept)   0.3848     0.5813   0.662   0.5095 

x1            7.2022     4.8552   1.483   0.1412  

x2          -14.0916    12.1385  -1.161   0.2486 

x3            8.2312     4.8559   1.695   0.0933 .

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 0.499 on 96 degrees of freedom

Multiple R-squared:  0.9984,      Adjusted R-squared:  0.9984

F-statistic: 2.057e+04 on 3 and 96 DF,  p-value: < 2.2e-16

#vif计算方差膨胀因子,一般而言,大于10则认为存在共线性

> vif(model9)

        x1         x2         x3

  7560.819 214990.752 222630.742

#运用自动回归判断是x中的哪些变量存在共线性:step()函数

> model10 <- step(model9)

Start:  AIC=-135.11

y ~ x1 + x2 + x3

 

       Df Sum of Sq    RSS     AIC

- x2    1   0.33560 24.241 -135.71

<none>              23.905 -135.11

- x1    1   0.54795 24.453 -134.84

- x3    1   0.71550 24.621 -134.16

 

Step:  AIC=-135.71

y ~ x1 + x3

 

       Df Sum of Sq     RSS     AIC

<none>                 24.2 -135.71

- x1    1     189.2   213.4   79.81

- x3    1   15276.4 15300.7  507.05

> summary(model10)

 

Call:

lm(formula = y ~ x1 + x3, data = LMdata$Mult)

 

Residuals:

     Min       1Q   Median       3Q      Max

-1.24046 -0.27519 -0.02751  0.32824  0.91344

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)  0.38158    0.58230   0.655    0.514   

x1           1.56614    0.05692  27.514   <2e-16 ***

x3           2.59393    0.01049 247.241   <2e-16 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 0.4999 on 97 degrees of freedom

Multiple R-squared:  0.9984,      Adjusted R-squared:  0.9984

F-statistic: 3.074e+04 on 2 and 97 DF,  p-value: < 2.2e-16

 

n  Logistic回归

与线性回归的区别:y为分类变量。

u  使用glm()函数:下载HSAUR2包的数据集plasma

> library(HSAUR2)

> data(\'plasma\')

> head(plasma)

  fibrinogen globulin      ESR

1       2.52       38 ESR < 20

2       2.56       31 ESR < 20

3       2.19       33 ESR < 20

4       2.18       31 ESR < 20

5       3.41       37 ESR < 20

6       2.46       36 ESR < 20

> fit1 <- glm(ESR~fibrinogen+globulin,data = plasma,

+             family = binomial())  #family


鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
热门推荐
热门话题
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap