[問題] Logistic regression 的變數設定(2)

看板Statistics作者gsuper (Logit(odds))時間12年前 (2012/06/22 01:50)推噓1(1推 0噓 29→)

留言30則, 2人參與討論串1/1

假設我要在[R]中計算單變量logistic regression DATA = Y X=基因型 ------------------ 1 AA 1 AA 1 AG 1 AA 1 -- (missing value) 0 GG 0 GG 0 GG 0 AG 0 -- -------------------- attach(DATA) 以下為可能的各種組合截距 X1 X2 X3 ============================================================================== glm(Y~X) | -- , AA , AG , GG glm(Y~factor(X=="AA")) | not AA , AA glm(Y~factor(X=="AG")) | not AG , AG glm(Y~factor(X=="GG")) | not GG , GG glm(Y~factor(X=="AA")+factor(X=="AG")) | GG與-- , AA , AG glm(Y~factor(X=="AA")+factor(X=="GG")) | AG與-- , AA , GG glm(Y~factor(X=="AG")+factor(X=="GG")) | AA與-- , AG , GG glm(Y~factor(X=="AA")+factor(X=="AG")+factor(X=="GG"))| -- , AA , AG , GG ============================================================================== 所以說 1. 一般是否習慣將 Low risk genotype 設定為截距項 (Baseline risk)? 2. 若我要定義單變量顯著 , 是要看下表的某項斜率檢定(見下表) 又或是要做 The likelihood ratio test, G ? > summary(glm(PHENO~as.factor(C[,5]=="AA")+as.factor(C[,5]=="AG"))) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.36390 0.02594 14.026 <2e-16 *** as.factor(C[, 5] == "AA")TRUE 0.01706 0.07916 0.215 0.829 as.factor(C[, 5] == "AG")TRUE 0.02072 0.03971 0.522 0.602 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 0.2349203) Null deviance: 152.29 on 650 degrees of freedom Residual deviance: 152.23 on 648 degrees of freedom AIC: 909.47 =========================================================================== G = 152.29 - 152.23 = 0.06 = 卡方 (where df = #predictors added to the model) 當我計算 glm(Y~X) , 共有四個子變項 , -- , AA , AG , GG 請問 df 等於 3 或 4 或 1? ======================================================================= 排版超亂..... 我盡力了~~ -- 祭頌后靈的騎士道與白主教穩守黃金鄉邊境兵躁動自高自大妄入堡壘黑主教冷靜人格分裂籠城王與雙子戰塔一籌莫展掌握無限的魔女喚醒躺下的靈魂 -- ※ 發信站: 批踢踢實業坊(ptt.cc) ◆ From: 140.113.239.247 ※ 編輯: gsuper 來自: 140.113.239.247 (06/22 02:04) ※ 編輯: gsuper 來自: 140.113.239.247 (06/22 02:05) ※ 編輯: gsuper 來自: 140.113.239.247 (06/22 02:05) ※ 編輯: gsuper 來自: 140.113.239.247 (06/22 02:06) ※ 編輯: gsuper 來自: 140.113.239.247 (06/22 02:06)

→

06/22 14:20, , 1^F

06/22 14:20, 1^F

→

06/22 14:22, , 2^F

06/22 14:22, 2^F

→

06/22 14:22, , 3^F

06/22 14:22, 3^F

→

06/22 14:24, , 4^F

06/22 14:24, 4^F

我手上有 114 個因子 (factors , features , SNPs) 想做 Stepwise Insert , multiple logistic regression 建 prediction model 因此要先 filter 掉一些完全無用的因子必須初步做 single logistic regression , 檢定 "單項因子是否有價值" ==================================================================== 而在觀察 summary(glm(簡單回歸)) 的結果時發現有以下兩種現存的檢定 , 但不知到哪種檢定符合我的要求 1. 看 summary(glm(單項回歸式))$coefficients 中 , 是否認一項顯著 2. null devaince - model deviance 的 LRT 主要想問的是這個... ※ 編輯: gsuper 來自: 140.113.239.247 (06/22 22:35)

→

06/23 00:01, , 5^F

06/23 00:01, 5^F

→

06/23 00:02, , 6^F

06/23 00:02, 6^F

→

06/23 00:02, , 7^F

06/23 00:02, 7^F

→

06/23 00:36, , 8^F

06/23 00:36, 8^F

→

06/23 00:36, , 9^F

06/23 00:36, 9^F

Anova(model)$p.value = 1-pchisq(mod$null.deviance-mod$deviance, 2) = LR test 至少在簡單邏輯回歸上述應該是成立的 LR test 在複邏輯回歸 1. 觀察整個 model 是否有效 (compare with null deviance) 2. 新增 factor , 對舊 model 是否有顯著的進步 (compare with old deviance) 我是這樣理解的也不知道對不對

→

06/23 01:00, , 10^F

06/23 01:00, 10^F

先釐清名詞 x變項 = factor(因子) , factor內含有 3 個 levels (AA,AG,GG) =============================================================== =============================================================== 主要目的有兩項 1. 先決定 x 是否有用 (是否具備基礎的預測能力) 2. x 是三元變項 , 但是否轉換成二元變項 , 預測能力會更佳 (e.g. AA vs AG+GG) ============================================================== ============================================================== 先設定資料 y = 1 = case , 共 20 人 x = AA , AG , GG y <- c(1,1,1,1,1,1,1,1,1,1, 0,0,0,0,0,0,0,0,0,0) x <- c("AA","AA","AA","AA","AA","AA","AA","AA","AA","AG", "AG","AG","GG","GG","GG","GG","GG","GG","GG","GG") 初步觀察資料 AA 與 case 相近 AG 稍微偏向 control GG 與 control 相近 ============================================================== ============================================================== 組四種 models , 第一個3元 , 後3個2元 m123 <- glm(y~factor(x,level=c("AA","AG","GG")),family=binomial) m1_23 <- glm(y~factor(x=="AA") ,family=binomial) m2_13 <- glm(y~factor(x=="AG") ,family=binomial) m3_12 <- glm(y~factor(x=="GG") ,family=binomial) ============================================================== ============================================================== 從 Deviance 來看最小代表最好 (因子解釋力最強) > summary(m123)$deviance [1] 3.819085 > summary(m1_23)$deviance [1] 6.701994 > summary(m2_13)$deviance [1] 27.32723 > summary(m3_12)$deviance [1] 10.81347 #結論 1 > 2 > 4 > 3 ============================================================== ============================================================== 從 LR test P.value 來看 (至少有一個 level 顯著) > library(car) > Anova(m123)[[3]] [1] 6.437302e-06 > Anova(m1_23)[[3]] [1] 4.535914e-06 > Anova(m2_13)[[3]] [1] 0.5277844 > Anova(m3_12)[[3]] [1] 3.914466e-05 #結論 : 2 > 1 > 4 >>>>>> 3 #結論 : model 3 不可用 , 僅 model 1 2 4 有預測能力 ============================================================= ============================================================= 觀察顯著的 effect size (斜率) 以max(顯著的斜率群)互相比較 > summary(m123) $coefficients[,c(1,4)] Estimate Pr(>|z|) (Intercept) 21.56607 0.9982341 factor(x, level = c("AA", "AG", "GG"))AG -22.25922 0.9981773 factor(x, level = c("AA", "AG", "GG"))GG -43.13214 0.9975772 > summary(m1_23)$coefficients[,c(1,4)] Estimate Pr(>|z|) (Intercept) -2.302585 0.02813286 factor(x == "AA")TRUE 22.868654 0.99691267 > summary(m2_13)$coefficients[,c(1,4)] Estimate Pr(>|z|) (Intercept) 0.1177830 0.8084737 factor(x == "AG")TRUE -0.8109302 0.5382557 > summary(m3_12)$coefficients[,c(1,4)] Estimate Pr(>|z|) (Intercept) 1.609438 0.03773005 factor(x == "GG")TRUE -21.175506 0.99555629 #結論 2 > 4 >>>>> (3與1都沒有顯著的斜率) ============================================================= ※ 編輯: gsuper 來自: 140.113.239.247 (06/23 01:09) ※ 編輯: gsuper 來自: 140.113.239.247 (06/23 01:10) ※ 編輯: gsuper 來自: 140.113.239.247 (06/23 01:12) 整理上述結果 Deviance : 1 > 2 > 4 > 3 LR test : 2 > 1 > 4 >>>>>>>> (3不顯著) 斜率觀察 : 2 > 4 >>>>> (3與1都沒有顯著的斜率) ---------------------------------------------------- 1. 先決定 x 是否有用首先我知道了 model 3 不可用 (LR test) 為何從斜率來看 , 3元的模型 (model 1) , 三個因子都不顯著? 照理說 AA 和 GG 的預測能力都應該很好才對 2. 四個 model 裡面 , 如何判定哪種最佳? 照理說該看 Deviance , 所以 model 1 最佳但從 LR test 來看 , model 2 稍佳 (LR test 和 Deviance 為何會不一致?) 而從斜率來看 , model 1 沒有斜率顯著 , 所以不可用 model 1 總覺得不能用 model 1 非常奇怪... 3. 三元模型的 Deviance 是否衡大於二元模型? p.s. 總覺得一定沒人看得懂我在說啥了 0rz... ※ 編輯: gsuper 來自: 140.113.239.247 (06/23 01:34) ※ 編輯: gsuper 來自: 140.113.239.247 (06/23 01:38) ※ 編輯: gsuper 來自: 140.113.239.247 (06/23 01:38) ※ 編輯: gsuper 來自: 140.113.239.247 (06/23 01:42) ※ 編輯: gsuper 來自: 140.113.239.247 (06/23 01:48) ※ 編輯: gsuper 來自: 140.113.239.247 (06/23 01:52) ※ 編輯: gsuper 來自: 140.113.239.247 (06/23 01:54) ※ 編輯: gsuper 來自: 140.113.239.247 (06/23 01:58)

推

06/23 02:01, , 11^F

06/23 02:01, 11^F

→

06/23 02:03, , 12^F

06/23 02:03, 12^F

→

06/23 02:03, , 13^F

06/23 02:03, 13^F

→

06/23 02:12, , 14^F

06/23 02:12, 14^F

→

06/23 02:13, , 15^F

06/23 02:13, 15^F

→

06/23 02:13, , 16^F

06/23 02:13, 16^F

所以天秤的兩端一邊是解釋力 ( 降低殘差 ) 另一端是什麼? ( 提高 LR test 顯著性? 提高 AUC of ROC ? )

→

06/23 02:17, , 17^F

06/23 02:17, 17^F

→

06/23 02:19, , 18^F

06/23 02:19, 18^F

所謂"後者"是什麼? 是指一般是用 LR test 選有效因子, 而非 devaiance 最小者?

→

06/23 02:19, , 19^F

06/23 02:19, 19^F

→

06/23 02:20, , 20^F

06/23 02:20, 20^F

→

06/23 02:22, , 21^F

06/23 02:22, 21^F

第二解和第三解是在說下面這個 table 嗎? Deviance : 1 > 2 > 4 > 3 LR test : 2 > 1 > 4 >>>>>>>> (3不顯著) 斜率觀察 : 2 > 4 >>>>> (3與1都沒有顯著的斜率) 我後來發現三元的 model , 全斜率不顯著的原因是因為細格有0所導致 (odds ratio 分子分母都不可為0) ※ 編輯: gsuper 來自: 140.113.239.247 (06/23 16:09) ※ 編輯: gsuper 來自: 140.113.239.247 (06/23 16:36)

→

06/25 02:57, , 22^F

06/25 02:57, 22^F

→

06/25 03:00, , 23^F

06/25 03:00, 23^F

→

06/25 03:01, , 24^F

06/25 03:01, 24^F

→

06/25 03:02, , 25^F

06/25 03:02, 25^F

→

06/25 03:03, , 26^F

06/25 03:03, 26^F

→

06/25 03:04, , 27^F

06/25 03:04, 27^F

→

06/25 03:05, , 28^F

06/25 03:05, 28^F

→

06/25 03:06, , 29^F

06/25 03:06, 29^F

→

06/25 03:08, , 30^F

06/25 03:08, 30^F

補個程式筆記 (自用,與本篇內容無關) str <- paste(paste("FACTOR[[",1:5,"]]",sep=""),collapse="*") code <- paste("glm(PHENO~",str,",family=binomial)") mod <- eval(parse(text=code)) ※ 編輯: gsuper 來自: 140.113.239.247 (07/08 17:54)

‣ 返回看板[ Statistics ] 統計

‣ 更多 gsuper 的文章

文章代碼(AID): #1FurxkiE (Statistics)