[問題] R中文編碼(MS950)問題

看板R_Language作者AmuroRai (SIEG ZEON!!!!!!)時間10年前 (2015/12/29 22:03)推噓0(0推 0噓 9→)

留言9則, 3人參與討論串1/3 (看更多)

[問題類型]: 程式諮詢(我想用R 做某件事情，但是我不知道要怎麼用R 寫出來) [軟體熟悉度]: 入門(寫過其他程式，只是對語法不熟悉) [問題敘述]: 最近開始在學用R寫爬蟲，而今天下午試著要爬證交所的股票代碼列表時發現他們似乎是使用MS950編碼，但是這個編碼R卻無法認得。（參見程式碼部分）後來還有試過用utf-8和big5硬推，也試著用tmcn去轉碼，但是中文部分還是只得到亂碼。因此想請問是否有什麼方法可以繞過這個問題？（把原網頁資料抓下來存成csv轉碼後再丟給R不在考慮之列） [程式範例]: 只附上一開始用MS950的程式碼，big5和utf-8的結果大同小異。另外最後res和ress的output不知道為什麼無法完整貼上，但總之遇到中文都是亂碼就是了 > rm(list=ls()) > > library(tmcn)Q > library(httr) > library(rvest) > library(stringr) > library(magrittr) > > r<-GET("http://isin.twse.com.tw/isin/C_public.jsp?strMode=2") > r Response [http://isin.twse.com.tw/isin/C_public.jsp?strMode=2] Date: 2015-12-29 10:15 Status: 200 Content-Type: text/html;charset=MS950 Size: 2.69 MB Unknown encoding MS950. Defaulting to latin1 (ISO-8859-1). <link rel="stylesheet" href="http://www.tse.com.tw/style1.css" type=... <body><table align=center><h2><strong><font class='h1'>￥蠊 > res<-r$content%>%read_html(encoding="MS950")%>% + html_node(".h4")%>%html_nodes(xpath="tr")%>%html_text() > ress<-toUTF8(res) > res[1:5] [1] "礎糧罈羅\xc3砥国並怏代繡繒瞻\xc3缃名繙\xc3\x99 簞礙罈\xc3斾砥国並怏螢姻砥＇蜃嗽碼(ISIN Code)瞻W瞼竄瞻矇瞼竄糧繭禮O簡瞿繚~禮OCFICode糧\xc3屡腕\xb9" [2] " 穠\xc3＇笨\xbc " [3] "1101 癒@瞼x穠dTW00011010041962/02/09瞻W瞼竄瞻繫穠d瞻u繚~ESVUFR" [4] "1102 癒@穡\xc3鱿泥TW00011020021962/06/08瞻W瞼竄瞻繫穠d瞻u繚~ESVUFR" [5] "1103 癒@繒\xc3埘泥TW00011030001969/11/14瞻W瞼竄瞻繫穠d瞻u繚~ESVUFR" > ress[1:5] [1] "| 3 瞼N繡繒瞻\xc3缃名繙\xc3\x99 簞礙罈\xc3斾 [2] " ªÑ2 ¼ " [3] "1101 ! [4] "1102 ! [5] "1103 ! [環境敘述]: Win7 64位元 R 3.2.3 (2015/12/10) [關鍵字]: 中文編碼 MS950 tmcn -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 140.109.122.150 ※ 文章網址: https://www.ptt.cc/bbs/R_Language/M.1451397792.A.742.html

→

Wush978

12/29 22:05, , 1^F

12/29 22:05, 1^F

→

Wush978

12/29 22:05, , 2^F

12/29 22:05, 2^F

→

AmuroRai

12/29 22:06, , 3^F

12/29 22:06, 3^F

→

AmuroRai

12/29 22:06, , 4^F

12/29 22:06, 4^F

承前面，把r定義出來後實驗性的做了以下幾種解析方法，結果都有點不同。如果用content然後強迫編MS950會出現編碼不對的狀況，但顯然改成Big5也不會比較對（即下面的res1和res2）而如果使用read_html的話，不論是塞Big5和MS950結果都是一樣（res3和res4）甚至如果是直接把網址塞進read_html並使用Big5的話還會直接出錯（res5，這幾天在另一個確定是用Big5編碼的網站也出了一樣的狀況，那個網站後來是靠用content編Big5加上tmcn的toUTF8解決掉。但這次爬的網頁就無法這麼處理） > res1<-content(r,encoding="MS950") Unknown encoding MS950. Defaulting to latin1 (ISO-8859-1). > res2<-content(r,encoding="Big5") > res3<-r%>%read_html(encoding="Big5") > res4<-r%>%read_html(encoding="MS950") > res5<-read_html("http://isin.twse.com.tw/isin/C_public.jsp?strMode=2", +　　　　　　　　　encoding="Big5") 錯誤: input conversion failed due to input error, bytes 0xF9 0xD6 0x3C 0x2F [6003] > res2 > res3 {xml_document} <html> [1] <head><link rel="stylesheet" href="http://www.tse.com.tw/style1.css" t ... [2] <body><table align="center"><h2><strong><font class="h1">驍冕祁上瞼竄\xc3 砥国並 ... > res4 {xml_document} <html> [1] <head><link rel="stylesheet" href="http://www.tse.com.tw/style1.css" t ... [2] <body><table align="center"><h2><strong><font class="h1">驍冕祁上瞼竄\xc3 ※ 編輯: AmuroRai (140.109.122.150), 12/29/2015 22:17:04 ※ 編輯: AmuroRai (140.109.122.150), 12/29/2015 22:18:42

→

celestialgod

12/29 23:06, , 5^F

12/29 23:06, 5^F

→

celestialgod

12/29 23:07, , 6^F

12/29 23:07, 6^F

→

celestialgod

12/29 23:28, , 7^F

12/29 23:28, 7^F