[問題] BS無法辨認之前已出現過的的中文字。

看板Python作者ansem (DoubleA)時間9年前 (2016/07/07 06:07)推噓1(1推 0噓 14→)

留言15則, 3人參與討論串1/1

小弟我在網頁抓資料時發現假如出現過的文字，再次出現時似乎無法被辨認。 import urllib from bs4 import BeautifulSoup #url ='http://mops.twse.com.tw/mops/web/ajax_t164sb04?encodeURIComponent=1&step=1&firstin=1&off=1&keyword4=&code1=&TYPEK2=&checkbtn=&queryName=co_id&TYPEK=all&isnew=false&co_id=2330&year=102&season=01' url='http://mops.twse.com.tw/server-java/t164sb01?step=1&CO_ID=2330&SYEAR=2013&SSEASON=1&REPORT_ID=C' response = urllib.urlopen(url) html= response.read() sp = BeautifulSoup(html,"lxml") #cp950 trs=sp.find_all('tr',attrs={'class':["odd","even"]}) for tr in trs: #只要前面的字有重複就會認不出來 tds=tr.find_all('td') for td in tds: if (td.get_text().strip().encode('utf8')=="營業收入合計"): if (tds[1].get_text().strip()!=''): print('Earning','102','1',tds[1].get_text().strip().encode('utf8')) print('Earning','102','1',tds[2].get_text().strip().encode('utf8')) if (td.get_text().strip().encode('utf8')=="基本每股盈餘合計"): if (tds[1].get_text().strip()!=''): print('EPS','102','1',tds[1].get_text().strip().encode('utf8')) print('EPS','102','1',tds[2].get_text().strip().encode('utf8')) 程式在抓取第一個營業收入合計時完全沒有問題，而在抓取基本每股盈餘時就完全沒反應而由於之前已經有出現過"基本每股盈餘"這個字串(不過少了合計)，請問這部分是程式的Bug 還是我的程式碼本身就有問題？還請各位賜教。 -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 60.250.205.229 ※ 文章網址: https://www.ptt.cc/bbs/Python/M.1467871646.A.402.html

推

s860134

07/07 19:13, , 1^F

07/07 19:13, 1^F

→

ansem

07/07 22:57, , 2^F

07/07 22:57, 2^F

→

s860134

07/08 08:04, , 3^F

07/08 08:04, 3^F