[問題] 爬蟲問題

看板Python作者MAGICXX (逢甲阿法)時間7年前發表 (2018/05/21 02:20), 7年前編輯推噓4(4推 0噓 10→)

留言14則, 6人參與, 7年前最新討論串3/5 (看更多)

大家早安我是個爬蟲新手... 我現在想要抓水庫的每日資料抓到一半就卡住了... 下面是我的code # -*- coding: utf-8 -*- import pandas as pd from selenium import webdriver from selenium.webdriver.support.ui import Select from selenium.webdriver.chrome.options import Options import _uniout import sys reload(sys) sys.setdefaultencoding('utf-8') driver=webdriver.Firefox() url='http://fhy.wra.gov.tw/ReservoirPage_2011/StorageCapacity.aspx' driver.get(url) sel = Select(driver.find_element_by_id('ctl00_cphMain_cboSearch')) sel.select_by_index(2) data=pd.read_html(url) print data 有兩個問題 1. 我現在執行之後會印出一串的亂碼... https://i.imgur.com/gsXnlWz.png

我上網找過解決方式也在一開始用了 # -*- coding: utf-8 -*- 還是亂碼避免政府網站用的是big5之類的也用chardet試過 >>> import urllib >>> data=urllib.urlopen('http://fhy.wra.gov.tw/ReservoirPage_2011/StorageCapacity.aspx').read() >>> import chardet >>> chardet.detect(data) {'confidence': 0.99, 'language': '', 'encoding': 'utf-8'} 所以確定是utf-8 結果還是亂碼..那我該處理阿...? 2.我現在想要抓2018年5月20日的「水庫及攔河堰」的資料可是我現在已經利用selenium下去將下拉式選單改成第三項可是最後讀取之後還是讀取到第一項請問我在data=pd.read_html(url) 該放甚麼? 麻煩各位大大了... -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 140.134.166.72 ※ 文章網址: https://www.ptt.cc/bbs/Python/M.1526869252.A.DB8.html

推

TitanEric

05/21 10:41, 7年前 , 1^F

05/21 10:41, 1^F

→

TitanEric

05/21 10:41, 7年前 , 2^F

05/21 10:41, 2^F

不好意思我現在抓下來的資料是list 他似乎沒有str這個屬性還是我要用迴圈把她轉換成str?

→

TitanEric

05/21 10:42, 7年前 , 3^F