[問題] Dcard圖片爬蟲遇到https該如何解決

看板Python作者craig1122321 (半醉夜貓)時間7年前 (2017/05/11 18:47)推噓1(1推 0噓 15→)

留言16則, 5人參與討論串1/2 (看更多)

如題　下方為程式碼 import requests ,threading from bs4 import BeautifulSoup from urllib.request import urlopen headers ={ 'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36' } url = ('http://www.dcard.tw/f/photography/p/226364232.html') res = requests.get(url , headers = headers) soup = BeautifulSoup(res.text, "html.parser") imgs = soup.select('img') for img in imgs: try: fn = img['src'] print(fn) img=urlopen(fn) except Exception as e: print (e) continue with open('./imgs/' + str(fn), 'wb') as f: f.write(img.read()) 上面的url為測試用網址。我有google爬過文有看到一種寫法是if re.match(r'^https?://(i.)?(m.)?imgur.com', link['href']): 不過因為Dcard的圖檔是存在src裡不知該如何修改第一次發文有錯誤煩請指導感謝各位大大 -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 123.205.57.171 ※ 文章網址: https://www.ptt.cc/bbs/Python/M.1494499637.A.4E1.html

→

uranusjr

05/11 22:02, , 1^F

05/11 22:02, 1^F

→

zerof

05/11 22:33, , 2^F

05/11 22:33, 2^F

→

craig1122321

05/12 13:03, , 3^F

05/12 13:03, 3^F