[問題] Dcard圖片爬蟲遇到https該如何解決
如題 下方為程式碼
import requests ,threading
from bs4 import BeautifulSoup
from urllib.request import urlopen
headers ={
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36'
}
url = ('http://www.dcard.tw/f/photography/p/226364232.html')
res = requests.get(url , headers = headers)
soup = BeautifulSoup(res.text, "html.parser")
imgs = soup.select('img')
for img in imgs:
try:
fn = img['src']
print(fn)
img=urlopen(fn)
except Exception as e:
print (e)
continue
with open('./imgs/' + str(fn), 'wb') as f:
f.write(img.read())
上面的url為測試用網址。
我有google爬過文
有看到一種寫法是if re.match(r'^https?://(i.)?(m.)?imgur.com', link['href']):
不過因為Dcard的圖檔是存在src裡 不知該如何修改 第一次發文有錯誤煩請指導
感謝各位大大
--
※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 123.205.57.171
※ 文章網址: https://www.ptt.cc/bbs/Python/M.1494499637.A.4E1.html
→
05/11 22:02, , 1F
05/11 22:02, 1F
→
05/11 22:33, , 2F
05/11 22:33, 2F
→
05/12 13:03, , 3F
05/12 13:03, 3F
→
05/12 13:04, , 4F
05/12 13:04, 4F
→
05/12 13:05, , 5F
05/12 13:05, 5F
→
05/12 13:07, , 6F
05/12 13:07, 6F
→
05/12 14:21, , 7F
05/12 14:21, 7F
→
05/12 21:41, , 8F
05/12 21:41, 8F
→
05/12 21:41, , 9F
05/12 21:41, 9F
→
05/12 22:19, , 10F
05/12 22:19, 10F
→
05/12 23:39, , 11F
05/12 23:39, 11F
推
05/20 23:46, , 12F
05/20 23:46, 12F
→
06/03 23:23, , 13F
06/03 23:23, 13F
→
06/03 23:23, , 14F
06/03 23:23, 14F
→
06/03 23:24, , 15F
06/03 23:24, 15F
→
06/03 23:28, , 16F
06/03 23:28, 16F
討論串 (同標題文章)
以下文章回應了本文:
完整討論串 (本文為第 1 之 2 篇):