[問題] 爬蟲相關問題(BeautifulSoup)

看板Python作者 (消費券收購商)時間6年前 (2017/09/02 16:20), 編輯推噓1(1014)
留言15則, 4人參與, 最新討論串1/1
小弟目前正在撰寫計算從網頁A到其他網站內連結的level, 採用的方式就是在各網頁內直接抓取'href', 當我將最高level設定在2時,沒有任何問題, 但當我將其設定為3時,會跑出錯誤訊息如下: UnboundLocalError: local variable 'soup' referenced before assignment 請問高手可以幫忙看看問題出在哪裡嗎?謝謝! 程式碼如下: import lxml.html import urllib.request from bs4 import BeautifulSoup foundUrls = {} for rootUrl in rootUrls: foundUrls.update({rootUrl : 0}) def getProtocolAndDomainName(url): protocolAndOther = url.split('://') # splitting url by '://' and retrun a list ptorocol = protocolAndOther[0] domainName = protocolAndOther[1].split('/')[0] # this will only return 'https://xxxxx.com' return ptorocol + '://' + domainName stopLevel = 3 ## 此處若改為2時不會有任何問題 rootUrls = ['http://ps.ucdavis.edu/'] foundUrls = {} for rootUrl in rootUrls: foundUrls.update({rootUrl : 0}) def getProtocolAndDomainName(url): protocolAndOther = url.split('://') ptorocol = protocolAndOther[0] domainName = protocolAndOther[1].split('/')[0] return ptorocol + '://' + domainName def crawl(urls, stopLevel = 5, level=1): nextUrls = [] if (level <= stopLevel): for url in urls: # need to handle urls (e.g., https) that cannot be read try: openedUrl = urllib.request.urlopen(url) soup = BeautifulSoup(openedUrl, 'lxml') except: print('cannot read for :' + url) for a in soup.find_all('a', href=True): href = a['href'] if href is not None: # for the case of a link is relative path if '://' not in href: href = getProtocolAndDomainName(url) + href # check url has been already visited or not if href not in foundUrls: foundUrls.update({href : level}) nextUrls.append(href) # recursive call crawl(nextUrls, stopLevel, level + 1) crawl(rootUrls, stopLevel) print(foundUrls) -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 36.225.30.106 ※ 文章網址: https://www.ptt.cc/bbs/Python/M.1504340414.A.5F1.html

09/02 22:05, , 1F
錯誤訊息已經說了 你還沒有給soup值就用了它
09/02 22:05, 1F

09/02 22:07, , 2F
只要想想什麼情況下會沒給soup值就知道問題在哪了
09/02 22:07, 2F

09/02 22:59, , 3F
execption處理還可以再好一點
09/02 22:59, 3F

09/03 15:15, , 4F
應該是有幾個轉換過的url無法用BeautifulSoup開啟
09/03 15:15, 4F

09/03 15:25, , 5F
我現在處理的方式是在crawl函數底下多寫一行
09/03 15:25, 5F

09/03 15:26, , 6F
global soup,雖然解決的問題但不確定是不是一個好的方法
09/03 15:26, 6F

09/03 15:26, , 7F
我也有想過在openedUrl = urllib.request.urlopen(url)
09/03 15:26, 7F

09/03 15:27, , 8F
底下放一個try-except而不是使用global soup
09/03 15:27, 8F

09/03 15:27, , 9F
請問有高手可以給點意見嗎?謝謝!
09/03 15:27, 9F

09/03 16:04, , 10F
把處理過的url印出來看看?看哪個會錯?
09/03 16:04, 10F

09/03 16:14, , 11F
用global等於是把soup擺到global symbol table去
09/03 16:14, 11F

09/03 16:14, , 12F
看起來解決問題只是碰巧沒有出錯而已
09/03 16:14, 12F

09/03 16:15, , 13F
比較正確的方法是 當發現取得或解析網頁失敗的時候
09/03 16:15, 13F

09/03 16:15, , 14F
就做好流程控制 不要再去使用soup變數
09/03 16:15, 14F

09/03 16:15, , 15F
因為它根本就沒有被填入應有的內容
09/03 16:15, 15F
文章代碼(AID): #1Pgcc-Nn (Python)