[問題] 爬蟲相關問題(BeautifulSoup)
小弟目前正在撰寫計算從網頁A到其他網站內連結的level,
採用的方式就是在各網頁內直接抓取'href',
當我將最高level設定在2時,沒有任何問題,
但當我將其設定為3時,會跑出錯誤訊息如下:
UnboundLocalError: local variable 'soup' referenced before assignment
請問高手可以幫忙看看問題出在哪裡嗎?謝謝!
程式碼如下:
import lxml.html
import urllib.request
from bs4 import BeautifulSoup
foundUrls = {}
for rootUrl in rootUrls:
foundUrls.update({rootUrl : 0})
def getProtocolAndDomainName(url):
protocolAndOther = url.split('://')
# splitting url by '://' and retrun a list
ptorocol = protocolAndOther[0]
domainName = protocolAndOther[1].split('/')[0]
# this will only return 'https://xxxxx.com'
return ptorocol + '://' + domainName
stopLevel = 3 ## 此處若改為2時不會有任何問題
rootUrls = ['http://ps.ucdavis.edu/']
foundUrls = {}
for rootUrl in rootUrls:
foundUrls.update({rootUrl : 0})
def getProtocolAndDomainName(url):
protocolAndOther = url.split('://')
ptorocol = protocolAndOther[0]
domainName = protocolAndOther[1].split('/')[0]
return ptorocol + '://' + domainName
def crawl(urls, stopLevel = 5, level=1):
nextUrls = []
if (level <= stopLevel):
for url in urls:
# need to handle urls (e.g., https) that cannot be read
try:
openedUrl = urllib.request.urlopen(url)
soup = BeautifulSoup(openedUrl, 'lxml')
except:
print('cannot read for :' + url)
for a in soup.find_all('a', href=True):
href = a['href']
if href is not None:
# for the case of a link is relative path
if '://' not in href:
href = getProtocolAndDomainName(url) + href
# check url has been already visited or not
if href not in foundUrls:
foundUrls.update({href : level})
nextUrls.append(href)
# recursive call
crawl(nextUrls, stopLevel, level + 1)
crawl(rootUrls, stopLevel)
print(foundUrls)
--
※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 36.225.30.106
※ 文章網址: https://www.ptt.cc/bbs/Python/M.1504340414.A.5F1.html
→
09/02 22:05, , 1F
09/02 22:05, 1F
→
09/02 22:07, , 2F
09/02 22:07, 2F
推
09/02 22:59, , 3F
09/02 22:59, 3F
→
09/03 15:15, , 4F
09/03 15:15, 4F
→
09/03 15:25, , 5F
09/03 15:25, 5F
→
09/03 15:26, , 6F
09/03 15:26, 6F
→
09/03 15:26, , 7F
09/03 15:26, 7F
→
09/03 15:27, , 8F
09/03 15:27, 8F
→
09/03 15:27, , 9F
09/03 15:27, 9F
→
09/03 16:04, , 10F
09/03 16:04, 10F
→
09/03 16:14, , 11F
09/03 16:14, 11F
→
09/03 16:14, , 12F
09/03 16:14, 12F
→
09/03 16:15, , 13F
09/03 16:15, 13F
→
09/03 16:15, , 14F
09/03 16:15, 14F
→
09/03 16:15, , 15F
09/03 16:15, 15F