[問題] lxml 特定字串過濾
各位先進大家好
小弟最近為了要抓一些資料所以剛接觸python這個程式
目前已經可以抓取網頁原始碼以及找出element內的text
不過有很多資訊是我不需要的,所以想問問看大大們有甚麼方式可以用lxml過濾
以下為部分原始碼,網站鏈結我砍掉了部分
每個id後面都有一串文字,我想要透過那串文字去過慮我要的資料
例如id="ctl00_ContentPlaceHolder1_gvStock_ctl02_hyNumber
我想要用ctl00_ContentPlaceHolder1_gvStock作為過濾條件並且用hyNumber作分類
請各位大大能否指點一下,謝謝。
<a id="ctl00_ContentPlaceHolder1_gvStock_ctl02_hyNumber"
href="http://corpinfom.aspx?stockno=2429"
target="_blank">2429</a></td><td>
<a id="ctl00_ContentPlaceHolder1_gvStock_ctl02_hyStock"
href="http://corpinfom.aspx?stockno=2429"
target="_blank">銘旺科</a></td><td>
<span id="ctl00_ContentPlaceHolder1_gvStock_ctl02_lblPrice"
style="color:White;background-color:Red;">45.15</span></td><td>
<span id="ctl00_ContentPlaceHolder1_gvStock_ctl02_lblDifference"
style="color:White;background-color:Red;">▲2.95</span></td><td>
<span id="ctl00_ContentPlaceHolder1_gvStock_ctl02_lblPercent"
style="color:White;background-color:Red;">+6.99%</span></td><td>
<span id="ctl00_ContentPlaceHolder1_gvStock_ctl02_lblWeekPercent"
style="color:Red;">21.70%</span></td><td>
<span
id="ctl00_ContentPlaceHolder1_gvStock_ctl02_lblAmplitude">5.09%</span></td><td>
<span id="ctl00_ContentPlaceHolder1_gvStock_ctl02_lblHigh"
style="color:Red;">45.15</span></td><td>
<span id="ctl00_ContentPlaceHolder1_gvStock_ctl02_lblLow"
style="color:Red;">43.00</span></td><td>
<span
id="ctl00_ContentPlaceHolder1_gvStock_ctl02_lblLast">42.20</span></td><td>
<span id="ctl00_ContentPlaceHolder1_gvStock_ctl02_lblVolume"
title="17,000 股">17</span></td><td>
<span id="ctl00_ContentPlaceHolder1_gvStock_ctl02_lblMoney"
title="17,000 股">0.01</span></td>
--
※ 發信站: 批踢踢實業坊(ptt.cc)
◆ From: 59.124.199.162
→
06/11 12:54, , 1F
06/11 12:54, 1F
→
06/11 19:23, , 2F
06/11 19:23, 2F
→
06/11 19:24, , 3F
06/11 19:24, 3F
當初就是看到相關文章說BS的效率蠻差的,所以沒用
RE我是有想過,不過用了lxml我就覺得再用RE過濾感覺多此一舉
目前程式如下
import urllib
from lxml eterr
html=urllib.urlopen('http:// ').read()
elements=etree.HTML(html)
text=element.xpath(u"//a")
for element in elements:
print element.text
我想在text=element.xpath(u"//a")這裡面順便加過濾字串
在ctl00_ContentPlaceHolder1_gvStock_ctl02_lblMoney
把ctl00_ContentPlaceHolder1_gvStock作為過濾條件
把lblMoney作為分類條件
如果只能用RE去過濾的話,請大大告知一下,順便跟我說一下RE的效率如何,謝謝。
※ 編輯: u9211008 來自: 59.124.199.162 (06/13 13:04)
→
06/13 13:16, , 4F
06/13 13:16, 4F
→
06/13 17:39, , 5F
06/13 17:39, 5F
→
06/13 17:40, , 6F
06/13 17:40, 6F
推
06/13 18:04, , 7F
06/13 18:04, 7F
經由k大的指點已經能夠抓取相關id的資料了,不過有時候會抓不到一些資料
其原因詢問過k大後才知道子節點的內容有可能會沒抓到,k大幫我解決這問題了
我把完整的程式碼放上
# -*- coding: utf-8 -*-
import urllib
import lxml.etree
html=urllib.urlopen('http:// ').read()
tree=lxml.etree.HTML(html)
regexpNS = "http://exslt.org/regular-expressions"
hrefs = tree.xpath("//a[re:match(@id,
'ctl00_ContentPlaceHolder1_gvStock(.*)')]|\
//span[re:match(@id,
'ctl00_ContentPlaceHolder1_gvStock(.*)')]", namespaces={'re': regexpNS})
for href in hrefs:
print "".join(href.itertext())
※ 編輯: u9211008 來自: 124.219.26.45 (06/17 20:18)
→
06/18 10:29, , 8F
06/18 10:29, 8F