Re: [問題] 關於網頁抓取內容已刪文

看板Python作者Neisseria (Neisseria)時間11年前 (2014/09/01 16:29)推噓0(0推 0噓 0→)

留言0則, 0人參與討論串2/4 (看更多)

※ 引述《jenocool ()》之銘言： : 但發現如果標籤內有其他的標籤，似乎會自動換行? : 例如原本標題:台北潮肉壽喜燒肉@*~洪小玥~*－iPeen 愛評網 : 而在Html中則是 : 台北 : <b>潮肉</b> : 壽喜燒肉@*~洪小玥~*－iPeen 愛評網 : 所以抓出來的結果就變成 : 台北 : 潮肉 : 壽喜燒肉@*~洪小玥~*－iPeen 愛評網 : 不知道要用什麼方法將他們合併? 提供其中一種方法，但是，是用 Perl 的 Mojo::UserAgent 來解決當然也可以試著用 Python 的模組來解決，自行參考看看想法是用程式來做原本手動做的 HTTP 的動作，在這裡用 POST 然後將 DOM 樹抓出想要的部分，剩一點點 HTML tag 就用 regex 直接無腦刪除 use Mojo::UserAgent; my $ua = Mojo::UserAgent->new(); my $tx = $ua->post('http://tw.search.yahoo.com/search' => form => {p => 'linux'}); binmode STDOUT, ":utf8"; # $res stands for http response if (my $res = $tx->success) { # get the elements with css selector $res->dom('h3 > a')->each(sub { my ($e, $count) = @_; $e =~ s{<[^>]+?>}{}g; print "$e\n";}); } : 另外想請問一下，抓取GOOGLE搜尋結果似乎會有問題 : 上網查了一下似乎是有防止抓取 : 不知道有沒有什麼方法可以解決? Google search 不允許 POST 或是 OPTIONS，只能用 GET use Mojo::UserAgent; my $ua = Mojo::UserAgent->new(); my $tx = $ua->get('http://www.google.com.tw/search' => form => {q => 'linux'}); if (my $res = $tx->success) { # modify the steps here to fit your need print $res->text, "\n"; } 試了一下 Python 的相關模組，像是 urllib2 或是 httplib2 但是試不出來，都會被 Google 認定是違規的動作，不太敢再一直試我比較沒有用 Python 寫抓網頁的 script 的經驗，可能要另請高明 : 看了一整天還是一頭霧水 .. : 謝謝大家了雖然不是 Python 程式碼，但是多多少少參考看看 -- Happy Computing Tips and Recipes for Unix and programming http://cwchen123.tw/ -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 59.105.57.75 ※ 文章網址: http://www.ptt.cc/bbs/Python/M.1409588974.A.CD1.html

‣ 返回看板[ Python ] 程設

‣ 更多 Neisseria 的文章

文章代碼(AID): #1K19xkpH (Python)