[問題] 爬取圖文成檔案

看板Python作者s4028600 (佑)時間4年前 (2019/12/25 00:28)推噓0(0推 0噓 22→)

留言22則, 8人參與討論串1/1

爬文都只有只爬取圖片或文字用requests和bs4爬取文字或圖片是會了但是想要爬取成圖文混排所以txt是沒辦法了目前能夠圖文混排的格式打算用world或epub 但是不知道要怎麼爬取圖文用bs4只會跑出圖片的連結沒有頭緒了... -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 125.224.161.174 (臺灣) ※ 文章網址: https://www.ptt.cc/bbs/Python/M.1577204893.A.478.html

→

Hsins

12/25 00:59, 4年前 , 1^F

12/25 00:59, 1^F

→

Hsins

12/25 01:00, 4年前 , 2^F

12/25 01:00, 2^F

→

Hsins

12/25 01:01, 4年前 , 3^F

12/25 01:01, 3^F

→

Hsins

12/25 01:02, 4年前 , 4^F

12/25 01:02, 4^F

我又不是很常用這功能... 雖然是比兩年前會很多不過爬蟲是最近才嘗試的東西找到的答案又都是單純爬圖或爬文 html... 嘗試還是只能爬圖或文寫不出一起爬的

→

junwh

12/25 02:35, 4年前 , 5^F

12/25 02:35, 5^F

這是？輕量級標記式語言？ ※ 編輯: s4028600 (125.224.161.174 臺灣), 12/25/2019 05:59:01

→

dennisxkimo

12/25 09:36, 4年前 , 6^F

12/25 09:36, 6^F

→

dennisxkimo

12/25 09:38, 4年前 , 7^F

12/25 09:38, 7^F

→

dennisxkimo

12/25 09:40, 4年前 , 8^F

12/25 09:40, 8^F

import requests from bs4 import BeautifulSoup import os url = 'https://ericjhang.github.io/archives/ad5450f3.html' html = requests.get(url).content with open('123.html','wb')as f: f.write(html) f.close() 這樣爬出來圖是叉叉怎麼鑲嵌比較好 ※ 編輯: s4028600 (125.224.161.174 臺灣), 12/25/2019 11:08:54

→

Hsins

12/25 12:40, 4年前 , 9^F

12/25 12:40, 9^F

→

Hsins

12/25 12:41, 4年前 , 10^F

12/25 12:41, 10^F

→

Hsins

12/25 12:41, 4年前 , 11^F

12/25 12:41, 11^F

→

Hsins

12/25 12:42, 4年前 , 12^F

12/25 12:42, 12^F

兩年前也才一篇文章而且還是文件處理中間並沒有用過好嗎？而且就是不懂才要問就算聽不懂也可以抽絲剝繭從回答中找到更多東西你的html建議就很有用的確我還摸索不出來不過比你指責我英文不好要有用多了 ※ 編輯: s4028600 (175.183.44.67 臺灣), 12/25/2019 14:14:52 https://blog.csdn.net/he_string/article/details/78574198 根據這篇文章可以把圖放入world 可是只能放到最後面... ※ 編輯: s4028600 (175.183.44.67 臺灣), 12/25/2019 14:18:34

→

kobe8112

12/25 16:54, 4年前 , 13^F

12/25 16:54, 13^F

→

s4028600

12/25 19:00, 4年前 , 14^F

12/25 19:00, 14^F

→

dennisxkimo

12/25 19:29, 4年前 , 15^F

12/25 19:29, 15^F

→

dennisxkimo

12/25 19:30, 4年前 , 16^F

12/25 19:30, 16^F

→

dennisxkimo

12/25 19:31, 4年前 , 17^F

12/25 19:31, 17^F

→

dennisxkimo

12/25 19:31, 4年前 , 18^F

12/25 19:31, 18^F

有想過也試過喔最主要是我還做不到批量改html裡的連結所以才難以合併不過剛找到方法了還很簡陋就是了

→

s860134

12/25 19:46, 4年前 , 19^F

12/25 19:46, 19^F

我的確還是copy paste的程度沒錯不過比以前只能直接用現在會自己改了雖然進度緩慢...

→

vi000246

12/26 00:27, 4年前 , 20^F

12/26 00:27, 20^F

好吧研究epub的結構後覺得我的確在一步登天還是從Hsins的建議從html開始不過我的確不太會從細部問題開始多是大問題開始走一步算一步... 圖已經會載了

→

vi000246

12/26 00:28, 4年前 , 21^F

12/26 00:28, 21^F

是指在print之前的排版嗎為什麼你們留言我app都沒顯示... 總之找到一個算可行方法但是很仰賴calibre import zipfile import requests from bs4 import BeautifulSoup a=1 url = '' res=requests.get(url) res.encoding='gbk' soup=BeautifulSoup(res.text,'html.parser') html=soup.select('#contentmain')[0].prettify() outZip=zipfile.ZipFile('test.zip', mode='w', compression=zipfile.ZIP_DEFLATED) #, compresslevel=9) for img in soup.select('#contentmain'): for src in img.select('img'): filename='images/%02d.jpg'%a print(src['src']) html=html.replace(src['src'],filename) imgUrl=src['src'] imgResponse=requests.get(imgUrl) outZip.writestr(filename, imgResponse.content) a+=1 print(html) htmlContent=html outZip.writestr('index.html', htmlContent) outZip.close() 然後再用calibre轉檔 ※ 編輯: s4028600 (36.232.106.188 臺灣), 12/26/2019 13:47:45

→

jiyu520

12/26 14:57, 4年前 , 22^F

12/26 14:57, 22^F

‣ 返回看板[ Python ] 程設

‣ 更多 s4028600 的文章

文章代碼(AID): #1U0ZoTHu (Python)