[問題] 存檔和LOOP

看板Python作者ibgvdlbj (:))時間6年前 (2019/08/18 14:54)推噓1(1推 0噓 3→)

留言4則, 4人參與討論串1/1

Hi 各位大大我又上來請教大家了目前想用python識別pdf檔做 key word 查尋也就是 optical character recognition 昨天朋友說 pytesseract 只能識別圖片不能識別 pdf檔所以我先手動把其中一個pdf檔存成圖檔當測試寫了一段code 成功的輸出在 cmd裡目前在思考能不能儲存成text檔(格式會跑掉嗎?) 然後讓程式讀取資料夾內的下一個 pdf 檔案自行轉成圖檔後再跑~~ 如果以上有可能的話該怎麼寫呢? 麻煩各位大大謝謝^^" 以下放code: from PIL import Image import pytesseract import argparse import cv2 import os # construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-i", "--image", required=True, help="path to input image to be OCR'd") ap.add_argument("-p", "--preprocess", type=str, default="thresh", help="type of preprocessing to be done") args = vars(ap.parse_args()) # load the example image and convert it to grayscale image = cv2.imread(args["image"]) gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # check to see if we should apply thresholding to preprocess the # image if args["preprocess"] == "thresh": gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1] # make a check to see if median blurring should be done to remove # noise elif args["preprocess"] == "blur": gray = cv2.medianBlur(gray, 3) # write the grayscale image to disk as a temporary file so we can # apply OCR to it filename = "{}.png".format(os.getpid()) cv2.imwrite(filename, gray) # load the image as a PIL/Pillow image, apply OCR, and then delete # the temporary file text = pytesseract.image_to_string(Image.open(filename)) os.remove(filename) print(text) # show the output images cv2.imshow("Image", image) cv2.imshow("Output", gray) cv2.waitKey(0) -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 99.241.153.151 (加拿大) ※ 文章網址: https://www.ptt.cc/bbs/Python/M.1566111292.A.D7C.html

推

eamansf96xs

08/18 19:28, 6年前 , 1^F

08/18 19:28, 1^F

→

mirror0227

08/18 20:33, 6年前 , 2^F

08/18 20:33, 2^F

→

s860134

08/18 23:32, 6年前 , 3^F

08/18 23:32, 3^F