提取pdf文件中图片和文字

2025-04-12 11:37:09 +08:00 · 2023-07-29 15:01:50 +08:00 · 2023-07-29 15:01:50 +08:00 · 1b246234b5
commit 1b246234b5
parent c1a1f73e7b
2 changed files with 25 additions and 1 deletions
--- a/pdf处理/README.md
+++ b/pdf处理/README.md
@ -0,0 +1,24 @@
+# 提取pdf文件中图片和文字
+
+## pdfimages提取图片
+
+pdfimages 是 Poppler 工具包的一部分，可以用于从 PDF 文件中提取图像。要使用 pdfimages，您需要安装 Poppler 工具包。
+
+1.在 Ubuntu 或 Debian 上安装 Poppler 工具包：
+```shell
+sudo apt-get update
+sudo apt-get install poppler-utils
+```
+
+2.提取所有图像并保存为 PNG 格式：
+```shell
+pdfimages -png ./file_test/1.pdf ./target_img/output_image
+```
+
+## pdfplumber提取文字
+```shell
+with pdfplumber.open(pdf_path) as pdf:
+    page = pdf.pages[0]
+    # 提取页面文本
+    text = page.extract_text()
+```
--- a/pdf处理/program/image_text_ocr.py
+++ b/pdf处理/program/image_text_ocr.py
@ -61,5 +61,5 @@ if __name__ == '__main__':
        '签发日期': '',
        '标志': ''
    }
-    res = image_text_ocr.run(text_dict, '../target_img/image_2.jpg')
+    res = image_text_ocr.run(text_dict, '../target_img/output_image-003.png')
    print(res)