pdf转txt的python刮削器对比

对比python的pdf刮削器:pypdf2、pdfplumber和pymupdf。三者都能提取pdf文件的信息并转换成txt文件,可用于ai分析和后续脚本处理。

准备条件

首先安装三个pdf刮削器。

pip install PyPDF2 pdfplumber PyMuPDF

然后准备一个pdf论文[1],和一个简短的提取python脚本如下:

import PyPDF2                                                                                                                                                                                                      
import pdfplumber                                                                                                                                                                                                  
import fitz  # PyMuPDF                                                                                   
import argparse                                     
                                                    
def extract_with_pypdf2(pdf_path):                  
    text = ""
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)                                                                  
        for page in reader.pages:                   
            text += page.extract_text() + "\n"
    return text                             

def extract_with_pdfplumber(pdf_path):                                                                   
    text = ""                                     
    with pdfplumber.open(pdf_path) as pdf:          
        for page in pdf.pages:
            text += page.extract_text() + "\n"
    return text                                     
                                                    
def extract_with_pymupdf(pdf_path):
    text = ""                          
    doc = fitz.open(pdf_path)
    for page in doc:
        text += page.get_text() + "\n"
    doc.close()
    return text

def extract_pdf_text(pdf_path, method='pypdf2'):
    methods = {
        'pypdf2': extract_with_pypdf2,
        'pdfplumber': extract_with_pdfplumber,
        'pymupdf': extract_with_pymupdf
    }
    if method not in methods:
        raise ValueError(f"Method must be one of: {', '.join(methods.keys())}")
    return methods[method](pdf_path)

def save_text_to_file(text, output_path):
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(text)

def main():
    # Set up argument parser
    parser = argparse.ArgumentParser(description='Extract text from PDF files')
    parser.add_argument('-i', '--input', required=True, help='Input PDF file path')
    parser.add_argument('-o', '--output', default='output.txt', help='Output text file path')
    parser.add_argument('-m', '--method',
                       choices=['pypdf2', 'pdfplumber', 'pymupdf'],
                       default='pymupdf',
                       help='Method to use for extraction (default: pymupdf)')

    args = parser.parse_args()

    try:
        text = extract_pdf_text(args.input, args.method)

        save_text_to_file(text, args.output)

        print(f"Successfully extracted text from {args.input} to {args.output} using {args.method}")
        print(f"Extracted {len(text)} characters")

    except Exception as e:
        print(f"Error: {str(e)}")

if __name__ == "__main__":
    main()

使用下列命令来生成对应pypdf2、pdfplumber、pymupdf三种方式产生的文本。

python pdf-text-extraction.py -i xxx.pdf -m pypdf2 -o pypdf.txt
python pdf-text-extraction.py -i xxx.pdf -m pdfplumber -o pdfplumber.txt
python pdf-text-extraction.py -i xxx.pdf -m pymupdf -o pymupdf.txt

对比结果

选取其中一段,包括公式和讲解的部分。

目前只有pymupdf正确识别出来了双列的论文,并且没有混淆和连字,可以方便给AI进行进一步分析,建议选用该方法。

参考文献
  1. S. Saarelma, J. Connor, P. B ilkov a, P. Bohm, C. Bowman, A. Field, L. Frassinetti, R. Friedström, S. Henderson, K. Imada and O. , Density pedestal prediction model for tokamak plasmas, IOP Publishing, Nuclear Fusion vol. 64, 2024. no. 7, 076025.
文章标题:pdf转txt的python刮削器对比
文章作者:Myron
转载链接:https://phyiscs.com/best-python-pdf-extractors-pypdf2-vs-pdfplumber-vs-pymupdf.html
上一篇
下一篇