对比python的pdf刮削器:pypdf2、pdfplumber和pymupdf。三者都能提取pdf文件的信息并转换成txt文件,可用于ai分析和后续脚本处理。
准备条件
首先安装三个pdf刮削器。
pip install PyPDF2 pdfplumber PyMuPDF
然后准备一个pdf论文[1],和一个简短的提取python脚本如下:
import PyPDF2
import pdfplumber
import fitz # PyMuPDF
import argparse
def extract_with_pypdf2(pdf_path):
text = ""
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
for page in reader.pages:
text += page.extract_text() + "\n"
return text
def extract_with_pdfplumber(pdf_path):
text = ""
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
text += page.extract_text() + "\n"
return text
def extract_with_pymupdf(pdf_path):
text = ""
doc = fitz.open(pdf_path)
for page in doc:
text += page.get_text() + "\n"
doc.close()
return text
def extract_pdf_text(pdf_path, method='pypdf2'):
methods = {
'pypdf2': extract_with_pypdf2,
'pdfplumber': extract_with_pdfplumber,
'pymupdf': extract_with_pymupdf
}
if method not in methods:
raise ValueError(f"Method must be one of: {', '.join(methods.keys())}")
return methods[method](pdf_path)
def save_text_to_file(text, output_path):
with open(output_path, 'w', encoding='utf-8') as f:
f.write(text)
def main():
# Set up argument parser
parser = argparse.ArgumentParser(description='Extract text from PDF files')
parser.add_argument('-i', '--input', required=True, help='Input PDF file path')
parser.add_argument('-o', '--output', default='output.txt', help='Output text file path')
parser.add_argument('-m', '--method',
choices=['pypdf2', 'pdfplumber', 'pymupdf'],
default='pymupdf',
help='Method to use for extraction (default: pymupdf)')
args = parser.parse_args()
try:
text = extract_pdf_text(args.input, args.method)
save_text_to_file(text, args.output)
print(f"Successfully extracted text from {args.input} to {args.output} using {args.method}")
print(f"Extracted {len(text)} characters")
except Exception as e:
print(f"Error: {str(e)}")
if __name__ == "__main__":
main()
使用下列命令来生成对应pypdf2、pdfplumber、pymupdf三种方式产生的文本。
python pdf-text-extraction.py -i xxx.pdf -m pypdf2 -o pypdf.txt
python pdf-text-extraction.py -i xxx.pdf -m pdfplumber -o pdfplumber.txt
python pdf-text-extraction.py -i xxx.pdf -m pymupdf -o pymupdf.txt
对比结果
选取其中一段,包括公式和讲解的部分。
目前只有pymupdf正确识别出来了双列的论文,并且没有混淆和连字,可以方便给AI进行进一步分析,建议选用该方法。
参考文献
- Density pedestal prediction model for tokamak plasmas, IOP Publishing, Nuclear Fusion vol. 64, 2024. no. 7, 076025. ,