PDFMinerLoader 集成

本指南提供了快速入门概述，帮助您开始使用 PDFMiner 文档加载器。有关所有 PDFMinerLoader 功能和配置的详细文档，请参阅 API 参考。

概述

集成详情

类	包	本地	可序列化	JS 支持
`PDFMinerLoader`	`langchain-community`	✅	❌	❌

加载器功能

来源	文档惰性加载	原生异步支持	提取图像	提取表格
`PDFMinerLoader`	✅	❌	✅	✅

安装设置

凭证

使用 PDFMinerLoader 无需凭证。如果您希望获得自动化、一流的模型调用追踪，您也可以通过取消注释以下代码来设置您的 LangSmith API 密钥：

os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
os.environ["LANGSMITH_TRACING"] = "true"

安装

安装 langchain-community 和 pdfminer。

pip install -qU langchain-community pdfminer.six

初始化

现在我们可以实例化我们的模型对象并加载文档：

from langchain_community.document_loaders import PDFMinerLoader

file_path = "./example_data/layout-parser-paper.pdf"
loader = PDFMinerLoader(file_path)

加载

docs = loader.load()
docs[0]

Document(metadata={'author': '', 'creationdate': '2021-06-22T01:27:10+00:00', 'creator': 'LaTeX with hyperref', 'keywords': '', 'moddate': '2021-06-22T01:27:10+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'producer': 'pdfTeX-1.40.21', 'subject': '', 'title': '', 'trapped': 'False', 'total_pages': 16, 'source': './example_data/layout-parser-paper.pdf'}, page_content='1\n2\n0\n2\n\nn\nu\nJ\n\n1\n2\n\n]\n\nV\nC\n.\ns\nc\n[\n\n2\nv\n8\n4\n3\n5\n1\n.\n3\n0\n1\n2\n:\nv\ni\nX\nr\na\n\nLayoutParser: A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\n\nZejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n\n1 Allen Institute for AI\nshannons@allenai.org\n2 Brown University\nruochen zhang@brown.edu\n3 Harvard University\n{melissadell,jacob carlson}@fas.harvard.edu\n4 University of Washington\nbcgl@cs.washington.edu\n5 University of Waterloo\nw422li@uwaterloo.ca\n\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model conﬁgurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\neﬀorts to improve reusability and simplify deep learning (DL) model\ndevelopment in disciplines like natural language processing and computer\nvision, none of them are optimized for challenges in the domain of DIA.\nThis represents a major gap in the existing toolkit, as DIA is central to\nacademic research across a wide range of disciplines in the social sciences\nand humanities.

---

<div className="source-links">
<Callout icon="edit">
    [Edit this page on GitHub](https://github.com/langchain-ai/docs/edit/main/src/i18n\zh-CN\oss\python\integrations\document_loaders\pdfminer.mdx) or [file an issue](https://github.com/langchain-ai/docs/issues/new/choose).
</Callout>
<Callout icon="terminal-2">
    [Connect these docs](/use-these-docs) to Claude, VSCode, and more via MCP for real-time answers.
</Callout>
</div>

Popular Providers

Integrations by component

概述

集成详情

加载器功能

安装设置

凭证

安装

初始化

加载

​概述

​集成详情

​加载器功能

​安装设置

​凭证

​安装

​初始化

​加载

概述

集成详情

加载器功能

安装设置

凭证

安装

初始化

加载