Skip to main content

Documentation Index

Fetch the complete documentation index at: https://langchain-zh.cn/llms.txt

Use this file to discover all available pages before exploring further.

本指南提供了快速入门概述,帮助您开始使用 PDFMiner 文档加载器。有关所有 PDFMinerLoader 功能和配置的详细文档,请参阅 API 参考

概述

集成详情

本地可序列化JS 支持
PDFMinerLoaderlangchain-community

加载器功能

来源文档惰性加载原生异步支持提取图像提取表格
PDFMinerLoader

安装设置

凭证

使用 PDFMinerLoader 无需凭证。 如果您希望获得自动化、一流的模型调用追踪,您也可以通过取消注释以下代码来设置您的 LangSmith API 密钥:
os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
os.environ["LANGSMITH_TRACING"] = "true"

安装

安装 langchain-communitypdfminer
pip install -qU langchain-community pdfminer.six

初始化

现在我们可以实例化我们的模型对象并加载文档:
from langchain_community.document_loaders import PDFMinerLoader

file_path = "./example_data/layout-parser-paper.pdf"
loader = PDFMinerLoader(file_path)

加载

docs = loader.load()
docs[0]
Document(metadata={'author': '', 'creationdate': '2021-06-22T01:27:10+00:00', 'creator': 'LaTeX with hyperref', 'keywords': '', 'moddate': '2021-06-22T01:27:10+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'producer': 'pdfTeX-1.40.21', 'subject': '', 'title': '', 'trapped': 'False', 'total_pages': 16, 'source': './example_data/layout-parser-paper.pdf'}, page_content='1\n2\n0\n2\n\nn\nu\nJ\n\n1\n2\n\n]\n\nV\nC\n.\ns\nc\n[\n\n2\nv\n8\n4\n3\n5\n1\n.\n3\n0\n1\n2\n:\nv\ni\nX\nr\na\n\nLayoutParser: A Unified Toolkit for Deep\nLearning Based Document Image Analysis\n\nZejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n\n1 Allen Institute for AI\nshannons@allenai.org\n2 Brown University\nruochen zhang@brown.edu\n3 Harvard University\n{melissadell,jacob carlson}@fas.harvard.edu\n4 University of Washington\nbcgl@cs.washington.edu\n5 University of Waterloo\nw422li@uwaterloo.ca\n\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model configurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\nefforts to improve reusability and simplify deep learning (DL) model\ndevelopment in disciplines like natural language processing and computer\nvision, none of them are optimized for challenges in the domain of DIA.\nThis represents a major gap in the existing toolkit, as DIA is central to\nacademic research across a wide range of disciplines in the social sciences\nand humanities.

---

<div className="source-links">
<Callout icon="edit">
    [Edit this page on GitHub](https://github.com/langchain-ai/docs/edit/main/src/i18n\zh-CN\oss\python\integrations\document_loaders\pdfminer.mdx) or [file an issue](https://github.com/langchain-ai/docs/issues/new/choose).
</Callout>
<Callout icon="terminal-2">
    [Connect these docs](/use-these-docs) to Claude, VSCode, and more via MCP for real-time answers.
</Callout>
</div>