Documentation Index
Fetch the complete documentation index at: https://langchain-zh.cn/llms.txt
Use this file to discover all available pages before exploring further.
本笔记本提供了快速入门 WebPDFLoader 文档加载器 的概述。有关 WebPDFLoader 所有功能和配置的详细文档,请参阅 API 参考。
集成详情
加载器特性
| 来源 | Web 加载器 | 仅 Node 环境 |
|---|
WebPDFLoader | ✅ | ❌ |
您可以在 Web 环境中使用此版本的流行 PDFLoader。
默认情况下,PDF 文件中的每一页都会创建一个文档,您可以通过将 splitPages 选项设置为 false 来更改此行为。
要访问 WebPDFLoader 文档加载器,您需要安装 @langchain/community 集成包以及 pdf-parse 包:
如果您希望获得模型调用的自动追踪,您还可以通过取消注释以下内容来设置您的 LangSmith API 密钥:
# export LANGSMITH_TRACING="true"
# export LANGSMITH_API_KEY="your-api-key"
LangChain WebPDFLoader 集成位于 @langchain/community 包中:
npm install @langchain/community @langchain/core pdf-parse
实例化
现在我们可以实例化模型对象并加载文档:
import fs from "fs/promises";
import { WebPDFLoader } from "@langchain/community/document_loaders/web/pdf"
const nike10kPDFPath = "../../../../data/nke-10k-2023.pdf";
// 将文件读取为缓冲区
const buffer = await fs.readFile(nike10kPDFPath);
// 从缓冲区创建 Blob
const nike10kPDFBlob = new Blob([buffer], { type: 'application/pdf' });
const loader = new WebPDFLoader(nike10kPDFBlob, {
// 必需参数 = ...
// 可选参数 = ...
})
const docs = await loader.load()
docs[0]
Document {
pageContent: 'Table of Contents\n' +
'UNITED STATES\n' +
'SECURITIES AND EXCHANGE COMMISSION\n' +
'Washington, D.C. 20549\n' +
'FORM 10-K\n' +
'(Mark One)\n' +
'☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934\n' +
'FOR THE FISCAL YEAR ENDED MAY 31, 2023\n' +
'OR\n' +
'☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934\n' +
'FOR THE TRANSITION PERIOD FROM TO .\n' +
'Commission File No. 1-10635\n' +
'NIKE, Inc.\n' +
'(Exact name of Registrant as specified in its charter)\n' +
'Oregon93-0584541\n' +
'(State or other jurisdiction of incorporation)(IRS Employer Identification No.)\n' +
'One Bowerman Drive, Beaverton, Oregon 97005-6453\n' +
'(Address of principal executive offices and zip code)\n' +
'(503) 671-6453\n' +
"(Registrant's telephone number, including area code)\n" +
'SECURITIES REGISTERED PURSUANT TO SECTION 12(B) OF THE ACT:\n' +
'Class B Common StockNKENew York Stock Exchange\n' +
'(Title of each class)(Trading symbol)(Name of each exchange on which registered)\n' +
'SECURITIES REGISTERED PURSUANT TO SECTION 12(G) OF THE ACT:\n' +
'NONE\n' +
'Indicate by check mark:YESNO\n' +
'•if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act.þ ̈\n' +
'•if the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Act. ̈þ\n' +
'•whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding\n' +
'12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the\n' +
'past 90 days.\n' +
'þ ̈\n' +
'•whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T\n' +
'(§232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files).\n' +
'þ ̈\n' +
'•whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company or an emerging growth company. See the definitions of “large accelerated filer,”\n' +
'“accelerated filer,” “smaller reporting company,” and “emerging growth company” in Rule 12b-2 of the Exchange Act.\n' +
'Large accelerated filerþAccelerated filer☐Non-accelerated filer☐Smaller reporting company☐Emerging growth company☐\n' +
'•if an emerging growth company, if the registrant has elected not to use the extended transition period for complying with any new or revised financial\n' +
'accounting standards provided pursuant to Section 13(a) of the Exchange Act.\n' +
' ̈\n' +
"•whether the registrant has filed a report on and attestation to its management's assessment of the effectiveness of its internal control over financial\n" +
'reporting under Section 404(b) of the Sarbanes-Oxley Act (15 U.S.C. 7262(b)) by the registered public accounting firm that prepared or issued its audit\n' +
'report.\n' +
'þ\n' +
'•if securities are registered pursuant to Section 12(b) of the Act, whether the financial statements of the registrant included in the filing reflect the\n' +
'correction of an error to previously issued financial statements.\n' +
' ̈\n' +
'•whether any of those error corrections are restatements that required a recovery analysis of incentive-based compensation received by any of the\n' +
"registrant's executive officers during the relevant recovery period pursuant to § 240.10D-1(b).\n" +
' ̈\n' +
'•\n' +
'whether the registrant is a shell company (as defined in Rule 12b-2 of the Act).☐þ\n' +
"As of November 30, 2022, the aggregate market values of the Registrant's Common Stock held by non-affiliates were:\n" +
'Class A$7,831,564,572 \n' +
'Class B136,467,702,472 \n' +
'$144,299,267,044 ',
metadata: {
pdf: {
version: '1.10.100',
info: [Object],
metadata: null,
totalPages: 107
},
loc: { pageNumber: 1 }
},
id: undefined
}
console.log(docs[0].metadata)
{
pdf: {
version: '1.10.100',
info: {
PDFFormatVersion: '1.4',
IsAcroFormPresent: false,
IsXFAPresent: false,
Title: '0000320187-23-000039',
Author: 'EDGAR Online, a division of Donnelley Financial Solutions',
Subject: 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31',
Keywords: '0000320187-23-000039; ; 10-K',
Creator: 'EDGAR Filing HTML Converter',
Producer: 'EDGRpdf Service w/ EO.Pdf 22.0.40.0',
CreationDate: "D:20230720162200-04'00'",
ModDate: "D:20230720162208-04'00'"
},
metadata: null,
totalPages: 107
},
loc: { pageNumber: 1 }
}
用法,自定义 pdfjs 构建
默认情况下,我们使用 pdf-parse 捆绑的 pdfjs 构建,它与大多数环境兼容,包括 Node.js 和现代浏览器。如果您想使用更新版本的 pdfjs-dist,或者想使用自定义构建的 pdfjs-dist,可以通过提供一个自定义的 pdfjs 函数来实现,该函数返回一个解析为 PDFJS 对象的 Promise。
在以下示例中,我们使用 pdfjs-dist 的“传统”构建(参见 pdfjs 文档),其中包含默认构建中未包含的多个 polyfill。
import { WebPDFLoader } from "@langchain/community/document_loaders/web/pdf";
const blob = new Blob(); // 例如来自文件输入
const customBuildLoader = new WebPDFLoader(blob, {
// 您可能需要在导入语句末尾添加 `.then(m => m.default)`
// @lc-ts-ignore
pdfjs: () => import("pdfjs-dist/legacy/build/pdf.js"),
});
消除多余空格
PDF 文件种类繁多,这使得读取它们具有挑战性。加载器默认解析单个文本元素并用空格将它们连接起来,但如果您看到过多的空格,这可能不是期望的行为。在这种情况下,您可以像这样用空字符串覆盖分隔符:
import { WebPDFLoader } from "@langchain/community/document_loaders/web/pdf";
// new Blob(); 例如来自文件输入
const eliminatingExtraSpacesLoader = new WebPDFLoader(new Blob(), {
parsedItemSeparator: "",
});
API 参考
有关 WebPDFLoader 所有功能和配置的详细文档,请参阅 API 参考。