Documentation Index
Fetch the complete documentation index at: https://langchain-zh.cn/llms.txt
Use this file to discover all available pages before exploring further.
Apify Actor 是专为各种网页抓取、爬取和数据提取任务设计的云程序。这些 Actor 促进了从网络自动收集数据,使用户能够高效地提取、处理和存储信息。Actor 可用于执行诸如抓取电子商务网站的产品详情、监控价格变化或收集搜索引擎结果等任务。它们与 Apify 数据集 无缝集成,允许将 Actor 收集的 structured 数据存储、管理和导出为 JSON、CSV 或 Excel 等格式,以便进一步分析或使用。
本笔记本将引导您使用 Apify Actor 与 LangChain 配合进行网页抓取和数据提取自动化。langchain-apify 包将 Apify 的基于云的工具与 LangChain 代理集成,实现 AI 应用的高效数据收集和数据处理。
集成详情
工具特性
| 返回工件 | 原生异步 | 返回数据 | 定价 |
|---|
| ❌ | ✅ | Actor 输出(因 Actor 而异) | 按使用量付费,提供免费层级 |
此集成位于 langchain-apify 包中。该包可以使用 pip 安装。
pip install langchain-apify
前置条件
import os
os.environ["APIFY_TOKEN"] = "your-apify-token"
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
Apify 采用按使用量付费的定价模式,并提供免费层级。
定价因 Actor 而异——有些 Actor 是免费的(您只需支付平台使用费),而另一些则按结果或事件收费。
实例化
此处我们实例化 ApifyActorsTool 以调用 RAG Web Browser Apify Actor。该 Actor 为 AI 和 LLM 应用提供网页浏览功能,类似于 ChatGPT 中的网页浏览功能。来自 Apify Store 的任何 Actor 都可以这样使用。
from langchain_apify import ApifyActorsTool
tool = ApifyActorsTool("apify/rag-web-browser")
ApifyActorsTool 接受单个参数,即 run_input —— 一个作为运行输入传递给 Actor 的字典。运行输入架构文档可在 Actor 详细信息页面的输入部分找到。参见 RAG Web Browser 输入架构。
tool.invoke({"run_input": {"query": "what is apify?", "maxResults": 2}})
链式调用
我们可以将创建的代理提供给 agent。当被要求搜索信息时,代理将调用 Apify Actor,后者将搜索网络,然后检索搜索结果。
pip install langgraph langchain-openai
from langchain.messages import ToolMessage
from langchain_openai import ChatOpenAI
from langchain.agents import create_agent
model = ChatOpenAI(model="gpt-5-mini")
tools = [tool]
graph = create_agent(model, tools=tools)
inputs = {"messages": [("user", "search for what is Apify")]}
for s in graph.stream(inputs, stream_mode="values"):
message = s["messages"][-1]
# skip tool messages
if isinstance(message, ToolMessage):
continue
message.pretty_print()
================================ Human Message =================================
search for what is Apify
================================== Ai Message ==================================
Tool Calls:
apify_actor_apify_rag-web-browser (call_27mjHLzDzwa5ZaHWCMH510lm)
Call ID: call_27mjHLzDzwa5ZaHWCMH510lm
Args:
run_input: {"run_input":{"query":"Apify","maxResults":3,"outputFormats":["markdown"]}}
================================== Ai Message ==================================
Apify is a comprehensive platform for web scraping, browser automation, and data extraction. It offers a wide array of tools and services that cater to developers and businesses looking to extract data from websites efficiently and effectively. Here's an overview of Apify:
1. **Ecosystem and Tools**:
- Apify provides an ecosystem where developers can build, deploy, and publish data extraction and web automation tools called Actors.
- The platform supports various use cases such as extracting data from social media platforms, conducting automated browser-based tasks, and more.
2. **Offerings**:
- Apify offers over 10,000 ready-made scraping tools and code templates.
- Users can also build custom solutions or hire Apify's professional services for more tailored data extraction needs.
3. **Technology and Integration**:
- The platform supports integration with popular tools and services like Zapier, GitHub, Google Sheets, Pinecone, and more.
- Apify supports open-source tools and technologies such as JavaScript, Python, Puppeteer, Playwright, Selenium, and its own Crawlee library for web crawling and browser automation.
4. **Community and Learning**:
- Apify hosts a community on Discord where developers can get help and share expertise.
- It offers educational resources through the Web Scraping Academy to help users become proficient in data scraping and automation.
5. **Enterprise Solutions**:
- Apify provides enterprise-grade web data extraction solutions with high reliability, 99.95% uptime, and compliance with SOC2, GDPR, and CCPA standards.
For more information, you can visit [Apify's official website](https://apify.com/) or their [GitHub page](https://github.com/apify) which contains their code repositories and further details about their projects.
其他 Actor 示例
Apify Store 包含数千个预构建的 Actor。以下是其他流行 Actor 的示例:
Instagram 抓取器
from langchain_apify import ApifyActorsTool
instagram_tool = ApifyActorsTool("apify/instagram-scraper")
# Scrape Instagram posts
result = instagram_tool.invoke({
"run_input": {
"directUrls": ["https://www.instagram.com/humansofny/"],
"resultsLimit": 10
}
})
Google 搜索结果抓取器
google_search_tool = ApifyActorsTool("apify/google-search-scraper")
# Scrape Google Search results
result = google_search_tool.invoke({
"run_input": {
"queries": "langchain python tutorial",
"maxPagesPerQuery": 1
}
})
浏览 Apify Store 以发现更多适用于您用例的 Actor。
何时使用 Apify
当您有以下需求时,Apify 是理想选择:
- 访问数千个预构建的 Actor 用于各种平台(社交媒体、电子商务、搜索引擎等)
- 自定义网页抓取和自动化工作流 超出简单搜索范围
- 无需基础设施的抓取 (无服务器平台处理扩展和维护)
- 灵活的 Actor 生态系统 – 运行 Apify Store 中的任何 Actor
API 参考
有关如何使用此集成的更多信息,请查看 git 仓库 或 Apify 集成文档。
使用 Apify MCP 服务器
不确定使用哪个 Actor 或其需要哪些参数?
Apify MCP (模型上下文协议) 服务器 可以帮助您通过模型上下文协议发现可用 Actor、探索其输入架构并理解参数要求。
要在 LangChain 中使用 Apify MCP 服务器:
import os
from langchain_mcp_adapters.client import MultiServerMCPClient
from langchain.agents import create_agent
client = MultiServerMCPClient({
"apify": {
"transport": "http",
"url": "https://mcp.apify.com",
"headers": {
"Authorization": f"Bearer {os.environ['APIFY_TOKEN']}",
},
}
})
tools = await client.get_tools()
agent = create_agent("gpt-5-mini", tools)
有关更多信息,请参阅 LangChain MCP 文档 和 Apify MCP 服务器。