|
@@ -3,76 +3,243 @@ identity:
|
|
|
author: Richards Tu
|
|
|
label:
|
|
|
en_US: Crawl
|
|
|
- zh_Hans: 爬取
|
|
|
+ zh_Hans: 深度爬取
|
|
|
description:
|
|
|
human:
|
|
|
- en_US: Extract data from a website by crawling through a URL.
|
|
|
- zh_Hans: 通过URL从网站中提取数据。
|
|
|
+ en_US: Recursively search through a urls subdomains, and gather the content.
|
|
|
+ zh_Hans: 递归爬取一个网址的子域名,并收集内容。
|
|
|
llm: This tool initiates a web crawl to extract data from a specified URL. It allows configuring crawler options such as including or excluding URL patterns, generating alt text for images using LLMs (paid plan required), limiting the maximum number of pages to crawl, and returning only the main content of the page. The tool can return either a list of crawled documents or a list of URLs based on the provided options.
|
|
|
parameters:
|
|
|
- name: url
|
|
|
type: string
|
|
|
required: true
|
|
|
label:
|
|
|
- en_US: URL to crawl
|
|
|
- zh_Hans: 要爬取的URL
|
|
|
+ en_US: Start URL
|
|
|
+ zh_Hans: 起始URL
|
|
|
human_description:
|
|
|
- en_US: The URL of the website to crawl and extract data from.
|
|
|
- zh_Hans: 要爬取并提取数据的网站URL。
|
|
|
+ en_US: The base URL to start crawling from.
|
|
|
+ zh_Hans: 要爬取网站的起始URL。
|
|
|
llm_description: The URL of the website that needs to be crawled. This is a required parameter.
|
|
|
form: llm
|
|
|
+ - name: wait_for_results
|
|
|
+ type: boolean
|
|
|
+ default: true
|
|
|
+ label:
|
|
|
+ en_US: Wait For Results
|
|
|
+ zh_Hans: 等待爬取结果
|
|
|
+ human_description:
|
|
|
+ en_US: If you choose not to wait, it will directly return a job ID. You can use this job ID to check the crawling results or cancel the crawling task, which is usually very useful for a large-scale crawling task.
|
|
|
+ zh_Hans: 如果选择不等待,则会直接返回一个job_id,可以通过job_id查询爬取结果或取消爬取任务,这通常对于一个大型爬取任务来说非常有用。
|
|
|
+ form: form
|
|
|
+############## Crawl Options #######################
|
|
|
- name: includes
|
|
|
type: string
|
|
|
required: false
|
|
|
label:
|
|
|
en_US: URL patterns to include
|
|
|
zh_Hans: 要包含的URL模式
|
|
|
+ placeholder:
|
|
|
+ en_US: Use commas to separate multiple tags
|
|
|
+ zh_Hans: 多个标签时使用半角逗号分隔
|
|
|
human_description:
|
|
|
- en_US: Specify URL patterns to include during the crawl. Only pages matching these patterns will be crawled, you can use ',' to separate multiple patterns.
|
|
|
- zh_Hans: 指定爬取过程中要包含的URL模式。只有与这些模式匹配的页面才会被爬取。
|
|
|
+ en_US: |
|
|
|
+ Only pages matching these patterns will be crawled. Example: blog/*, about/*
|
|
|
+ zh_Hans: 只有与这些模式匹配的页面才会被爬取。示例:blog/*, about/*
|
|
|
form: form
|
|
|
- default: ''
|
|
|
- name: excludes
|
|
|
type: string
|
|
|
- required: false
|
|
|
label:
|
|
|
en_US: URL patterns to exclude
|
|
|
zh_Hans: 要排除的URL模式
|
|
|
+ placeholder:
|
|
|
+ en_US: Use commas to separate multiple tags
|
|
|
+ zh_Hans: 多个标签时使用半角逗号分隔
|
|
|
+ human_description:
|
|
|
+ en_US: |
|
|
|
+ Pages matching these patterns will be skipped. Example: blog/*, about/*
|
|
|
+ zh_Hans: 匹配这些模式的页面将被跳过。示例:blog/*, about/*
|
|
|
+ form: form
|
|
|
+ - name: returnOnlyUrls
|
|
|
+ type: boolean
|
|
|
+ default: false
|
|
|
+ label:
|
|
|
+ en_US: return Only Urls
|
|
|
+ zh_Hans: 仅返回URL
|
|
|
+ human_description:
|
|
|
+ en_US: |
|
|
|
+ If true, returns only the URLs as a list on the crawl status. Attention: the return response will be a list of URLs inside the data, not a list of documents.
|
|
|
+ zh_Hans: 只返回爬取到的网页链接,而不是网页内容本身。
|
|
|
+ form: form
|
|
|
+ - name: maxDepth
|
|
|
+ type: number
|
|
|
+ label:
|
|
|
+ en_US: Maximum crawl depth
|
|
|
+ zh_Hans: 爬取深度
|
|
|
+ human_description:
|
|
|
+ en_US: Maximum depth to crawl relative to the entered URL. A maxDepth of 0 scrapes only the entered URL. A maxDepth of 1 scrapes the entered URL and all pages one level deep. A maxDepth of 2 scrapes the entered URL and all pages up to two levels deep. Higher values follow the same pattern.
|
|
|
+ zh_Hans: 相对于输入的URL,爬取的最大深度。maxDepth为0时,仅抓取输入的URL。maxDepth为1时,抓取输入的URL以及所有一级深层页面。maxDepth为2时,抓取输入的URL以及所有两级深层页面。更高值遵循相同模式。
|
|
|
+ form: form
|
|
|
+ min: 0
|
|
|
+ - name: mode
|
|
|
+ type: select
|
|
|
+ required: false
|
|
|
+ form: form
|
|
|
+ options:
|
|
|
+ - value: default
|
|
|
+ label:
|
|
|
+ en_US: default
|
|
|
+ - value: fast
|
|
|
+ label:
|
|
|
+ en_US: fast
|
|
|
+ default: default
|
|
|
+ label:
|
|
|
+ en_US: Crawl Mode
|
|
|
+ zh_Hans: 爬取模式
|
|
|
human_description:
|
|
|
- en_US: Specify URL patterns to exclude during the crawl. Pages matching these patterns will be skipped, you can use ',' to separate multiple patterns.
|
|
|
- zh_Hans: 指定爬取过程中要排除的URL模式。匹配这些模式的页面将被跳过。
|
|
|
+ en_US: The crawling mode to use. Fast mode crawls 4x faster websites without sitemap, but may not be as accurate and shouldn't be used in heavy js-rendered websites.
|
|
|
+ zh_Hans: 使用fast模式将不会使用其站点地图,比普通模式快4倍,但是可能不够准确,也不适用于大量js渲染的网站。
|
|
|
+ - name: ignoreSitemap
|
|
|
+ type: boolean
|
|
|
+ default: false
|
|
|
+ label:
|
|
|
+ en_US: ignore Sitemap
|
|
|
+ zh_Hans: 忽略站点地图
|
|
|
+ human_description:
|
|
|
+ en_US: Ignore the website sitemap when crawling.
|
|
|
+ zh_Hans: 爬取时忽略网站站点地图。
|
|
|
form: form
|
|
|
- default: 'blog/*'
|
|
|
- name: limit
|
|
|
type: number
|
|
|
required: false
|
|
|
label:
|
|
|
- en_US: Maximum number of pages to crawl
|
|
|
+ en_US: Maximum pages to crawl
|
|
|
zh_Hans: 最大爬取页面数
|
|
|
human_description:
|
|
|
en_US: Specify the maximum number of pages to crawl. The crawler will stop after reaching this limit.
|
|
|
zh_Hans: 指定要爬取的最大页面数。爬虫将在达到此限制后停止。
|
|
|
form: form
|
|
|
min: 1
|
|
|
- max: 20
|
|
|
default: 5
|
|
|
+ - name: allowBackwardCrawling
|
|
|
+ type: boolean
|
|
|
+ default: false
|
|
|
+ label:
|
|
|
+ en_US: allow Backward Crawling
|
|
|
+ zh_Hans: 允许向后爬取
|
|
|
+ human_description:
|
|
|
+ en_US: Enables the crawler to navigate from a specific URL to previously linked pages. For instance, from 'example.com/product/123' back to 'example.com/product'
|
|
|
+ zh_Hans: 使爬虫能够从特定URL导航到之前链接的页面。例如,从'example.com/product/123'返回到'example.com/product'
|
|
|
+ form: form
|
|
|
+ - name: allowExternalContentLinks
|
|
|
+ type: boolean
|
|
|
+ default: false
|
|
|
+ label:
|
|
|
+ en_US: allow External Content Links
|
|
|
+ zh_Hans: 允许爬取外链
|
|
|
+ human_description:
|
|
|
+ en_US: Allows the crawler to follow links to external websites.
|
|
|
+ zh_Hans:
|
|
|
+ form: form
|
|
|
+############## Page Options #######################
|
|
|
+ - name: headers
|
|
|
+ type: string
|
|
|
+ label:
|
|
|
+ en_US: headers
|
|
|
+ zh_Hans: 请求头
|
|
|
+ human_description:
|
|
|
+ en_US: |
|
|
|
+ Headers to send with the request. Can be used to send cookies, user-agent, etc. Example: {"cookies": "testcookies"}
|
|
|
+ zh_Hans: |
|
|
|
+ 随请求发送的头部。可以用来发送cookies、用户代理等。示例:{"cookies": "testcookies"}
|
|
|
+ placeholder:
|
|
|
+ en_US: Please enter an object that can be serialized in JSON
|
|
|
+ zh_Hans: 请输入可以json序列化的对象
|
|
|
+ form: form
|
|
|
+ - name: includeHtml
|
|
|
+ type: boolean
|
|
|
+ default: false
|
|
|
+ label:
|
|
|
+ en_US: include Html
|
|
|
+ zh_Hans: 包含HTML
|
|
|
+ human_description:
|
|
|
+ en_US: Include the HTML version of the content on page. Will output a html key in the response.
|
|
|
+ zh_Hans: 返回中包含一个HTML版本的内容,将以html键返回。
|
|
|
+ form: form
|
|
|
+ - name: includeRawHtml
|
|
|
+ type: boolean
|
|
|
+ default: false
|
|
|
+ label:
|
|
|
+ en_US: include Raw Html
|
|
|
+ zh_Hans: 包含原始HTML
|
|
|
+ human_description:
|
|
|
+ en_US: Include the raw HTML content of the page. Will output a rawHtml key in the response.
|
|
|
+ zh_Hans: 返回中包含一个原始HTML版本的内容,将以rawHtml键返回。
|
|
|
+ form: form
|
|
|
+ - name: onlyIncludeTags
|
|
|
+ type: string
|
|
|
+ label:
|
|
|
+ en_US: only Include Tags
|
|
|
+ zh_Hans: 仅抓取这些标签
|
|
|
+ placeholder:
|
|
|
+ en_US: Use commas to separate multiple tags
|
|
|
+ zh_Hans: 多个标签时使用半角逗号分隔
|
|
|
+ human_description:
|
|
|
+ en_US: |
|
|
|
+ Only include tags, classes and ids from the page in the final output. Use comma separated values. Example: script, .ad, #footer
|
|
|
+ zh_Hans: |
|
|
|
+ 仅在最终输出中包含HTML页面的这些标签,可以通过标签名、类或ID来设定,使用逗号分隔值。示例:script, .ad, #footer
|
|
|
+ form: form
|
|
|
- name: onlyMainContent
|
|
|
type: boolean
|
|
|
- required: false
|
|
|
+ default: false
|
|
|
label:
|
|
|
- en_US: Only return the main content of the page
|
|
|
- zh_Hans: 仅返回页面的主要内容
|
|
|
+ en_US: only Main Content
|
|
|
+ zh_Hans: 仅抓取主要内容
|
|
|
human_description:
|
|
|
- en_US: If enabled, the crawler will only return the main content of the page, excluding headers, navigation, footers, etc.
|
|
|
- zh_Hans: 如果启用,爬虫将仅返回页面的主要内容,不包括标题、导航、页脚等。
|
|
|
+ en_US: Only return the main content of the page excluding headers, navs, footers, etc.
|
|
|
+ zh_Hans: 只返回页面的主要内容,不包括头部、导航栏、尾部等。
|
|
|
+ form: form
|
|
|
+ - name: removeTags
|
|
|
+ type: string
|
|
|
+ label:
|
|
|
+ en_US: remove Tags
|
|
|
+ zh_Hans: 要移除这些标签
|
|
|
+ human_description:
|
|
|
+ en_US: |
|
|
|
+ Tags, classes and ids to remove from the page. Use comma separated values. Example: script, .ad, #footer
|
|
|
+ zh_Hans: |
|
|
|
+ 要在最终输出中移除HTML页面的这些标签,可以通过标签名、类或ID来设定,使用逗号分隔值。示例:script, .ad, #footer
|
|
|
+ placeholder:
|
|
|
+ en_US: Use commas to separate multiple tags
|
|
|
+ zh_Hans: 多个标签时使用半角逗号分隔
|
|
|
+ form: form
|
|
|
+ - name: replaceAllPathsWithAbsolutePaths
|
|
|
+ type: boolean
|
|
|
+ default: false
|
|
|
+ label:
|
|
|
+ en_US: All AbsolutePaths
|
|
|
+ zh_Hans: 使用绝对路径
|
|
|
+ human_description:
|
|
|
+ en_US: Replace all relative paths with absolute paths for images and links.
|
|
|
+ zh_Hans: 将所有图片和链接的相对路径替换为绝对路径。
|
|
|
+ form: form
|
|
|
+ - name: screenshot
|
|
|
+ type: boolean
|
|
|
+ default: false
|
|
|
+ label:
|
|
|
+ en_US: screenshot
|
|
|
+ zh_Hans: 截图
|
|
|
+ human_description:
|
|
|
+ en_US: Include a screenshot of the top of the page that you are scraping.
|
|
|
+ zh_Hans: 提供正在抓取的页面的顶部的截图。
|
|
|
+ form: form
|
|
|
+ - name: waitFor
|
|
|
+ type: number
|
|
|
+ min: 0
|
|
|
+ label:
|
|
|
+ en_US: wait For
|
|
|
+ zh_Hans: 等待时间
|
|
|
+ human_description:
|
|
|
+ en_US: Wait x amount of milliseconds for the page to load to fetch content.
|
|
|
+ zh_Hans: 等待x毫秒以使页面加载并获取内容。
|
|
|
form: form
|
|
|
- options:
|
|
|
- - value: 'true'
|
|
|
- label:
|
|
|
- en_US: 'Yes'
|
|
|
- zh_Hans: 是
|
|
|
- - value: 'false'
|
|
|
- label:
|
|
|
- en_US: 'No'
|
|
|
- zh_Hans: 否
|
|
|
- default: 'false'
|