From 1c66d05c447bdd29b4d7abdd6ebb4ead0d331795 Mon Sep 17 00:00:00 2001 From: coderhxl Date: Sun, 12 Mar 2023 19:55:48 +0800 Subject: [PATCH] Update: Docs --- README.md | 48 ++++++++++++++++++++++++++--------- docs/cn.md | 59 ++++++++++++++++++++++++++++++-------------- package.json | 4 +-- publish/README.md | 48 ++++++++++++++++++++++++++--------- publish/package.json | 5 ++-- 5 files changed, 119 insertions(+), 45 deletions(-) diff --git a/README.md b/README.md index ea2db72a..371eb715 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ English | [简体中文](https://github.com/coder-hxl/x-crawl/blob/main/docs/cn.md) -x-crawl is a flexible nodejs crawler library. Used to crawl pages, batch network requests, and batch download file resources. Crawl data in asynchronous or synchronous mode, 3 ways to get results, and 5 ways to write requestConfig. Runs on nodejs, friendly to JS/TS developers. +x-crawl is a flexible nodejs crawler library. You can crawl pages and control operations such as pages, batch network requests, and batch downloads of file resources. Support asynchronous/synchronous mode crawling data. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers. If you feel good, you can support [x-crawl repository](https://github.com/coder-hxl/x-crawl) with a Star. @@ -11,8 +11,8 @@ If you feel good, you can support [x-crawl repository](https://github.com/coder- - Cules data for asynchronous/synchronous ways. - In three ways to obtain the results of the three ways of supporting Promise, Callback, and Promise + Callback. - RquestConfig has 5 ways of writing. -- The anthropomorphic request interval time. -- In a simple configuration, you can capture pages, JSON, file resources, and so on. +- Flexible request interval. +- Operations such as crawling pages, batch network requests, and batch downloading of file resources can be performed with simple configuration. - The rotation function, crawl regularly. - The built -in Puppeteer crawl the page and uses the JSDOM library to analyze the page, or it can also be parsed by itself. - Chopening with TypeScript, possessing type prompts, and providing generic types. @@ -214,38 +214,64 @@ myXCrawl.crawlPage('https://xxx.com').then(res => { #### jsdom instance -Refer to [jsdom](https://github.com/jsdom/jsdom) for specific usage. +It is an instance object of [JSDOM](https://github.com/jsdom/jsdom), please refer to [jsdom](https://github.com/jsdom/jsdom) for specific usage. + +**Note:** The jsdom instance only parses the content of [page instance](#page-instance), if you use page instance for event operation, you may need to parse the latest by yourself For details, please refer to the self-parsing page of [page instance](#page-instance). #### browser instance -The browser instance is a headless browser without a UI shell. What he does is to bring **all modern network platform functions** provided by the browser rendering engine to the code. +It is an instance object of [Browser](https://pptr.dev/api/puppeteer.browser). For specific usage, please refer to [Browser](https://pptr.dev/api/puppeteer.browser). -**Purpose of calling close:** The browser instance will always be running internally, causing the file not to be terminated. Do not call [crawlPage](#crawlPage) or [page](#page) if you need to use it later. When you modify the properties of a browser instance, it will affect the browser instance inside the crawlPage API of the crawler instance, the page instance that returns the result, and the browser instance, because the browser instance is shared within the crawlPage API of the same crawler instance. +The browser instance is a headless browser without a UI shell. What he does is to bring **all modern network platform functions** provided by the browser rendering engine to the code. -Refer to [browser](https://pptr.dev/api/puppeteer.browser) for specific usage. +**Note:** An event loop will always be generated inside the browser instance, causing the file not to be terminated. If you want to stop, you can execute browser.close() to close it. Do not call [crawlPage](#crawlPage) or [page](#page) if you need to use it later. Because when you modify the properties of the browser instance, it will affect the browser instance inside the crawlPage API of the crawler instance, the page instance that returns the result, and the browser instance, because the browser instance is shared within the crawlPage API of the same crawler instance. #### page instance +It is an instance object of [Page](https://pptr.dev/api/puppeteer.page). The instance can also perform interactive operations such as events. For specific usage, please refer to [page](https://pptr.dev /api/puppeteer. page). + +**Parse the page by yourself** + +Take the jsdom library as an example: + +```js +import xCrawl from 'x-crawl' +import { JSDOM } from 'jsdom' + +const myXCrawl = xCrawl({ timeout: 10000 }) + +myXCrawl.crawlPage('https://www.xxx.com').then(async (res) => { + const { page } = res + + // Get the latest page content + const content = await page.content() + + // Use the jsdom library to parse it yourself + const jsdom = new JSDOM(content) + + console.log(jsdom.window.document.querySelector('title').textContent) +}) +``` + **Take Screenshot** ```js import xCrawl from 'x-crawl' -const testXCrawl = xCrawl({ timeout: 10000 }) +const myXCrawl = xCrawl({ timeout: 10000 }) -testXCrawl +myXCrawl .crawlPage('https://xxx.com') .then(async (res) => { const { page } = res + // Get a screenshot of the rendered page await page.screenshot({ path: './upload/page.png' }) console.log('Screen capture is complete') }) ``` -The page instance can also perform interactive operations such as events. For details, refer to [page](https://pptr.dev/api/puppeteer.page). - ### Crawl interface Crawl interface data through [crawlData()](#crawlData) diff --git a/docs/cn.md b/docs/cn.md index e5a4787c..41623685 100644 --- a/docs/cn.md +++ b/docs/cn.md @@ -2,7 +2,7 @@ [English](https://github.com/coder-hxl/x-crawl#x-crawl) | 简体中文 -x-crawl 是一个灵活的 nodejs 爬虫库。用来爬取页面、批量网络请求以及批量下载文件资源。异步或同步模式爬取数据,3 种获取结果的写法,有 5 种 requestConfig 的写法。跑在 nodejs 上,对 JS/TS 开发者友好。 +x-crawl 是一个灵活的 nodejs 爬虫库。可以爬取页面并控制页面、批量网络请求以及批量下载文件资源等操作。支持 异步/同步 模式爬取数据。跑在 nodejs 上,用法灵活和简单,对 JS/TS 开发者友好。 如果感觉不错,可以给 [x-crawl 存储库](https://github.com/coder-hxl/x-crawl) 点个 Star 支持一下。 @@ -11,8 +11,8 @@ x-crawl 是一个灵活的 nodejs 爬虫库。用来爬取页面、批量网络 - 支持 异步/同步 方式爬取数据。 - 支持 Promise、Callback 以及 Promise + Callback 这 3 种方式获取结果。 - requestConfig 拥有 5 种写法。 -- 拟人化的请求间隔时间。 -- 只需简单的配置即可抓取页面、JSON、文件资源等等。 +- 灵活的请求间隔时间。 +- 只需简单的配置即可抓取页面、批量网络请求以及批量下载文件资源等操作。 - 轮询功能,定时爬取。 - 内置 puppeteer 爬取页面 ,并用采用 jsdom 库对页面解析,也可自行解析。 - 使用 TypeScript 编写,拥有类型提示,提供泛型。 @@ -30,9 +30,7 @@ crawlPage API 内部使用 [puppeteer](https://github.com/puppeteer/puppeteer) # 目录 - [安装](#安装) - - [示例](#示例) - - [核心概念](#核心概念) * [创建应用](#创建应用) + [一个爬虫应用实例](#一个爬虫应用实例) @@ -41,14 +39,13 @@ crawlPage API 内部使用 [puppeteer](https://github.com/puppeteer/puppeteer) * [爬取页面](#爬取页面) + [jsdom 实例](#jsdom-实例) + [browser 实例](#browser-实例) - + [page-实例](#page-实例) + + [page 实例](#page-实例) * [爬取接口](#爬取接口) * [爬取文件](#爬取文件) * [启动轮询](#启动轮询) * [请求间隔时间](#请求间隔时间) * [requestConfig 选项的多种写法](#requestConfig-选项的多种写法) * [获取结果的多种方式](#获取结果的多种方式) - - [API](#API) * [xCrawl](#xCrawl) + [类型](#类型-1) @@ -65,7 +62,6 @@ crawlPage API 内部使用 [puppeteer](https://github.com/puppeteer/puppeteer) * [startPolling](#startPolling) + [类型](#类型-5) + [示例](#示例-5) - - [类型](#类型-6) * [AnyObject](#AnyObject) * [Method](#Method) @@ -82,8 +78,7 @@ crawlPage API 内部使用 [puppeteer](https://github.com/puppeteer/puppeteer) * [CrawlResCommonV1](#CrawlResCommonV1) * [CrawlResCommonArrV1](#CrawlResCommonArrV1) * [FileInfo](#FileInfo) - * [CrawlPage](#CrawlPage) - + * [CrawlPage](#CrawlPage) - [更多](#更多) ## 安装 @@ -113,7 +108,7 @@ const myXCrawl = xCrawl({ myXCrawl.startPolling({ d: 1 }, () => { // 调用 crawlPage API 爬取 Page myXCrawl.crawlPage('https://www.bilibili.com/guochuang/').then((res) => { - const { browser, jsdom } = res // 默认使用了 JSDOM 库解析 Page + const { jsdom } = res // 默认使用了 JSDOM 库解析 Page // 获取轮播图片元素 const imgEls = jsdom.window.document.querySelectorAll('.chief-recom-item img') @@ -211,38 +206,64 @@ myXCrawl.crawlPage('https://xxx.com').then(res => { #### jsdom 实例 -具体使用参考 [jsdom](https://github.com/jsdom/jsdom) 。 +它是 [JSDOM](https://github.com/jsdom/jsdom) 的实例对象,具体使用可以参考 [jsdom](https://github.com/jsdom/jsdom) 。 + +**注意:**jsdom 实例只是对 [page 实例](#page-实例) 的 content 进行了解析,如果您使用 page 实例进行了事件操作的话,可能需要自行解析最新的页面内容,具体操作可查看 [page 实例](#page-实例) 的自行解析页面。 #### browser 实例 -browser 实例他是个无头浏览器,并无 UI 外壳,他做的是将浏览器渲染引擎提供的**所有现代网络平台功能**带到代码中。 +它是 [Browser](https://pptr.dev/api/puppeteer.browser) 的实例对象,具体使用可以参考 [Browser](https://pptr.dev/api/puppeteer.browser) 。 -**调用 close 的目的:** browser 实例内部会一直处于运行,造成文件不会终止。如果后面还需要用到 [crawlPage](#crawlPage) 或者 [page](#page) 请勿调用。当您修改 browser 实例的属性时,会对该爬虫实例 crawlPage API 内部的 browser 实例和返回结果的 page 实例以及 browser 实例造成影响,因为 browser 实例在同一个爬虫实例的 crawlPage API 内是共享的。 +browser 实例他是个无头浏览器,并无 UI 外壳,他做的是将浏览器渲染引擎提供的**所有现代网络平台功能**带到代码中。 -具体使用参考 [browser](https://pptr.dev/api/puppeteer.browser) 。 +**注意:** browser 实例内部会一直产生事件循环,造成文件不会终止,如果想停止可以执行 browser.close() 关闭。如果后面还需要用到 [crawlPage](#crawlPage) 或者 [page](#page) 请勿调用。因为当您修改 browser 实例的属性时,会对该爬虫实例 crawlPage API 内部的 browser 实例和返回结果的 page 实例以及 browser 实例造成影响,因为 browser 实例在同一个爬虫实例的 crawlPage API 内是共享的。 #### page 实例 +它是 [Page](https://pptr.dev/api/puppeteer.page) 的实例对象,实例还可以做事件之类的交互操作,具体使用可以参考 [page](https://pptr.dev/api/puppeteer.page) 。 + +**自行解析页面** + +以使用 jsdom 库为例: + +```js +import xCrawl from 'x-crawl' +import { JSDOM } from 'jsdom' + +const myXCrawl = xCrawl({ timeout: 10000 }) + +myXCrawl.crawlPage('https://www.xxx.com').then(async (res) => { + const { page } = res + + // 获取最新的页面内容 + const content = await page.content() + + // 使用 jsdom 库自行解析 + const jsdom = new JSDOM(content) + + console.log(jsdom.window.document.querySelector('title').textContent) +}) +``` + **获取屏幕截图** ```js import xCrawl from 'x-crawl' -const testXCrawl = xCrawl({ timeout: 10000 }) +const myXCrawl = xCrawl({ timeout: 10000 }) -testXCrawl +myXCrawl .crawlPage('https://xxx.com') .then(async (res) => { const { page } = res + // 获取页面渲染后的截图 await page.screenshot({ path: './upload/page.png' }) console.log('获取屏幕截图完毕') }) ``` -page 实例还可以做事件之类的交互操作,具体使用参考 [page](https://pptr.dev/api/puppeteer.page) 。 - ### 爬取接口 通过 [crawlData()](#crawlData) 爬取接口数据 diff --git a/package.json b/package.json index bdb46c0f..d1d0c34f 100644 --- a/package.json +++ b/package.json @@ -1,9 +1,9 @@ { "private": true, "name": "x-crawl", - "version": "3.2.4", + "version": "3.2.5", "author": "coderHXL", - "description": "x-crawl is a flexible nodejs crawler library. ", + "description": "x-crawl is a flexible nodejs crawler library.", "license": "MIT", "main": "src/index.ts", "scripts": { diff --git a/publish/README.md b/publish/README.md index ea2db72a..371eb715 100644 --- a/publish/README.md +++ b/publish/README.md @@ -2,7 +2,7 @@ English | [简体中文](https://github.com/coder-hxl/x-crawl/blob/main/docs/cn.md) -x-crawl is a flexible nodejs crawler library. Used to crawl pages, batch network requests, and batch download file resources. Crawl data in asynchronous or synchronous mode, 3 ways to get results, and 5 ways to write requestConfig. Runs on nodejs, friendly to JS/TS developers. +x-crawl is a flexible nodejs crawler library. You can crawl pages and control operations such as pages, batch network requests, and batch downloads of file resources. Support asynchronous/synchronous mode crawling data. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers. If you feel good, you can support [x-crawl repository](https://github.com/coder-hxl/x-crawl) with a Star. @@ -11,8 +11,8 @@ If you feel good, you can support [x-crawl repository](https://github.com/coder- - Cules data for asynchronous/synchronous ways. - In three ways to obtain the results of the three ways of supporting Promise, Callback, and Promise + Callback. - RquestConfig has 5 ways of writing. -- The anthropomorphic request interval time. -- In a simple configuration, you can capture pages, JSON, file resources, and so on. +- Flexible request interval. +- Operations such as crawling pages, batch network requests, and batch downloading of file resources can be performed with simple configuration. - The rotation function, crawl regularly. - The built -in Puppeteer crawl the page and uses the JSDOM library to analyze the page, or it can also be parsed by itself. - Chopening with TypeScript, possessing type prompts, and providing generic types. @@ -214,38 +214,64 @@ myXCrawl.crawlPage('https://xxx.com').then(res => { #### jsdom instance -Refer to [jsdom](https://github.com/jsdom/jsdom) for specific usage. +It is an instance object of [JSDOM](https://github.com/jsdom/jsdom), please refer to [jsdom](https://github.com/jsdom/jsdom) for specific usage. + +**Note:** The jsdom instance only parses the content of [page instance](#page-instance), if you use page instance for event operation, you may need to parse the latest by yourself For details, please refer to the self-parsing page of [page instance](#page-instance). #### browser instance -The browser instance is a headless browser without a UI shell. What he does is to bring **all modern network platform functions** provided by the browser rendering engine to the code. +It is an instance object of [Browser](https://pptr.dev/api/puppeteer.browser). For specific usage, please refer to [Browser](https://pptr.dev/api/puppeteer.browser). -**Purpose of calling close:** The browser instance will always be running internally, causing the file not to be terminated. Do not call [crawlPage](#crawlPage) or [page](#page) if you need to use it later. When you modify the properties of a browser instance, it will affect the browser instance inside the crawlPage API of the crawler instance, the page instance that returns the result, and the browser instance, because the browser instance is shared within the crawlPage API of the same crawler instance. +The browser instance is a headless browser without a UI shell. What he does is to bring **all modern network platform functions** provided by the browser rendering engine to the code. -Refer to [browser](https://pptr.dev/api/puppeteer.browser) for specific usage. +**Note:** An event loop will always be generated inside the browser instance, causing the file not to be terminated. If you want to stop, you can execute browser.close() to close it. Do not call [crawlPage](#crawlPage) or [page](#page) if you need to use it later. Because when you modify the properties of the browser instance, it will affect the browser instance inside the crawlPage API of the crawler instance, the page instance that returns the result, and the browser instance, because the browser instance is shared within the crawlPage API of the same crawler instance. #### page instance +It is an instance object of [Page](https://pptr.dev/api/puppeteer.page). The instance can also perform interactive operations such as events. For specific usage, please refer to [page](https://pptr.dev /api/puppeteer. page). + +**Parse the page by yourself** + +Take the jsdom library as an example: + +```js +import xCrawl from 'x-crawl' +import { JSDOM } from 'jsdom' + +const myXCrawl = xCrawl({ timeout: 10000 }) + +myXCrawl.crawlPage('https://www.xxx.com').then(async (res) => { + const { page } = res + + // Get the latest page content + const content = await page.content() + + // Use the jsdom library to parse it yourself + const jsdom = new JSDOM(content) + + console.log(jsdom.window.document.querySelector('title').textContent) +}) +``` + **Take Screenshot** ```js import xCrawl from 'x-crawl' -const testXCrawl = xCrawl({ timeout: 10000 }) +const myXCrawl = xCrawl({ timeout: 10000 }) -testXCrawl +myXCrawl .crawlPage('https://xxx.com') .then(async (res) => { const { page } = res + // Get a screenshot of the rendered page await page.screenshot({ path: './upload/page.png' }) console.log('Screen capture is complete') }) ``` -The page instance can also perform interactive operations such as events. For details, refer to [page](https://pptr.dev/api/puppeteer.page). - ### Crawl interface Crawl interface data through [crawlData()](#crawlData) diff --git a/publish/package.json b/publish/package.json index 6a1aa4c2..7a913093 100644 --- a/publish/package.json +++ b/publish/package.json @@ -1,6 +1,6 @@ { "name": "x-crawl", - "version": "3.2.4", + "version": "3.2.5", "author": "coderHXL", "description": "x-crawl is a flexible nodejs crawler library.", "license": "MIT", @@ -9,7 +9,8 @@ "typescript", "crawl", "crawler", - "spider" + "spider", + "flexible" ], "main": "dist/index.js", "types": "dist/index.d.ts",