doc: the crawler page

2025-04-06 02:51:25 +08:00 · 2025-04-06 02:51:25 +08:00 · 37fa6791d5
commit 37fa6791d5
parent a8292d7b6b
2 changed files with 67 additions and 10 deletions
--- a/doc/en/architecure/crawler.md
+++ b/doc/en/architecure/crawler.md
@ -1,4 +1,66 @@
 # Crawler

-A central aspect of CVSA's technical design is its emphasis on automation. The data collection process within the `crawler` is orchestrated using a message queue powered by [BullMQ](https://bullmq.io/). This enables concurrent processing of various tasks involved in the data lifecycle. State management and data persistence are handled by a combination of Redis for caching and real-time data, and PostgreSQL as the primary database.
+Automation is at the core of CVSA’s technical architecture. The `crawler` is built to efficiently orchestrate data collection tasks using a message queue system powered by [BullMQ](https://bullmq.io/). This design enables concurrent processing across multiple stages of the data collection lifecycle. 

+State management and data persistence are handled using a combination of Redis (for caching and real-time data) and PostgreSQL (as the primary database).
+
+## `crawler/db`
+
+This module handles all database interactions for the crawler, including creation, updates, and data retrieval.
+
+- `init.ts`: Initializes the PostgreSQL connection pool.
+- `redis.ts`: Sets up the Redis client.
+- `withConnection.ts`: Exports `withDatabaseConnection`, a helper that provides a database context to any function.
+- Other files: Contain table-specific functions, with each file corresponding to a database table.
+
+## `crawler/ml`
+
+This module handles machine learning tasks, such as content classification.
+
+- `manager.ts`: Defines a base class `AIManager` for managing ML models.
+- `akari.ts`: Implements our primary classification model, `AkariProto`, which extends `AIManager`. It filters videos to determine if they should be included as songs.
+
+## `crawler/mq`
+
+This module manages task queuing and processing through BullMQ.
+
+## `crawler/mq/exec`
+
+Contains the functions executed by BullMQ workers. Examples include `getVideoInfoWorker` and `takeBulkSnapshotForVideosWorker`.
+
+> **Terminology note:**  
+> In this documentation:
+> - Functions in `crawler/mq/exec` are called **workers**.  
+> - Functions in `crawler/mq/workers` are called **BullMQ workers**.
+
+**Design detail:**  
+Since BullMQ requires one handler per queue, we use a `switch` statement inside each BullMQ worker to route jobs based on their name to the correct function in `crawler/mq/exec`.
+
+## `crawler/mq/workers`
+
+Houses the BullMQ worker functions. Each function handles jobs for a specific queue.
+
+## `crawler/mq/task`
+
+To keep worker functions clean and focused, reusable logic is extracted into this directory as **tasks**. These tasks are then imported and used by the worker functions.
+
+## `crawler/net`
+
+This module handles all data fetching operations. Its core component is the `NetworkDelegate`, defined in `net/delegate.ts`.
+
+## `crawler/net/delegate.ts`
+
+Implements robust network request handling, including:
+
+- Rate limiting by task type and proxy
+- Support for serverless functions to dynamically rotate requesting IPs
+
+## `crawler/utils`
+
+A collection of utility functions shared across the crawler modules.
+
+## `crawler/src`
+
+Contains the main entry point of the crawler.
+
+We use [concurrently](https://www.npmjs.com/package/concurrently) to run multiple scripts in parallel, enabling efficient execution of various processes.
--- a/doc/en/architecure/overview.md
+++ b/doc/en/architecure/overview.md
@ -31,12 +31,7 @@ cvsa

 **Package Breakdown:**

-* **`backend`**: This package houses the server-side logic, built with the [Hono](https://hono.dev/) web framework. It's responsible for interacting with the database and exposing data through REST and GraphQL APIs for consumption by the frontend, internal applications, and third-party developers.
-* **`frontend`**: The user-facing web interface of CVSA is developed using [Astro](https://astro.build/). This package handles the presentation layer, displaying information fetched from the database.
-* **`crawler`**: This automated data collection system is a key component of CVSA. It's designed to automatically discover and gather new song data from bilibili, as well as track relevant statistics over time.
-* **`core`**: This package contains reusable and generic code that is utilized across multiple workspaces within the CVSA monorepo.
-
-### Crawler
-
-Automation is the biggest highlight of CVSA's technical design. The data collection process within the `crawler` is orchestrated using a message queue powered by [BullMQ](https://bullmq.io/). This enables concurrent processing of various tasks involved in the data collection lifecycle. State management and data persistence are handled by a combination of Redis for caching and real-time data, and PostgreSQL as the primary database.
-
+- **`backend`**: This package houses the server-side logic, built with the [Hono](https://hono.dev/) web framework. It's responsible for interacting with the database and exposing data through REST and GraphQL APIs for consumption by the frontend, internal applications, and third-party developers.
+- **`frontend`**: The user-facing web interface of CVSA is developed using [Astro](https://astro.build/). This package handles the presentation layer, displaying information fetched from the database.
+- **`crawler`**: This automated data collection system is a key component of CVSA. It's designed to automatically discover and gather new song data from bilibili, as well as track relevant statistics over time.
+- **`core`**: This package contains reusable and generic code that is utilized across multiple workspaces within the CVSA monorepo.