From 37fa6791d5069555499da9314084f423b8fc1864 Mon Sep 17 00:00:00 2001 From: alikia2x Date: Sun, 6 Apr 2025 02:51:25 +0800 Subject: [PATCH] doc: the crawler page --- doc/en/architecure/crawler.md | 64 +++++++++++++++++++++++++++++++++- doc/en/architecure/overview.md | 13 +++---- 2 files changed, 67 insertions(+), 10 deletions(-) diff --git a/doc/en/architecure/crawler.md b/doc/en/architecure/crawler.md index e60f132..0634ec3 100644 --- a/doc/en/architecure/crawler.md +++ b/doc/en/architecure/crawler.md @@ -1,4 +1,66 @@ # Crawler -A central aspect of CVSA's technical design is its emphasis on automation. The data collection process within the `crawler` is orchestrated using a message queue powered by [BullMQ](https://bullmq.io/). This enables concurrent processing of various tasks involved in the data lifecycle. State management and data persistence are handled by a combination of Redis for caching and real-time data, and PostgreSQL as the primary database. +Automation is at the core of CVSA’s technical architecture. The `crawler` is built to efficiently orchestrate data collection tasks using a message queue system powered by [BullMQ](https://bullmq.io/). This design enables concurrent processing across multiple stages of the data collection lifecycle. +State management and data persistence are handled using a combination of Redis (for caching and real-time data) and PostgreSQL (as the primary database). + +## `crawler/db` + +This module handles all database interactions for the crawler, including creation, updates, and data retrieval. + +- `init.ts`: Initializes the PostgreSQL connection pool. +- `redis.ts`: Sets up the Redis client. +- `withConnection.ts`: Exports `withDatabaseConnection`, a helper that provides a database context to any function. +- Other files: Contain table-specific functions, with each file corresponding to a database table. + +## `crawler/ml` + +This module handles machine learning tasks, such as content classification. + +- `manager.ts`: Defines a base class `AIManager` for managing ML models. +- `akari.ts`: Implements our primary classification model, `AkariProto`, which extends `AIManager`. It filters videos to determine if they should be included as songs. + +## `crawler/mq` + +This module manages task queuing and processing through BullMQ. + +## `crawler/mq/exec` + +Contains the functions executed by BullMQ workers. Examples include `getVideoInfoWorker` and `takeBulkSnapshotForVideosWorker`. + +> **Terminology note:** +> In this documentation: +> - Functions in `crawler/mq/exec` are called **workers**. +> - Functions in `crawler/mq/workers` are called **BullMQ workers**. + +**Design detail:** +Since BullMQ requires one handler per queue, we use a `switch` statement inside each BullMQ worker to route jobs based on their name to the correct function in `crawler/mq/exec`. + +## `crawler/mq/workers` + +Houses the BullMQ worker functions. Each function handles jobs for a specific queue. + +## `crawler/mq/task` + +To keep worker functions clean and focused, reusable logic is extracted into this directory as **tasks**. These tasks are then imported and used by the worker functions. + +## `crawler/net` + +This module handles all data fetching operations. Its core component is the `NetworkDelegate`, defined in `net/delegate.ts`. + +## `crawler/net/delegate.ts` + +Implements robust network request handling, including: + +- Rate limiting by task type and proxy +- Support for serverless functions to dynamically rotate requesting IPs + +## `crawler/utils` + +A collection of utility functions shared across the crawler modules. + +## `crawler/src` + +Contains the main entry point of the crawler. + +We use [concurrently](https://www.npmjs.com/package/concurrently) to run multiple scripts in parallel, enabling efficient execution of various processes. diff --git a/doc/en/architecure/overview.md b/doc/en/architecure/overview.md index fc694fe..cafdb28 100644 --- a/doc/en/architecure/overview.md +++ b/doc/en/architecure/overview.md @@ -31,12 +31,7 @@ cvsa **Package Breakdown:** -* **`backend`**: This package houses the server-side logic, built with the [Hono](https://hono.dev/) web framework. It's responsible for interacting with the database and exposing data through REST and GraphQL APIs for consumption by the frontend, internal applications, and third-party developers. -* **`frontend`**: The user-facing web interface of CVSA is developed using [Astro](https://astro.build/). This package handles the presentation layer, displaying information fetched from the database. -* **`crawler`**: This automated data collection system is a key component of CVSA. It's designed to automatically discover and gather new song data from bilibili, as well as track relevant statistics over time. -* **`core`**: This package contains reusable and generic code that is utilized across multiple workspaces within the CVSA monorepo. - -### Crawler - -Automation is the biggest highlight of CVSA's technical design. The data collection process within the `crawler` is orchestrated using a message queue powered by [BullMQ](https://bullmq.io/). This enables concurrent processing of various tasks involved in the data collection lifecycle. State management and data persistence are handled by a combination of Redis for caching and real-time data, and PostgreSQL as the primary database. - +- **`backend`**: This package houses the server-side logic, built with the [Hono](https://hono.dev/) web framework. It's responsible for interacting with the database and exposing data through REST and GraphQL APIs for consumption by the frontend, internal applications, and third-party developers. +- **`frontend`**: The user-facing web interface of CVSA is developed using [Astro](https://astro.build/). This package handles the presentation layer, displaying information fetched from the database. +- **`crawler`**: This automated data collection system is a key component of CVSA. It's designed to automatically discover and gather new song data from bilibili, as well as track relevant statistics over time. +- **`core`**: This package contains reusable and generic code that is utilized across multiple workspaces within the CVSA monorepo.