merge: branch 'gitbook' of https://github.com/alikia2x/cvsa into gitbook

merge: branch 'main' into gitbook
doc: GitBook - No subject
2025-03-31 05:34:55 +08:00 · 2025-03-31 05:34:24 +08:00 · 2025-03-30 21:29:02 +00:00 · 2025-03-15 13:42:19 +00:00
13 changed files with 79 additions and 62 deletions
--- a/doc/en/SUMMARY.md
+++ b/doc/en/SUMMARY.md
@ -9,12 +9,11 @@

 ## Architecure

- [Overview](architecure/overview.md)
- [Database Structure](architecure/database-structure/README.md)
-  - [Type of Song](architecure/database-structure/type-of-song.md)
- [Message Queue](architecure/message-queue/README.md)
-  - [VideoTagsQueue](architecure/message-queue/videotagsqueue.md)
- [Artificial Intelligence](architecure/artificial-intelligence.md)
+* [Overview](architecure/overview.md)
+* [Database Structure](architecure/database-structure/README.md)
+  * [Type of Song](architecure/database-structure/type-of-song.md)
+* [Message Queue](architecure/message-queue.md)
+* [Artificial Intelligence](architecure/artificial-intelligence.md)

 ## API Doc

--- a/doc/en/about/scope-of-inclusion.md
+++ b/doc/en/about/scope-of-inclusion.md
@ -7,13 +7,23 @@ For a **song**, it must meet the following conditions to be included in CVSA:

 ### Category 30

-In principle, the songs featured in CVSA must be included in a video categorized under VOCALOID·UTAU (ID 30) that is
-posted on Bilibili. In some special cases, this rule may not be enforced.&#x20;
+In principle, the songs must be featured in a video that is categorized under the VOCALOID·UTAU (ID 30) category in [Bilibili](https://en.wikipedia.org/wiki/Bilibili) in order to be observed by our [automation program](../architecure/overview.md#crawler). We welcome editors to manually add songs that have not been uploaded to bilibili / categorized under this category.

-### At Leats One Line of Chinese
+#### NEWS

-The lyrics of the song must contain at least one line in Chinese. This means that even if a voicebank that only supports
-Chinese is used, if the lyrics of the song do not contain Chinese, it will not be included in the CVSA.
+Recently, Bilibili seems to be offlining the sub-category. This means the VOCALOID·UTAU category can no longer be entered from the frontend, and producers can no longer upload videos to this category (instead, they can only choose the parent category "Music").&#x20;
+
+According to our experiments, Bilibili still retains the code logic of sub-categories in the backend, and newly published songs may still be in the VOCALOID·UTAU sub-category, and the related APIs can still work normally. However, there are [reports](https://www.bilibili.com/opus/1041223385394184199) that some of the new songs have been placed under the "Music General" sub-category.\
+We are still waiting for Bilibili's follow-up actions, and in the future, we may adjust the scope of our automated program's crawling.
+
+### At Leats One Line of Chinese / Chinese Virtual Singer
+
+The lyrics of the song must contain at least one line in Chinese. Otherwise, if the lyrics of the song do not contain Chinese,  it will only be included in the CVSA only if a Chinese virtual singer has been used.
+
+We define a **Chinese virtual singer** as follows:
+
+1. The singer primarily uses Chinese voicebank (i.e. the most widely used voickbank for the singer is Chinese)
+2. The singer is operated by a company, organization, individual or group located in Mainland China, Hong Kong, Macau or Taiwan.

 ### Using Vocal Synthesizer

--- a/doc/en/architecure/artificial-intelligence.md
+++ b/doc/en/architecure/artificial-intelligence.md
@ -9,6 +9,10 @@ The AI systems we currently use are:
 Located at `/filter/` under project root dir, it classifies a video in the
 [category 30](../about/scope-of-inclusion.md#category-30) into the following categories:

- 0: Not related to Chinese vocal synthesis
- 1: A original song with Chinese vocal synthesis
- 2: A cover/remix song with Chinese vocal synthesis
+* 0: Not related to Chinese vocal synthesis
+* 1: A original song with Chinese vocal synthesis
+* 2: A cover/remix song with Chinese vocal synthesis
+
+### The Predictor
+
+Located at `/pred/`under the project root dir, it predicts the future views of a video. This is a regression model that takes historical view trends of a video, other contextual information (such as the current time), and future time points to be predicted as feature inputs, and outputs the increment in the video's view count from "now" to the specified future time point.
--- a/doc/en/architecure/database-structure/README.md
+++ b/doc/en/architecure/database-structure/README.md
@ -5,8 +5,10 @@ CVSA uses [PostgreSQL](https://www.postgresql.org/) as our database.
 All public data of CVSA (excluding users' personal data) is stored in a database named `cvsa_main`, which contains the
 following tables:

- songs: stores the main information of songs
- bili\_user: stores snapshots of Bilibili user information
- all\_data: metadata of all videos in [category 30](../../about/scope-of-inclusion.md#category-30).
- labelling\_result: Contains label of videos in `all_data`tagged by our
-  [AI system](../artificial-intelligence.md#the-filter).
+* songs: stores the main information of songs
+* bili\_user: stores snapshots of Bilibili user information
+* all\_data: metadata of all videos in [category 30](../../about/scope-of-inclusion.md#category-30).
+* labelling\_result: Contains label of videos in `all_data`tagged by our [AI system](../artificial-intelligence.md#the-filter).
+* video\_snapshot: Statistical data of videos that are fetched regularly (e.g., number of views, etc.), we call this fetch process as "snapshot".
+* snapshot\_schedule: The scheduling information for video snapshots.
+
--- a/doc/en/architecure/message-queue.md
+++ b/doc/en/architecure/message-queue.md
@ -0,0 +1,7 @@
+# Message Queue
+
+We rely on message queues to manage the various tasks that [the cralwer ](overview.md#crawler)needs to perform.
+
+### Code Path
+
+Currently, the code related to message queues are located at `lib/mq` and `src`.
--- a/doc/en/architecure/message-queue/README.md
+++ b/doc/en/architecure/message-queue/README.md
@ -1 +0,0 @@
-# Message Queue
--- a/doc/en/architecure/message-queue/videotagsqueue.md
+++ b/doc/en/architecure/message-queue/videotagsqueue.md
@ -1,12 +0,0 @@
-# VideoTagsQueue
-
-### Jobs
-
-The VideoTagsQueue contains two jobs: `getVideoTags`and `getVideosTags`. The former is used to fetch the tags of a
-video, and the latter is responsible for scheduling the former.
-
-### Return value
-
-The return values across two jobs follows the following table:
-
-<table><thead><tr><th width="168">Return Value</th><th>Description</th></tr></thead><tbody><tr><td>0</td><td>In <code>getVideoTags</code>: the tags was successfully fetched<br>In <code>getVideosTags</code>: all null-tags videos have a corresponding job successfully queued.</td></tr><tr><td>1</td><td>Used in <code>getVideoTags</code>: occured <code>fetch</code>error during the job</td></tr><tr><td>2</td><td>Used in <code>getVideoTags</code>: we've reached the rate limit set in NetScheduler</td></tr><tr><td>3</td><td>Used in <code>getVideoTags</code>: did't provide aid in the job data</td></tr><tr><td>4</td><td>Used in<code>getVideosTags</code>: There's no video with NULL as `tags`</td></tr><tr><td>1xx</td><td>Used in<code>getVideosTags</code>:  the number of tasks in the queue has exceeded the limit, thus <code>getVideosTags</code> stops adding tasks. <code>xx</code> is the number of jobs added to the queue during execution.</td></tr></tbody></table>
--- a/doc/en/architecure/overview.md
+++ b/doc/en/architecure/overview.md
@ -1,5 +1,4 @@
 ---
-icon: globe-pointer
 layout:
  title:
    visible: true
@ -15,5 +14,14 @@ layout:

 # Overview

-Automation is the biggest highlight of CVSA's technical design. To achieve this, we use a message queue powered by
-[BullMQ](https://bullmq.io/) to concurrently process various tasks in the data collection life cycle.
+The whole CVSA system can be sperate into three different parts:
+
+* Frontend
+* API
+* Crawler
+
+The frontend is driven by [Astro](https://astro.build/) and is used to display the final CVSA page. The API is driven by [Hono](https://hono.dev) and is used to query the database and provide REST/GraphQL APIs that can be called by out website, applications, or third parties. The crawler is our automatic data collector, used to automatically collect new songs from bilibili, track their statistics, etc.
+
+### Crawler
+
+Automation is the biggest highlight of CVSA's technical design. To achieve this, we use a message queue powered by [BullMQ](https://bullmq.io/) to concurrently process various tasks in the data collection life cycle.
--- a/doc/zh/SUMMARY.md
+++ b/doc/zh/SUMMARY.md
@ -9,12 +9,12 @@

 ## 技术架构 <a href="#architecture" id="architecture"></a>

- [概览](architecture/overview.md)
- [数据库结构](architecture/database-structure/README.md)
-  - [歌曲类型](architecture/database-structure/type-of-song.md)
- [人工智能](architecture/artificial-intelligence.md)
- [消息队列](architecture/message-queue/README.md)
-  - [VideoTagsQueue队列](architecture/message-queue/video-tags-queue.md)
+* [概览](architecture/overview.md)
+* [数据库结构](architecture/database-structure/README.md)
+  * [歌曲类型](architecture/database-structure/type-of-song.md)
+* [人工智能](architecture/artificial-intelligence.md)
+* [消息队列](architecture/message-queue/README.md)
+  * [LatestVideosQueue 队列](architecture/message-queue/latestvideosqueue-dui-lie.md)

 ## API 文档 <a href="#api-doc" id="api-doc"></a>

--- a/doc/zh/architecture/database-structure/README.md
+++ b/doc/zh/architecture/database-structure/README.md
@ -4,7 +4,11 @@ CVSA 使用 [PostgreSQL](https://www.postgresql.org/) 作为数据库。

 CVSA 的所有公开数据（不包括用户的个人数据）都存储在名为 `cvsa_main` 的数据库中，该数据库包含以下表：

- songs：存储歌曲的主要信息
- bili\_user：存储 Bilibili 用户信息快照
- all\_data：[分区 30](../../about/scope-of-inclusion.md#vocaloiduatu-fen-qu) 中所有视频的元数据。
- labelling\_result：包含由我们的 AI 系统 标记的 `all_data` 中视频的标签。
+* songs：存储歌曲的主要信息
+* bilibili\_user：存储 Bilibili 用户信息快照
+* bilibili\_metadata：[分区 30](../../about/scope-of-inclusion.md#vocaloiduatu-fen-qu) 中所有视频的元数据
+* labelling\_result：包含由我们的 AI 系统 标记的 `all_data` 中视频的标签。
+* latest\_video\_snapshot：存储视频最新的快照
+* video\_snapshot：存储视频的快照，包括特定时间下视频的统计信息（播放量、点赞数等）
+* snapshot\_schedule：视频快照的规划信息，为辅助表
+
--- a/doc/zh/architecture/message-queue/latestvideosqueue-dui-lie.md
+++ b/doc/zh/architecture/message-queue/latestvideosqueue-dui-lie.md
@ -0,0 +1,2 @@
+# LatestVideosQueue 队列
+
--- a/doc/zh/architecture/message-queue/video-tags-queue.md
+++ b/doc/zh/architecture/message-queue/video-tags-queue.md
@ -1,15 +0,0 @@
---
-description: 关于VideoTagsQueue队列的信息。
---
-
-# VideoTagsQueue队列
-
-### 任务
-
-视频标签队列包含两个任务：`getVideoTags`和`getVideosTags`。前者用于获取视频的标签，后者负责调度前者。
-
-### 返回值
-
-两个任务的返回值遵循以下表格：
-
-<table><thead><tr><th width="168">返回值</th><th>描述</th></tr></thead><tbody><tr><td>0</td><td>在 <code>getVideoTags</code> 中：标签成功获取<br>在 <code>getVideosTags</code> 中：所有无标签视频的相应任务已成功排队。</td></tr><tr><td>1</td><td>在 <code>getVideoTags</code> 中：任务期间发生 <code>fetch</code> 错误</td></tr><tr><td>2</td><td>在 <code>getVideoTags</code> 中：已达到 NetScheduler 设置的速率限制</td></tr><tr><td>3</td><td>在 <code>getVideoTags</code> 中：未在任务数据中提供帮助</td></tr><tr><td>4</td><td>在 <code>getVideosTags</code> 中：没有视频的 `tags` 为 NULL</td></tr><tr><td>1xx</td><td>在 <code>getVideosTags</code> 中：队列中的任务数量超过了限制，因此 <code>getVideosTags</code> 停止添加任务。<code>xx</code> 是在执行期间添加到队列的任务数量。</td></tr></tbody></table>
--- a/doc/zh/architecture/overview.md
+++ b/doc/zh/architecture/overview.md
@ -1,5 +1,4 @@
 ---
-icon: globe-pointer
 layout:
  title:
    visible: true
@ -15,4 +14,14 @@ layout:

 # 概览

-自动化是 CVSA 技术设计的最大亮点，为了实现自动化，我们使用BullMQ驱动的消息队列来并发处理数据采集生命周期中的各项任务。
+整个CVSA项目分为三个组件：**crawler**, **frontend** 和 **backend。**
+
+### **crawler**
+
+位于项目目录`packages/crawler` 下，它负责以下工作：
+
+* 抓取新的视频并收录作品
+* 持续监控视频的播放量等统计信息
+
+整个 crawler 由 BullMQ 消息队列驱动，使用 Redis 和 PostgreSQL 管理状态。
+
Author	SHA1	Message	Date
alikia2x	28772fcd9f	merge: branch 'gitbook' of https://github.com/alikia2x/cvsa into gitbook	2025-03-31 05:34:55 +08:00
alikia2x	4d2b002264	merge: branch 'main' into gitbook	2025-03-31 05:34:24 +08:00
alikia2x	834f81eff0	doc: GitBook - No subject	2025-03-30 21:29:02 +00:00
alikia2x	35b84787ad	doc: GitBook - No subject	2025-03-15 13:42:19 +00:00