From 35b84787adcea98941ace4fc236a588cd6b2c25c Mon Sep 17 00:00:00 2001 From: alikia2x Date: Sat, 15 Mar 2025 13:42:19 +0000 Subject: [PATCH] doc: GitBook - No subject --- doc/en/SUMMARY.md | 3 +-- doc/en/about/scope-of-inclusion.md | 18 +++++++++++++++--- doc/en/architecure/artificial-intelligence.md | 4 ++++ .../architecure/database-structure/README.md | 2 ++ doc/en/architecure/message-queue.md | 7 +++++++ doc/en/architecure/message-queue/README.md | 2 -- .../message-queue/videotagsqueue.md | 11 ----------- doc/en/architecure/overview.md | 11 ++++++++++- 8 files changed, 39 insertions(+), 19 deletions(-) create mode 100644 doc/en/architecure/message-queue.md delete mode 100644 doc/en/architecure/message-queue/README.md delete mode 100644 doc/en/architecure/message-queue/videotagsqueue.md diff --git a/doc/en/SUMMARY.md b/doc/en/SUMMARY.md index 5137229..536f1f7 100644 --- a/doc/en/SUMMARY.md +++ b/doc/en/SUMMARY.md @@ -12,8 +12,7 @@ * [Overview](architecure/overview.md) * [Database Structure](architecure/database-structure/README.md) * [Type of Song](architecure/database-structure/type-of-song.md) -* [Message Queue](architecure/message-queue/README.md) - * [VideoTagsQueue](architecure/message-queue/videotagsqueue.md) +* [Message Queue](architecure/message-queue.md) * [Artificial Intelligence](architecure/artificial-intelligence.md) ## API Doc diff --git a/doc/en/about/scope-of-inclusion.md b/doc/en/about/scope-of-inclusion.md index d893e33..214b141 100644 --- a/doc/en/about/scope-of-inclusion.md +++ b/doc/en/about/scope-of-inclusion.md @@ -6,11 +6,23 @@ For a **song**, it must meet the following conditions to be included in CVSA: ### Category 30 -In principle, the songs featured in CVSA must be included in a video categorized under VOCALOID·UTAU (ID 30) that is posted on Bilibili. In some special cases, this rule may not be enforced. +In principle, the songs must be featured in a video that is categorized under the VOCALOID·UTAU (ID 30) category in [Bilibili](https://en.wikipedia.org/wiki/Bilibili) in order to be observed by our [automation program](../architecure/overview.md#crawler). We welcome editors to manually add songs that have not been uploaded to bilibili / categorized under this category. -### At Leats One Line of Chinese +#### NEWS -The lyrics of the song must contain at least one line in Chinese. This means that even if a voicebank that only supports Chinese is used, if the lyrics of the song do not contain Chinese, it will not be included in the CVSA. +Recently, Bilibili seems to be offlining the sub-category. This means the VOCALOID·UTAU category can no longer be entered from the frontend, and producers can no longer upload videos to this category (instead, they can only choose the parent category "Music"). + +According to our experiments, Bilibili still retains the code logic of sub-categories in the backend, and newly published songs may still be in the VOCALOID·UTAU sub-category, and the related APIs can still work normally. However, there are [reports](https://www.bilibili.com/opus/1041223385394184199) that some of the new songs have been placed under the "Music General" sub-category.\ +We are still waiting for Bilibili's follow-up actions, and in the future, we may adjust the scope of our automated program's crawling. + +### At Leats One Line of Chinese / Chinese Virtual Singer + +The lyrics of the song must contain at least one line in Chinese. Otherwise, if the lyrics of the song do not contain Chinese, it will only be included in the CVSA only if a Chinese virtual singer has been used. + +We define a **Chinese virtual singer** as follows: + +1. The singer primarily uses Chinese voicebank (i.e. the most widely used voickbank for the singer is Chinese) +2. The singer is operated by a company, organization, individual or group located in Mainland China, Hong Kong, Macau or Taiwan. ### Using Vocal Synthesizer diff --git a/doc/en/architecure/artificial-intelligence.md b/doc/en/architecure/artificial-intelligence.md index 849cb27..1d560d9 100644 --- a/doc/en/architecure/artificial-intelligence.md +++ b/doc/en/architecure/artificial-intelligence.md @@ -11,3 +11,7 @@ Located at `/filter/` under project root dir, it classifies a video in the [cate * 0: Not related to Chinese vocal synthesis * 1: A original song with Chinese vocal synthesis * 2: A cover/remix song with Chinese vocal synthesis + +### The Predictor + +Located at `/pred/`under the project root dir, it predicts the future views of a video. This is a regression model that takes historical view trends of a video, other contextual information (such as the current time), and future time points to be predicted as feature inputs, and outputs the increment in the video's view count from "now" to the specified future time point. diff --git a/doc/en/architecure/database-structure/README.md b/doc/en/architecure/database-structure/README.md index 96704b7..f9f738e 100644 --- a/doc/en/architecure/database-structure/README.md +++ b/doc/en/architecure/database-structure/README.md @@ -8,4 +8,6 @@ All public data of CVSA (excluding users' personal data) is stored in a database * bili\_user: stores snapshots of Bilibili user information * all\_data: metadata of all videos in [category 30](../../about/scope-of-inclusion.md#category-30). * labelling\_result: Contains label of videos in `all_data`tagged by our [AI system](../artificial-intelligence.md#the-filter). +* video\_snapshot: Statistical data of videos that are fetched regularly (e.g., number of views, etc.), we call this fetch process as "snapshot". +* snapshot\_schedule: The scheduling information for video snapshots. diff --git a/doc/en/architecure/message-queue.md b/doc/en/architecure/message-queue.md new file mode 100644 index 0000000..4fa4877 --- /dev/null +++ b/doc/en/architecure/message-queue.md @@ -0,0 +1,7 @@ +# Message Queue + +We rely on message queues to manage the various tasks that [the cralwer ](overview.md#crawler)needs to perform. + +### Code Path + +Currently, the code related to message queues are located at `lib/mq` and `src`. diff --git a/doc/en/architecure/message-queue/README.md b/doc/en/architecure/message-queue/README.md deleted file mode 100644 index d0a8349..0000000 --- a/doc/en/architecure/message-queue/README.md +++ /dev/null @@ -1,2 +0,0 @@ -# Message Queue - diff --git a/doc/en/architecure/message-queue/videotagsqueue.md b/doc/en/architecure/message-queue/videotagsqueue.md deleted file mode 100644 index bdddddb..0000000 --- a/doc/en/architecure/message-queue/videotagsqueue.md +++ /dev/null @@ -1,11 +0,0 @@ -# VideoTagsQueue - -### Jobs - -The VideoTagsQueue contains two jobs: `getVideoTags`and `getVideosTags`. The former is used to fetch the tags of a video, and the latter is responsible for scheduling the former. - -### Return value - -The return values across two jobs follows the following table: - -
Return ValueDescription
0In getVideoTags: the tags was successfully fetched
In getVideosTags: all null-tags videos have a corresponding job successfully queued.
1Used in getVideoTags: occured fetcherror during the job
2Used in getVideoTags: we've reached the rate limit set in NetScheduler
3Used in getVideoTags: did't provide aid in the job data
4Used ingetVideosTags: There's no video with NULL as `tags`
1xxUsed ingetVideosTags: the number of tasks in the queue has exceeded the limit, thus getVideosTags stops adding tasks. xx is the number of jobs added to the queue during execution.
diff --git a/doc/en/architecure/overview.md b/doc/en/architecure/overview.md index d80036e..e46c887 100644 --- a/doc/en/architecure/overview.md +++ b/doc/en/architecure/overview.md @@ -1,5 +1,4 @@ --- -icon: globe-pointer layout: title: visible: true @@ -15,4 +14,14 @@ layout: # Overview +The whole CVSA system can be sperate into three different parts: + +* Frontend +* API +* Crawler + +The frontend is driven by [Astro](https://astro.build/) and is used to display the final CVSA page. The API is driven by [Hono](https://hono.dev) and is used to query the database and provide REST/GraphQL APIs that can be called by out website, applications, or third parties. The crawler is our automatic data collector, used to automatically collect new songs from bilibili, track their statistics, etc. + +### Crawler + Automation is the biggest highlight of CVSA's technical design. To achieve this, we use a message queue powered by [BullMQ](https://bullmq.io/) to concurrently process various tasks in the data collection life cycle.