Compare commits

..

4 Commits

13 changed files with 79 additions and 62 deletions

View File

@ -9,12 +9,11 @@
## Architecure
- [Overview](architecure/overview.md)
- [Database Structure](architecure/database-structure/README.md)
- [Type of Song](architecure/database-structure/type-of-song.md)
- [Message Queue](architecure/message-queue/README.md)
- [VideoTagsQueue](architecure/message-queue/videotagsqueue.md)
- [Artificial Intelligence](architecure/artificial-intelligence.md)
* [Overview](architecure/overview.md)
* [Database Structure](architecure/database-structure/README.md)
* [Type of Song](architecure/database-structure/type-of-song.md)
* [Message Queue](architecure/message-queue.md)
* [Artificial Intelligence](architecure/artificial-intelligence.md)
## API Doc

View File

@ -7,13 +7,23 @@ For a **song**, it must meet the following conditions to be included in CVSA:
### Category 30
In principle, the songs featured in CVSA must be included in a video categorized under VOCALOID·UTAU (ID 30) that is
posted on Bilibili. In some special cases, this rule may not be enforced. 
In principle, the songs must be featured in a video that is categorized under the VOCALOID·UTAU (ID 30) category in [Bilibili](https://en.wikipedia.org/wiki/Bilibili) in order to be observed by our [automation program](../architecure/overview.md#crawler). We welcome editors to manually add songs that have not been uploaded to bilibili / categorized under this category.
### At Leats One Line of Chinese
#### NEWS
The lyrics of the song must contain at least one line in Chinese. This means that even if a voicebank that only supports
Chinese is used, if the lyrics of the song do not contain Chinese, it will not be included in the CVSA.
Recently, Bilibili seems to be offlining the sub-category. This means the VOCALOID·UTAU category can no longer be entered from the frontend, and producers can no longer upload videos to this category (instead, they can only choose the parent category "Music"). 
According to our experiments, Bilibili still retains the code logic of sub-categories in the backend, and newly published songs may still be in the VOCALOID·UTAU sub-category, and the related APIs can still work normally. However, there are [reports](https://www.bilibili.com/opus/1041223385394184199) that some of the new songs have been placed under the "Music General" sub-category.\
We are still waiting for Bilibili's follow-up actions, and in the future, we may adjust the scope of our automated program's crawling.
### At Leats One Line of Chinese / Chinese Virtual Singer
The lyrics of the song must contain at least one line in Chinese. Otherwise, if the lyrics of the song do not contain Chinese, it will only be included in the CVSA only if a Chinese virtual singer has been used.
We define a **Chinese virtual singer** as follows:
1. The singer primarily uses Chinese voicebank (i.e. the most widely used voickbank for the singer is Chinese)
2. The singer is operated by a company, organization, individual or group located in Mainland China, Hong Kong, Macau or Taiwan.
### Using Vocal Synthesizer

View File

@ -9,6 +9,10 @@ The AI systems we currently use are:
Located at `/filter/` under project root dir, it classifies a video in the
[category 30](../about/scope-of-inclusion.md#category-30) into the following categories:
- 0: Not related to Chinese vocal synthesis
- 1: A original song with Chinese vocal synthesis
- 2: A cover/remix song with Chinese vocal synthesis
* 0: Not related to Chinese vocal synthesis
* 1: A original song with Chinese vocal synthesis
* 2: A cover/remix song with Chinese vocal synthesis
### The Predictor
Located at `/pred/`under the project root dir, it predicts the future views of a video. This is a regression model that takes historical view trends of a video, other contextual information (such as the current time), and future time points to be predicted as feature inputs, and outputs the increment in the video's view count from "now" to the specified future time point.

View File

@ -5,8 +5,10 @@ CVSA uses [PostgreSQL](https://www.postgresql.org/) as our database.
All public data of CVSA (excluding users' personal data) is stored in a database named `cvsa_main`, which contains the
following tables:
- songs: stores the main information of songs
- bili\_user: stores snapshots of Bilibili user information
- all\_data: metadata of all videos in [category 30](../../about/scope-of-inclusion.md#category-30).
- labelling\_result: Contains label of videos in `all_data`tagged by our
[AI system](../artificial-intelligence.md#the-filter).
* songs: stores the main information of songs
* bili\_user: stores snapshots of Bilibili user information
* all\_data: metadata of all videos in [category 30](../../about/scope-of-inclusion.md#category-30).
* labelling\_result: Contains label of videos in `all_data`tagged by our [AI system](../artificial-intelligence.md#the-filter).
* video\_snapshot: Statistical data of videos that are fetched regularly (e.g., number of views, etc.), we call this fetch process as "snapshot".
* snapshot\_schedule: The scheduling information for video snapshots.

View File

@ -0,0 +1,7 @@
# Message Queue
We rely on message queues to manage the various tasks that [the cralwer ](overview.md#crawler)needs to perform.
### Code Path
Currently, the code related to message queues are located at `lib/mq` and `src`.

View File

@ -1 +0,0 @@
# Message Queue

View File

@ -1,12 +0,0 @@
# VideoTagsQueue
### Jobs
The VideoTagsQueue contains two jobs: `getVideoTags`and `getVideosTags`. The former is used to fetch the tags of a
video, and the latter is responsible for scheduling the former.
### Return value
The return values across two jobs follows the following table:
<table><thead><tr><th width="168">Return Value</th><th>Description</th></tr></thead><tbody><tr><td>0</td><td>In <code>getVideoTags</code>: the tags was successfully fetched<br>In <code>getVideosTags</code>: all null-tags videos have a corresponding job successfully queued.</td></tr><tr><td>1</td><td>Used in <code>getVideoTags</code>: occured <code>fetch</code>error during the job</td></tr><tr><td>2</td><td>Used in <code>getVideoTags</code>: we've reached the rate limit set in NetScheduler</td></tr><tr><td>3</td><td>Used in <code>getVideoTags</code>: did't provide aid in the job data</td></tr><tr><td>4</td><td>Used in<code>getVideosTags</code>: There's no video with NULL as `tags`</td></tr><tr><td>1xx</td><td>Used in<code>getVideosTags</code>: the number of tasks in the queue has exceeded the limit, thus <code>getVideosTags</code> stops adding tasks. <code>xx</code> is the number of jobs added to the queue during execution.</td></tr></tbody></table>

View File

@ -1,5 +1,4 @@
---
icon: globe-pointer
layout:
title:
visible: true
@ -15,5 +14,14 @@ layout:
# Overview
Automation is the biggest highlight of CVSA's technical design. To achieve this, we use a message queue powered by
[BullMQ](https://bullmq.io/) to concurrently process various tasks in the data collection life cycle.
The whole CVSA system can be sperate into three different parts:
* Frontend
* API
* Crawler
The frontend is driven by [Astro](https://astro.build/) and is used to display the final CVSA page. The API is driven by [Hono](https://hono.dev) and is used to query the database and provide REST/GraphQL APIs that can be called by out website, applications, or third parties. The crawler is our automatic data collector, used to automatically collect new songs from bilibili, track their statistics, etc.
### Crawler
Automation is the biggest highlight of CVSA's technical design. To achieve this, we use a message queue powered by [BullMQ](https://bullmq.io/) to concurrently process various tasks in the data collection life cycle.

View File

@ -9,12 +9,12 @@
## 技术架构 <a href="#architecture" id="architecture"></a>
- [概览](architecture/overview.md)
- [数据库结构](architecture/database-structure/README.md)
- [歌曲类型](architecture/database-structure/type-of-song.md)
- [人工智能](architecture/artificial-intelligence.md)
- [消息队列](architecture/message-queue/README.md)
- [VideoTagsQueue队列](architecture/message-queue/video-tags-queue.md)
* [概览](architecture/overview.md)
* [数据库结构](architecture/database-structure/README.md)
* [歌曲类型](architecture/database-structure/type-of-song.md)
* [人工智能](architecture/artificial-intelligence.md)
* [消息队列](architecture/message-queue/README.md)
* [LatestVideosQueue 队列](architecture/message-queue/latestvideosqueue-dui-lie.md)
## API 文档 <a href="#api-doc" id="api-doc"></a>

View File

@ -4,7 +4,11 @@ CVSA 使用 [PostgreSQL](https://www.postgresql.org/) 作为数据库。
CVSA 的所有公开数据(不包括用户的个人数据)都存储在名为 `cvsa_main` 的数据库中,该数据库包含以下表:
- songs存储歌曲的主要信息
- bili\_user存储 Bilibili 用户信息快照
- all\_data[分区 30](../../about/scope-of-inclusion.md#vocaloiduatu-fen-qu) 中所有视频的元数据。
- labelling\_result包含由我们的 AI 系统 标记的 `all_data` 中视频的标签。
* songs存储歌曲的主要信息
* bilibili\_user存储 Bilibili 用户信息快照
* bilibili\_metadata[分区 30](../../about/scope-of-inclusion.md#vocaloiduatu-fen-qu) 中所有视频的元数据
* labelling\_result包含由我们的 AI 系统 标记的 `all_data` 中视频的标签。
* latest\_video\_snapshot存储视频最新的快照
* video\_snapshot存储视频的快照包括特定时间下视频的统计信息播放量、点赞数等
* snapshot\_schedule视频快照的规划信息为辅助表

View File

@ -0,0 +1,2 @@
# LatestVideosQueue 队列

View File

@ -1,15 +0,0 @@
---
description: 关于VideoTagsQueue队列的信息。
---
# VideoTagsQueue队列
### 任务
视频标签队列包含两个任务:`getVideoTags`和`getVideosTags`。前者用于获取视频的标签,后者负责调度前者。
### 返回值
两个任务的返回值遵循以下表格:
<table><thead><tr><th width="168">返回值</th><th>描述</th></tr></thead><tbody><tr><td>0</td><td><code>getVideoTags</code> 中:标签成功获取<br><code>getVideosTags</code> 中:所有无标签视频的相应任务已成功排队。</td></tr><tr><td>1</td><td><code>getVideoTags</code> 中:任务期间发生 <code>fetch</code> 错误</td></tr><tr><td>2</td><td><code>getVideoTags</code> 中:已达到 NetScheduler 设置的速率限制</td></tr><tr><td>3</td><td><code>getVideoTags</code> 中:未在任务数据中提供帮助</td></tr><tr><td>4</td><td><code>getVideosTags</code> 中:没有视频的 `tags` 为 NULL</td></tr><tr><td>1xx</td><td><code>getVideosTags</code> 中:队列中的任务数量超过了限制,因此 <code>getVideosTags</code> 停止添加任务。<code>xx</code> 是在执行期间添加到队列的任务数量。</td></tr></tbody></table>

View File

@ -1,5 +1,4 @@
---
icon: globe-pointer
layout:
title:
visible: true
@ -15,4 +14,14 @@ layout:
# 概览
自动化是 CVSA 技术设计的最大亮点为了实现自动化我们使用BullMQ驱动的消息队列来并发处理数据采集生命周期中的各项任务。
整个CVSA项目分为三个组件**crawler**, **frontend** 和 **backend。**
### **crawler**
位于项目目录`packages/crawler` 下,它负责以下工作:
* 抓取新的视频并收录作品
* 持续监控视频的播放量等统计信息
整个 crawler 由 BullMQ 消息队列驱动,使用 Redis 和 PostgreSQL 管理状态。