Compare commits

...

6 Commits

28 changed files with 479 additions and 174 deletions

View File

@ -0,0 +1,107 @@
openapi: 3.1.0
info:
title: CVSA API
version: v1
servers:
- url: https://api.projectcvsa.com
paths:
/video/{id}/snapshots:
get:
summary: Get list of video snapshots
description: Get a list of video snapshots by the ID. The ID can be "av" + number, or "BV" + a 12-digit alphanumeric string, or an integer as the av number in bilibili.
parameters:
- in: path
name: id
required: true
schema:
type: string
description: "The ID of the video (e.g. av78977256, BV1KJ411C7CW, 78977256)"
- in: query
name: ps
schema:
type: integer
minimum: 1
default: 1000
description: The number of snapshots returned per page (pageSize), the default is 1000.
- in: query
name: pn
schema:
type: integer
minimum: 1
description: The page number, used for pagination. Only one of offset and pn can be specified.
- in: query
name: offset
schema:
type: integer
minimum: 1
description: The offset for offset-based queries. Only one of offset and pn can be specified.
- in: query
name: reverse
schema:
type: boolean
description: Reverse snapshots from old to new if set to true. Default is false.
responses:
"200":
description: Successfuly retrieved snapshots
content:
application/json:
schema:
type: array
items:
type: object
properties:
id:
type: integer
description: Snapshot ID (Not the same as the video ID)
aid:
type: integer
description: The av number of the video
views:
type: integer
description: The number of views the video has
coins:
type: integer
description: The number of coins the video has
likes:
type: integer
description: The number of likes the video has
favorites:
type: integer
description: The number of favorites the video has
shares:
type: integer
description: The number of shares the video has
danmakus:
type: integer
description: The number of danmakus the video has
replies:
type: integer
description: The number of replies the video has
"400":
description: Invalid query parameters
content:
application/json:
schema:
type: object
properties:
message:
type: string
description: Error message
errors:
type: object
description: Detailed error information
"500":
description: Internal server error
content:
application/json:
schema:
type: object
properties:
message:
type: string
description: Error message
error:
type: object
description: Detailed error information

View File

@ -17,9 +17,8 @@ layout:
Welcome to the CVSA Documentation!
This doc contains various information about the CVSA project, including technical architecture, tutorials for visitors,
etc.
This doc contains various information about the CVSA project, including technical architecture, tutorials for visitors, etc.
### Jump right in
<table data-view="cards"><thead><tr><th></th><th></th><th data-hidden data-card-cover data-type="files"></th><th data-hidden></th><th data-hidden data-card-target data-type="content-ref"></th></tr></thead><tbody><tr><td><strong>About CVSA</strong></td><td>Some information you might want to know about.</td><td></td><td></td><td><a href="about/this-project.md">this-project.md</a></td></tr><tr><td><strong>Architecture</strong></td><td>The technical details about how CVSA was built.</td><td></td><td></td><td><a href="broken-reference">Broken link</a></td></tr><tr><td><strong>API Doc</strong></td><td>Documentation about APIs provided by CVSA.</td><td></td><td></td><td><a href="broken-reference">Broken link</a></td></tr></tbody></table>
<table data-view="cards"><thead><tr><th></th><th></th><th data-hidden data-card-cover data-type="files"></th><th data-hidden></th><th data-hidden data-card-target data-type="content-ref"></th></tr></thead><tbody><tr><td><strong>About this project</strong></td><td>Some information you might want to know about.</td><td></td><td></td><td><a href="about/this-project.md">this-project.md</a></td></tr><tr><td><strong>Architecture</strong></td><td>The technical details about how CVSA was built.</td><td></td><td></td><td><a href="broken-reference">Broken link</a></td></tr><tr><td><strong>API Doc</strong></td><td>Documentation about APIs provided by CVSA.</td><td></td><td></td><td><a href="broken-reference/">broken-reference</a></td></tr><tr><td><strong>Source Code</strong></td><td>View this project on GitHub</td><td></td><td></td><td><a href="https://github.com/alikia2x/cvsa">https://github.com/alikia2x/cvsa</a></td></tr><tr><td>🇨🇳 中文版本</td><td>浏览本文档的中文版本</td><td></td><td></td><td><a href="https://app.gitbook.com/s/pv6AFgCPzXeRmP9slTBR/">欢迎</a></td></tr></tbody></table>

View File

@ -4,16 +4,16 @@
## About
* [About CVSA Project](about/this-project.md)
* [About the CVSA Project](about/this-project.md)
* [Scope of Inclusion](about/scope-of-inclusion.md)
## Architecure
## Architecture
* [Overview](architecure/overview.md)
* [Crawler](architecure/crawler.md)
* [Database Structure](architecure/database-structure/README.md)
* [Type of Song](architecure/database-structure/type-of-song.md)
* [Artificial Intelligence](architecure/artificial-intelligence.md)
* [Overview](architecture/overview.md)
* [Crawler](architecture/crawler.md)
* [Database Structure](architecture/database-structure/README.md)
* [Type of a Song](architecture/database-structure/type-of-song.md)
* [Artificial Intelligence](architecture/artificial-intelligence.md)
## API Doc

View File

@ -1,48 +1,34 @@
# Scope of Inclusion
CVSA contains many aspects of Chinese Vocal Synthesis, including songs, albums, artists (publisher, manipulators,
arranger, etc), singers and voice engines / voicebanks.&#x20;
CVSA contains many aspects of Chinese Vocal Synthesis, including songs, albums, artists (publisher, manipulators, arranger, etc), singers and voice engines / voicebanks.
For a **song**, it must meet the following conditions to be included in CVSA:
### Category 30
In principle, the songs must be featured in a video that is categorized under the VOCALOID·UTAU (ID 30) category in
[Bilibili](https://en.wikipedia.org/wiki/Bilibili) in order to be observed by our
[automation program](../architecure/overview.md#crawler). We welcome editors to manually add songs that have not been
uploaded to bilibili / categorized under this category.
#### NEWS
Recently, Bilibili seems to be offlining the sub-category. This means the VOCALOID·UTAU category can no longer be
entered from the frontend, and producers can no longer upload videos to this category (instead, they can only choose the
parent category "Music").&#x20;
According to our experiments, Bilibili still retains the code logic of sub-categories in the backend, and newly
published songs may still be in the VOCALOID·UTAU sub-category, and the related APIs can still work normally. However,
there are [reports](https://www.bilibili.com/opus/1041223385394184199) that some of the new songs have been placed under
the "Music General" sub-category.\
We are still waiting for Bilibili's follow-up actions, and in the future, we may adjust the scope of our automated
program's crawling.
For a **song**, it must meet the following two conditions to be included in CVSA:
### At Leats One Line of Chinese / Chinese Virtual Singer
The lyrics of the song must contain at least one line in Chinese. Otherwise, if the lyrics of the song do not contain
Chinese, it will only be included in the CVSA only if a Chinese virtual singer has been used.
The lyrics of the song must contain at least one line in Chinese. Otherwise, if the lyrics of the song do not contain Chinese, it will only be included in the CVSA only if a Chinese virtual singer has been used.
We define a **Chinese virtual singer** as follows:
1. The singer primarily uses Chinese voicebank (i.e. the most widely used voickbank for the singer is Chinese)
2. The singer is operated by a company, organization, individual or group located in Mainland China, Hong Kong, Macau or
2. The singer is operated by a company, organization, individual or group located in Mainland China, Hong Kong, Macau or\
Taiwan.
### Using Vocal Synthesizer
To be included in CVSA, at least one line of the song must be produced by a Vocal Synthesizer (including harmony
vocals).
To be included in CVSA, at least one line of the song must be produced by a Vocal Synthesizer (including harmony vocals).
We define a vocal synthesizer as a software or system that generates synthesized singing voices by algorithmically
modeling vocal characteristics and producing audio from input parameters such as lyrics, pitch, and dynamics,
encompassing both waveform-concatenation-based (e.g., VOCALOID, UTAU) and AI-based (e.g., Synthesizer V, ACE Studio)
approaches, **but excluding voice conversion tools that solely alter the timbre of pre-existing recordings** (e.g.,
[so-vits svc](https://github.com/svc-develop-team/so-vits-svc)).
We define a vocal synthesizer as a software or system that generates synthesized singing voices by algorithmically modeling vocal characteristics and producing audio from input parameters such as lyrics, pitch, and dynamics, encompassing both waveform-concatenation-based (e.g., VOCALOID 1\~5, UTAU) and AI-based (e.g., Synthesizer V, ACE Studio) approaches, **but excluding voice conversion tools that solely alter the timbre of pre-existing recordings** (e.g.,[so-vits svc](https://github.com/svc-develop-team/so-vits-svc)).
In addition, the songs must be featured in a video that is categorized under the VOCALOID·UTAU (ID 30) category in [Bilibili](https://en.wikipedia.org/wiki/Bilibili) in order to be observed by our [automation program](../architecture/overview.md#crawler). We welcome editors to manually add songs that have not been uploaded to bilibili / categorized under this category.
#### NEWS
Recently, Bilibili seems to be offlining the sub-category. This means the VOCALOID·UTAU category can no longer be entered from the frontend, and producers can no longer upload videos to this category (instead, they can only choose the parent category "Music").
According to our experiments, Bilibili still retains the code logic of sub-categories in the backend, and newly published songs may still be in the VOCALOID·UTAU sub-category, and the related APIs can still work normally. However, there are [reports](https://www.bilibili.com/opus/1041223385394184199) that some of the new songs have been placed under\
the "Music General" sub-category.
We are still waiting for Bilibili's follow-up actions, and in the future, we may adjust the scope of our automated program's crawling.

View File

@ -1,13 +1,13 @@
# About CVSA Project
# About the CVSA Project
CVSA (Chinese Vocal Synthesis Archive) aims to collect as much content as possible about the Chinese Vocal Synthesis
community in a highly automation-assisted way.&#x20;
CVSA (Chinese Vocal Synthesis Archive) aims to collect as much content as possible about the Chinese Vocal Synthesis\
community in a highly automation-assisted way.
Unlike existing projects such as [VocaDB](https://vocadb.net), CVSA collects and displays the following content in an
Unlike existing projects such as [VocaDB](https://vocadb.net), CVSA collects and displays the following content in an\
automated and manually edited way:
- Metadata of songs (name, duration, publisher, singer, etc.)
- Descriptive information of songs (content introduction, creation background, lyrics, etc.)
- Engagement data snapshots of songs, i.e. historical snapshots of their engagement data (including views, favorites,
likes, etc.) on the [Bilibili](https://en.wikipedia.org/wiki/Bilibili) website.
- Information about artists, albums, vocal synthesizers, and voicebanks.
* Metadata of songs (name, duration, publisher, singer, etc.)
* Descriptive information of songs (content introduction, creation background, lyrics, etc.)
* Statistical data snapshots of songs, i.e. historical snapshots of their statistical data (including number views, favorites, likes, etc.) on the [bilibili](https://en.wikipedia.org/wiki/Bilibili) website.
* Information about artists, albums, vocal synthesizers, and voicebanks.

View File

@ -1,3 +1,6 @@
# Songs
Not implemented yet.
{% openapi src="../.gitbook/assets/api-doc.yaml" path="/video/{id}/snapshots" method="get" %}
[api-doc.yaml](../.gitbook/assets/api-doc.yaml)
{% endopenapi %}

View File

@ -0,0 +1,23 @@
# Artificial Intelligence
CVSA's automated workflow relies heavily on artificial intelligence for information extraction and classification.
The AI systems we currently use are:
### The Filter (codename Akari)
Located at `/ml/filter/` under project root dir, it classifies a video in the [category 30](../about/scope-of-inclusion.md#category-30) into the following categories:
* 0: Not related to Chinese vocal synthesis
* 1: A original song with Chinese vocal synthesis
* 2: A cover/remix song with Chinese vocal synthesis
We also have some experimental work that is not yet in production:
### The Predictor
Located at `/ml/pred/`under the project root dir, it predicts the future views of a video. This is a regression model that takes historical view trends of a video, other contextual information (such as the current time), and future time points to be predicted as feature inputs, and outputs the increment in the video's view count from "now" to the specified future time point.
### Lyrics Alignment
Located at `/ml/lab/`under the project root dir, it uses [MMS wav2vec](https://huggingface.co/docs/transformers/en/model_doc/mms) and [Whisper](https://github.com/openai/whisper) models for phoneme-level and line-level alignment, respectively. The original purpose of this work is to drive the live lyrics feature in our other project: [AquaVox](https://github.com/alikia2x/aquavox).

View File

@ -0,0 +1,66 @@
# Crawler
Automation is at the core of CVSAs technical architecture. The `crawler` is built to efficiently orchestrate data collection tasks using a message queue system powered by [BullMQ](https://bullmq.io/). This design enables concurrent processing across multiple stages of the data collection lifecycle.
State management and data persistence are handled using a combination of Redis (for caching and real-time data) and PostgreSQL (as the primary database).
## `crawler/db`
This module handles all database interactions for the crawler, including creation, updates, and data retrieval.
- `init.ts`: Initializes the PostgreSQL connection pool.
- `redis.ts`: Sets up the Redis client.
- `withConnection.ts`: Exports `withDatabaseConnection`, a helper that provides a database context to any function.
- Other files: Contain table-specific functions, with each file corresponding to a database table.
## `crawler/ml`
This module handles machine learning tasks, such as content classification.
- `manager.ts`: Defines a base class `AIManager` for managing ML models.
- `akari.ts`: Implements our primary classification model, `AkariProto`, which extends `AIManager`. It filters videos to determine if they should be included as songs.
## `crawler/mq`
This module manages task queuing and processing through BullMQ.
## `crawler/mq/exec`
Contains the functions executed by BullMQ workers. Examples include `getVideoInfoWorker` and `takeBulkSnapshotForVideosWorker`.
> **Terminology note:**
> In this documentation:
> - Functions in `crawler/mq/exec` are called **workers**.
> - Functions in `crawler/mq/workers` are called **BullMQ workers**.
**Design detail:**
Since BullMQ requires one handler per queue, we use a `switch` statement inside each BullMQ worker to route jobs based on their name to the correct function in `crawler/mq/exec`.
## `crawler/mq/workers`
Houses the BullMQ worker functions. Each function handles jobs for a specific queue.
## `crawler/mq/task`
To keep worker functions clean and focused, reusable logic is extracted into this directory as **tasks**. These tasks are then imported and used by the worker functions.
## `crawler/net`
This module handles all data fetching operations. Its core component is the `NetworkDelegate`, defined in `net/delegate.ts`.
## `crawler/net/delegate.ts`
Implements robust network request handling, including:
- Rate limiting by task type and proxy
- Support for serverless functions to dynamically rotate requesting IPs
## `crawler/utils`
A collection of utility functions shared across the crawler modules.
## `crawler/src`
Contains the main entry point of the crawler.
We use [concurrently](https://www.npmjs.com/package/concurrently) to run multiple scripts in parallel, enabling efficient execution of various processes.

View File

@ -0,0 +1,14 @@
# Database Structure
CVSA uses [PostgreSQL](https://www.postgresql.org/) as our database.
All public data of CVSA (excluding users' personal data) is stored in a database named `cvsa_main`, which contains the\
following tables:
* songs: stores the main information of songs
* bilibili\_user: stores snapshots of Bilibili user information
* bilibili\_metadata: metadata of all videos we collected from bilibili.
* labelling\_result: Contains label of videos in `bilibili_metadata`tagged by our [AI system](../artificial-intelligence.md#the-filter).
* video\_snapshot: Statistical data of videos that are fetched regularly (e.g., number of views, etc.), we call this fetch process as "snapshot".
* snapshot\_schedule: The scheduling information for video snapshots.

View File

@ -31,12 +31,7 @@ cvsa
**Package Breakdown:**
* **`backend`**: This package houses the server-side logic, built with the [Hono](https://hono.dev/) web framework. It's responsible for interacting with the database and exposing data through REST and GraphQL APIs for consumption by the frontend, internal applications, and third-party developers.
* **`frontend`**: The user-facing web interface of CVSA is developed using [Astro](https://astro.build/). This package handles the presentation layer, displaying information fetched from the database.
* **`crawler`**: This automated data collection system is a key component of CVSA. It's designed to automatically discover and gather new song data from bilibili, as well as track relevant statistics over time.
* **`core`**: This package contains reusable and generic code that is utilized across multiple workspaces within the CVSA monorepo.
### Crawler
Automation is the biggest highlight of CVSA's technical design. The data collection process within the `crawler` is orchestrated using a message queue powered by [BullMQ](https://bullmq.io/). This enables concurrent processing of various tasks involved in the data collection lifecycle. State management and data persistence are handled by a combination of Redis for caching and real-time data, and PostgreSQL as the primary database.
- **`backend`**: This package houses the server-side logic, built with the [Hono](https://hono.dev/) web framework. It's responsible for interacting with the database and exposing data through REST and GraphQL APIs for consumption by the frontend, internal applications, and third-party developers.
- **`frontend`**: The user-facing web interface of CVSA is developed using [Astro](https://astro.build/). This package handles the presentation layer, displaying information fetched from the database.
- **`crawler`**: This automated data collection system is a key component of CVSA. It's designed to automatically discover and gather new song data from bilibili, as well as track relevant statistics over time.
- **`core`**: This package contains reusable and generic code that is utilized across multiple workspaces within the CVSA monorepo.

View File

@ -1,21 +0,0 @@
# Artificial Intelligence
CVSA's automated workflow relies heavily on artificial intelligence for information extraction and classification.
The AI systems we currently use are:
### The Filter
Located at `/filter/` under project root dir, it classifies a video in the
[category 30](../about/scope-of-inclusion.md#category-30) into the following categories:
- 0: Not related to Chinese vocal synthesis
- 1: A original song with Chinese vocal synthesis
- 2: A cover/remix song with Chinese vocal synthesis
### The Predictor
Located at `/pred/`under the project root dir, it predicts the future views of a video. This is a regression model that
takes historical view trends of a video, other contextual information (such as the current time), and future time points
to be predicted as feature inputs, and outputs the increment in the video's view count from "now" to the specified
future time point.

View File

@ -1,4 +0,0 @@
# Crawler
A central aspect of CVSA's technical design is its emphasis on automation. The data collection process within the `crawler` is orchestrated using a message queue powered by [BullMQ](https://bullmq.io/). This enables concurrent processing of various tasks involved in the data lifecycle. State management and data persistence are handled by a combination of Redis for caching and real-time data, and PostgreSQL as the primary database.

View File

@ -1,15 +0,0 @@
# Database Structure
CVSA uses [PostgreSQL](https://www.postgresql.org/) as our database.
All public data of CVSA (excluding users' personal data) is stored in a database named `cvsa_main`, which contains the
following tables:
- songs: stores the main information of songs
- bili\_user: stores snapshots of Bilibili user information
- all\_data: metadata of all videos in [category 30](../../about/scope-of-inclusion.md#category-30).
- labelling\_result: Contains label of videos in `all_data`tagged by our
[AI system](../artificial-intelligence.md#the-filter).
- video\_snapshot: Statistical data of videos that are fetched regularly (e.g., number of views, etc.), we call this
fetch process as "snapshot".
- snapshot\_schedule: The scheduling information for video snapshots.

View File

@ -1,6 +1,6 @@
---
icon: hand-wave
description: 「中V档案馆」 (CVSA) 是一个收录中文歌声合成文化圈有关信息的网站。
icon: hand-wave
layout:
title:
visible: true
@ -16,10 +16,10 @@ layout:
# 欢迎
欢迎阅读CVSA文档!
欢迎阅读中V档案馆文档!
该文档包含有关中V档案馆项目的各种信息包括本项目的有关信息、技术架构、访客指南、API文档等。
### 导航
<table data-view="cards"><thead><tr><th></th><th></th><th data-hidden data-card-cover data-type="files"></th><th data-hidden></th><th data-hidden data-card-target data-type="content-ref"></th></tr></thead><tbody><tr><td><strong>关于本项目</strong></td><td>一些你可能想知道的…</td><td></td><td></td><td><a href="about/this-project.md">this-project.md</a></td></tr><tr><td><strong>技术架构</strong></td><td>关于本项目的技术细节</td><td></td><td></td><td><a href="broken-reference">Broken link</a></td></tr><tr><td><strong>API 文档</strong> </td><td>中V档案馆公开 API 的文档</td><td></td><td></td><td><a href="broken-reference">Broken link</a></td></tr><tr><td><strong>项目地址</strong></td><td><a href="https://github.com/alikia2x/cvsa">GitHub</a><a href="https://gitee.com/alikia/cvsa">Gitee</a> 上查看本项目</td><td></td><td></td><td><a href="https://gitee.com/alikia/cvsa">https://gitee.com/alikia/cvsa</a></td></tr><tr><td>🇺🇸 English Version</td><td>Hint: There's a language switcher on the top-left corner, just to the right of the logo.</td><td></td><td></td><td><a href="https://app.gitbook.com/o/ZRcyqFK0ovlJduZb50X0/s/89Gi0XfqMigoQkEYJZZl/">CVSA Doc English</a></td></tr></tbody></table>
<table data-view="cards"><thead><tr><th></th><th></th><th data-hidden data-card-target data-type="content-ref"></th></tr></thead><tbody><tr><td><strong>关于本项目</strong></td><td>一些你可能想知道的…</td><td><a href="about/this-project.md">this-project.md</a></td></tr><tr><td><strong>技术架构</strong></td><td>关于本项目的技术细节</td><td><a href="broken-reference">Broken link</a></td></tr><tr><td><strong>API 文档</strong> </td><td>中V档案馆公开 API 的文档</td><td><a href="broken-reference">Broken link</a></td></tr><tr><td>🇺🇸 English Version</td><td>Tip: There is a language selector in the header.</td><td><a href="https://app.gitbook.com/o/ZRcyqFK0ovlJduZb50X0/s/89Gi0XfqMigoQkEYJZZl/">CVSA Doc English</a></td></tr><tr><td><strong>项目地址</strong></td><td><a href="https://github.com/alikia2x/cvsa">GitHub</a><a href="https://gitee.com/alikia/cvsa">Gitee</a> 上查看本项目</td><td><a href="https://gitee.com/alikia/cvsa">https://gitee.com/alikia/cvsa</a></td></tr><tr><td><strong>网站</strong></td><td>我们新上线的测试网站,查看目前数据库中的信息</td><td><a href="https://projectcvsa.com">https://projectcvsa.com</a></td></tr></tbody></table>

View File

@ -9,12 +9,12 @@
## 技术架构 <a href="#architecture" id="architecture"></a>
- [概览](architecture/overview.md)
- [数据库结构](architecture/database-structure/README.md)
- [歌曲类型](architecture/database-structure/type-of-song.md)
- [人工智能](architecture/artificial-intelligence.md)
- [消息队列](architecture/message-queue/README.md)
- [LatestVideosQueue 队列](architecture/message-queue/latestvideosqueue-dui-lie.md)
* [概览](architecture/overview.md)
* [Crawler 模块介绍](architecture/crawler.md)
* [数据库结构](architecture/database-structure/README.md)
* [歌曲类型](architecture/database-structure/type-of-song.md)
* [snapshot\_schedule 表](architecture/database-structure/table-snapshot_schedule.md)
* [机器学习](architecture/machine-learning.md)
## API 文档 <a href="#api-doc" id="api-doc"></a>

View File

@ -1,22 +1,32 @@
# 收录范围
中V档案馆收录许多有关中文歌声合成的内容包括歌曲、专辑、艺术家发布者、调校师、编曲者等、歌手以及引擎/声库。&#x20;
中V档案馆收录许多有关中文歌声合成的内容包括歌曲、专辑、艺术家发布者、调校师、编曲者等、歌手以及引擎/声库。
对于一首**歌曲**必须满足以下条件才能被收录到中V档案馆中
对于一首**歌曲**,必须满足以下两个条件才能被收录到中V档案馆中
#### VOCALOID·UATU 分区
### 至少一行中文/中文虚拟歌手
原则上中V档案馆中收录的歌曲必须包含在哔哩哔哩 VOCALOID·UTAU
分区分区ID为30下的视频中。在某些特殊情况下此规则可能不是强制的。
歌曲歌词必须至少包含一行中文。否则如果歌曲歌词不包含中文则只有在使用中文虚拟歌手的情况下才会将其包含在中V档案馆中。
#### 至少一行中文
我们对**中文虚拟歌手**的定义如下:
歌曲的歌词必须包含至少一行中文。这意味着即使使用了仅支持中文的声库如果歌曲的歌词中没有中文也不会被收录到中V档案馆中例如跨语种调校
1. 歌手主要使用中文声库(即歌手最广泛使用的声库是中文)。
2. 歌手由位于中国大陆、香港、澳门或台湾的公司、组织、个人或团体运营。
#### 使用歌声合成器
### 使用歌声合成器
歌曲的至少一行必须由歌声合成器生成(包括和声部分才能被收录到中V档案馆中。
歌曲的至少一行必须由歌声合成器合成(包括和声才能被收录到中V档案馆中。
我们将歌声合成器定义为通过算法建模声音特征并根据输入的歌词、音高等参数生成音频的软件或系统,包括基于波形拼接的(如
VOCALOID、UTAU和基于 AI 的(如 Synthesizer V、ACE Studio方法**但不包括仅改变现有歌声音色的AI声音转换器**(例如
[so-vits svc](https://github.com/svc-develop-team/so-vits-svc))。
我们将歌声合成器定义为通过算法建模声音特征并根据输入的歌词、音高等参数生成音频的软件或系统包括基于波形拼接的如VOCALOID 1\~5、UTAU和基于 AI 的(如 Synthesizer V、ACE Studio方法**但不包括仅改变现有歌声音色的AI声音转换器**(例如[so-vits svc](https://github.com/svc-develop-team/so-vits-svc))。
&#x20;
此外,歌曲必须出现在发布到哔哩哔哩中 VOCALOID·UTAU 分区下视频中,才能被我们的自动化程序观察到。我们欢迎编辑手动添加尚未上传到 bilibili或未归类到此类别的歌曲。
**新闻**
最近哔哩哔哩似乎正在下线二级分区。这意味着VOCALOID·UTAU分区将无法从前端进入创作者们也无法再将视频上传到该分区只能选择“音乐区”
根据我们的实验,哔哩哔哩在后端仍然保留了二级分区的代码逻辑,新发布的歌曲可能仍在 VOCALOID·UTAU 分区中相关API仍可正常工作。目前有[报告](https://www.bilibili.com/opus/1041223385394184199)称部分新歌曲被归入了“音乐综合”子分区。。此外,我们观察到哔哩哔哩实际上并没有尊重创作者投稿时选择的分区,而是使用某种方法自动为视频分配分区。我们已经观察到有[稿件](https://www.bilibili.com/video/av114163368068672/)出现了被归类到非音乐区的问题。
我们仍在等待哔哩哔哩的后续行动,未来我们可能会调整自动化程序的抓取范围。

View File

@ -6,33 +6,28 @@
纵观整个互联网对于「中文歌声合成」或「中文虚拟歌手」常简称为中V或VC相关信息进行较为系统、全面地整理收集的主要有以下几个网站
- [萌娘百科](https://zh.moegirl.org.cn/):
收录了大量中V歌曲及歌姬的信息呈现形式为传统维基基于[MediaWiki](https://www.mediawiki.org/))。
- [VCPedia](https://vcpedia.cn/):
由原萌娘百科中文歌声合成编辑团队的部分成员搭建,专属于中文歌声合成相关内容的信息集成站点[^1],呈现形式为传统维基(基于[MediaWiki](https://www.mediawiki.org/))。
- [VocaDB](https://vocadb.net/):
[一个围绕 Vocaloid、UTAU 和其他歌声合成器的协作数据库其中包含艺术家、唱片、PV 等](#user-content-fn-2)[^2],其中包含大量中文歌声合成作品。
- [天钿Daily](https://tdd.bunnyxt.com/)一个VC相关数据交流与分享的网站。致力于VC相关数据交流定期抓取VC相关数据选取有意义的纬度展示。
* [萌娘百科](https://zh.moegirl.org.cn/): 收录了大量中V歌曲及歌姬的信息呈现形式为传统维基基于[MediaWiki](https://www.mediawiki.org/))。
* [VCPedia](https://vcpedia.cn/): 由原萌娘百科中文歌声合成编辑团队的部分成员搭建,专属于中文歌声合成相关内容的信息集成站点,呈现形式为传统维基(基于[MediaWiki](https://www.mediawiki.org/))。
* [VocaDB](https://vocadb.net/): [一个围绕 Vocaloid、UTAU 和其他歌声合成器的协作数据库其中包含艺术家、唱片、PV 等](#user-content-fn-1)[^1],其中包含大量中文歌声合成作品。
* [天钿Daily](https://tdd.bunnyxt.com/)一个VC相关数据交流与分享的网站。致力于VC相关数据交流定期抓取VC相关数据选取有意义的纬度展示。
上述网站中,或多或少存在一些不足,例如:
- 萌娘百科、VCPedia受限于传统维基绝大多数内容依赖人工编辑。
- VocaDB基于结构化数据库构建由此可以依赖程序生成一些信息但**条目收录**仍然完全依赖人工完成。
- VocaDB主要专注于元数据展示少有关于歌曲、作者等的描述性的文字也缺乏描述性的背景信息。
- 天钿Daily只展示歌曲的统计数据及历史趋势没有关于歌曲其它信息的收集。
* 萌娘百科、VCPedia受限于传统维基绝大多数内容依赖人工编辑。
* VocaDB基于结构化数据库构建由此可以依赖程序生成一些信息但**条目收录**仍然完全依赖人工完成。
* VocaDB主要专注于元数据展示少有关于歌曲、作者等的描述性的文字也缺乏描述性的背景信息。
* 天钿Daily只展示歌曲的统计数据及历史趋势没有关于歌曲其它信息的收集。
因此,**中V档案馆**吸取前人经验,克服上述网站的不足,希望做到:
- 歌曲收录(指发现歌曲并创建条目)的完全自动化
- 歌曲元信息提取的高度自动化
- 歌曲统计数据收集的完全自动化
- 在程序辅助的同时欢迎并鼓励贡献者参与编辑(主要为描述性内容)或纠错
- 在适当的许可声明下,引用来自上述源的数据,使内容更加全面、丰富。
* 歌曲收录(指发现歌曲并创建条目)的完全自动化
* 歌曲元信息提取的高度自动化
* 歌曲统计数据收集的完全自动化
* 在程序辅助的同时欢迎并鼓励贡献者参与编辑(主要为描述性内容)或纠错
* 在适当的许可声明下,引用来自上述源的数据,使内容更加全面、丰富。
---
***
本文在[CC BY-NC-SA 4.0协议](https://creativecommons.org/licenses/by-nc-sa/4.0/)提供。
[^1]: 引用自[VCPedia](https://vcpedia.cn/%E9%A6%96%E9%A1%B5),于[知识共享 署名-非商业性使用-相同方式共享 3.0中国大陆 (CC BY-NC-SA 3.0 CN) 许可协议](https://creativecommons.org/licenses/by-nc-sa/3.0/cn/)下提供。
[^2]: 翻译自[VocaDB](https://vocadb.net/),于[CC BY 4.0协议](https://creativecommons.org/licenses/by/4.0/)下提供。
[^1]: 翻译自[VocaDB](https://vocadb.net/),于[CC BY 4.0协议](https://creativecommons.org/licenses/by/4.0/)下提供。

View File

@ -1,6 +1,6 @@
# 视频快照
{% openapi src="../.gitbook/assets/1.yaml" path="/video/{id}/snapshots" method="get" %}
[1.yaml](../.gitbook/assets/1.yaml)
{% openapi src="../.gitbook/assets/API-doc.yaml" path="/video/{id}/snapshots" method="get" %}
[API-doc.yaml](../.gitbook/assets/API-doc.yaml)
{% endopenapi %}

View File

@ -1,13 +0,0 @@
# 人工智能
CVSA 的自动化工作流高度依赖人工智能进行信息提取和分类。
我们目前使用的 AI 系统有:
#### Filter
位于项目根目录下的 `/filter/`,它将 [30 分区](../about/scope-of-inclusion.md#vocaloiduatu-fen-qu) 中的视频分为以下类别:
- 0与中文人声合成无关
- 1中文人声合成原创曲
- 2中文人声合成的翻唱/混音歌曲

View File

@ -0,0 +1,68 @@
# Crawler 模块介绍
在中V档案馆的技术架构中自动化是核心设计理念。`crawler` 模块负责整个数据采集流程,通过 [BullMQ](https://bullmq.io/) 实现任务的消息队列管理,支持高并发地处理多个采集任务。
系统的数据存储与状态管理采用了 Redis用于缓存和实时数据与 PostgreSQL作为主数据库的组合方式确保了稳定性与高效性。
***
### 模块结构概览
#### `crawler/db` —— 数据库操作模块
负责与数据库的交互,提供创建、更新、查询等功能。
* `init.ts`:初始化 PostgreSQL 连接池。
* `redis.ts`:配置 Redis 客户端。
* `withConnection.ts`:导出 `withDatabaseConnection` 函数,用于包装数据库操作函数,提供数据库上下文。
* 其他文件:每个文件对应数据库中的一张表,封装了该表的操作逻辑。
#### `crawler/ml` —— 机器学习模块
负责与机器学习模型相关的处理逻辑,主要用于视频内容的文本分类。
* `manager.ts`:定义了一个模型管理基类 `AIManager`
* `akari.ts`:实现了用于筛选歌曲视频的分类模型 `AkariProto`,继承自 `AIManager`
#### `crawler/mq` —— 消息队列模块
整合 BullMQ实现任务调度和异步处理。
**`crawler/mq/exec`**
该目录下包含了各类任务的处理函数。虽然这些函数并非 BullMQ 所直接定义的“worker”但在文档中我们仍将其统一称为 **worker**(例如 `getVideoInfoWorker`、`takeBulkSnapshotForVideosWorker`)。
> **说明:**
>
> * `crawler/mq/exec` 中的函数称为 **worker**。
> * `crawler/mq/workers` 中的函数我们称为 **BullMQ worker**。
**架构设计说明:**\
由于 BullMQ 设计上每个队列只能有一个处理函数,我们通过 `switch` 语句在一个 worker 中区分并路由不同的任务类型,将其分发给相应的执行函数。
**`crawler/mq/workers`**
这个目录定义了真正的 BullMQ worker用于消费对应队列中的任务并调用具体的执行逻辑。
**`crawler/mq/task`**
为了保持 worker 函数的简洁与可维护性部分复杂逻辑被抽离成独立的“任务task”函数集中放在这个目录中。
#### `crawler/net` —— 网络请求模块
该模块用于与外部系统通信,负责所有网络请求的封装和管理。核心是 `net/delegate.ts` 中定义的 `NetworkDelegate` 类。
**`crawler/net/delegate.ts`**
这是我们进行大规模请求的主要实现,支持以下功能:
* 基于任务类型和代理的限速策略
* 结合 serverless 架构,根据策略动态切换请求来源 IP
#### `crawler/utils` —— 工具函数模块
存放项目中通用的工具函数,供各模块调用。
#### `crawler/src` —— 主程序入口
该目录包含 crawler 的启动脚本。我们使用 [concurrently](https://www.npmjs.com/package/concurrently) 同时运行多个任务文件,实现并行处理。

View File

@ -2,14 +2,21 @@
CVSA 使用 [PostgreSQL](https://www.postgresql.org/) 作为数据库。
CVSA 设计了两个
CVSA 设计了两个数据库,`cvsa_main` 和 `cvsa_cred`。前者用于存储可公开的数据,而后者则存储用户相关的个人信息(如登录凭据、账户管理信息等)。
CVSA 的所有公开数据(不包括用户的个人数据)都存储在名为 `cvsa_main` 的数据库中,该数据库包含以下表:
- songs存储歌曲的主要信息
- bilibili\_user存储 Bilibili 用户信息快照
- bilibili\_metadata[分区 30](../../about/scope-of-inclusion.md#vocaloiduatu-fen-qu) 中所有视频的元数据
- labelling\_result包含由我们的 AI 系统 标记的 `all_data` 中视频的标签。
- latest\_video\_snapshot存储视频最新的快照
- video\_snapshot存储视频的快照包括特定时间下视频的统计信息播放量、点赞数等
- snapshot\_schedule视频快照的规划信息为辅助表
* songs存储歌曲的主要信息。
* bilibili\_user存储哔哩哔哩 UP主 的元信息。
* bilibili\_metadata我们收录的哔哩哔哩所有视频的元数据。
* labelling\_result包含由我们的机器学习模型标记的 `bilibili_metadata` 中视频的标签。
* latest\_video\_snapshot存储视频最新的快照。
* video\_snapshot存储视频的快照包括特定时间下视频的统计信息播放量、点赞数等
* snapshot\_schedule视频快照的规划信息为辅助表。
> **快照:**
>
> 我们定期采集哔哩哔哩视频的播放量、点赞收藏数等统计信息,在一个给定时间点下某支视频的统计数据即为该视频的一个快照。

View File

@ -0,0 +1,43 @@
# snapshot\_schedule 表
该表用于记录视频快照任务的调度信息。
### 字段说明
| 字段名 | 类型 | 是否为空 | 默认值 | 描述 |
| ------------- | -------------------------- | ---- | ------------------------------------- | ------------ |
| `id` | `bigint` | 否 | `nextval('snapshot_schedule_id_seq')` | 主键自增ID |
| `aid` | `bigint` | 否 | 无 | 哔哩哔哩视频的 AV 号 |
| `type` | `text` | 是 | 无 | 快照类型。 |
| `created_at` | `timestamp with time zone` | 否 | `CURRENT_TIMESTAMP` | 记录创建时间 |
| `started_at` | `timestamp with time zone` | 是 | 无 | 计划开始拍摄快照的时间 |
| `finished_at` | `timestamp with time zone` | 是 | 无 | 快照任务完成的时间 |
| `status` | `text` | 否 | `'pending'` | 快照任务状态。 |
### 字段取值说明(待补充)
#### `type` 字段
用于标识快照的类型,例如是定期存档、成就节点、首次收录等。
* `archive`:每隔一段时间内,对`bilibili_metadata`表中所有视频的定期快照。
* `milestone`:监测到曲目即将达成成就(殿堂/传说/神话)时,将会调度该类型的快照任务。
* `new`新观测到歌曲时会在最长48小时内持续追踪其初始播放量增长趋势。
* `normal`:对于所有`songs`表内的曲目根据播放量增长速度以动态间隔6-72小时定期进行的快照。
#### `status` 字段
用于标识快照任务的当前状态。
* `completed`:快照任务已经完成
* `failed`:快照任务因不明原因失败
* `no_proxy`:快照任务被执行,但当前没有代理可用于拍摄快照
* `pending`:快照任务已经被调度,但尚未开始执行
* `processing`:正在获取快照
* `timeout`:快照任务在一定时间内没有被响应,因此被丢弃
* `bili_error`: 哔哩哔哩返回了一个表示请求失败的状态码
### 备注
* 此表中的 `started_at` 字段为计划中的快照开始时间,实际执行时间可能与其略有偏差,具体执行记录可结合其他日志或任务表查看。
* 每个 av 号在可以同时存在多个不同类型的快照任务处于 pending 状态但对于同一种类型只允许一个pending任务同时存在。

View File

@ -0,0 +1,27 @@
# 机器学习
中V档案馆的自动化工作流高度依赖机器学习进行信息提取和分类。
我们目前使用的机器学习系统有:
#### Filter (代号 Akari
位于项目根目录下的 `/ml/filter/`,它是一个分类模型,将来自哔哩哔哩的视频分为以下类别:
* 0与中文歌声合成无关
* 1中文歌声合成原创曲
* 2中文歌声合成的翻唱/Remix歌曲
它接收三个通道的纯文本:视频的标题、简介和标签,使用一个修改后的[model2vec](https://github.com/MinishLab/model2vec)模型(从[jina-embedding-v3](https://huggingface.co/jinaai/jina-embeddings-v3)从三个通道的文本分别产生1024维的嵌入向量作为表征通过可学习的通道权重进行调整后送入一个隐藏层维度1296的单层全连接网络最终连接到一个三分类器作为输出。我们使用了一个自定义的损失函数`AdaptiveRecallLoss`,以优化歌声合成作品的 recall即使得第 0 类的 precision 尽可能高)。
此外,我们还有一些尚未投入生产的实验性工作:
#### Predictor
位于项目根目录下的 `/ml/pred/`,它预测视频的未来播放量。这是一个回归模型,它将视频的历史播放量趋势、其他上下文信息(例如当前时间)和要预测的未来时间增量作为特征输入,并输出视频播放量从“现在”到指定未来时间点的增量。
#### 歌词对齐
位于项目根目录下的 `/ml/lab/`,它分别使用 [MMS wav2vec](https://huggingface.co/docs/transformers/en/model_doc/mms) 和 [Whisper](https://github.com/openai/whisper) 模型进行音素级和行级对齐。这项工作的最初目的是驱动我们另一个项目 [AquaVox](https://github.com/alikia2x/aquavox) 中的实时歌词功能。

View File

@ -1 +0,0 @@
# 消息队列

View File

@ -1 +0,0 @@
# LatestVideosQueue 队列

View File

@ -14,13 +14,30 @@ layout:
# 概览
整个CVSA项目分为三个组件**crawler**, **frontend** 和 **backend。**
CVSA 是一个 [monorepo](https://en.wikipedia.org/wiki/Monorepo) 代码库,使用 [Deno workspace](https://docs.deno.com/runtime/fundamentals/workspaces/) 作为monorepo管理工具TypeScript 是主要的开发语言。
### **crawler**
**项目结构:**
位于项目目录`packages/crawler` 下,它负责以下工作:
```
cvsa
├── deno.json
├── ml
│ ├── filter
│ ├── lab
│ └── pred
├── packages
│ ├── backend
│ ├── core
│ ├── crawler
│ └── frontend
└── README.md
```
- 抓取新的视频并收录作品
- 持续监控视频的播放量等统计信息
**其中, `packages` 为 monorepo 主要的根目录,包含 CVSA 主要的程序逻辑**
整个 crawler 由 BullMQ 消息队列驱动,使用 Redis 和 PostgreSQL 管理状态。
* **`backend`**:这个模块包含使用 [Hono](https://hono.dev/) 框架构建的服务器端逻辑。它负责与数据库交互并通过 REST 和 GraphQL API 公开数据,供前端网站、应用和第三方使用。
* **`frontend`**中V档案馆的网站是 [Astro](https://astro.build/) 驱动的。这个模块包含完整的 Astro 前端项目。
* **`crawler`**这个模块包含中V档案馆的自动数据收集系统。它旨在自动发现和收集来自哔哩哔哩的新歌曲数据以及跟踪相关统计数据如播放量信息
* **`core`**:这个模块内包含可重用和通用的代码。
`ml` 为机器学习相关包,参见