1
0
cvsa/ml_new/training/README.md

4.4 KiB

ML Training Service

A FastAPI-based ML training service for dataset building, embedding generation, and experiment management.

Architecture

The service is organized into modular components:

ml_new/training/
├── main.py              # FastAPI application entry point
├── models.py            # Pydantic data models
├── config_loader.py     # Configuration loading from TOML
├── database.py          # Database connection and operations
├── embedding_service.py # Embedding generation service
├── dataset_service.py   # Dataset building logic
├── api_routes.py        # API endpoint definitions
├── embedding_models.toml # Embedding model configurations
└── requirements.txt     # Python dependencies

Key Components

1. Main Application (main.py)

  • FastAPI app initialization
  • CORS middleware configuration
  • Service dependency injection
  • Startup/shutdown event handlers

2. Data Models (models.py)

  • DatasetBuildRequest: Request model for dataset building
  • DatasetBuildResponse: Response model for dataset building
  • DatasetRecord: Individual dataset record structure
  • EmbeddingModelInfo: Embedding model configuration

3. Configuration (config_loader.py)

  • Loads embedding model configurations from TOML
  • Manages model parameters (dimensions, API endpoints, etc.)

4. Database Layer (database.py)

  • PostgreSQL connection management
  • CRUD operations for video metadata, user labels, and embeddings
  • Optimized batch queries to avoid N+1 problems

5. Embedding Service (embedding_service.py)

  • Integration with OpenAI-compatible embedding APIs
  • Text preprocessing and checksum generation
  • Batch embedding generation with rate limiting

6. Dataset Building (dataset_service.py)

  • Complete dataset construction workflow:
    1. Pull raw text from database
    2. Text preprocessing (placeholder)
    3. Batch embedding generation with deduplication
    4. Embedding storage and caching
    5. Final dataset compilation with labels

7. API Routes (api_routes.py)

  • /api/v1/health: Health check
  • /api/v1/models/embedding: List available embedding models
  • /api/v1/dataset/build: Build new dataset
  • /api/v1/dataset/{id}: Retrieve built dataset
  • /api/v1/datasets: List all datasets
  • /api/v1/dataset/{id}: Delete dataset

Dataset Building Flow

  1. Model Selection: Choose embedding model from TOML configuration
  2. Data Retrieval: Pull video metadata and user labels from PostgreSQL
  3. Text Processing: Combine title, description, and tags
  4. Deduplication: Generate checksums to avoid duplicate embeddings
  5. Batch Processing: Generate embeddings for new texts only
  6. Storage: Store embeddings in database with caching
  7. Final Assembly: Combine embeddings with labels using consensus mechanism

Configuration

Embedding Models (embedding_models.toml)

[text-embedding-3-large]
name = "text-embedding-3-large"
dimensions = 3072
type = "openai"
api_endpoint = "https://api.openai.com/v1/embeddings"
max_tokens = 8192
max_batch_size = 100

Environment Variables

  • DATABASE_URL: PostgreSQL connection string
  • OPENAI_API_KEY: OpenAI API key for embedding generation

Usage

Start the Service

cd ml_new/training
python main.py

Build a Dataset

curl -X POST "http://localhost:8322/v1/dataset/build" \
  -H "Content-Type: application/json" \
  -d '{
    "aid_list": [170001, 170002, 170003],
    "embedding_model": "text-embedding-3-large",
    "force_regenerate": false
  }'

Check Health

curl "http://localhost:8322/v1/health"

List Embedding Models

curl "http://localhost:8322/v1/models/embedding"

Features

  • High Performance: Optimized database queries with batch operations
  • Deduplication: Text-level deduplication using MD5 checksums
  • Consensus Labels: Majority vote mechanism for user annotations
  • Batch Processing: Efficient embedding generation and storage
  • Error Handling: Comprehensive error handling and logging
  • Async Support: Fully asynchronous operations for scalability
  • CORS Enabled: Ready for frontend integration

Production Considerations

  • Replace in-memory dataset storage with database
  • Add authentication and authorization
  • Implement rate limiting for API endpoints
  • Add monitoring and metrics collection
  • Configure proper logging levels
  • Set up database connection pooling
  • Add API documentation with OpenAPI/Swagger