4.4 KiB
4.4 KiB
ML Training Service
A FastAPI-based ML training service for dataset building, embedding generation, and experiment management.
Architecture
The service is organized into modular components:
ml_new/training/
├── main.py # FastAPI application entry point
├── models.py # Pydantic data models
├── config_loader.py # Configuration loading from TOML
├── database.py # Database connection and operations
├── embedding_service.py # Embedding generation service
├── dataset_service.py # Dataset building logic
├── api_routes.py # API endpoint definitions
├── embedding_models.toml # Embedding model configurations
└── requirements.txt # Python dependencies
Key Components
1. Main Application (main.py)
- FastAPI app initialization
- CORS middleware configuration
- Service dependency injection
- Startup/shutdown event handlers
2. Data Models (models.py)
DatasetBuildRequest: Request model for dataset buildingDatasetBuildResponse: Response model for dataset buildingDatasetRecord: Individual dataset record structureEmbeddingModelInfo: Embedding model configuration
3. Configuration (config_loader.py)
- Loads embedding model configurations from TOML
- Manages model parameters (dimensions, API endpoints, etc.)
4. Database Layer (database.py)
- PostgreSQL connection management
- CRUD operations for video metadata, user labels, and embeddings
- Optimized batch queries to avoid N+1 problems
5. Embedding Service (embedding_service.py)
- Integration with OpenAI-compatible embedding APIs
- Text preprocessing and checksum generation
- Batch embedding generation with rate limiting
6. Dataset Building (dataset_service.py)
- Complete dataset construction workflow:
- Pull raw text from database
- Text preprocessing (placeholder)
- Batch embedding generation with deduplication
- Embedding storage and caching
- Final dataset compilation with labels
7. API Routes (api_routes.py)
/api/v1/health: Health check/api/v1/models/embedding: List available embedding models/api/v1/dataset/build: Build new dataset/api/v1/dataset/{id}: Retrieve built dataset/api/v1/datasets: List all datasets/api/v1/dataset/{id}: Delete dataset
Dataset Building Flow
- Model Selection: Choose embedding model from TOML configuration
- Data Retrieval: Pull video metadata and user labels from PostgreSQL
- Text Processing: Combine title, description, and tags
- Deduplication: Generate checksums to avoid duplicate embeddings
- Batch Processing: Generate embeddings for new texts only
- Storage: Store embeddings in database with caching
- Final Assembly: Combine embeddings with labels using consensus mechanism
Configuration
Embedding Models (embedding_models.toml)
[text-embedding-3-large]
name = "text-embedding-3-large"
dimensions = 3072
type = "openai"
api_endpoint = "https://api.openai.com/v1/embeddings"
max_tokens = 8192
max_batch_size = 100
Environment Variables
DATABASE_URL: PostgreSQL connection stringOPENAI_API_KEY: OpenAI API key for embedding generation
Usage
Start the Service
cd ml_new/training
python main.py
Build a Dataset
curl -X POST "http://localhost:8322/v1/dataset/build" \
-H "Content-Type: application/json" \
-d '{
"aid_list": [170001, 170002, 170003],
"embedding_model": "text-embedding-3-large",
"force_regenerate": false
}'
Check Health
curl "http://localhost:8322/v1/health"
List Embedding Models
curl "http://localhost:8322/v1/models/embedding"
Features
- High Performance: Optimized database queries with batch operations
- Deduplication: Text-level deduplication using MD5 checksums
- Consensus Labels: Majority vote mechanism for user annotations
- Batch Processing: Efficient embedding generation and storage
- Error Handling: Comprehensive error handling and logging
- Async Support: Fully asynchronous operations for scalability
- CORS Enabled: Ready for frontend integration
Production Considerations
- Replace in-memory dataset storage with database
- Add authentication and authorization
- Implement rate limiting for API endpoints
- Add monitoring and metrics collection
- Configure proper logging levels
- Set up database connection pooling
- Add API documentation with OpenAPI/Swagger