Upload a video/audio file + caption file (SRT/VTT) to extract speech clips for TTS training. Supports YouTube download.