brew install youtube-dl
pip install pysrt beautifulsoup4
pip install --pre ttconv
Download the subtitles in ttml format and rename the file to subtitles.ttml.
youtube-dl --write-subs https://www.bbc.com/news/world-us-canada-65452940
Convert the subtitles to srt format.1
tt convert -i subtitles.ttml -o subtitles.srt
Read subtitles from srt file, remove all formatting (e.g. font tags) and save as plain text.
import pysrt
from bs4 import BeautifulSoup
subs = pysrt.open("subtitles.srt")
html_text = "\n".join([sub.text for sub in subs])
soup = BeautifulSoup(html_text, 'lxml')
plain_text = soup.get_text()
with open("subtitles.txt", "w") as text_file:
text_file.write(plain_text)1. youtube-dl provides --convert-subs which could be used to extract subtitles in srt format, but ttconv automatically removes unnecessary line breaks