Created
June 14, 2020 14:43
-
-
Save Snarp/ce3e86ab9349f9cd2bb8b02d6d6116a5 to your computer and use it in GitHub Desktop.
Extracts all UTF-8 strings from YouTube's closed caption ("timedtext") JSON object to generate a rough transcript.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'json' | |
# Extracts all UTF-8 strings from YouTube's closed caption (`timedtext`) JSON | |
# object to generate a rough transcript. | |
# | |
# Instructions for acquiring the JSON: | |
# | |
# 1. Open the video in a new Firefox tab. Don't click "CC" yet. | |
# | |
# 2. Open Firefox Developer Tools (right-click somewhere and select "Inspect | |
# Element") and open the "Network" tab. (Shortcut: Ctrl+Shift+E) | |
# | |
# 4. Click "CC" to turn on closed captions. | |
# | |
# 5. Right click on the "?timedtext(somethingsometing)" object that appears | |
# in the list. | |
# | |
# 6. Select "Copy response". | |
# | |
# 7. Paste the JSON into your text editor / console / whatever. | |
# | |
def extract_closed_captions(json_string, output_fname="video_transcript.txt") | |
json = JSON::parse(json_string, symbolize_names: true) | |
words = Array.new | |
json[:events].each do |event| | |
next unless event[:segs].is_a?(Array) | |
event[:segs].each do |segment| | |
next unless segment.is_a?(Hash) | |
words.push(segment[:utf8]) | |
end | |
end | |
transcript = words.join | |
File.write(output_fname, transcript) if output_fname | |
return transcript | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment