here's a little ted talk transcript downloader i threw together. because watching videos is a timekill (sounds like the author learned this the hard way :) sometimes we'd rather just skim a document - focus on the words not the delivery. below is a filter - the way UNIX utilities are supposed to be written, remember? it's bourne shell, sed and curl (no bash needed). no ruby python perl nonsense; no release on github; just simple stuff; simple. i'm not endorsing curl as an httpclient but seems like people like curl so that's what's used. 1st the filter fetches an index of all the speaker names called "ted.idx" if ted.idx does not already exist. you then feed speaker names to the filter and it outputs the transcript text to stdout (if there is a transcript) with each transcript separated by a line of 80 dashes. each line is prefixed with a colon and a space.
you read the output file however you want; maybe something like this:<p><pre><code> less -Gp--------- file.txt
</code></pre>
then you can jump from transcript to transcript by pressing "n" or "N"<p>would you like to download all the ted transcripts? this will do it:<p><pre><code> filter < ted.idx > file.txt
</code></pre>
here's the filter:<p>(note: \&#039 does not need the backslash in sed; that's only to get past the HN forum software without being translated)<p><pre><code> read -p'what should we call this command? ' d
[ x$d = x"" ]||
echo don\'t forget to place $d in your PATH
[ x$d = x"" ]||
cat > $d << done
c=http://www.ted.com/speakers
[ -f ted.idx ]||{
echo fetching ted.idx... >&2
b=\$(curl -s \$c/atoz |sed '
/Showing page 1 of/!d;
s/.* //;
s/<.*//;
')
curl -s \$c/atoz/page/[1-"\$b"] | sed '
/href=.*speakers\/.*html/!d;
s/</\
/g;' |sed '
/speakers/!d;
s,.*=\",,;
s,\".*,,;
s,.*/,,;
s,\.html\$,,' > ted.idx
}
echo >&2;
echo "usage: less ted.idx" >&2;
echo "usage: grep speaker_name ted.idx |\$0 > file" >&2;
echo "usage: sed 'line-no!d' ted.idx |\$0 > file" >&2;
echo "usage: \$0 < ted.idx > file" >&2;
echo "usage: less -Gp----- file" >&2;
while read a;do
curl -s \$c/\$a.html |sed '
/notranslate.*href=.\/talks\//!d;
s,.*href=.,http://www.ted.com,;
s/\">.*//;
s/.*/url = \"&\"/;
' | curl -sK - | sed -n '
s/\&#039;/'"'"'/g;</code></pre>
/<title>/s//--------------------------------------------------------------------------------\
: /;
s/<.title>//;
/transcriptLink/{
s/.*nofollow\">/: /;
s/<\/a>//;};
/^----------/p;
/^: /p;
' ;<p><pre><code> done</code></pre>