Washington state has been holding a lot of press conferences with updates about the COVID-19 situation recently. The information has always been summarized in a few slides during the video, but those slides and explanatory text are only posted separately several hours to a day later.
youtube-dl will download videos off Twitter just
given the URL of the tweet like this one. Then clone and run
./slide-detector.py video.mp4 473 105 727 397
video.mp4 is the
filename of the video and the relevant section of the video is a 727x397
rectangle whose top-left corner is at the coordinates (473, 105), which
is the correct rectangle to crop the linked video to just the main video
section (i.e. omitting the ASL interpreter who is always on screen).
Omit the numbers to not crop the video.
The script will output the slides as image files in the current
directory with names like
static_at_3:55.jpg for the slide that
appears on the screen 3 minutes and 55 seconds into the video.
When a slide is being shown, it is static, so detecting slides can be done by detecting when there is not motion. Motion detection is a common simple video analysis, so I started with a demo of using OpenCV to detect motion. Then basically all I needed to do was output a frame whenever it didn't detect motion for a long enough period of time.
Since slides are very low (although due to video artifacts not quite zero) motion, I could tune the detector to be very sensitive to motion. Additionally, they stay still for several sections, longer than a video of a person ever would, so they are easy to distinguish from live video. (I was originally thinking of trying to detect large sections of solid color as I thought the background of a slide would be fairly distinctive, but this ended up being unnecessary as the motion detection was much simpler and already implemented.)
The original demo looked for motion relative to an initial frame by taking the difference of the Gaussian blur of the frames converted to grayscale, thresholded so any difference greater than 25 (out of a max of 255) was counted as a difference. I lowered the threshold to 10 to be more sensitive to motion (and because experimenting with lower thresholds I found they didn't work).
Instead of always comparing against the initial frame, I reset the frame to anchor the comparison on to the current frame whenever motion was detected. If that happens at least 3 seconds after the previous time motion was detected, then that frame is considered to be a slide and is output to a file named with its timestamp to make it easy to find that part of the video.
Finding the crop rectangle
Because the video always has an ASL interpreter on screen, there's never a part of the entire video without motion. To do the processing, the video has to first be cropped to just the inset with the main video where the slides appear. The code for cropping frames is straightfoward:
cropped_frame = frame[y:y+h, x:x+w]
To get quicker feedback on refining the crop rectangle, I added key commands for adjusting it which also output the new values so I could save them for future runs of the program. Then, instead of guessing the values for a new rectangle, editing the command line, rerunning the program, moving the multiple debug windows so I could see the right one, and getting far enough into the video to see if it was correct, I got immediate feedback and could easily further fine-tune the selection to get the rectangle given above.