Poor man’s intelligent reframing for GoPro videos

Maxim Leonovich
13 min readJan 6, 2024

--

It’s been a while since I last touched this blog (3 years?) — time flies! A lot has changed; my startup is long gone, and I joined cube.dev engineering team. While life has generally been busy, there were almost two weeks of quiet time during the holidays, so I thought I’d work on a hobby project.

Outside of work, my biggest hobbies are, perhaps, snowboarding and buying gadgets. Hence, I own a few action cameras and have a ton of lame GoPro footage. Sometimes (often), I post it on Instagram. Sometimes, I make little edits with cheesy music.

Anyway, one of the cameras I own is the almighty Insta 360 One X II, which makes shooting action videos as easy as it can possibly be. You don’t even have to point it; just shoot and reframe later in their mobile app. While the app is generally ok, there’s one killer feature that makes it awesome — it’s their computer-vision (ok, AI) powered subject tracking. You simply select a person or an object of interest, and it will keep it centered as long as it’s visible somewhere in the 360 frame. This creates great-looking videos composition-wise. However, there’s one downside to it: the picture quality of a 360 camera. After cropping in from a 5.7k sphere into something flat-ish looking, there are not enough pixels left to keep it sharp and vibrant. This is where good old GoPro comes to help. It shines in the picture quality department, but you need a good cameraman who can follow you and keep the camera pointed at all times — a rare breed, to say the least.

Insta 360 reframing interface

GoPro is not a 360 camera, but it still captures a lot of information, especially in its SuperView mode. Unfortunately, its app doesn’t offer anything that resembles Insta 360's reframing. If you want to create a vertical video for Instagram, but all you have is a shaky 16:9 or 8:7, your best shot at keeping things centered is manually keyframing the cropping window in something like FCPX to move it around. This feels wrong in an age of AI, and I thought, “How hard could it be to create an intelligent reframing feature that would keep the subject centered as much as possible completely automatically? What can I do with just off-the-shelf tools and zero CV knowledge?”

Here’s some background and assumptions. I’ve been doing software for most of my life, but I’ve never worked with video or computer vision. I know how to python, but I’ll use ChatGPT and Google for the rest of it. For our purpose, we’ll assume we’re reframing a “follow-cam” video with one subject that should be centered in the frame most of the time. Also, performance doesn’t matter — just a proof of concept that it’s hackable in a couple of evenings by an average engineer. Let’s go!

As with many things these days, I started by consulting ChatGPT.

After a bit of Googling and reading some blogs, I settled on OpenCV, YOLO for object detection, and DeepSort with CLIP embedder for tracking. I also asked ChatGPT to generate the app structure for me, but since it has changed many times after the initial draft, I’d rather unwind the rest of the story from the final version.

Here’s the main idea:

  • We’ll use OpenCV to read the source video frame by frame, “do AI,” and modify the dimensions (crop)
  • We’re going to detect all people in each frame using YOLO
  • then track them across frames using DeepSORT and CLIP
  • We’ll analyze the obtained tracking information, de-noise it, and apply some heuristics to detect the subject for each frame.
  • We’ll then crop the original video to a smaller size and “ease” the cropping window towards the subject center.
  • We’ll re-encode the final video and combine it with the original audio using FFmpeg.

Let’s start with the basic structure:

if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Process a GoPro video.")
parser.add_argument('video_path', type=str,
help='Path to the GoPro video file')
parser.add_argument('command', type=str,
help='Action to perform', choices=['track', 'reframe'])
parser.add_argument('--preview', dest='preview', action='store_true',
help='Display the processed video in a window', default=False)
args = parser.parse_args()

subjects_fn = f'{args.video_path.split(".")[0]}_subjects.pickle'

if args.command == 'track':
track(args.video_path, subjects_fn, args.preview)
elif args.command == 'reframe':
reframe(args.video_path, subjects_fn, args.preview)

The tracking portion proved to be computationally heavy and slow. Also, I had to iterate a few times on easing parameters and overall cropping logic. Hence, we’re going to split the process into two stages: track and reframe. The tracking stage will run YOLO and DeepSort, apply subject detection heuristics, and save the result into an intermediary file. The reframing stage will then pick up the file and complete the reframing/reencoding.

Let’s take a closer look at the tracking stage.

def detect_people(frame, model):
# Apply the YOLOv8 detector to the frame and keep only people (class_id == 0)
# mps here means it will use the hardware acceleration on macOS
# Change it to cpu if you're on Linux or cuda if you have an Nvidia GPU
detections = model(frame, device="mps")[0]
for data in detections.boxes.data.tolist():
confidence = data[4]
class_id = data[5]
if confidence >= 0.5 and class_id == 0:
xmin, ymin, xmax, ymax = int(data[0]), int(
data[1]), int(data[2]), int(data[3])
yield [[xmin, ymin, xmax - xmin, ymax - ymin], confidence, class_id]

def track(video_path, subjects_fn, preview):
# Open the source file using OpenCV
cap = cv2.VideoCapture(video_path)
# Initialize the YOLOv8 detector
# It will automatically download the model weights on the first run
detector = YOLO("yolov8l.pt")
# Also initialize the DeepSort tracker.
# The embedder parameter specifies the model to use for feature extraction.
# In our case we're going to use one of the pre-trained variants of a CLIP model.
tracker = DeepSort(max_age=10, embedder='clip_ViT-B/32',
embedder_gpu=False)

cap = cv2.VideoCapture(video_path)

frame_count = 0
detections = []
tracks = []

# For our subject detection logic we'll need to know the total durion of each track
# Due to this, we can't do subject detection in "online", but we'll have to do a second pass
# On the first pass we'll just accumulate tracks and their durations in this dictionary
track_durations = defaultdict(int)
tracks_per_frame = []
subjects = []

while cap.isOpened():
ret, frame = cap.read()
if not ret:
break

frame_count += 1

# Skip every other frame to speed up processing
if frame_count % 2 == 0:
# Detect people
detections = list(detect_people(frame, detector))
tracks = tracker.update_tracks(detections, frame=frame)

tracks_per_frame.append([])

for track in tracks:
# if the track is not confirmed, ignore it
if not track.is_confirmed():
continue

# Update track durations and save some pre-frame info for the second pass
track_durations[track.track_id] = track.age
tracks_per_frame[-1].append({
'track_id': track.track_id,
'bbox': track.to_ltrb(),
})

# Draw the bounding box and the track id on the frame
# And display a preview window to track progress
track_id = track.track_id
ltrb = track.to_ltrb()
xmin, ymin, xmax, ymax = int(ltrb[0]), int(
ltrb[1]), int(ltrb[2]), int(ltrb[3])
cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), GREEN, 2)
cv2.rectangle(frame, (xmin, ymin - 20),
(xmin + 20, ymin), GREEN, -1)
cv2.putText(frame, str(track_id), (xmin + 5, ymin - 8),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, WHITE, 2)

# Display the frame
if preview:
cv2.imshow('Processed Frame', frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break

# After all tracking is done, run subject detection and dump the result into an intermediary file
subjects = find_subjects(
tracks_per_frame, track_durations)
pickle.dump(subjects, open(subjects_fn, 'wb'))

cap.release()
cv2.destroyAllWindows()

DeepSORT works in conjunction with YOLO object detection. For each frame, YOLO detects bounding boxes of all objects of interest (in our case, people) and passes them over to DeepSORT. It then updates the list of currently active tracks using YOLO detections, people’s appearance embeddings from an external embedder, and some math.

The DeepSORT algorithm assigns a unique track_id to each detected identity and aims to minimize the number of identity switches. An identity switch occurs when the same person is occluded, disappears from the frame, or changes their pose in a way that the algorithm cannot recognize them as the same person anymore. While the ultimate goal is to uniquely track each person, in practice YOLO + DeepSORT generates tens of thousands of track IDs for a 5-minute video with fewer than 100 real people. The fewer track IDs we have — the easier it will be to detect the subject afterward. An “embedder” model generates a vector representation for whatever it sees, and DeepSORT uses this information to reduce the number of id switches. I tried a few supported embedders and found that OpenAI’s CLIP performed the best for our case. However, it is painfully slow because it doesn't run on Apple Silicon GPUs yet, but I can live with that. Fingers crossed, they’re going to bring full MPS support to PyTorch someday 🤞

Moving on to subject detection. The primary assumption is that the person or object that appears most frequently across all video frames is the "subject" of the video. In an ideal world where tracking works perfectly, we could sort all tracks by duration, select the longest one, and consider it the subject. However, identities often change and there is a lot of noise in the resulting tracks, making this approach unreliable. To improve accuracy, we can use various heuristics such as scoring people not only by track duration but also by their proximity to the frame center and the size of their bounding box. Nevertheless, these metrics are also noisy and it is challenging to blend them effectively with the duration metric. In the end, I opted for a simple approach that works reasonably well:

  • Filter out the bottom 80% of tracks by duration, as they’re likely noise
  • If the frame has more than eight people in it — give up, just focus on the center
  • Otherwise, pick the longest track by total duration appearing in the current frame and center on it
def bbox_center(bbox):
return (int((bbox[0] + bbox[2]) // 2), int((bbox[1] + bbox[3]) // 2))


def filter_top_percent_tracks(track_durations, top_percent):
# Calculate the number of tracks to keep (top N%)
num_tracks_to_keep = int(len(track_durations) * top_percent)

sorted_tracks = sorted(track_durations.items(),
key=lambda item: item[1], reverse=True)
top_tracks = sorted_tracks[:num_tracks_to_keep]

# Create a new dictionary with only the top tracks
filtered_track_durations = {
track_id: duration for track_id, duration in top_tracks}

return filtered_track_durations


def find_subjects(frames, track_durations):

subjects = []
if len(track_durations) > 100:
# The perentage of tracks to keep varies depnding on the crowdiness of the video
track_durations = filter_top_percent_tracks(track_durations, 0.2)

for frame in frames:
longest_duration = 0
subject_center = None
# Don't even try to re-center frames with more than 8 people
if len(frame) <= 8:
for track in frame:
track_id = track['track_id']
duration = track_durations.get(track_id, 0)

if duration > longest_duration:
longest_duration = duration
subject_center = bbox_center(
track['bbox'])

subjects.append(subject_center)

return subjects

Finally, let’s add reframing. We’ll go frame by frame again and try to focus the camera on the subject if there’s one or drift back to the middle of the frame. One challenge is that the subject position “jitters” a lot because YOLO detection is fuzzy. To help with this, we should apply easing to ensure the resulting video isn’t too jerky. We’ll write cropped frames to a new video, combine it with the original audio, and compress it to a reasonable size using FFmpeg. Theoretically, we could use OpenCV’s own VideoWriter for saving the cropped vide, but it was segfaulting for me, and I couldn’t find a workaround. I’m using another FFmpeg wrapper — imageio for that task. It’s a bit messy to have this many libraries, but it does the job 👌

def ease_camera_towards_subject(current_pos, target_pos, damping_factor):
# Calculate the distance vector between current position and target
distance_vector = np.array(target_pos) - np.array(current_pos)

# Apply damping to the distance vector
eased_vector = distance_vector * damping_factor

# Update the current position
new_pos = np.array(current_pos) + eased_vector
return tuple(new_pos.astype(int))


def center_subject_in_frame(frame, new_size, subject_position, last_position, damping_factor):
original_height, original_width = frame.shape[:2]
new_width, new_height = new_size

# Calculate desired top-left corner for centered subject
subject_center_x, subject_center_y = subject_position
desired_x = max(0, min(original_width - new_width,
subject_center_x - new_width // 2))
desired_y = max(0, min(original_height - new_height,
subject_center_y - new_height // 2))

# Apply easing towards the subject
new_x, new_y = ease_camera_towards_subject(
last_position, (desired_x, desired_y), damping_factor)

# Ensure the new position is within bounds
new_x = max(0, min(new_x, original_width - new_width))
new_y = max(0, min(new_y, original_height - new_height))

# Crop the frame to the new dimensions
cropped_frame = frame[new_y:new_y + new_height, new_x:new_x + new_width]

return cropped_frame, (new_x, new_y), (new_x, new_y, new_width, new_height)


def round_to_multiple(number, multiple):
return round(number / multiple) * multiple


def reframe(video_path, subjects_fn, preview):
# We could parametrize this one too, but I'm just using this script for my
# vertical IG videos so, 9:16 it is :P
target_aspect_ratio = (9, 16)
cap = cv2.VideoCapture(video_path)
# Get the original video dimensions
width = cap.get(cv2.CAP_PROP_FRAME_WIDTH)
height = cap.get(cv2.CAP_PROP_FRAME_HEIGHT)
fps = cap.get(cv2.CAP_PROP_FPS)

# Determine the base dimension (shortest side)
base_dimension = min(width, height)

# Calculate target dimensions maintaining aspect ratio
target_aspect_ratio_width, target_aspect_ratio_height = target_aspect_ratio
if width < height: # Landscape to portrait
new_width = int(base_dimension)
new_height = int(
base_dimension * target_aspect_ratio_height / target_aspect_ratio_width)
else: # Portrait to landscape or same orientation
new_height = int(base_dimension)
new_width = int(base_dimension *
target_aspect_ratio_width / target_aspect_ratio_height)

# Ensure new dimensions do not exceed original dimensions
new_width = int(min(round_to_multiple(new_width, 16), width))
new_height = int(min(round_to_multiple(new_height, 16), height))

frame_center = (int(width // 2), int(height // 2))

# Create two temporary files to store the reframed video and the original audio
with tempfile.NamedTemporaryFile(suffix='.mp3') as temp_audio, \
tempfile.NamedTemporaryFile(suffix='.mp4') as temp_video:

# I tried using OpenCV's VideoWriter but it segfaults on macOS, hence imageio
writer = imageio.get_writer(
temp_video.name, fps=fps, format='mp4', codec='libx264', quality=10)

# Load tracking information dumped by the tracking stage
subjects = pickle.load(open(subjects_fn, 'rb'))

frame_count = 0
last_crop_position = (0, 0)
last_subject_position = (int(width // 2), int(height // 2))
lost_subject_for = 0

while cap.isOpened():
ret, frame = cap.read()

if not ret:
break

# If no subject is found, just stick with the last position for a few seconds
# hoping it will reappear. If not - ease back to the center
if not subjects[frame_count]:
subject = last_subject_position
lost_subject_for += 1
else:
subject = subjects[frame_count]
last_subject_position = subject
lost_subject_for = 0

# Drift back towards the center if the subject is lost for too long
LOST_SUBJECT_THRESHOLD_SEC = 3
if lost_subject_for > LOST_SUBJECT_THRESHOLD_SEC * fps:
subject = frame_center

# The last parameter is the damping factor
# It determines how quickly the camera moves towards the subject
# I found 0.1 to be a good overall value
cropped_frame, last_crop_position, crop_bbox = center_subject_in_frame(
frame, (new_width, new_height), subject, last_crop_position, 0.1
)

# Write the new frame to the output video
writer.append_data(cv2.cvtColor(cropped_frame, cv2.COLOR_BGR2RGB))

# Also create some markings on the original frame for the live preview
cv2.rectangle(frame, (int(subject[0]) - 5, int(subject[1]) - 5),
(int(subject[0]) + 5, int(subject[1]) + 5), GREEN, 2)
cv2.rectangle(frame, (crop_bbox[0], crop_bbox[1]), (crop_bbox[0] + crop_bbox[2],
crop_bbox[1] + crop_bbox[3]), GREEN, 2)

if lost_subject_for > 0:
cv2.putText(frame, f"Lost subject for {lost_subject_for / fps} seconds",
(10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, WHITE, 2)

# Display the live preview
if preview:
cv2.imshow('Processed Frame', frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break

frame_count += 1

cap.release()
writer.close()

# Extract audio from the original video
ffmpeg.input(video_path).output(temp_audio.name,
q=0, map='a').run(overwrite_output=True)

# Combine the new video with the original audio
input_video_stream = ffmpeg.input(temp_video.name)
input_audio_stream = ffmpeg.input(temp_audio.name)
# Specify your desired codec here. hevc_videotoolbox is the hardware accelerated codec on macOS
ffmpeg.output(input_video_stream, input_audio_stream, f"{video_path.split('.')[0]}_reframed.mp4",
codec='aac', vcodec='libx264', pix_fmt='yuv420p', vf='format=yuv420p', profile='main', level='4.0').run(overwrite_output=True)

Here’s the full script: https://gist.github.com/bsod90/fbeca5fd3d021e43aead278d176f07fb

I didn’t include the requirements.txt as I figured dependency management in the ML world is a whole separate topic. I’d say just look at the imports and use condafor whatever it can solve, then install the rest with pip and you should be good. For my experiments, I used Python 11.

The result

I reframed two videos: one of my buddy learning the park and another one of my own sloppy riding.

19:9 wide -> 9:16
4:3 linear -> 9:16

On the left, you can see the result of just exporting a “Linear” 9:16 video from a GoPro app. On the right is our reframed video, and in the middle is the reframing preview illustrating how the cropping window moves and what it considers the current subject.

In my first video, I used a 16:9 WideView as the source, whereas for the second video, I tried cropping a narrower 4:3 video with a taller aspect ratio. Initially, I didn’t realize that GoPro actually uses the 8:7 (true sensor size) when exporting a vertical video from the app. This results in a more zoomed-out view with minimal information cropped from the sides. This works fine for most social media applications, and perhaps this is why GoPro doesn't bother introducing smart reframing in their own Quik. However, if you zoom in on the GoPro export, you'll lose important details, which is where our smart reframing shines.

Frames like the one below are where you save “priceless” information by moving the crop window. Notice that you could zoom in on the right version a lot more, and I’d still be in the frame. The downside, however, is that the edges of a GoPro video are still heavily distorted, even though “Linear” lens correction is applied.

I'm not particularly impressed with this result, except for the technical aspect, of course. However, I believe shooting in 5.3k resolution and reframing from the full 8:7 frame while simultaneously cropping for 9:16 and zooming in on the subject is the way to go. This will allow for both horizontal and vertical movement of the cropping frame and still result in a solid 4k image after cropping. I’ll try it out soon and update this article. I also think this would be much more useful with more close-up shots from something narrower like a DSLR. I’ll keep playing :)

Conclusion. We live in an amazing era where previously military-grade strategic know-how technologies (consider aim tracking) are available to anyone with average Python knowledge and a decent laptop. Building “intelligent” (non-if-else-driven) tools is easier than ever, and we should all be doing it more. I’m now curious what else is possible. Posture estimation for highlighting riding mistakes? Trick recognition? How about a multi-modal, ChatGPT-driven personal coach that feeds from your GoPro collection and gives you personalized riding advice? Why not 🤷‍♂

P.S.

If any real CV engineers are reading this, I’d be curious to hear how this should’ve been done “the right way”, with better accuracy and faster speed. As they now say, let me know what you think in the comments section below ✌️

--

--

No responses yet