Poor man’s intelligent reframing for GoPro videos

13 min readJan 6, 2024

It’s been a while since I last touched this blog (3 years?) — time flies! A lot has changed; my startup is long gone, and I joined cube.dev engineering team. While life has generally been busy, there were almost two weeks of quiet time during the holidays, so I thought I’d work on a hobby project.

Outside of work, my biggest hobbies are, perhaps, snowboarding and buying gadgets. Hence, I own a few action cameras and have a ton of lame GoPro footage. Sometimes (often), I post it on Instagram. Sometimes, I make little edits with cheesy music.

Anyway, one of the cameras I own is the almighty Insta 360 One X II, which makes shooting action videos as easy as it can possibly be. You don’t even have to point it; just shoot and reframe later in their mobile app. While the app is generally ok, there’s one killer feature that makes it awesome — it’s their computer-vision (ok, AI) powered subject tracking. You simply select a person or an object of interest, and it will keep it centered as long as it’s visible somewhere in the 360 frame. This creates great-looking videos composition-wise. However, there’s one downside to it: the picture quality of a 360 camera. After cropping in from a 5.7k sphere into something flat-ish looking, there are not enough pixels left to keep it sharp and vibrant. This is where good old GoPro comes to help. It shines in the picture quality department, but you need a good cameraman who can follow you and keep the camera pointed at all times — a rare breed, to say the least.

GoPro is not a 360 camera, but it still captures a lot of information, especially in its SuperView mode. Unfortunately, its app doesn’t offer anything that resembles Insta 360's reframing. If you want to create a vertical video for Instagram, but all you have is a shaky 16:9 or 8:7, your best shot at keeping things centered is manually keyframing the cropping window in something like FCPX to move it around. This feels wrong in an age of AI, and I thought, “How hard could it be to create an intelligent reframing feature that would keep the subject centered as much as possible completely automatically? What can I do with just off-the-shelf tools and zero CV knowledge?”

Here’s some background and assumptions. I’ve been doing software for most of my life, but I’ve never worked with video or computer vision. I know how to python, but I’ll use ChatGPT and Google for the rest of it. For our purpose, we’ll assume we’re reframing a “follow-cam” video with one subject that should be centered in the frame most of the time. Also, performance doesn’t matter — just a proof of concept that it’s hackable in a couple of evenings by an average engineer. Let’s go!

As with many things these days, I started by consulting ChatGPT.

After a bit of Googling and reading some blogs, I settled on OpenCV, YOLO for object detection, and DeepSort with CLIP embedder for tracking. I also asked ChatGPT to generate the app structure for me, but since it has changed many times after the initial draft, I’d rather unwind the rest of the story from the final version.

Here’s the main idea:

We’ll use OpenCV to read the source video frame by frame, “do AI,” and modify the dimensions (crop)
We’re going to detect all people in each frame using YOLO
then track them across frames using DeepSORT and CLIP
We’ll analyze the obtained tracking information, de-noise it, and apply some heuristics to detect the subject for each frame.
We’ll then crop the original video to a smaller size and “ease” the cropping window towards the subject center.
We’ll re-encode the final video and combine it with the original audio using FFmpeg.

Let’s start with the basic structure:

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Process a GoPro video.")
    parser.add_argument('video_path', type=str,
                        help='Path to the GoPro video file')
    parser.add_argument('command', type=str,
                        help='Action to perform', choices=['track', 'reframe'])
    parser.add_argument('--preview', dest='preview', action='store_true',
                        help='Display the processed video in a window', default=False)
    args = parser.parse_args()

    subjects_fn = f'{args.video_path.split(".")[0]}_subjects.pickle'

    if args.command == 'track':
        track(args.video_path, subjects_fn, args.preview)
    elif args.command == 'reframe':
        reframe(args.video_path, subjects_fn, args.preview)

The tracking portion proved to be computationally heavy and slow. Also, I had to iterate a few times on easing parameters and overall cropping logic. Hence, we’re going to split the process into two stages: track and reframe. The tracking stage will run YOLO and DeepSort, apply subject detection heuristics, and save the result into an intermediary file. The reframing stage will then pick up the file and complete the reframing/reencoding.

Let’s take a closer look at the tracking stage.

def detect_people(frame, model):
    # Apply the YOLOv8 detector to the frame and keep only people (class_id == 0)
    # mps here means it will use the hardware acceleration on macOS
    # Change it to cpu if you're on Linux or cuda if you have an Nvidia GPU
    detections = model(frame, device="mps")[0]
    for data in detections.boxes.data.tolist():
        confidence = data[4]
        class_id = data[5]
        if confidence >= 0.5 and class_id == 0:
            xmin, ymin, xmax, ymax = int(data[0]), int(
                data[1]), int(data[2]), int(data[3])
            yield [[xmin, ymin, xmax - xmin, ymax - ymin], confidence, class_id]

def track(video_path, subjects_fn, preview):
    # Open the source file using OpenCV
    cap = cv2.VideoCapture(video_path)
    # Initialize the YOLOv8 detector
    # It will automatically download the model weights on the first run
    detector = YOLO("yolov8l.pt")
    # Also initialize the DeepSort tracker.
    # The embedder parameter specifies the model to use for feature extraction.
    # In our case we're going to use one of the pre-trained variants of a CLIP model.
    tracker = DeepSort(max_age=10, embedder='clip_ViT-B/32',
                       embedder_gpu=False)

    cap = cv2.VideoCapture(video_path)

    frame_count = 0
    detections = []
    tracks = []

    # For our subject detection logic we'll need to know the total durion of each track
    # Due to this, we can't do subject detection in "online", but we'll have to do a second pass
    # On the first pass we'll just accumulate tracks and their durations in this dictionary
    track_durations = defaultdict(int)
    tracks_per_frame = []
    subjects = []

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        frame_count += 1

        # Skip every other frame to speed up processing
        if frame_count % 2 == 0:
            # Detect people
            detections = list(detect_people(frame, detector))
            tracks = tracker.update_tracks(detections, frame=frame)

        tracks_per_frame.append([])

        for track in tracks:
            # if the track is not confirmed, ignore it
            if not track.is_confirmed():
                continue

            # Update track durations and save some pre-frame info for the second pass
            track_durations[track.track_id] = track.age
            tracks_per_frame[-1].append({
                'track_id': track.track_id,
                'bbox': track.to_ltrb(),
            })

            # Draw the bounding box and the track id on the frame
            # And display a preview window to track progress
            track_id = track.track_id
            ltrb = track.to_ltrb()
            xmin, ymin, xmax, ymax = int(ltrb[0]), int(
                ltrb[1]), int(ltrb[2]), int(ltrb[3])
            cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), GREEN, 2)
            cv2.rectangle(frame, (xmin, ymin - 20),
                          (xmin + 20, ymin), GREEN, -1)
            cv2.putText(frame, str(track_id), (xmin + 5, ymin - 8),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.5, WHITE, 2)

        # Display the frame
        if preview:
            cv2.imshow('Processed Frame', frame)
            if cv2.waitKey(1) & 0xFF == ord('q'):
                break

    # After all tracking is done, run subject detection and dump the result into an intermediary file
    subjects = find_subjects(
        tracks_per_frame, track_durations)
    pickle.dump(subjects, open(subjects_fn, 'wb'))

    cap.release()
    cv2.destroyAllWindows()

DeepSORT works in conjunction with YOLO object detection. For each frame, YOLO detects bounding boxes of all objects of interest (in our case, people) and passes them over to DeepSORT. It then updates the list of currently active tracks using YOLO detections, people’s appearance embeddings from an external embedder, and some math.

The DeepSORT algorithm assigns a unique track_id to each detected identity and aims to minimize the number of identity switches. An identity switch occurs when the same person is occluded, disappears from the frame, or changes their pose in a way that the algorithm cannot recognize them as the same person anymore. While the ultimate goal is to uniquely track each person, in practice YOLO + DeepSORT generates tens of thousands of track IDs for a 5-minute video with fewer than 100 real people. The fewer track IDs we have — the easier it will be to detect the subject afterward. An “embedder” model generates a vector representation for whatever it sees, and DeepSORT uses this information to reduce the number of id switches. I tried a few supported embedders and found that OpenAI’s CLIP performed the best for our case. However, it is painfully slow because it doesn't run on Apple Silicon GPUs yet, but I can live with that. Fingers crossed, they’re going to bring full MPS support to PyTorch someday 🤞

Moving on to subject detection. The primary assumption is that the person or object that appears most frequently across all video frames is the "subject" of the video. In an ideal world where tracking works perfectly, we could sort all tracks by duration, select the longest one, and consider it the subject. However, identities often change and there is a lot of noise in the resulting tracks, making this approach unreliable. To improve accuracy, we can use various heuristics such as scoring people not only by track duration but also by their proximity to the frame center and the size of their bounding box. Nevertheless, these metrics are also noisy and it is challenging to blend them effectively with the duration metric. In the end, I opted for a simple approach that works reasonably well:

Filter out the bottom 80% of tracks by duration, as they’re likely noise
If the frame has more than eight people in it — give up, just focus on the center
Otherwise, pick the longest track by total duration appearing in the current frame and center on it

def bbox_center(bbox):
    return (int((bbox[0] + bbox[2]) // 2), int((bbox[1] + bbox[3]) // 2))


def filter_top_percent_tracks(track_durations, top_percent):
    # Calculate the number of tracks to keep (top N%)
    num_tracks_to_keep = int(len(track_durations) * top_percent)

    sorted_tracks = sorted(track_durations.items(),
                           key=lambda item: item[1], reverse=True)
    top_tracks = sorted_tracks[:num_tracks_to_keep]

    # Create a new dictionary with only the top tracks
    filtered_track_durations = {
        track_id: duration for track_id, duration in top_tracks}

    return filtered_track_durations


def find_subjects(frames, track_durations):

    subjects = []
    if len(track_durations) > 100:
        # The perentage of tracks to keep varies depnding on the crowdiness of the video
        track_durations = filter_top_percent_tracks(track_durations, 0.2)

    for frame in frames:
        longest_duration = 0
        subject_center = None
        # Don't even try to re-center frames with more than 8 people
        if len(frame) <= 8:
            for track in frame:
                track_id = track['track_id']
                duration = track_durations.get(track_id, 0)

                if duration > longest_duration:
                    longest_duration = duration
                    subject_center = bbox_center(
                        track['bbox'])

        subjects.append(subject_center)

    return subjects

Finally, let’s add reframing. We’ll go frame by frame again and try to focus the camera on the subject if there’s one or drift back to the middle of the frame. One challenge is that the subject position “jitters” a lot because YOLO detection is fuzzy. To help with this, we should apply easing to ensure the resulting video isn’t too jerky. We’ll write cropped frames to a new video, combine it with the original audio, and compress it to a reasonable size using FFmpeg. Theoretically, we could use OpenCV’s own VideoWriter for saving the cropped vide, but it was segfaulting for me, and I couldn’t find a workaround. I’m using another FFmpeg wrapper — imageio for that task. It’s a bit messy to have this many libraries, but it does the job 👌

def ease_camera_towards_subject(current_pos, target_pos, damping_factor):
    # Calculate the distance vector between current position and target
    distance_vector = np.array(target_pos) - np.array(current_pos)

    # Apply damping to the distance vector
    eased_vector = distance_vector * damping_factor

    # Update the current position
    new_pos = np.array(current_pos) + eased_vector
    return tuple(new_pos.astype(int))


def center_subject_in_frame(frame, new_size, subject_position, last_position, damping_factor):
    original_height, original_width = frame.shape[:2]
    new_width, new_height = new_size

    # Calculate desired top-left corner for centered subject
    subject_center_x, subject_center_y = subject_position
    desired_x = max(0, min(original_width - new_width,
                    subject_center_x - new_width // 2))
    desired_y = max(0, min(original_height - new_height,
                    subject_center_y - new_height // 2))

    # Apply easing towards the subject
    new_x, new_y = ease_camera_towards_subject(
        last_position, (desired_x, desired_y), damping_factor)

    # Ensure the new position is within bounds
    new_x = max(0, min(new_x, original_width - new_width))
    new_y = max(0, min(new_y, original_height - new_height))

    # Crop the frame to the new dimensions
    cropped_frame = frame[new_y:new_y + new_height, new_x:new_x + new_width]

    return cropped_frame, (new_x, new_y), (new_x, new_y, new_width, new_height)


def round_to_multiple(number, multiple):
    return round(number / multiple) * multiple


def reframe(video_path, subjects_fn, preview):
    # We could parametrize this one too, but I'm just using this script for my
    # vertical IG videos so, 9:16 it is :P
    target_aspect_ratio = (9, 16)
    cap = cv2.VideoCapture(video_path)
    # Get the original video dimensions
    width = cap.get(cv2.CAP_PROP_FRAME_WIDTH)
    height = cap.get(cv2.CAP_PROP_FRAME_HEIGHT)
    fps = cap.get(cv2.CAP_PROP_FPS)

    # Determine the base dimension (shortest side)
    base_dimension = min(width, height)

    # Calculate target dimensions maintaining aspect ratio
    target_aspect_ratio_width, target_aspect_ratio_height = target_aspect_ratio
    if width < height:  # Landscape to portrait
        new_width = int(base_dimension)
        new_height = int(
            base_dimension * target_aspect_ratio_height / target_aspect_ratio_width)
    else:  # Portrait to landscape or same orientation
        new_height = int(base_dimension)
        new_width = int(base_dimension *
                        target_aspect_ratio_width / target_aspect_ratio_height)

    # Ensure new dimensions do not exceed original dimensions
    new_width = int(min(round_to_multiple(new_width, 16), width))
    new_height = int(min(round_to_multiple(new_height, 16), height))

    frame_center = (int(width // 2), int(height // 2))

    # Create two temporary files to store the reframed video and the original audio
    with tempfile.NamedTemporaryFile(suffix='.mp3') as temp_audio, \
            tempfile.NamedTemporaryFile(suffix='.mp4') as temp_video:

        # I tried using OpenCV's VideoWriter but it segfaults on macOS, hence imageio
        writer = imageio.get_writer(
            temp_video.name, fps=fps, format='mp4', codec='libx264', quality=10)

        # Load tracking information dumped by the tracking stage
        subjects = pickle.load(open(subjects_fn, 'rb'))

        frame_count = 0
        last_crop_position = (0, 0)
        last_subject_position = (int(width // 2), int(height // 2))
        lost_subject_for = 0

        while cap.isOpened():
            ret, frame = cap.read()

            if not ret:
                break

            # If no subject is found, just stick with the last position for a few seconds
            # hoping it will reappear. If not - ease back to the center
            if not subjects[frame_count]:
                subject = last_subject_position
                lost_subject_for += 1
            else:
                subject = subjects[frame_count]
                last_subject_position = subject
                lost_subject_for = 0

            # Drift back towards the center if the subject is lost for too long
            LOST_SUBJECT_THRESHOLD_SEC = 3
            if lost_subject_for > LOST_SUBJECT_THRESHOLD_SEC * fps:
                subject = frame_center

            # The last parameter is the damping factor
            # It determines how quickly the camera moves towards the subject
            # I found 0.1 to be a good overall value
            cropped_frame, last_crop_position, crop_bbox = center_subject_in_frame(
                frame, (new_width, new_height), subject, last_crop_position, 0.1
            )

            # Write the new frame to the output video
            writer.append_data(cv2.cvtColor(cropped_frame, cv2.COLOR_BGR2RGB))

            # Also create some markings on the original frame for the live preview
            cv2.rectangle(frame, (int(subject[0]) - 5, int(subject[1]) - 5),
                          (int(subject[0]) + 5, int(subject[1]) + 5), GREEN, 2)
            cv2.rectangle(frame, (crop_bbox[0], crop_bbox[1]), (crop_bbox[0] + crop_bbox[2],
                                                                crop_bbox[1] + crop_bbox[3]), GREEN, 2)

            if lost_subject_for > 0:
                cv2.putText(frame, f"Lost subject for {lost_subject_for / fps} seconds",
                            (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, WHITE, 2)

            # Display the live preview
            if preview:
                cv2.imshow('Processed Frame', frame)
                if cv2.waitKey(1) & 0xFF == ord('q'):
                    break

            frame_count += 1

        cap.release()
        writer.close()

        # Extract audio from the original video
        ffmpeg.input(video_path).output(temp_audio.name,
                                        q=0, map='a').run(overwrite_output=True)

        # Combine the new video with the original audio
        input_video_stream = ffmpeg.input(temp_video.name)
        input_audio_stream = ffmpeg.input(temp_audio.name)
        # Specify your desired codec here. hevc_videotoolbox is the hardware accelerated codec on macOS
        ffmpeg.output(input_video_stream, input_audio_stream, f"{video_path.split('.')[0]}_reframed.mp4",
                      codec='aac', vcodec='libx264', pix_fmt='yuv420p', vf='format=yuv420p', profile='main', level='4.0').run(overwrite_output=True)

Here’s the full script: https://gist.github.com/bsod90/fbeca5fd3d021e43aead278d176f07fb

I didn’t include the requirements.txt as I figured dependency management in the ML world is a whole separate topic. I’d say just look at the imports and use condafor whatever it can solve, then install the rest with pip and you should be good. For my experiments, I used Python 11.

The result

I reframed two videos: one of my buddy learning the park and another one of my own sloppy riding.

19:9 wide -> 9:16

4:3 linear -> 9:16

On the left, you can see the result of just exporting a “Linear” 9:16 video from a GoPro app. On the right is our reframed video, and in the middle is the reframing preview illustrating how the cropping window moves and what it considers the current subject.

In my first video, I used a 16:9 WideView as the source, whereas for the second video, I tried cropping a narrower 4:3 video with a taller aspect ratio. Initially, I didn’t realize that GoPro actually uses the 8:7 (true sensor size) when exporting a vertical video from the app. This results in a more zoomed-out view with minimal information cropped from the sides. This works fine for most social media applications, and perhaps this is why GoPro doesn't bother introducing smart reframing in their own Quik. However, if you zoom in on the GoPro export, you'll lose important details, which is where our smart reframing shines.

Frames like the one below are where you save “priceless” information by moving the crop window. Notice that you could zoom in on the right version a lot more, and I’d still be in the frame. The downside, however, is that the edges of a GoPro video are still heavily distorted, even though “Linear” lens correction is applied.

I'm not particularly impressed with this result, except for the technical aspect, of course. However, I believe shooting in 5.3k resolution and reframing from the full 8:7 frame while simultaneously cropping for 9:16 and zooming in on the subject is the way to go. This will allow for both horizontal and vertical movement of the cropping frame and still result in a solid 4k image after cropping. I’ll try it out soon and update this article. I also think this would be much more useful with more close-up shots from something narrower like a DSLR. I’ll keep playing :)

Conclusion. We live in an amazing era where previously military-grade strategic know-how technologies (consider aim tracking) are available to anyone with average Python knowledge and a decent laptop. Building “intelligent” (non-if-else-driven) tools is easier than ever, and we should all be doing it more. I’m now curious what else is possible. Posture estimation for highlighting riding mistakes? Trick recognition? How about a multi-modal, ChatGPT-driven personal coach that feeds from your GoPro collection and gives you personalized riding advice? Why not 🤷‍♂

P.S.

If any real CV engineers are reading this, I’d be curious to hear how this should’ve been done “the right way”, with better accuracy and faster speed. As they now say, let me know what you think in the comments section below ✌️

Poor man’s intelligent reframing for GoPro videos

The result

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Maxim Leonovich

No responses yet