A new way to consume video over the internet

Arnaud AUBRY
6 min readJul 26, 2021

A few days ago, I woke up with an idea to bring more context to the video we consume over the internet. Most of the videos I consume over the internet comes from random people and are shots on smartphones. Videos are shared almost instantly on social platforms like Instagram, Tiktok, or other social media platforms. It is even more true with the so-called; Stories.

I find it a bit sad that in 2021 we only consume video while our modern smartphones are loaded with many sensors that could give more context to the content you are actually watching. Using such sensors would also allow us to enable the multiplayer mode we deserve on our content. And because I am more confident with iOS, I am gonna start a journey with my iPhone, trying to enhance that experience a bit.

Start simple

How do we start? We are going with the obvious to know we can actually add interesting metadata to a video file. Let’s add location and heading to our video. To get these data, I am gonna use the official CoreLocation framework from Apple.

But first, let’s start with two basic view controllers. The first one would be the player and the second one the camera. The camera is shown as a modal to the player and automatically triggered when I start the application.

I took a nasty shortcut to show the user’s location, but it works, don’t judge me.

Recording the video

I have to go the hard way concerning the recording. I need to build a recorder that can ping me each time a frame is recorded to associate these sensor data. Using AVAssetWriter I can run a session that would use AVAssetWriterInput for audio and video. By rebuilding a basic camera with these folks, I can use the didOuput sampleBufferdelegate method every time a frame is recorded. On the background thread, I am constantly recording the last known GPS location and the last magnetic heading known by the phone. Each time a frame is recorded, I store these two data associated with the frame's timecode. It is not extremely accurate, but it is a fair enough solution to get the location and heading for each video frame.

I built a “Video” class that handles the metadata along with the recorded video.

Playing it smoothly

On the player part, I have the same issue. I cannot use the “automagic” built by Apple but must choose a more manual approach.AVPlayer is still a good solution to play such a custom asset. On the view controller, I put an AVPlayer on top and a map view just beside. Using an MKMapCamera I can rotate and move the map according to the geographical data I got while recording.

The getClosestMetadata is basically a loop that returns the first item that is greater or equal to the time.

Originally the periodic time observer is used to build a slider component that goes with the player time. I use that method to get the closest metadata object using the time offset. It is a more quantitative approach rather than a qualitative one. I am aware that my frame is not perfectly timed with the location and heading I got from CoreLocation.

I needed a cool way to render the vision area of the recorder. I know I was not supposed to use a fixed image view placed on top of the map. It turns out that annotations are a bit annoying when it comes to rotating them. To hack that and simplify my prototype, I move the map camera while this image stays centered to the map view, always on top and fixed. I know it is a bit gross, but it works and takes me 5 minutes to code.

The interface is pretty rough but we can feel the potential

Sharing is caring

That’s a good start, but it is very boringly local. What if we build an API on top of that to share these videos publicly on the internet? Let’s do it!

Walkthrough the architecture

We could use an S3 bucket on AWS with an ElasticSearch instance on top for the indexing. We eventually can use a MongoDB instance somewhere to store any product logic we might have in the future like video ownership, user account, etc… All of this would be served by an API that bundle everything.

The API handles the index on ES and manage access to the bucket

There is no limit to the number of objects that you can store in an S3 bucket. It is pretty cheap and reliable. We can store directly the mp4 file containing the movie and use the MongoDB instance to store the JSON metadata. Using a hash string as a filename and make it an ID in the database to match the file and its data associated. Using ElastiSearch on top would allow me to create an index to retrieve my video geologically.

Storing files

I built a simple REST API written in Typescript that exposes a post video endpoint. Here are the steps:

  • It creates a unique UUID;
  • It creates a document on MongoDB based on that UUID and store the JSON object generated from the iOS application;
  • I use MongoDB triggers to start a function which index the UDID and geological data in an ElasticSearch instance;
  • It creates a signed URL on S3 to return the iOS application for the video upload part;
  • It monitors the uploading of the video on S3 and keeps a status associated to make sure the video has been uploaded properly;

Exploring files

ElasticSearch is natively built with the geo_point field type.

As stand-in the official documentation: “Fields of type geo_point accept latitude-longitude pairs, which can be used: To find geo-points within a bounding box, within a certain distance of a central point, or a polygon or a geo_shape query.

What we have to do here is to create a get REST endpoint that would take a latitude and longitude parameters. Doing so we can leverage the geo_point field type on ElasticSearch to get all UUID in a specified range. Once we got these UUIDs we can fetch the metadata directly from MongoDB and return these to the client.

What’s next?

What if we push our little experimentation a little bit further and see if we could recognize multiple videos, taken at the same time at the same place and showing the same scene. That would allow people to review a particular event from multiple points of view.

The art of regrouping similar items

How to recognize if two videos are looking at the same thing at some point? For now, we get the heading and GPS coordinates of each frame of the video. It is a good clue to start. What if we review videos that appear to be close to each other, to see if at some point the users who took them were actually looking in the same direction at the same time and to a decent distance.

Going further

On recent smartphones, there is usually more than one camera. This is very good news for us because on such devices, the camera is able to get a distance in meters relative to a subject. It gives your brain the ability to distinguish distances thanks to the depth. The iPhone can do the same as long as it has access to multiple cameras synchronously.

A camera optic from iPhone 11

Why is that cool for us? Because it can fixes the issue where there is a wall in behind multiple shots that appears to be close to each other on a map. By getting the maximum distance the camera is filming we can give more context to our metadata and therefore calculate if two videos are able to film the same subject or not. This would allow us to be more accurate when clustering our events.

Would you support this ?

Whether you are a developer, a future user, or even an investor and you think this kind of technology would be dope, please reach me. I would love to make it happened :)

--

--