This is a demonstration of the abilities of OpenAI's CLIP neural network. The tool will download a YouTube video, extract frames at regular intervals and precompute the feature vectors of each frame using CLIP.
You can then use natural language search queries to find a particular frame of the video. The results are really amazing in my opinion...
Just a heads up: in the demo's setup block, it installs pytorch 1.7.0 + cu101 from CLIP (since it requires them) and then immediately uninstalls it to re-install pytorch 1.7.1 by the next command, which takes at least 5 minutes. If we don't really need 1.7.1, we can save some time by removing the manual pytorch installation line.
Not yet, but I had this idea as well. You basically get a score describing how well a phrase is matching each of the images so it won't be difficult to do. I'll look into that!
I think it can. However, you will likely need a bigger model. Currently, OpenAI shares only their small model and I hope they will soon release bigger ones!
If it could take image set as input, then perhaps we can use this to identify our self in a random Internet video e.g. Lengthy tourist video in which you suspect you could have been covered as you were there at that place on that day.
There are people already looking for such solution(I've added the link to that discussion on my profile).
I think this is a valuable application, but I don't think CLIP is well suited for it. The power of CLIP comes from training a model to jointly "understand" text and images. If you are looking at identifying a particular person there are more suitable designs for face recognition.
Wondering whether it would be more efficient if extracting frames where the content has changed (e.g. over a threshold and/or all I-frames)?
Also, could this be used to identify event type in videos? I'd love to run my 25 years of home videos through this an have it annotate: "Christmas, birthday, park, camping...".
No, that wouldn't work too well. It's for YouTubers who stream their desktop screens and I need to extract some information to automatically process it. The desktop streams always look very similar so I don't need advanced AI/neural nets to extract that.
There is one caveat to be aware of - the image is cropped to a square in the center and scaled down to 224x224. So small details will be lost, for example if you want to run it on scanned documents. Photos work great though.
For info, the same tool works well with 2 million images found in the Unsplash dataset [1]. Features only have to be computed once for the dataset, and only the feature vector for the user query has to be computed on the fly. Then matching features can be done in a manner that scales well.
So, the present tool does not scale because the videos are part of the user query, but a company with an easy access to the videos and the computational power to pre-encode the frames as features could create a search engine based on CLIP.
You can then use natural language search queries to find a particular frame of the video. The results are really amazing in my opinion...
If you want to experiment with it yourself, I prepared a Colab notebook that can easily be run: https://colab.research.google.com/github/haltakov/natural-la...