This is a demonstration of the abilities of OpenAI's CLIP neural network. The to...

fireattack · on Feb 12, 2021

Just a heads up: in the demo's setup block, it installs pytorch 1.7.0 + cu101 from CLIP (since it requires them) and then immediately uninstalls it to re-install pytorch 1.7.1 by the next command, which takes at least 5 minutes. If we don't really need 1.7.1, we can save some time by removing the manual pytorch installation line.

vladoh · on Feb 12, 2021

Yes, I know that this is a bit slow. The problem is you really need 1.7.1, because 1.7.0 leads to some strange issues and broken results:

https://github.com/openai/CLIP/issues/13#issuecomment-771143...

fireattack · on Feb 12, 2021

Ah, got it. I just noticed 1.7.1 is already in CLIP's requirements.txt, weird that colab would still install 1.7.0 to begin with.

Edit: I just realized 1.7.0 just came with colab, not installed by CLIP.

vladoh · on Feb 12, 2021

Yes, this is the actual problem...

pininja · on Feb 12, 2021

This is very cool! Does this produce a occurrence index by any chance? It would be neat to explore a word map of a video.

vladoh · on Feb 12, 2021

I pushed a small update and you will now see a heatmap displaying the score of the search query for each frame.

pininja · on Feb 17, 2021

Looks great! Adds a whole dimension to the video. Thanks.

vladoh · on Feb 12, 2021

Not yet, but I had this idea as well. You basically get a score describing how well a phrase is matching each of the images so it won't be difficult to do. I'll look into that!

Crazyontap · on Feb 13, 2021

Can it work for more advance keywords like say, "traffic violation" where it spots a car jumping red light or pedestrian not using a crosswalk, etc?

It could be very useful to help with law enforcement.

vladoh · on Feb 13, 2021

I think it can. However, you will likely need a bigger model. Currently, OpenAI shares only their small model and I hope they will soon release bigger ones!

Abishek_Muthian · on Feb 13, 2021

Excellent work.

If it could take image set as input, then perhaps we can use this to identify our self in a random Internet video e.g. Lengthy tourist video in which you suspect you could have been covered as you were there at that place on that day.

There are people already looking for such solution(I've added the link to that discussion on my profile).

vladoh · on Feb 13, 2021

I think this is a valuable application, but I don't think CLIP is well suited for it. The power of CLIP comes from training a model to jointly "understand" text and images. If you are looking at identifying a particular person there are more suitable designs for face recognition.

canada_dry · on Feb 13, 2021

Great demo.

Wondering whether it would be more efficient if extracting frames where the content has changed (e.g. over a threshold and/or all I-frames)?

Also, could this be used to identify event type in videos? I'd love to run my 25 years of home videos through this an have it annotate: "Christmas, birthday, park, camping...".

vladoh · on Feb 13, 2021

Yes, this is definitely possible. You can maybe try computing some kind of image distance between frames or some keyframe extraction.

Once you compute the features, the search is very efficient! I tried it for searching in the 2M photos dataset from Unsplash and it takes like 2-3 seconds: https://github.com/haltakov/natural-language-image-search

I plan to run my personal photos through it :)

mockingbirdy · on Feb 13, 2021

Awesome! I'm currently working on the exact same thing (but with OCR added). Thank you for releasing this.

canada_dry · on Feb 13, 2021

> the exact same thing (but with OCR added)

Hmmm... what does "with OCR added" mean? If there is text in the video (e.g. street sign) that it can also be searched??

mockingbirdy · on Feb 13, 2021

No, that wouldn't work too well. It's for YouTubers who stream their desktop screens and I need to extract some information to automatically process it. The desktop streams always look very similar so I don't need advanced AI/neural nets to extract that.

ramraj07 · on Feb 13, 2021

This is amazing. I'm going to get this running on my Dropbox. Curious what it gets out of scanned documents as well.

vladoh · on Feb 13, 2021

There is one caveat to be aware of - the image is cropped to a square in the center and scaled down to 224x224. So small details will be lost, for example if you want to run it on scanned documents. Photos work great though.

I tried it on the 2M photos from the Unsplash dataset: https://github.com/haltakov/natural-language-image-search

mandeepj · on Feb 12, 2021

> The tool will download a YouTube video, extract frames at regular intervals

That should be able to scale well :-)

woko · on Feb 13, 2021

For info, the same tool works well with 2 million images found in the Unsplash dataset [1]. Features only have to be computed once for the dataset, and only the feature vector for the user query has to be computed on the fly. Then matching features can be done in a manner that scales well.

So, the present tool does not scale because the videos are part of the user query, but a company with an easy access to the videos and the computational power to pre-encode the frames as features could create a search engine based on CLIP.

[1] https://github.com/haltakov/natural-language-image-search

vladoh · on Feb 13, 2021

Thanks for sharing! :)

Yes, the feature computation on the images has to be sone only once and the representation is very efficient - 512 float16 values per image.

woko · on Feb 13, 2021

Yes, I know. :D Your previous project with Unsplash made me try a similar approach [1] for banners of video games on Steam.

[1] https://github.com/woctezuma/steam-image-search

vladoh · on Feb 13, 2021

Cool application! I wasn't aware that there are so many images available on Steam...

woko · on Feb 13, 2021

Yes, Steam has grown a lot. Last I checked, there were ~50k apps.

Edit: 50,630 apps according to https://www.gamedatacrunch.com/

As I focused on vertical banners, the list was smaller (~30k apps). This is equivalent to the *lite* version of Unsplash's dataset.

akinhwan · on Feb 13, 2021

Wow what sort of business ideas do you think could come of this?