Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is a demonstration of the abilities of OpenAI's CLIP neural network. The tool will download a YouTube video, extract frames at regular intervals and precompute the feature vectors of each frame using CLIP.

You can then use natural language search queries to find a particular frame of the video. The results are really amazing in my opinion...

If you want to experiment with it yourself, I prepared a Colab notebook that can easily be run: https://colab.research.google.com/github/haltakov/natural-la...



Just a heads up: in the demo's setup block, it installs pytorch 1.7.0 + cu101 from CLIP (since it requires them) and then immediately uninstalls it to re-install pytorch 1.7.1 by the next command, which takes at least 5 minutes. If we don't really need 1.7.1, we can save some time by removing the manual pytorch installation line.


Yes, I know that this is a bit slow. The problem is you really need 1.7.1, because 1.7.0 leads to some strange issues and broken results:

https://github.com/openai/CLIP/issues/13#issuecomment-771143...


Ah, got it. I just noticed 1.7.1 is already in CLIP's requirements.txt, weird that colab would still install 1.7.0 to begin with.

Edit: I just realized 1.7.0 just came with colab, not installed by CLIP.


Yes, this is the actual problem...


This is very cool! Does this produce a occurrence index by any chance? It would be neat to explore a word map of a video.


I pushed a small update and you will now see a heatmap displaying the score of the search query for each frame.


Looks great! Adds a whole dimension to the video. Thanks.


Not yet, but I had this idea as well. You basically get a score describing how well a phrase is matching each of the images so it won't be difficult to do. I'll look into that!


Can it work for more advance keywords like say, "traffic violation" where it spots a car jumping red light or pedestrian not using a crosswalk, etc?

It could be very useful to help with law enforcement.


I think it can. However, you will likely need a bigger model. Currently, OpenAI shares only their small model and I hope they will soon release bigger ones!


Excellent work.

If it could take image set as input, then perhaps we can use this to identify our self in a random Internet video e.g. Lengthy tourist video in which you suspect you could have been covered as you were there at that place on that day.

There are people already looking for such solution(I've added the link to that discussion on my profile).


I think this is a valuable application, but I don't think CLIP is well suited for it. The power of CLIP comes from training a model to jointly "understand" text and images. If you are looking at identifying a particular person there are more suitable designs for face recognition.


Great demo.

Wondering whether it would be more efficient if extracting frames where the content has changed (e.g. over a threshold and/or all I-frames)?

Also, could this be used to identify event type in videos? I'd love to run my 25 years of home videos through this an have it annotate: "Christmas, birthday, park, camping...".


Yes, this is definitely possible. You can maybe try computing some kind of image distance between frames or some keyframe extraction.

Once you compute the features, the search is very efficient! I tried it for searching in the 2M photos dataset from Unsplash and it takes like 2-3 seconds: https://github.com/haltakov/natural-language-image-search

I plan to run my personal photos through it :)


Awesome! I'm currently working on the exact same thing (but with OCR added). Thank you for releasing this.


> the exact same thing (but with OCR added)

Hmmm... what does "with OCR added" mean? If there is text in the video (e.g. street sign) that it can also be searched??


No, that wouldn't work too well. It's for YouTubers who stream their desktop screens and I need to extract some information to automatically process it. The desktop streams always look very similar so I don't need advanced AI/neural nets to extract that.


This is amazing. I'm going to get this running on my Dropbox. Curious what it gets out of scanned documents as well.


There is one caveat to be aware of - the image is cropped to a square in the center and scaled down to 224x224. So small details will be lost, for example if you want to run it on scanned documents. Photos work great though.

I tried it on the 2M photos from the Unsplash dataset: https://github.com/haltakov/natural-language-image-search


> The tool will download a YouTube video, extract frames at regular intervals

That should be able to scale well :-)


For info, the same tool works well with 2 million images found in the Unsplash dataset [1]. Features only have to be computed once for the dataset, and only the feature vector for the user query has to be computed on the fly. Then matching features can be done in a manner that scales well.

So, the present tool does not scale because the videos are part of the user query, but a company with an easy access to the videos and the computational power to pre-encode the frames as features could create a search engine based on CLIP.

[1] https://github.com/haltakov/natural-language-image-search


Thanks for sharing! :)

Yes, the feature computation on the images has to be sone only once and the representation is very efficient - 512 float16 values per image.


Yes, I know. :D Your previous project with Unsplash made me try a similar approach [1] for banners of video games on Steam.

[1] https://github.com/woctezuma/steam-image-search


Cool application! I wasn't aware that there are so many images available on Steam...


Yes, Steam has grown a lot. Last I checked, there were ~50k apps.

Edit: 50,630 apps according to https://www.gamedatacrunch.com/

As I focused on vertical banners, the list was smaller (~30k apps). This is equivalent to the *lite* version of Unsplash's dataset.


Wow what sort of business ideas do you think could come of this?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: