Molly is a video editing tool that uses a combination of whisper and llama (hosted by NIVIDIA NIMS) to help pear down long videos into clips. It does this by getting clips of when people are talking and then sorting through what it thinks are the most interesting. I will likely return to this project in the future and work on how it makes the selection but for now it is a good start that is most effective on videos over an hour long.
This is my submission to the LLamaIndex dev contest hosted by Nividia. Sup Nividia engineers? I hope you are having a good day.
IMPORTANT, I have only tested this on one computer, my own. I used a NIVIDIA graphics card with cuda 12.1 + cudnn on python 3.10. Also I had a stuffed animal on my desk, not sure if that'll help.
Some prerequisite include:
Clone this repo, or download it as a zip if you like that better.
Not nessesarily critical but setting up a venv or conda environment is recommended. I used python 3.10.
Install Pytorch for your version of cuda.
Open a terminal in the cloned directory and use this to install the rest of the libraries. ... pip install -r requirments.txt ...
Go into Molly.py and on line 64 add your NIMS Api key for a language model. I have it set to llama3 70B but you could make it any on the service.
To run just use: ... python app.py ...
It is pretty self explanatory but let me give a brief overview of the controls you have in the GUI. There are 4 inputs:
- Path to file. Where you can give the path to the video file you want to edit.
- Ouput folder. Path to the folder you want the transcript and final timeline to be put.
- Initial description. Injects a prompt into the start of the language model that helps it understand the content a bit before going in. Just more context for it to work with.
- Chunk size. The number of max number of words that are proccessed from the transcript at a time. Generally increasing will result in less clips being selected if the video is above a certain length. Don't expect much to happen if you change it for shorter videos. Since more api calls will be made at lower numbers (since it'll process only that many words) I recommend keeping it in the triple digits.
Hope its useful. Let me know if it isn't and if so what WOULD be.