YOHO does object detection in street view and provides audible scene descriptions. With YOHO, a visually impared person would be able to take out their smartphone, put in their earbuds, and take a walk in the street while their phone lets them know about their surroundings.
YOHO was created via transfer learning and fine-tuning on a YOLOv2 architechture (pre-trained on COCO dataset) on the Berkeley Deep Drive dataset, and further, placing an inference engine - Tell a Vision - on top of the model to get audible output from its predictions.
This is the architecture of the system:
You may also download the notebook and the report from this repository.
If there are any questions or recommendations, you can reach out to me at [email protected].