-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/voice clone agent #129
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThe pull request introduces a new Changes
Poem
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
backend/director/agents/clone_voice.py (4)
19-19
: Fix grammar in parameter description.“urser” should be changed to “user” to maintain clarity and correctness.
- "description": "List of audio file URLs to given by the urser to clone", + "description": "List of audio file URLs provided by the user to clone",
71-71
: Consider broadening MIME type check for MP3 files.Using an exact check for 'audio/mpeg' may exclude valid MP3 files if their Content-Type differs slightly. You could check for
'audio/'
to allow for variations.-if 'audio/mpeg' not in response.headers.get('Content-Type', ''): +if not response.headers.get('Content-Type', '').startswith('audio'):
127-127
: Correct the spelling in error message.Change “Could'nt process the sample audioss” to a more grammatically correct phrasing.
-return AgentResponse(status=AgentStatus.ERROR, message="Could'nt process the sample audioss") +return AgentResponse(status=AgentStatus.ERROR, message="Couldn't process the sample audios")
131-131
: Remove extraneous f prefixes.These f-strings contain no placeholders and can be regular strings, improving clarity.
- f"Using previously generated cloned voice" + "Using previously generated cloned voice" - f"Cloning the voice" + "Cloning the voice" - f"Synthesising the given text" + "Synthesising the given text"Also applies to: 137-137, 146-146
🧰 Tools
🪛 Ruff (0.8.2)
131-131: f-string without any placeholders
Remove extraneous
f
prefix(F541)
backend/director/tools/elevenlabs.py (1)
5-5
: Remove unused import.
play
is not referenced in the code, which may trigger lint warnings and clutter the import list.-from elevenlabs import VoiceSettings, Voice, play +from elevenlabs import VoiceSettings, Voice🧰 Tools
🪛 Ruff (0.8.2)
5-5:
elevenlabs.play
imported but unusedRemove unused import:
elevenlabs.play
(F401)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
backend/director/agents/clone_voice.py
(1 hunks)backend/director/handler.py
(2 hunks)backend/director/tools/elevenlabs.py
(2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
backend/director/tools/elevenlabs.py
5-5: elevenlabs.play
imported but unused
Remove unused import: elevenlabs.play
(F401)
backend/director/agents/clone_voice.py
131-131: f-string without any placeholders
Remove extraneous f
prefix
(F541)
137-137: f-string without any placeholders
Remove extraneous f
prefix
(F541)
146-146: f-string without any placeholders
Remove extraneous f
prefix
(F541)
🔇 Additional comments (1)
backend/director/handler.py (1)
28-28
: Integration looks good.The
CloneVoiceAgent
import and registration follow the existing pattern. No issues found.Also applies to: 70-70
sample_audios: list[str], | ||
text_to_synthesis: str, | ||
name_of_voice: str, | ||
is_authorized_to_clone_voice: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Parameter type mismatch.
You declared is_authorized_to_clone_voice
as a string, but the JSON schema specifies a boolean. Convert it to a boolean type to ensure consistency.
-def run(self, sample_audios: list[str], text_to_synthesis: str, name_of_voice: str, is_authorized_to_clone_voice: str, ...
+def run(self, sample_audios: list[str], text_to_synthesis: str, name_of_voice: str, is_authorized_to_clone_voice: bool, ...
Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (4)
backend/director/agents/clone_voice.py (4)
93-113
: Audio file download logic can be improved with more robust content type validation.The current content type check only verifies if it "starts with 'audio'", which may not catch all valid audio formats or might allow non-audio content that happens to have "audio" in its content type.
Consider using a more specific content type check:
- if not response.headers.get('Content-Type', '').startswith('audio'): - raise ValueError(f"The URL does not point to an MP3 file: {audio_url}") + content_type = response.headers.get('Content-Type', '') + if not content_type.startswith('audio/') and not content_type in ['application/octet-stream']: + raise ValueError(f"The URL does not point to an audio file. Content-Type: {content_type}, URL: {audio_url}")
137-171
: Good download and extraction logic, but consider adding timeout parameters.The download and extraction logic is well-structured, but network requests don't have timeout parameters, which could lead to hanging requests if the server doesn't respond.
Add timeouts to network requests to ensure they don't hang indefinitely:
- response = requests.get(audio_url, stream=True) + response = requests.get(audio_url, stream=True, timeout=30) # 30 seconds timeoutSimilarly for other requests.get() calls in the code.
260-271
: Fix HTML indentation in the download link.The HTML string has inconsistent indentation which will lead to unnecessary whitespace in the rendered output.
text_content = TextContent( agent_name=self.agent_name, status=MsgStatus.success, status_message="Here is your generated audio", - text=f"""Click <a href='{data_url}' download='{output_file_name}' target='_blank'>here</a> to download the audio - """, + text=f"""Click <a href='{data_url}' download='{output_file_name}' target='_blank'>here</a> to download the audio""", )
283-293
: Consider adding file cleanup logic to prevent disk space issues.The code downloads files but doesn't clean them up after use, which could lead to disk space issues over time.
Consider adding a cleanup function to remove temporary files after they're no longer needed:
def _cleanup_temp_files(self, *file_paths): """Remove temporary files to free up disk space.""" for file_path in file_paths: if file_path and os.path.exists(file_path): try: os.remove(file_path) logger.debug(f"Removed temporary file: {file_path}") except Exception as e: logger.warning(f"Failed to remove temporary file {file_path}: {e}")Then call this method after processing is complete, e.g., after line 273, add:
self._cleanup_temp_files(sample_file, output_path)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
backend/director/agents/clone_voice.py
(1 hunks)backend/director/tools/videodb_tool.py
(1 hunks)
🔇 Additional comments (3)
backend/director/tools/videodb_tool.py (1)
125-125
: Good enhancement to include URL in audio retrieval.Adding the URL generation to the returned audio data provides direct access to the audio file, which is essential for the voice cloning functionality where audio files need to be downloaded and processed.
backend/director/agents/clone_voice.py (2)
14-83
: Parameter schema looks well-structured.The parameter schema for the CloneVoiceAgent is comprehensive and clearly defines the required and optional parameters with appropriate descriptions. The schema properly handles the two different types of audio sources (direct URL or video extraction) through the oneOf construct.
200-205
:❓ Verification inconclusive
Consider additional server-side validation for authorization.
The method relies solely on a client-provided flag to determine if the user is authorized to clone voices. This could be bypassed if a malicious client sets the flag to
True
.Consider implementing additional server-side validation to ensure the user truly has permission:
🏁 Script executed:
#!/bin/bash # Check if there's any server-side authorization validation for voice cloning # Look for authorization checks in related files rg -A 5 "authorized.*clone" --glob "*.py"Length of output: 3051
Server-Side Authorization Validation Required
The validation relying solely on the client-provided
is_authorized_to_clone_voice
flag (inbackend/director/agents/clone_voice.py
at lines 200-205) has been verified. The API schema and inline comments confirm that no additional server-side checks are present to ensure that the caller genuinely has the permission to clone voices. A malicious client could potentially bypass this check by simply setting the flag toTrue
.
- The API schema explicitly accepts a boolean flag without any further verification.
- The existing code only returns an error if the flag is false, without cross-checking user permissions against a secure server-side context.
Recommendation:
Consider integrating an explicit server-side validation mechanism. For example, use an authentication context or permission service to verify if the user is authorized before allowing the cloning operation:# Example pseudocode snippet if not validate_user_permissions(user_context, 'clone_voice'): return AgentResponse(status=AgentStatus.ERROR, message="User does not have permission to clone voice")This validation would complement the current logic by ensuring that the authorization status reflects the actual user's permissions, rather than solely relying on client input.
output_file_name = f"audio_clone_voice_output_{str(uuid.uuid4())}.mp3" | ||
output_path = f"{DOWNLOADS_PATH}/{output_file_name}" | ||
|
||
with open(output_path, "wb") as f: | ||
for chunk in synthesised_audio: | ||
if chunk: | ||
f.write(chunk) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use os.path.join for file path construction.
Using string concatenation for file paths can lead to issues on different operating systems. It's better to use os.path.join() for consistency, as done elsewhere in the code.
output_file_name = f"audio_clone_voice_output_{str(uuid.uuid4())}.mp3"
- output_path = f"{DOWNLOADS_PATH}/{output_file_name}"
+ output_path = os.path.join(DOWNLOADS_PATH, output_file_name)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
output_file_name = f"audio_clone_voice_output_{str(uuid.uuid4())}.mp3" | |
output_path = f"{DOWNLOADS_PATH}/{output_file_name}" | |
with open(output_path, "wb") as f: | |
for chunk in synthesised_audio: | |
if chunk: | |
f.write(chunk) | |
output_file_name = f"audio_clone_voice_output_{str(uuid.uuid4())}.mp3" | |
output_path = os.path.join(DOWNLOADS_PATH, output_file_name) | |
with open(output_path, "wb") as f: | |
for chunk in synthesised_audio: | |
if chunk: | |
f.write(chunk) |
if "audio_url" in audio_source: | ||
sample_file = self._download_audio_file(audio_source["audio_url"]) | ||
|
||
if "video_id" in audio_source: | ||
sample_file = self._download_audio_from_video(audio_source) | ||
|
||
if not sample_file: | ||
return AgentResponse(status=AgentStatus.ERROR, message="Could'nt process the sample audios") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix typo in error message and add validation for audio_url.
There's a typo in the error message and there's no validation for the audio_url case unlike the video_id case.
if "audio_url" in audio_source:
+ if not audio_source["audio_url"]:
+ return AgentResponse(status=AgentStatus.ERROR, message="Audio URL is missing or empty")
sample_file = self._download_audio_file(audio_source["audio_url"])
if "video_id" in audio_source:
sample_file = self._download_audio_from_video(audio_source)
if not sample_file:
- return AgentResponse(status=AgentStatus.ERROR, message="Could'nt process the sample audios")
+ return AgentResponse(status=AgentStatus.ERROR, message="Couldn't process the sample audios")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if "audio_url" in audio_source: | |
sample_file = self._download_audio_file(audio_source["audio_url"]) | |
if "video_id" in audio_source: | |
sample_file = self._download_audio_from_video(audio_source) | |
if not sample_file: | |
return AgentResponse(status=AgentStatus.ERROR, message="Could'nt process the sample audios") | |
if "audio_url" in audio_source: | |
if not audio_source["audio_url"]: | |
return AgentResponse(status=AgentStatus.ERROR, message="Audio URL is missing or empty") | |
sample_file = self._download_audio_file(audio_source["audio_url"]) | |
if "video_id" in audio_source: | |
sample_file = self._download_audio_from_video(audio_source) | |
if not sample_file: | |
return AgentResponse(status=AgentStatus.ERROR, message="Couldn't process the sample audios") |
def _download_video_file(self, video_url: str) -> str | None: | ||
os.makedirs(DOWNLOADS_PATH, exist_ok=True) | ||
|
||
try: | ||
response = requests.get(video_url, stream=True) | ||
response.raise_for_status() | ||
|
||
if not response.headers.get('Content-Type', '').startswith('video'): | ||
raise ValueError(f"The URL does not point to a video file: {video_url}") | ||
|
||
download_file_name = f"video_download_{str(uuid.uuid4())}.mp4" | ||
local_path = os.path.join(DOWNLOADS_PATH, download_file_name) | ||
|
||
with open(local_path, 'wb') as file: | ||
for chunk in response.iter_content(chunk_size=65536): | ||
file.write(chunk) | ||
|
||
return local_path | ||
|
||
except Exception as e: | ||
print(f"Failed to download {video_url}: {e}") | ||
return None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Replace print with logger for consistency in error handling.
The error handling in this method uses print()
while other methods use the logger. This inconsistency makes debugging and log monitoring more difficult.
- print(f"Failed to download {video_url}: {e}")
+ logger.error(f"Failed to download {video_url}: {e}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def _download_video_file(self, video_url: str) -> str | None: | |
os.makedirs(DOWNLOADS_PATH, exist_ok=True) | |
try: | |
response = requests.get(video_url, stream=True) | |
response.raise_for_status() | |
if not response.headers.get('Content-Type', '').startswith('video'): | |
raise ValueError(f"The URL does not point to a video file: {video_url}") | |
download_file_name = f"video_download_{str(uuid.uuid4())}.mp4" | |
local_path = os.path.join(DOWNLOADS_PATH, download_file_name) | |
with open(local_path, 'wb') as file: | |
for chunk in response.iter_content(chunk_size=65536): | |
file.write(chunk) | |
return local_path | |
except Exception as e: | |
print(f"Failed to download {video_url}: {e}") | |
return None | |
def _download_video_file(self, video_url: str) -> str | None: | |
os.makedirs(DOWNLOADS_PATH, exist_ok=True) | |
try: | |
response = requests.get(video_url, stream=True) | |
response.raise_for_status() | |
if not response.headers.get('Content-Type', '').startswith('video'): | |
raise ValueError(f"The URL does not point to a video file: {video_url}") | |
download_file_name = f"video_download_{str(uuid.uuid4())}.mp4" | |
local_path = os.path.join(DOWNLOADS_PATH, download_file_name) | |
with open(local_path, 'wb') as file: | |
for chunk in response.iter_content(chunk_size=65536): | |
file.write(chunk) | |
return local_path | |
except Exception as e: | |
logger.error(f"Failed to download {video_url}: {e}") | |
return None |
def run( | ||
self, | ||
audio_source: dict, | ||
text_to_synthesis: str, | ||
name_of_voice: str, | ||
is_authorized_to_clone_voice: bool, | ||
collection_id: str, | ||
description="", | ||
cloned_voice_id=None, | ||
*args, | ||
**kwargs) -> AgentResponse: | ||
""" | ||
Clone the given audio file and synthesis the given text | ||
|
||
:param list sample_audios: The urls of the video given to clone | ||
:param str text_to_synthesis: The given text which needs to be synthesised in the cloned voice | ||
:param bool is_authorized_to_clone_voice: The flag which tells whether the user is authorised to clone the audio or not | ||
:param str name_of_voice: The name to be given to the cloned voice | ||
:param str descrption: The description about how the voice sounds like | ||
:param str collection_id: The collection id to store generated voice | ||
:param str cloned_voice_id: The voice ID generated from the previously given voice which can be used for cloning | ||
:param args: Additional positional arguments. | ||
:param kwargs: Additional keyword arguments. | ||
:return: The response containing information about voice cloning. | ||
:rtype: AgentResponse | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Docstring parameter list doesn't match the actual method parameters.
The docstring references sample_audios
which doesn't exist in the method signature, and it's missing documentation for the audio_source
parameter which is actually used.
"""
Clone the given audio file and synthesis the given text
- :param list sample_audios: The urls of the video given to clone
+ :param dict audio_source: The source of the audio, either containing an audio_url or video_id with timing parameters
:param str text_to_synthesis: The given text which needs to be synthesised in the cloned voice
:param bool is_authorized_to_clone_voice: The flag which tells whether the user is authorised to clone the audio or not
:param str name_of_voice: The name to be given to the cloned voice
:param str descrption: The description about how the voice sounds like
:param str collection_id: The collection id to store generated voice
:param str cloned_voice_id: The voice ID generated from the previously given voice which can be used for cloning
:param args: Additional positional arguments.
:param kwargs: Additional keyword arguments.
:return: The response containing information about voice cloning.
:rtype: AgentResponse
"""
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def run( | |
self, | |
audio_source: dict, | |
text_to_synthesis: str, | |
name_of_voice: str, | |
is_authorized_to_clone_voice: bool, | |
collection_id: str, | |
description="", | |
cloned_voice_id=None, | |
*args, | |
**kwargs) -> AgentResponse: | |
""" | |
Clone the given audio file and synthesis the given text | |
:param list sample_audios: The urls of the video given to clone | |
:param str text_to_synthesis: The given text which needs to be synthesised in the cloned voice | |
:param bool is_authorized_to_clone_voice: The flag which tells whether the user is authorised to clone the audio or not | |
:param str name_of_voice: The name to be given to the cloned voice | |
:param str descrption: The description about how the voice sounds like | |
:param str collection_id: The collection id to store generated voice | |
:param str cloned_voice_id: The voice ID generated from the previously given voice which can be used for cloning | |
:param args: Additional positional arguments. | |
:param kwargs: Additional keyword arguments. | |
:return: The response containing information about voice cloning. | |
:rtype: AgentResponse | |
""" | |
def run( | |
self, | |
audio_source: dict, | |
text_to_synthesis: str, | |
name_of_voice: str, | |
is_authorized_to_clone_voice: bool, | |
collection_id: str, | |
description="", | |
cloned_voice_id=None, | |
*args, | |
**kwargs) -> AgentResponse: | |
""" | |
Clone the given audio file and synthesis the given text | |
:param dict audio_source: The source of the audio, either containing an audio_url or video_id with timing parameters | |
:param str text_to_synthesis: The given text which needs to be synthesised in the cloned voice | |
:param bool is_authorized_to_clone_voice: The flag which tells whether the user is authorised to clone the audio or not | |
:param str name_of_voice: The name to be given to the cloned voice | |
:param str descrption: The description about how the voice sounds like | |
:param str collection_id: The collection id to store generated voice | |
:param str cloned_voice_id: The voice ID generated from the previously given voice which can be used for cloning | |
:param args: Additional positional arguments. | |
:param kwargs: Additional keyword arguments. | |
:return: The response containing information about voice cloning. | |
:rtype: AgentResponse | |
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
backend/director/agents/clone_voice.py (3)
47-80
: Clarify usage ofcollection_id
in both schema and top-level parameters.
collection_id
appears both within the second object of"audio_source"
and again as a top-level parameter. This duplication may be confusing for consumers of this API. Consider consolidating or clarifying how/when eachcollection_id
is used.
101-102
: Validate content-type beyond simple prefix matching.Relying solely on
response.headers.get('Content-Type', '').startswith('audio')
might fail if the server doesn't set (or sets an unexpected) HTTP header. Consider additional checks or fallback logic to handle possible discrepancies.
223-223
: Correct spelling in error message."Could'nt" should be "Couldn't" in the error string for clarity and correctness.
- return AgentResponse(status=AgentStatus.ERROR, message="Could'nt process the sample audios") + return AgentResponse(status=AgentStatus.ERROR, message="Couldn't process the sample audios")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
backend/director/agents/clone_voice.py
(1 hunks)
🔇 Additional comments (4)
backend/director/agents/clone_voice.py (4)
136-136
: Replace print with logger for consistency in error handling.Similar to a past review comment, please use
logger.error(...)
instead ofprint(...)
to maintain consistent logging across methods.
193-194
: Fix docstring to match actual parameters.A previous review comment noted that the docstring references
sample_audios
, but the method signature usesaudio_source
. Update the docstring to avoid confusion.
219-219
: Confirm single source logic.Currently, the code checks
if "audio_url" in audio_source:
and then againif "video_id" in audio_source:
. If both keys are present, both blocks run sequentially. The schema is designed to accept only one or the other, yet there's no explicitelif
or logic to guard against both existing.Could you confirm that the schema fully enforces exclusivity such that only one key can exist, preventing undesired double processing? If not, use
elif
to ensure only one path is taken.
246-246
: Useos.path.join
for file path construction.This is identical to a past review comment. Instead of:
output_path = f"{DOWNLOADS_PATH}/{output_file_name}"use:
output_path = os.path.join(DOWNLOADS_PATH, output_file_name)
Fixes #114
Summary by CodeRabbit
New Features
Improvements
Technical Updates
CloneVoiceAgent
in the chat handler system.ElevenLabsTool
with methods for audio cloning, voice retrieval, and text synthesis.