-
Notifications
You must be signed in to change notification settings - Fork 69
[feat]: update weaviate interactions to async and add user profiling logic for user indexing and retrieval #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis update introduces asynchronous GitHub user profiling and storage into Weaviate upon user authentication. It adds new models for pull requests, extends user profile schemas, and refactors Weaviate client usage and scripts for async operations. Health checks and environment variable support for GitHub tokens are also incorporated. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant AuthAPI
participant GitHub
participant Profiler
participant Weaviate
User->>AuthAPI: OAuth callback
AuthAPI->>GitHub: Verify user
GitHub-->>AuthAPI: User info
AuthAPI->>Profiler: (async) profile_user_from_github(user_id, github_username)
Profiler->>GitHub: Fetch user, repos, PRs
Profiler->>Weaviate: store_user_profile(profile)
Weaviate-->>Profiler: Ack
Profiler-->>AuthAPI: Success/Failure (logged)
Suggested labels
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
backend/app/api/v1/auth.py (1)
74-79: Consider task lifecycle management for production robustness.The background task implementation is correct for fire-and-forget execution. However, consider implementing task lifecycle management for production environments to handle task monitoring and cleanup.
For enhanced production robustness, consider:
- try: - asyncio.create_task(profile_user_from_github(str(verified_user.id), github_username)) - logger.info(f"User profiling started in background for: {verified_user.id}") - except Exception as e: - logger.error(f"Error starting user profiling: {verified_user.id}: {str(e)}") + try: + task = asyncio.create_task(profile_user_from_github(str(verified_user.id), github_username)) + # Optional: Store task reference for monitoring if needed + logger.info(f"User profiling started in background for: {verified_user.id}") + except Exception as e: + logger.error(f"Error starting user profiling: {verified_user.id}: {str(e)}")backend/app/db/weaviate/weaviate_operations.py (1)
94-110: Simplify control flow by removing unnecessary else block.The else block is not needed after the return statement.
Apply this diff to improve code readability:
- if existing_uuid: - logger.info(f"Updating existing profile for user_id: {profile.user_id}") - return await self.update_user_profile(existing_uuid, profile) - else: - logger.info(f"Creating new profile for user_id: {profile.user_id}") - return await self.create_user_profile(profile) + if existing_uuid: + logger.info(f"Updating existing profile for user_id: {profile.user_id}") + return await self.update_user_profile(existing_uuid, profile) + + logger.info(f"Creating new profile for user_id: {profile.user_id}") + return await self.create_user_profile(profile)backend/app/db/weaviate/user_profiling.py (3)
54-74: Simplify the control flow in error handling.The elif chain can be simplified since each branch returns.
Apply this diff to improve readability:
async with self.session.get(url, params=params) as response: if response.status == 200: return await response.json() - elif response.status == 404: + if response.status == 404: logger.warning(f"GitHub API 404: {url}") return None - elif response.status == 403: + if response.status == 403: logger.error(f"GitHub API rate limit exceeded: {url}") return None - else: - logger.error(f"GitHub API error {response.status}: {url}") - return None + logger.error(f"GitHub API error {response.status}: {url}") + return None
88-110: Remove unnecessary else block after return.Apply this diff:
if repos and isinstance(repos, list): logger.info(f"Successfully fetched {len(repos)} repositories for {github_username}") return repos - else: - logger.info(f"No repositories found for {github_username}") - return [] + + logger.info(f"No repositories found for {github_username}") + return []
284-284: Address the TODO for profile text generation.The
profile_text_for_embeddingfield is currently set to an empty string with a TODO comment. This field is crucial for semantic search capabilities.Would you like me to implement the profile text generation logic or create a GitHub issue to track this task?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (9)
backend/.env.example(1 hunks)backend/app/api/v1/auth.py(2 hunks)backend/app/db/weaviate/user_profiling.py(1 hunks)backend/app/db/weaviate/weaviate_client.py(1 hunks)backend/app/db/weaviate/weaviate_operations.py(1 hunks)backend/app/model/weaviate/models.py(3 hunks)backend/app/scripts/weaviate/create_schemas.py(3 hunks)backend/app/scripts/weaviate/populate_db.py(7 hunks)backend/main.py(5 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (4)
backend/app/api/v1/auth.py (1)
backend/app/db/weaviate/user_profiling.py (1)
profile_user_from_github(296-310)
backend/main.py (1)
backend/app/db/weaviate/weaviate_client.py (1)
get_weaviate_client(19-32)
backend/app/scripts/weaviate/populate_db.py (2)
backend/app/db/weaviate/weaviate_client.py (1)
get_weaviate_client(19-32)backend/app/scripts/weaviate/create_schemas.py (1)
main(51-53)
backend/app/db/weaviate/user_profiling.py (2)
backend/app/model/weaviate/models.py (3)
WeaviateUserProfile(32-129)WeaviateRepository(5-15)WeaviatePullRequest(17-30)backend/app/db/weaviate/weaviate_operations.py (1)
store_user_profile(131-136)
🪛 Pylint (3.3.7)
backend/app/scripts/weaviate/create_schemas.py
[error] 3-3: No name 'classes' in module 'weaviate'
(E0611)
backend/app/model/weaviate/models.py
[refactor] 17-17: Too few public methods (0/2)
(R0903)
backend/app/db/weaviate/weaviate_operations.py
[refactor] 101-106: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
backend/app/db/weaviate/user_profiling.py
[refactor] 58-68: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 101-106: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[refactor] 210-210: Too many local variables (22/15)
(R0914)
🔇 Additional comments (24)
backend/.env.example (1)
11-11: LGTM! Environment variable addition is appropriate.The addition of
GITHUB_TOKENis correctly formatted and necessary for the GitHub user profiling functionality implemented in this PR.backend/app/api/v1/auth.py (2)
5-5: LGTM! Import addition is correct.The import of
profile_user_from_githubis properly added to support the new user profiling functionality.
8-8: LGTM! Asyncio import is necessary.The
asyncioimport is correctly added to support background task creation for user profiling.backend/app/db/weaviate/weaviate_client.py (4)
2-4: LGTM! Proper imports for async context management.The imports for
asynccontextmanager,AsyncGenerator, andloggingare correctly added to support the async client refactoring.
6-8: LGTM! Logger and global client initialization.The logger setup and global
_clientvariable for lazy initialization follow proper patterns for singleton client management.
11-16: LGTM! Proper lazy initialization pattern.The
get_client()function correctly implements lazy initialization with the global singleton pattern, creating the async client only when needed.
18-32: Excellent async context manager implementation.The
get_weaviate_client()context manager properly handles the complete client lifecycle with:
- Automatic connection establishment
- Resource cleanup in finally block
- Comprehensive error handling and logging
- Proper exception propagation
This follows async best practices and ensures reliable resource management.
backend/app/model/weaviate/models.py (4)
17-30: Excellent pull request model design.The
WeaviatePullRequestmodel is well-structured with comprehensive fields capturing essential PR metadata including:
- Descriptive fields (title, body, state)
- Repository context and timestamps
- Collaboration metadata (labels, URLs)
The field descriptions are clear and the optional timestamp fields properly handle different PR states.
46-47: LGTM! Logical extension of user profile.The addition of the
pull_requestsfield toWeaviateUserProfilelogically extends the user profiling capabilities to include contribution patterns and collaboration style insights.
50-50: Good clarification of languages field scope.The updated description specifying "top 5 languages by usage frequency" provides clear expectations about the data scope and prevents unlimited list growth.
96-119: Comprehensive example data enhances documentation.The added pull request examples are realistic and demonstrate:
- Different PR states (closed/merged vs open)
- Various metadata combinations
- Proper timestamp formats
- Realistic repository targets
This greatly improves the model's documentation value.
backend/app/scripts/weaviate/create_schemas.py (8)
1-1: LGTM! Asyncio import added for async execution.The
asyncioimport is correctly added to support the async refactoring of the schema creation script.
5-10: LGTM! Proper async function conversion.The
create_schemafunction is correctly converted to async with proper await usage for the collection creation.
12-12: LGTM! User profile schema function converted to async.The function signature is properly updated to async to support the async client operations.
24-24: LGTM! Pull requests property added to schema.The addition of the
pull_requestsproperty withTEXTdata type (JSON string) is consistent with the new model field and follows the same pattern as therepositoriesproperty.
34-34: LGTM! Schema creation function updated to async.The
create_schemacall is properly awaited to support the async execution pattern.
36-49: Excellent async client lifecycle management.The
create_all_schemasfunction properly implements:
- Async client connection before operations
- Comprehensive error handling with try-except-finally
- Guaranteed client closure in finally block
- Clear success/error messaging
This follows async best practices for resource management.
51-57: LGTM! Proper async script execution pattern.The
mainfunction correctly usesasyncio.run()to execute the async schema creation, following standard Python async script patterns.
3-3: Verify Weaviate import compatibility.The static analysis tool flagged that
weaviate.classesmight not exist. Please verify that the current Weaviate library version supports this import path.What is the correct import path for weaviate.classes.config in the latest version of the Weaviate Python client?backend/main.py (2)
13-13: Excellent async refactoring of Weaviate client usage!The migration to asynchronous Weaviate client management is well-implemented. The connection test during startup ensures the service is ready before proceeding with other operations.
Also applies to: 31-31, 42-42, 53-62
106-126: Well-structured health check endpoint!The health check endpoint properly uses the async context manager and provides comprehensive status information for both Weaviate and Discord bot services.
backend/app/scripts/weaviate/populate_db.py (2)
1-11: Clean async migration with proper resource management!The script has been successfully migrated to async operations with proper error handling and the new
asyncio.runpattern for script execution.Also applies to: 275-306
38-51: Good addition of pull request data!The new
pull_requestsfield is properly structured with comprehensive metadata that will enable rich user profiling capabilities.Also applies to: 93-106, 139-152, 191-204, 244-257
backend/app/db/weaviate/weaviate_operations.py (1)
112-129: Well-implemented data preparation with proper serialization!The
_prepare_profile_datamethod correctly handles JSON serialization of nested objects and timezone-aware datetime formatting.
| async def build_user_profile(self, user_id: str, github_username: str) -> Optional[WeaviateUserProfile]: | ||
| """ | ||
| Build a complete user profile for Weaviate indexing | ||
| """ | ||
| logger.info(f"Building user profile for GitHub user: {github_username}") | ||
|
|
||
| # Run user data, repositories, and pull requests fetch concurrently | ||
| user_task = self.get_user_data(github_username) | ||
| repos_task = self.get_user_repositories(github_username) | ||
| prs_task = self.get_user_pull_requests(github_username) | ||
|
|
||
| try: | ||
| user_data, repos_data, pull_requests = await asyncio.gather( | ||
| user_task, repos_task, prs_task, return_exceptions=True | ||
| ) | ||
| except Exception as e: | ||
| logger.error(f"Error in concurrent data fetching: {str(e)}") | ||
| return None | ||
|
|
||
| if isinstance(user_data, Exception) or not user_data: | ||
| logger.error(f"Could not fetch user data for {github_username}") | ||
| return None | ||
|
|
||
| if isinstance(repos_data, Exception): | ||
| logger.warning(f"Error fetching repositories: {repos_data}") | ||
| repos_data = [] | ||
|
|
||
| if isinstance(pull_requests, Exception): | ||
| logger.warning(f"Error fetching pull requests: {pull_requests}") | ||
| pull_requests = [] | ||
|
|
||
| logger.info(f"Found {len(repos_data)} repositories and {len(pull_requests)} pull requests for {github_username}") | ||
|
|
||
| repository_tasks = [self._process_repository(repo) for repo in repos_data] | ||
|
|
||
| repositories = [] | ||
| if repository_tasks: | ||
| try: | ||
| repo_results = await asyncio.gather(*repository_tasks, return_exceptions=True) | ||
| repositories = [r for r in repo_results if r is not None and not isinstance(r, Exception)] | ||
| except Exception as e: | ||
| logger.warning(f"Error processing repositories: {str(e)}") | ||
|
|
||
| all_languages = set() | ||
| all_topics = set() | ||
| total_stars = 0 | ||
| total_forks = 0 | ||
|
|
||
| for repo_obj in repositories: | ||
| all_languages.update(repo_obj.languages) | ||
| total_stars += repo_obj.stars | ||
| total_forks += repo_obj.forks | ||
|
|
||
| for repo_data in repos_data: | ||
| topics = repo_data.get("topics", []) | ||
| if topics: | ||
| all_topics.update(topics) | ||
|
|
||
| top_languages = self.analyze_language_frequency(repositories) | ||
|
|
||
| profile = WeaviateUserProfile( | ||
| user_id=user_id, | ||
| github_username=github_username, | ||
| display_name=user_data.get("name"), | ||
| bio=user_data.get("bio"), | ||
| location=user_data.get("location"), | ||
| repositories=repositories, | ||
| pull_requests=pull_requests, | ||
| languages=top_languages, | ||
| topics=list(all_topics), | ||
| followers_count=user_data.get("followers", 0), | ||
| following_count=user_data.get("following", 0), | ||
| total_stars_received=total_stars, | ||
| total_forks=total_forks, | ||
| profile_text_for_embedding="", # TODO: Invoke agent/llm to generate this | ||
| last_updated=datetime.now() | ||
| ) | ||
|
|
||
| logger.info( | ||
| f"Successfully built profile for {github_username}: " | ||
| f"{len(repositories)} repos, {len(top_languages)} top languages, " | ||
| f"{len(pull_requests)} pull requests analyzed" | ||
| ) | ||
| return profile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider refactoring this method to reduce complexity.
This method has 22 local variables (pylint recommends max 15). Consider extracting some logic into helper methods, such as:
- Repository processing logic (lines 243-266)
- Language and topic aggregation (lines 253-268)
Would you like me to help refactor this method into smaller, more focused helper methods?
🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 210-210: Too many local variables (22/15)
(R0914)
🤖 Prompt for AI Agents
In backend/app/db/weaviate/user_profiling.py from lines 210 to 293, the
build_user_profile method is too complex with 22 local variables, exceeding
pylint's recommended maximum of 15. To fix this, extract the repository
processing logic (lines 243-266) into a separate helper method that handles
processing repositories and filtering results. Also, move the language and topic
aggregation logic (lines 253-268) into another helper method that computes all
languages, topics, stars, and forks totals. Then call these helper methods from
build_user_profile to simplify it and reduce local variables.
|
@chandansgowda, could you please review and merge? Planning to work either on restructuring the codebase or Falkor DB part or both concurrently xD. Will introduce GitHub agent once these are finalized. PS: initially I tried working around PyGitHub but it didn't have any support for async calls and when being wrapped as async using asyncio threads or thread pool some functions internally were still having sync calls leading to blocking behavior which initially resulted in whole process to be blocking and taking around ~20 seconds. But using GitHub REST API reduced this profiling workflow to entirely async with time taken being around ~3 seconds to profile out a user. Async wrapper alternatives exists for this but flexibility ends up a cap in those. |
Attached interactions
Summary by CodeRabbit
New Features
Improvements
Chores