Performance Tips¶

GitFleet is designed with performance in mind, leveraging Rust's speed and memory safety for core Git operations while providing a convenient Python interface. This guide provides tips and best practices for optimizing GitFleet's performance in your applications.

Performance Architecture¶

GitFleet's performance architecture is built on several key components:

Rust Implementation for Critical Operations:
Repository cloning
Git blame analysis
Commit history extraction
Core repository management
Asynchronous Processing:
Non-blocking I/O via asyncio and Tokio
Parallel operations where appropriate
Efficient Memory Management:
Streaming results for large repositories
Temporary directory cleanup

General Performance Tips¶

Use Asynchronous APIs¶

Always prefer asynchronous APIs when available. GitFleet's async functions leverage Rust's Tokio runtime for optimal performance:

# Good: Using async/await
async def process_repos():
    repos = await github_client.fetch_repositories("octocat")
    return repos

# Bad: Blocking the main thread
def process_repos_blocking():
    # Hypothetical synchronous version would block
    repos = github_client.fetch_repositories_sync("octocat")
    return repos

Batch Operations¶

Group related operations into batches rather than making multiple individual calls:

# Good: Batch operation
results = await repo_manager.bulk_blame(repo_path, ["file1.py", "file2.py", "file3.py"])

# Less efficient: Individual operations
result1 = await repo_manager.blame(repo_path, "file1.py")
result2 = await repo_manager.blame(repo_path, "file2.py")
result3 = await repo_manager.blame(repo_path, "file3.py")

Process Data Incrementally¶

When working with large repositories, process data incrementally rather than loading everything into memory:

# Process commits in batches
async for commit_batch in repo_manager.yield_commits(repo_path, batch_size=100):
    # Process each batch without loading everything into memory
    process_batch(commit_batch)

Repository Cloning Performance¶

Optimize Clone Options¶

When cloning repositories, consider using these options for better performance:

# Clone with minimal history (faster)
repo_manager = RepoManager(
    urls=["https://github.com/example/repo.git"],
    clone_options={"depth": 1}  # Only fetch the latest commit
)

Efficient Clone Monitoring¶

When monitoring clone progress, be mindful of polling frequency:

# Good: Reasonable refresh rate
while not clone_future.done():
    clone_tasks = await repo_manager.fetch_clone_tasks()
    # Process tasks...
    await asyncio.sleep(1)  # 1 second interval

# Bad: Excessive polling
while not clone_future.done():
    clone_tasks = await repo_manager.fetch_clone_tasks()
    # Process tasks...
    await asyncio.sleep(0.01)  # 10ms interval is too frequent

Clean Up Temporary Directories¶

Always clean up temporary directories when you're done with them:

# Clean up after operations complete
cleanup_results = repo_manager.cleanup()

API Interaction Performance¶

Rate Limit Management¶

Efficiently manage rate limits with TokenManager:

# Create a token manager with multiple tokens
token_manager = TokenManager()
for token in github_tokens:
    token_manager.add_token(token, ProviderType.GITHUB)

# Use token manager for auto-rotation
github = GitHubClient(
    token=github_tokens[0],  # Default token
    token_manager=token_manager  # Will rotate tokens as needed
)

Local Caching¶

Implement caching for frequently accessed data:

# Simple in-memory cache
repo_cache = {}

async def get_repo_info(repo_name):
    if repo_name in repo_cache:
        return repo_cache[repo_name]

    # Fetch from API
    repo_info = await github_client.fetch_repository_details("owner", repo_name)
    repo_cache[repo_name] = repo_info
    return repo_info

Memory Optimization¶

Stream Large Repository Data¶

For large repositories, use streaming operations where available:

# Process large files incrementally
async def process_large_file(repo_path, filename):
    results = []
    async for blame_batch in repo_manager.stream_blame(repo_path, filename, batch_size=1000):
        # Process each batch
        results.extend(analyze_blame(blame_batch))
    return results

Parallel Processing¶

Concurrent API Operations¶

Use concurrent operations for independent API tasks:

import asyncio

async def analyze_multiple_repos():
    # Start multiple operations concurrently
    tasks = [
        github_client.fetch_repositories("user1"),
        github_client.fetch_repositories("user2"),
        github_client.fetch_repositories("user3"),
    ]

    # Wait for all to complete
    results = await asyncio.gather(*tasks)
    return results

Parallel Git Operations¶

For operations on multiple repositories, use concurrent processing:

async def analyze_multiple_repos(repo_paths):
    # Extract commits from multiple repositories concurrently
    tasks = [
        repo_manager.extract_commits(repo_path)
        for repo_path in repo_paths
    ]

    # Wait for all to complete
    results = await asyncio.gather(*tasks)
    return results

Git Operation Performance¶

Blame Analysis Optimization¶

When using blame operations, consider these optimizations:

# Limit to specific files of interest rather than entire repo
file_patterns = ["*.py", "*.rs", "*.js"]
relevant_files = []

# Find relevant files (better than recursive search)
for pattern in file_patterns:
    pattern_files = await repo_manager.find_files(repo_path, pattern)
    relevant_files.extend(pattern_files)

# Then perform blame on the filtered set
blame_results = await repo_manager.bulk_blame(repo_path, relevant_files)

Commit Extraction Optimization¶

For commit extraction, filter to reduce processing time:

# Extract only recent commits (faster)
recent_commits = await repo_manager.extract_commits(
    repo_path, 
    max_count=100,  # Only most recent 100 commits
    since="2023-01-01"  # Only commits after this date
)

Monitoring and Profiling¶

Memory Usage Monitoring¶

Monitor memory usage for large operations:

import psutil
import os

def log_memory_usage(label):
    process = psutil.Process(os.getpid())
    mem = process.memory_info().rss / 1024 / 1024  # Convert to MB
    print(f"Memory usage ({label}): {mem:.2f} MB")

async def memory_intensive_operation():
    log_memory_usage("Before operation")
    result = await repo_manager.extract_commits(repo_path)
    log_memory_usage("After operation")
    return result

Performance Timing¶

Time critical operations to identify bottlenecks:

import time

async def measure_execution_time(coroutine, *args, **kwargs):
    start_time = time.time()
    result = await coroutine(*args, **kwargs)
    end_time = time.time()

    execution_time = end_time - start_time
    print(f"{coroutine.__name__} took {execution_time:.2f} seconds")

    return result

# Example usage
results = await measure_execution_time(
    repo_manager.bulk_blame, repo_path, file_paths
)

Implementation-Specific Optimizations¶

Rust vs Python Implementation¶

GitFleet implements performance-critical operations in Rust, while providing Python implementations for higher-level functionality:

Rust-powered operations (leverage these for best performance): - Repository cloning - Git blame analysis - Commit history extraction

Python-implemented features: - API clients - Token management - Credential handling - Data conversion utilities

When to Process Locally vs. Via API¶

For best performance, consider whether to use local Git operations (faster, implemented in Rust) or API operations (more convenient, implemented in Python):

# Fast local operation using Rust implementation
blame_info = await repo_manager.bulk_blame(local_repo_path, ["file.py"])

# Convenient but potentially slower API operation
file_content = await github_client.fetch_file_content("owner", "repo", "file.py")

Architecture Overview - Understanding GitFleet's performance architecture
Python-Rust Bridge - How the Rust and Python components interact