This tool uses AI embeddings to detect matching columns between two CSV files and automatically merges them. It's designed to solve the common problem of semantically similar columns with different names (e.g., "oprational_expenses" vs "cost_of_operation").
- AI-based column matching: Uses lightweight Sentence Transformers to find semantically similar columns
- Manual matching interface: Visual interface to manually match unmatched columns using mouse selection
- Sorted recommendations: "Best Potential Matches" section with columns sorted by similarity scores
- Real-time similarity scoring: Shows similarity scores with sample values for better decisions
- Performance optimized: Caching system prevents redundant similarity calculations
- Sorted dropdowns: Dropdowns show columns sorted by similarity when making manual matches
- Auto-refreshing UI: Dropdowns automatically update to remove matched columns after applying matches
- Automatic merging: Merges CSV files based on both AI and manual matches
- Schema generation: Creates a JSON schema describing the unified columns
- Web interface: Simple Streamlit UI for easy file upload and processing
- Export functionality: Exports both merged CSV and schema JSON
- Two output modes: "Mapped columns only" and "Full data with mapped + unmapped columns"
- Proper column naming: Unmapped columns get proper suffixes with actual file names
- Correct column ordering: Mapped → unmapped_file1 → unmapped_file2
- Clone or download the repository
- Install dependencies:
pip install -r requirements.txt
streamlit run main.pyfrom csv_matcher import CSVIngestor
from embeddings_matcher import EmbeddingsMatcher
from csv_merger import CSVMerger
# Load CSV files
ingestor = CSVIngestor()
df1 = ingestor.load_csv("file1.csv", is_file1=True)
df2 = ingestor.load_csv("file2.csv", is_file1=False)
# Get column information with samples
file1_info = ingestor.get_column_info(is_file1=True)
file2_info = ingestor.get_column_info(is_file1=False)
# Perform matching
matcher = EmbeddingsMatcher()
matches = matcher.match_columns(file1_info, file2_info, threshold=0.5)
# Merge files and create schema
merger = CSVMerger()
merged_df = merger.merge_csv_files(df1, df2, matches)
schema = merger.create_schema(df1.columns.tolist(), df2.columns.tolist(), matches)
# Export results
csv_path, schema_path = merger.export_results(merged_df, schema, "merged.csv", "schema.json")- Upload: Upload two CSV files with potentially different column names
- Analysis: The tool analyzes column names and sample values using AI embeddings
- Matching: Identifies semantically similar columns across the two files
- Merging: Creates a unified dataset based on the similiarity-mapped columns
- Export: Generates both merged CSV and schema JSON files
- File 1 has columns:
Cost_Center,Expense_Total - File 2 has columns:
Org_Code,Actual_Spend - Tool recognizes:
Cost_Center≈Org_CodeandExpense_Total≈Actual_Spend - Merges the files accordingly
ai_csv_column_matching_tool/
├── main.py # Streamlit web interface
├── csv_matcher.py # CSV ingestion and parsing
├── embeddings_matcher.py # Sentence Transformers matching logic
├── run_app.py # Application launcher script
├── government_projects.csv # Sample test file 1
├── commercial_engagements.csv # Sample test file 2
├── merged.csv # Sample merged output
├── schema.json # Sample schema output
├── project_progress.md # Project progress tracking
├── csv_merger.py # Merge and export functionality
├── generate_test_data.py # Generate synthetic test files
├── requirements.txt # Python dependencies
├── README.md # This file
└── UpWork_Final_Proposal_with_First_Prototype.md # Final proposal documentation
- Similarity Threshold: Adjust the minimum score for column matches (0.0 to 1.0)
- Primary Model: By default, uses "sentence-transformers/all-MiniLM-L6-v2" model
merged.csv: Unified dataset with matched columnsschema.json: Detailed schema describing unified columns and their sources
- Python 3.8+ (Recommended v3.11)
- See
requirements.txtfor full dependency list
The project includes synthetic test data to verify functionality. Run tests with:
python test_app.pyThe original requirements called for a simple tool to match CSV columns using AI embeddings with a basic UI. Our solution exceeds these requirements by providing:
- Enhanced AI matching: Not only basic embeddings, adjustable threshold, but also manual verification
- Human in the Loop: [Optional] based on best possible suggestions, with manual override capabilities
- Superior UI experience: Sorted recommendations, sample previews, and real-time similarity scoring
- Performance optimization: Caching system that prevents redundant calculations
- Intuitive workflow: Best matches highlighted with sample values for better decision-making
- Robust architecture: Modular design that's easy to extend and maintain
© 2025 Amit Kumar. MIT License.