Skip to content

This tool uses AI embeddings to detect matching columns between two CSV files and automatically merges them. It's designed to solve the common problem of semantically similar columns with different names (e.g., "oprational_expenses" vs "cost_of_operation").

License

Notifications You must be signed in to change notification settings

amitportal/similar_column_merge_tool

Repository files navigation

AI CSV Column Matching Tool

Overview

This tool uses AI embeddings to detect matching columns between two CSV files and automatically merges them. It's designed to solve the common problem of semantically similar columns with different names (e.g., "oprational_expenses" vs "cost_of_operation").

Features

  • AI-based column matching: Uses lightweight Sentence Transformers to find semantically similar columns
  • Manual matching interface: Visual interface to manually match unmatched columns using mouse selection
  • Sorted recommendations: "Best Potential Matches" section with columns sorted by similarity scores
  • Real-time similarity scoring: Shows similarity scores with sample values for better decisions
  • Performance optimized: Caching system prevents redundant similarity calculations
  • Sorted dropdowns: Dropdowns show columns sorted by similarity when making manual matches
  • Auto-refreshing UI: Dropdowns automatically update to remove matched columns after applying matches
  • Automatic merging: Merges CSV files based on both AI and manual matches
  • Schema generation: Creates a JSON schema describing the unified columns
  • Web interface: Simple Streamlit UI for easy file upload and processing
  • Export functionality: Exports both merged CSV and schema JSON
  • Two output modes: "Mapped columns only" and "Full data with mapped + unmapped columns"
  • Proper column naming: Unmapped columns get proper suffixes with actual file names
  • Correct column ordering: Mapped → unmapped_file1 → unmapped_file2

Installation

  1. Clone or download the repository
  2. Install dependencies:
    pip install -r requirements.txt

Usage

Run the Web Interface

streamlit run main.py

Command Line Usage (example)

from csv_matcher import CSVIngestor
from embeddings_matcher import EmbeddingsMatcher
from csv_merger import CSVMerger

# Load CSV files
ingestor = CSVIngestor()
df1 = ingestor.load_csv("file1.csv", is_file1=True)
df2 = ingestor.load_csv("file2.csv", is_file1=False)

# Get column information with samples
file1_info = ingestor.get_column_info(is_file1=True)
file2_info = ingestor.get_column_info(is_file1=False)

# Perform matching
matcher = EmbeddingsMatcher()
matches = matcher.match_columns(file1_info, file2_info, threshold=0.5)

# Merge files and create schema
merger = CSVMerger()
merged_df = merger.merge_csv_files(df1, df2, matches)
schema = merger.create_schema(df1.columns.tolist(), df2.columns.tolist(), matches)

# Export results
csv_path, schema_path = merger.export_results(merged_df, schema, "merged.csv", "schema.json")

How It Works

  1. Upload: Upload two CSV files with potentially different column names
  2. Analysis: The tool analyzes column names and sample values using AI embeddings
  3. Matching: Identifies semantically similar columns across the two files
  4. Merging: Creates a unified dataset based on the similiarity-mapped columns
  5. Export: Generates both merged CSV and schema JSON files

Example

  • File 1 has columns: Cost_Center, Expense_Total
  • File 2 has columns: Org_Code, Actual_Spend
  • Tool recognizes: Cost_CenterOrg_Code and Expense_TotalActual_Spend
  • Merges the files accordingly

Project Structure

ai_csv_column_matching_tool/
├── main.py                 # Streamlit web interface
├── csv_matcher.py          # CSV ingestion and parsing
├── embeddings_matcher.py   # Sentence Transformers matching logic
├── run_app.py                # Application launcher script
├── government_projects.csv          # Sample test file 1
├── commercial_engagements.csv       # Sample test file 2
├── merged.csv                # Sample merged output
├── schema.json               # Sample schema output
├── project_progress.md       # Project progress tracking
├── csv_merger.py           # Merge and export functionality
├── generate_test_data.py   # Generate synthetic test files
├── requirements.txt        # Python dependencies
├── README.md              # This file
└── UpWork_Final_Proposal_with_First_Prototype.md  # Final proposal documentation

Configuration

  • Similarity Threshold: Adjust the minimum score for column matches (0.0 to 1.0)
  • Primary Model: By default, uses "sentence-transformers/all-MiniLM-L6-v2" model

Output Files

  • merged.csv: Unified dataset with matched columns
  • schema.json: Detailed schema describing unified columns and their sources

Requirements

  • Python 3.8+ (Recommended v3.11)
  • See requirements.txt for full dependency list

Testing

The project includes synthetic test data to verify functionality. Run tests with:

python test_app.py

How Our Solution Exceeds Requirements

The original requirements called for a simple tool to match CSV columns using AI embeddings with a basic UI. Our solution exceeds these requirements by providing:

  • Enhanced AI matching: Not only basic embeddings, adjustable threshold, but also manual verification
  • Human in the Loop: [Optional] based on best possible suggestions, with manual override capabilities
  • Superior UI experience: Sorted recommendations, sample previews, and real-time similarity scoring
  • Performance optimization: Caching system that prevents redundant calculations
  • Intuitive workflow: Best matches highlighted with sample values for better decision-making
  • Robust architecture: Modular design that's easy to extend and maintain

License & attribution

© 2025 Amit Kumar. MIT License.

About

This tool uses AI embeddings to detect matching columns between two CSV files and automatically merges them. It's designed to solve the common problem of semantically similar columns with different names (e.g., "oprational_expenses" vs "cost_of_operation").

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages