AI CSV Column Matching Tool

Overview

This tool uses AI embeddings to detect matching columns between two CSV files and automatically merges them. It's designed to solve the common problem of semantically similar columns with different names (e.g., "oprational_expenses" vs "cost_of_operation").

Features

AI-based column matching: Uses lightweight Sentence Transformers to find semantically similar columns
Manual matching interface: Visual interface to manually match unmatched columns using mouse selection
Sorted recommendations: "Best Potential Matches" section with columns sorted by similarity scores
Real-time similarity scoring: Shows similarity scores with sample values for better decisions
Performance optimized: Caching system prevents redundant similarity calculations
Sorted dropdowns: Dropdowns show columns sorted by similarity when making manual matches
Auto-refreshing UI: Dropdowns automatically update to remove matched columns after applying matches
Automatic merging: Merges CSV files based on both AI and manual matches
Schema generation: Creates a JSON schema describing the unified columns
Web interface: Simple Streamlit UI for easy file upload and processing
Export functionality: Exports both merged CSV and schema JSON
Two output modes: "Mapped columns only" and "Full data with mapped + unmapped columns"
Proper column naming: Unmapped columns get proper suffixes with actual file names
Correct column ordering: Mapped → unmapped_file1 → unmapped_file2

Installation

Clone or download the repository
Install dependencies:
```
pip install -r requirements.txt
```

Usage

Run the Web Interface

streamlit run main.py

Command Line Usage (example)

from csv_matcher import CSVIngestor
from embeddings_matcher import EmbeddingsMatcher
from csv_merger import CSVMerger

# Load CSV files
ingestor = CSVIngestor()
df1 = ingestor.load_csv("file1.csv", is_file1=True)
df2 = ingestor.load_csv("file2.csv", is_file1=False)

# Get column information with samples
file1_info = ingestor.get_column_info(is_file1=True)
file2_info = ingestor.get_column_info(is_file1=False)

# Perform matching
matcher = EmbeddingsMatcher()
matches = matcher.match_columns(file1_info, file2_info, threshold=0.5)

# Merge files and create schema
merger = CSVMerger()
merged_df = merger.merge_csv_files(df1, df2, matches)
schema = merger.create_schema(df1.columns.tolist(), df2.columns.tolist(), matches)

# Export results
csv_path, schema_path = merger.export_results(merged_df, schema, "merged.csv", "schema.json")

How It Works

Upload: Upload two CSV files with potentially different column names
Analysis: The tool analyzes column names and sample values using AI embeddings
Matching: Identifies semantically similar columns across the two files
Merging: Creates a unified dataset based on the similiarity-mapped columns
Export: Generates both merged CSV and schema JSON files

Example

File 1 has columns: Cost_Center, Expense_Total
File 2 has columns: Org_Code, Actual_Spend
Tool recognizes: Cost_Center ≈ Org_Code and Expense_Total ≈ Actual_Spend
Merges the files accordingly

Project Structure

ai_csv_column_matching_tool/
├── main.py                 # Streamlit web interface
├── csv_matcher.py          # CSV ingestion and parsing
├── embeddings_matcher.py   # Sentence Transformers matching logic
├── run_app.py                # Application launcher script
├── government_projects.csv          # Sample test file 1
├── commercial_engagements.csv       # Sample test file 2
├── merged.csv                # Sample merged output
├── schema.json               # Sample schema output
├── project_progress.md       # Project progress tracking
├── csv_merger.py           # Merge and export functionality
├── generate_test_data.py   # Generate synthetic test files
├── requirements.txt        # Python dependencies
├── README.md              # This file
└── UpWork_Final_Proposal_with_First_Prototype.md  # Final proposal documentation

Configuration

Similarity Threshold: Adjust the minimum score for column matches (0.0 to 1.0)
Primary Model: By default, uses "sentence-transformers/all-MiniLM-L6-v2" model

Output Files

merged.csv: Unified dataset with matched columns
schema.json: Detailed schema describing unified columns and their sources

Requirements

Python 3.8+ (Recommended v3.11)
See requirements.txt for full dependency list

Testing

The project includes synthetic test data to verify functionality. Run tests with:

python test_app.py

How Our Solution Exceeds Requirements

The original requirements called for a simple tool to match CSV columns using AI embeddings with a basic UI. Our solution exceeds these requirements by providing:

Enhanced AI matching: Not only basic embeddings, adjustable threshold, but also manual verification
Human in the Loop: [Optional] based on best possible suggestions, with manual override capabilities
Superior UI experience: Sorted recommendations, sample previews, and real-time similarity scoring
Performance optimization: Caching system that prevents redundant calculations
Intuitive workflow: Best matches highlighted with sample values for better decision-making
Robust architecture: Modular design that's easy to extend and maintain

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI CSV Column Matching Tool

Overview

Features

Installation

Usage

Run the Web Interface

Command Line Usage (example)

How It Works

Example

Project Structure

Configuration

Output Files

Requirements

Testing

How Our Solution Exceeds Requirements

License & attribution

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
USAGE_INSTRUCTIONS.md		USAGE_INSTRUCTIONS.md
commercial_engagements.csv		commercial_engagements.csv
csv_matcher.py		csv_matcher.py
csv_merger.py		csv_merger.py
embeddings_matcher.py		embeddings_matcher.py
fallback_matcher.py		fallback_matcher.py
generate_test_data.py		generate_test_data.py
government_projects.csv		government_projects.csv
main.py		main.py
merged_output.csv		merged_output.csv
requirements.txt		requirements.txt
run_app.py		run_app.py
schema_output.json		schema_output.json

License

amitportal/similar_column_merge_tool

Folders and files

Latest commit

History

Repository files navigation

AI CSV Column Matching Tool

Overview

Features

Installation

Usage

Run the Web Interface

Command Line Usage (example)

How It Works

Example

Project Structure

Configuration

Output Files

Requirements

Testing

How Our Solution Exceeds Requirements

License & attribution

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages