This project implements a complete classification pipeline for ECG heartbeat signals using the MIT-BIH Arrhythmia dataset.
The goal is to classify heartbeats into five arrhythmia types using rhythmic features, data balancing, augmentation techniques, and an SVM classifier with an RBF kernel.
The dataset contains ECG beats labeled into 5 classes:
- N — Normal
- S — Fusion of paced and normal beats
- V — Premature Ventricular Contraction
- F — Atrial premature contraction
- Q — Fusion of ventricular and normal beats
Each row contains 187 ECG samples + 1 class label.
Dataset size:
- Training set: 87,553 samples
- Test set: 21,891 samples
Initial operations included:
- Loading the CSV datasets using Pandas
- Inspecting structure and label distribution
- Visualizing ECG waveforms
- Identifying severe class imbalance (class N dominates)
Waveform visualization was performed using Matplotlib by sampling signals at fixed intervals.
Due to computational constraints, 10% of the original training and testing sets were extracted using:
train_test_split(..., stratify=labels)
Resulting shapes:
- Train: 8755 × 188
- Test: 2189 × 188
Signals were normalized with MinMaxScaler (-1, 1):
fit_transform()→ training settransform()→ test set
The last column was split from the ECG samples.
Feature extraction was performed through the compute_feature_vector() function, producing rhythmic and statistical descriptors.
- Mean
- Standard deviation
- Frame-wise ZCR
- Mean of ZCR
- Standard deviation of ZCR
- Magnitude spectrum
- Spectral flux using
librosa.onset.onset_strength - Mean and standard deviation
All features (statistics + frame features) are concatenated into a single vector.
Final shapes:
- Train feature matrix: (8755, 30)
- Test feature matrix: (2189, 30)
To remove bias toward class N, a perfectly balanced dataset was created:
- Train: 641 samples per class → 3205 samples
- Test: 162 samples per class → 810 samples
This was done via random sampling for each class.
Two augmentation techniques were implemented:
Non-linear resampling and padding/truncation to 187 samples.
Amplitude scaling with a factor α ∈ [-0.5, 0.5].
The perform() method applies stretch and/or amplify with 50% probability.
For each class:
- 100 augmented samples were generated
- Added to the balanced dataset
Final augmented training size:
- 3705 samples
Normalization was re-applied afterward.
Classifier setup:
- Kernel: RBF
- C: 10 (initial run)
- Test accuracy: 67.53%
- Train accuracy: 65.75%
→ No overfitting.
| Class | Precision | Recall | F1-score |
|---|---|---|---|
| N | 0.59 | 0.69 | 0.64 |
| S | 0.84 | 0.67 | 0.75 |
| V | 0.80 | 0.71 | 0.75 |
| F | 0.79 | 0.89 | 0.84 |
| Q | 0.89 | 0.91 | 0.90 |
- Overall accuracy: 78%
- Macro F1: 0.78
Compared to the original imbalanced dataset, minority class performance improved dramatically.
Several C values were tested:
| C | Accuracy | Macro F1 | Macro Recall | False Negatives |
|---|---|---|---|---|
| 0.1 | 67.53% | 0.6666 | 0.6753 | 263 |
| 1 | 72.72% | 0.7252 | 0.7272 | 221 |
| 100 | 79.26% | 0.7926 | 0.7926 | 168 |
| 1000 | 80.00% | 0.8003 | 0.8000 | 162 |
| 10000 | 77.28% | 0.7732 | 0.7728 | 184 |
| 100000 | 77.04% | 0.7722 | 0.7704 | 186 |
- Kernel: RBF
- C: 1000
- Gamma: auto
- N: 64
- H: 8
- Accuracy: 80.49%
- Balancing the dataset is crucial.
- Rhythmic features significantly help distinguish arrhythmias.
- Augmentation improves generalization.
- SVM with RBF delivers robust results (~80%).
- 1D CNNs for automatic feature extraction
- Additional augmentations (noise, shifts, warping)
- PCA or feature selection
- Class-weighted SVM or focal loss