Skip to content

Data harvesting tool that automates the collection, validation, and structuring of vulnerability reports from sources like HackerOne, Bugcrowd, ExploitDB, and CVE databases. It transforms raw data into clean, deduplicated, quality-scored JSONL datasets for AI/ML use..

License

Notifications You must be signed in to change notification settings

RafalW3bCraft/SecuDataExtractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SecuDataExtractor

Cybersecurity vulnerability data harvesting tool for AI model training datasets.

Overview

SecuDataExtractor automatically collects, validates, and transforms vulnerability reports from multiple authoritative sources into structured training datasets optimized for fine-tuning AI and machine learning models. The platform aggregates data from HackerOne, Bugcrowd, ExploitDB, and CVE databases, generating high-quality JSONL files in instruction/input/output format.

Features

  • Multi-Source Data Collection: Automated harvesting from HackerOne, Bugcrowd, ExploitDB, and CVE databases
  • AI-Ready Processing: Transforms vulnerability reports into JSONL format optimized for ML training
  • Data Quality Assurance: Built-in validation, deduplication, and quality scoring mechanisms
  • Modern Web Interface: Responsive dashboard with real-time progress monitoring
  • Database Integration: PostgreSQL backend for persistent storage and data management
  • High Performance: Multi-threaded background processing with rate limiting

Prerequisites

  • Python 3.11 or higher
  • PostgreSQL database
  • Modern web browser (for web interface)

Installation

1. Clone Repository

git clone <repository-url>
cd SecuDataExtractor

2. Install Dependencies

Using pip:

pip install -r requirements.txt

Or using uv (recommended):

uv sync

3. Set Environment Variables

Create a .env file or export environment variables:

DATABASE_URL=postgresql://user:password@host:port/database
SECRET_KEY=your-secret-key-here

Required:

  • DATABASE_URL - PostgreSQL connection string

Optional:

  • SECRET_KEY - Flask session security key (uses default if not set)

Usage

Start the Application

python app.py

The web interface will be available at http://localhost:5000

Web Interface

  1. Configure Sources: Select vulnerability data sources (HackerOne, Bugcrowd, ExploitDB, CVE)
  2. Set Parameters: Choose harvest mode (Unlimited, 5000, 1000, 500, or 100 entries per source)
  3. Start Extraction: Initiate the data collection process
  4. Monitor Progress: Real-time tracking of scraping progress and data quality
  5. Download Results: Export generated JSONL datasets for AI training

Command Line Usage

from scrapers.hackerone_scraper import HackerOneScraper
from utils.data_processor import DataProcessor

scraper = HackerOneScraper()
raw_data = scraper.scrape(max_entries=1000)

processor = DataProcessor()
training_data = processor.process_entries(raw_data, 'hackerone')

Data Format

SecuDataExtractor generates AI-ready datasets in JSONL format:

{
  "instruction": "Analyze this vulnerability report and provide security recommendations",
  "input": "SQL injection vulnerability in user authentication system...",
  "output": "This is a critical SQL injection vulnerability that allows attackers to bypass authentication..."
}

Quality Metrics

  • Deduplication: Content-based hashing prevents duplicate entries
  • Validation: Automatic field validation and format checking
  • Scoring: Quality scores based on completeness and relevance
  • Filtering: Advanced filtering for cybersecurity-specific content

Configuration

Database Configuration

The application uses PostgreSQL for data storage. Ensure your DATABASE_URL environment variable is properly configured:

postgresql://username:password@hostname:port/database_name

Application Settings

Optional configuration via environment variables:

  • FLASK_ENV - Set to development for debug mode or production for production deployment
  • SECRET_KEY - Flask secret key for session security

Scraping Configuration

The application includes built-in rate limiting and ethical scraping practices:

  • Respects robots.txt directives
  • Configurable rate limits between requests (default: 1-2 seconds)
  • Automatic retry logic for failed requests

Data Sources

Source Type Data Quality Status
HackerOne Bug Bounty Reports ⭐⭐⭐⭐⭐ Active
Bugcrowd Vulnerability Disclosures ⭐⭐⭐⭐ Active
ExploitDB Exploit Database ⭐⭐⭐⭐⭐ Active
CVE Database Official CVE Records ⭐⭐⭐⭐⭐ Active

Output Management

Generated datasets are stored in the datasets/ directory with the following naming convention:

cybersec_dataset_<mode>_<timestamp>.jsonl

Examples:

  • cybersec_dataset_unlimited_20250114_153045.jsonl
  • cybersec_dataset_1000_20250114_153045.jsonl

Security & Compliance

  • Rate Limiting: Respectful scraping with configurable delays
  • Robots.txt Compliance: Automatic checking of scraping permissions
  • Data Privacy: No personal information collection
  • Legal Compliance: Designed for educational and research purposes

Disclaimer: SecuDataExtractor is designed for educational and research purposes only. Users are responsible for ensuring compliance with target websites' Terms of Service and adhering to applicable laws and regulations.

License

This project is licensed under the MIT License.

MIT License

Copyright (c) 2025 RafalW3bCraft

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Credits

Built and developed by RafalW3bCraft


SecuDataExtractor - Transform vulnerability data into high-quality training datasets for AI model fine-tuning

About

Data harvesting tool that automates the collection, validation, and structuring of vulnerability reports from sources like HackerOne, Bugcrowd, ExploitDB, and CVE databases. It transforms raw data into clean, deduplicated, quality-scored JSONL datasets for AI/ML use..

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published