SecuDataExtractor

Cybersecurity vulnerability data harvesting tool for AI model training datasets.

Overview

SecuDataExtractor automatically collects, validates, and transforms vulnerability reports from multiple authoritative sources into structured training datasets optimized for fine-tuning AI and machine learning models. The platform aggregates data from HackerOne, Bugcrowd, ExploitDB, and CVE databases, generating high-quality JSONL files in instruction/input/output format.

Features

Multi-Source Data Collection: Automated harvesting from HackerOne, Bugcrowd, ExploitDB, and CVE databases
AI-Ready Processing: Transforms vulnerability reports into JSONL format optimized for ML training
Data Quality Assurance: Built-in validation, deduplication, and quality scoring mechanisms
Modern Web Interface: Responsive dashboard with real-time progress monitoring
Database Integration: PostgreSQL backend for persistent storage and data management
High Performance: Multi-threaded background processing with rate limiting

Prerequisites

Python 3.11 or higher
PostgreSQL database
Modern web browser (for web interface)

Installation

1. Clone Repository

git clone <repository-url>
cd SecuDataExtractor

2. Install Dependencies

Using pip:

pip install -r requirements.txt

Or using uv (recommended):

uv sync

3. Set Environment Variables

Create a .env file or export environment variables:

DATABASE_URL=postgresql://user:password@host:port/database
SECRET_KEY=your-secret-key-here

Required:

DATABASE_URL - PostgreSQL connection string

Optional:

SECRET_KEY - Flask session security key (uses default if not set)

Usage

Start the Application

python app.py

The web interface will be available at http://localhost:5000

Web Interface

Configure Sources: Select vulnerability data sources (HackerOne, Bugcrowd, ExploitDB, CVE)
Set Parameters: Choose harvest mode (Unlimited, 5000, 1000, 500, or 100 entries per source)
Start Extraction: Initiate the data collection process
Monitor Progress: Real-time tracking of scraping progress and data quality
Download Results: Export generated JSONL datasets for AI training

Command Line Usage

from scrapers.hackerone_scraper import HackerOneScraper
from utils.data_processor import DataProcessor

scraper = HackerOneScraper()
raw_data = scraper.scrape(max_entries=1000)

processor = DataProcessor()
training_data = processor.process_entries(raw_data, 'hackerone')

Data Format

SecuDataExtractor generates AI-ready datasets in JSONL format:

{
  "instruction": "Analyze this vulnerability report and provide security recommendations",
  "input": "SQL injection vulnerability in user authentication system...",
  "output": "This is a critical SQL injection vulnerability that allows attackers to bypass authentication..."
}

Quality Metrics

Deduplication: Content-based hashing prevents duplicate entries
Validation: Automatic field validation and format checking
Scoring: Quality scores based on completeness and relevance
Filtering: Advanced filtering for cybersecurity-specific content

Configuration

Database Configuration

The application uses PostgreSQL for data storage. Ensure your DATABASE_URL environment variable is properly configured:

postgresql://username:password@hostname:port/database_name

Application Settings

Optional configuration via environment variables:

FLASK_ENV - Set to development for debug mode or production for production deployment
SECRET_KEY - Flask secret key for session security

Scraping Configuration

The application includes built-in rate limiting and ethical scraping practices:

Respects robots.txt directives
Configurable rate limits between requests (default: 1-2 seconds)
Automatic retry logic for failed requests

Data Sources

Source	Type	Data Quality	Status
HackerOne	Bug Bounty Reports	⭐⭐⭐⭐⭐	Active
Bugcrowd	Vulnerability Disclosures	⭐⭐⭐⭐	Active
ExploitDB	Exploit Database	⭐⭐⭐⭐⭐	Active
CVE Database	Official CVE Records	⭐⭐⭐⭐⭐	Active

Output Management

Generated datasets are stored in the datasets/ directory with the following naming convention:

cybersec_dataset_<mode>_<timestamp>.jsonl

Examples:

cybersec_dataset_unlimited_20250114_153045.jsonl
cybersec_dataset_1000_20250114_153045.jsonl

Security & Compliance

Rate Limiting: Respectful scraping with configurable delays
Robots.txt Compliance: Automatic checking of scraping permissions
Data Privacy: No personal information collection
Legal Compliance: Designed for educational and research purposes

Disclaimer: SecuDataExtractor is designed for educational and research purposes only. Users are responsible for ensuring compliance with target websites' Terms of Service and adhering to applicable laws and regulations.

License

This project is licensed under the MIT License.

MIT License

Copyright (c) 2025 RafalW3bCraft

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Credits

Built and developed by RafalW3bCraft

SecuDataExtractor - Transform vulnerability data into high-quality training datasets for AI model fine-tuning

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
scrapers		scrapers
static		static
templates		templates
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
database.py		database.py
models.py		models.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SecuDataExtractor

Overview

Features

Prerequisites

Installation

1. Clone Repository

2. Install Dependencies

3. Set Environment Variables

Usage

Start the Application

Web Interface

Command Line Usage

Data Format

Quality Metrics

Configuration

Database Configuration

Application Settings

Scraping Configuration

Data Sources

Output Management

Security & Compliance

License

Credits

About

Uh oh!

Releases

Packages

Languages

License

RafalW3bCraft/SecuDataExtractor

Folders and files

Latest commit

History

Repository files navigation

SecuDataExtractor

Overview

Features

Prerequisites

Installation

1. Clone Repository

2. Install Dependencies

3. Set Environment Variables

Usage

Start the Application

Web Interface

Command Line Usage

Data Format

Quality Metrics

Configuration

Database Configuration

Application Settings

Scraping Configuration

Data Sources

Output Management

Security & Compliance

License

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages