Cybersecurity vulnerability data harvesting tool for AI model training datasets.
SecuDataExtractor automatically collects, validates, and transforms vulnerability reports from multiple authoritative sources into structured training datasets optimized for fine-tuning AI and machine learning models. The platform aggregates data from HackerOne, Bugcrowd, ExploitDB, and CVE databases, generating high-quality JSONL files in instruction/input/output format.
- Multi-Source Data Collection: Automated harvesting from HackerOne, Bugcrowd, ExploitDB, and CVE databases
- AI-Ready Processing: Transforms vulnerability reports into JSONL format optimized for ML training
- Data Quality Assurance: Built-in validation, deduplication, and quality scoring mechanisms
- Modern Web Interface: Responsive dashboard with real-time progress monitoring
- Database Integration: PostgreSQL backend for persistent storage and data management
- High Performance: Multi-threaded background processing with rate limiting
- Python 3.11 or higher
- PostgreSQL database
- Modern web browser (for web interface)
git clone <repository-url>
cd SecuDataExtractorUsing pip:
pip install -r requirements.txtOr using uv (recommended):
uv syncCreate a .env file or export environment variables:
DATABASE_URL=postgresql://user:password@host:port/database
SECRET_KEY=your-secret-key-hereRequired:
DATABASE_URL- PostgreSQL connection string
Optional:
SECRET_KEY- Flask session security key (uses default if not set)
python app.pyThe web interface will be available at http://localhost:5000
- Configure Sources: Select vulnerability data sources (HackerOne, Bugcrowd, ExploitDB, CVE)
- Set Parameters: Choose harvest mode (Unlimited, 5000, 1000, 500, or 100 entries per source)
- Start Extraction: Initiate the data collection process
- Monitor Progress: Real-time tracking of scraping progress and data quality
- Download Results: Export generated JSONL datasets for AI training
from scrapers.hackerone_scraper import HackerOneScraper
from utils.data_processor import DataProcessor
scraper = HackerOneScraper()
raw_data = scraper.scrape(max_entries=1000)
processor = DataProcessor()
training_data = processor.process_entries(raw_data, 'hackerone')SecuDataExtractor generates AI-ready datasets in JSONL format:
{
"instruction": "Analyze this vulnerability report and provide security recommendations",
"input": "SQL injection vulnerability in user authentication system...",
"output": "This is a critical SQL injection vulnerability that allows attackers to bypass authentication..."
}- Deduplication: Content-based hashing prevents duplicate entries
- Validation: Automatic field validation and format checking
- Scoring: Quality scores based on completeness and relevance
- Filtering: Advanced filtering for cybersecurity-specific content
The application uses PostgreSQL for data storage. Ensure your DATABASE_URL environment variable is properly configured:
postgresql://username:password@hostname:port/database_name
Optional configuration via environment variables:
FLASK_ENV- Set todevelopmentfor debug mode orproductionfor production deploymentSECRET_KEY- Flask secret key for session security
The application includes built-in rate limiting and ethical scraping practices:
- Respects robots.txt directives
- Configurable rate limits between requests (default: 1-2 seconds)
- Automatic retry logic for failed requests
| Source | Type | Data Quality | Status |
|---|---|---|---|
| HackerOne | Bug Bounty Reports | ⭐⭐⭐⭐⭐ | Active |
| Bugcrowd | Vulnerability Disclosures | ⭐⭐⭐⭐ | Active |
| ExploitDB | Exploit Database | ⭐⭐⭐⭐⭐ | Active |
| CVE Database | Official CVE Records | ⭐⭐⭐⭐⭐ | Active |
Generated datasets are stored in the datasets/ directory with the following naming convention:
cybersec_dataset_<mode>_<timestamp>.jsonl
Examples:
cybersec_dataset_unlimited_20250114_153045.jsonlcybersec_dataset_1000_20250114_153045.jsonl
- Rate Limiting: Respectful scraping with configurable delays
- Robots.txt Compliance: Automatic checking of scraping permissions
- Data Privacy: No personal information collection
- Legal Compliance: Designed for educational and research purposes
Disclaimer: SecuDataExtractor is designed for educational and research purposes only. Users are responsible for ensuring compliance with target websites' Terms of Service and adhering to applicable laws and regulations.
This project is licensed under the MIT License.
MIT License
Copyright (c) 2025 RafalW3bCraft
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Built and developed by RafalW3bCraft
SecuDataExtractor - Transform vulnerability data into high-quality training datasets for AI model fine-tuning