Automated Retraining System
The Automated Retraining System is the core of Soteria’s evolving intelligence. By leveraging a background "Watchdog" process, the system monitors for new data entries, processes structural changes, and updates the underlying machine learning models without requiring manual intervention.
Continuous Intelligence Lifecycle
Soteria does not rely on a static model. Instead, it follows a continuous loop of ingestion, normalization, and optimization to stay ahead of evolving obfuscation techniques.
- Data Ingestion: New Python snippets are added to the
backend/datadirectory (Clean or Corrupted) or imported via external datasets like Hugging Face. - Structural Normalization: The
dataPipeline_ASTengine parses the code into an Abstract Syntax Tree (AST), anonymizes identifiers, and generates a unique SHA-256 "Structural DNA" hash to prevent duplicate bias. - Automated Trigger: The
watch_data.pyservice detects changes in the data repository and signals the training pipeline. - Model Optimization: The system refreshes the Random Forest Ensemble, Neural Network, or Hybrid Stacking models based on the new feature matrix.
The Watchdog Process
The Watchdog is a background utility that monitors the file system for new training samples. It ensures that the model is always synchronized with the latest known "malicious" patterns.
Configuration
The Watchdog looks for files within the following structure:
backend/data/clean/: Trusted, benign code samples.backend/data/corrupted/: Known malicious injections or backdoors.backend/data/external/: CSV imports (e.g.,huggingface_raw.csv).
When a file is added or modified in these directories, the system automatically executes the hardened_Dataset_with_Normalization routine to update the master numeric feature set.
Training Terminal & Neural Engine
Administrators can monitor and trigger retraining manually through the Neural Engine dashboard in the Cyber Sentinel interface. This view provides high-level metrics on the retraining progress.
Available Training Pipelines
| Pipeline | Model Type | Use Case | | :--- | :--- | :--- | | AST Ensemble | Random Forest / Gradient Boosting | General purpose, high-speed detection of structural anomalies. | | Neural Engine | PyTorch Deep Learning | Deep pattern recognition for complex, multi-stage backdoors. | | Hybrid Stacking | Stacked Classifier | Maximum accuracy; uses NN predictions as features for a meta-classifier. |
Usage: Triggering a Manual Refresh
While the system is automated, developers and admins can manually trigger the pipeline via the CLI to force a model export:
# Navigate to the backend source
cd backend/src
# Run the data pipeline to refresh numeric features
python3 dataPipeline_AST.py
# Manually trigger the hybrid trainer
python3 trainerModel_Hybrid.py
Upon completion, the system exports a new acidModel.pkl (Ensemble) or acidModel_neural.pt (Neural Network) to the ML_master directory. The Intelligence Engine automatically hot-reloads the new model for real-time scanning.
Data Purity & De-duplication
To ensure high model integrity, the retraining system implements Structural De-duplication.
Before a function is converted into a numerical vector, its normalized AST form is hashed. If the hash exists in the current dataset, the system skips the entry. This prevents "Dataset Leakage," where the model achieves artificially high accuracy by memorizing duplicate functions rather than learning behavioral patterns.