Automated Retraining System

The Automated Retraining System is the core of Soteria’s evolving intelligence. By leveraging a background "Watchdog" process, the system monitors for new data entries, processes structural changes, and updates the underlying machine learning models without requiring manual intervention.

Continuous Intelligence Lifecycle

Soteria does not rely on a static model. Instead, it follows a continuous loop of ingestion, normalization, and optimization to stay ahead of evolving obfuscation techniques.

Data Ingestion: New Python snippets are added to the backend/data directory (Clean or Corrupted) or imported via external datasets like Hugging Face.
Structural Normalization: The dataPipeline_AST engine parses the code into an Abstract Syntax Tree (AST), anonymizes identifiers, and generates a unique SHA-256 "Structural DNA" hash to prevent duplicate bias.
Automated Trigger: The watch_data.py service detects changes in the data repository and signals the training pipeline.
Model Optimization: The system refreshes the Random Forest Ensemble, Neural Network, or Hybrid Stacking models based on the new feature matrix.

The Watchdog Process

The Watchdog is a background utility that monitors the file system for new training samples. It ensures that the model is always synchronized with the latest known "malicious" patterns.

Configuration

The Watchdog looks for files within the following structure:

backend/data/clean/: Trusted, benign code samples.
backend/data/corrupted/: Known malicious injections or backdoors.
backend/data/external/: CSV imports (e.g., huggingface_raw.csv).

When a file is added or modified in these directories, the system automatically executes the hardened_Dataset_with_Normalization routine to update the master numeric feature set.

Training Terminal & Neural Engine

Administrators can monitor and trigger retraining manually through the Neural Engine dashboard in the Cyber Sentinel interface. This view provides high-level metrics on the retraining progress.

Available Training Pipelines

Usage: Triggering a Manual Refresh

While the system is automated, developers and admins can manually trigger the pipeline via the CLI to force a model export:

# Navigate to the backend source
cd backend/src

# Run the data pipeline to refresh numeric features
python3 dataPipeline_AST.py

# Manually trigger the hybrid trainer
python3 trainerModel_Hybrid.py

Upon completion, the system exports a new acidModel.pkl (Ensemble) or acidModel_neural.pt (Neural Network) to the ML_master directory. The Intelligence Engine automatically hot-reloads the new model for real-time scanning.

Data Purity & De-duplication

To ensure high model integrity, the retraining system implements Structural De-duplication.

Before a function is converted into a numerical vector, its normalized AST form is hashed. If the hash exists in the current dataset, the system skips the entry. This prevents "Dataset Leakage," where the model achieves artificially high accuracy by memorizing duplicate functions rather than learning behavioral patterns.