ML Model Architectures

Soteria’s Intelligence Engine utilizes three distinct machine learning architectures to analyze the "Structural DNA" of Python code. By transforming Abstract Syntax Trees (AST) into numerical feature matrices, the system can identify malicious patterns that traditional signature-based scanners miss.

Feature Extraction & Normalization

Before reaching the models, all source code passes through a hardening pipeline. This ensures the models analyze logic, not formatting:

AST Normalization: Variable names, constants, and function names are anonymized to prevent "renaming obfuscation."
Vectorization: Code is converted into a feature matrix based on the distribution of node types (e.g., BinOp, Attribute, Call, Assign).
SHA-256 Deduplication: Ensures the training set contains unique logical patterns, preventing bias from duplicate boilerplate code.

1. Ensemble Architecture (Voting Classifier)

The default architecture uses a multi-model voting strategy. This is designed for high reliability and general-purpose vulnerability detection.

Technical Specification

The ensemble model combines three different classification algorithms using a Soft Voting approach (averaging predicted probabilities):

Usage Scenario: Best for general scanning where speed and interpretability are required. It is resistant to outliers in the AST data.

2. Neural Engine (Deep Learning)

The Neural Engine is a deep feedforward neural network built with PyTorch. It is designed to capture complex, deep-seated behavioral patterns in code structure.

Architecture: `VulnerabilityNet`

The network consists of a multi-layer perceptron (MLP) with the following topology:

Input Layer: Sized to match the number of unique AST node types.
Hidden Layers: Four dense layers (512 → 256 → 128 → 64 units).
Regularization: BatchNorm1d and Dropout (0.3) are applied after each hidden layer to prevent overfitting.
Activation: ReLU for hidden layers; Sigmoid for the final output (probability of malice).

# Model Definition Summary
network = nn.Sequential(
    nn.Linear(input_size, 512),
    nn.BatchNorm1d(512),
    nn.ReLU(),
    nn.Dropout(0.3),
    # ... successive layers ...
    nn.Linear(64, 1),
    nn.Sigmoid()
)

Usage Scenario: Ideal for detecting sophisticated backdoors where malicious intent is hidden across complex control flows.

3. Hybrid Stacking Model

The Hybrid architecture represents the "high-fidelity" tier of the Soteria pipeline. It utilizes a Stacking Classifier approach, where the strengths of deep learning and classical ensemble methods are combined.

The Stacking Process

Base Layer: The Neural Engine processes the AST features and outputs a "Vulnerability Probability."
Augmentation: This probability is appended to the original feature matrix as a new feature (the "Neural Insight" column).
Meta-Classifier: A final StackingClassifier (using Random Forest and Gradient Boosting) analyzes the augmented matrix.
Final Decision: A Logistic Regression meta-learner makes the final call based on the consensus of the base estimators.

Usage Scenario: Enterprise-grade security audits where maximum accuracy is required and the cost of a False Negative is high.

Model Comparison

Model Persistence

Models are serialized using joblib (for Scikit-Learn) and torch.save (for PyTorch). The system automatically loads the .pkl or .pt files from the /backend/ML_master directory during the API startup sequence.