ML Model Architectures
Soteria’s Intelligence Engine utilizes three distinct machine learning architectures to analyze the "Structural DNA" of Python code. By transforming Abstract Syntax Trees (AST) into numerical feature matrices, the system can identify malicious patterns that traditional signature-based scanners miss.
Feature Extraction & Normalization
Before reaching the models, all source code passes through a hardening pipeline. This ensures the models analyze logic, not formatting:
- AST Normalization: Variable names, constants, and function names are anonymized to prevent "renaming obfuscation."
- Vectorization: Code is converted into a feature matrix based on the distribution of node types (e.g.,
BinOp,Attribute,Call,Assign). - SHA-256 Deduplication: Ensures the training set contains unique logical patterns, preventing bias from duplicate boilerplate code.
1. Ensemble Architecture (Voting Classifier)
The default architecture uses a multi-model voting strategy. This is designed for high reliability and general-purpose vulnerability detection.
Technical Specification
The ensemble model combines three different classification algorithms using a Soft Voting approach (averaging predicted probabilities):
| Classifier | Configuration | Role | | :--- | :--- | :--- | | Random Forest | 300 estimators, depth 15 | Captures non-linear relationships in node distribution. | | Gradient Boosting | 200 estimators, learning rate 0.1 | Focuses on minimizing residuals from difficult-to-classify samples. | | Logistic Regression | L2 Regularization, SAGA solver | Provides a stable linear baseline for standardized features. |
Usage Scenario: Best for general scanning where speed and interpretability are required. It is resistant to outliers in the AST data.
2. Neural Engine (Deep Learning)
The Neural Engine is a deep feedforward neural network built with PyTorch. It is designed to capture complex, deep-seated behavioral patterns in code structure.
Architecture: VulnerabilityNet
The network consists of a multi-layer perceptron (MLP) with the following topology:
- Input Layer: Sized to match the number of unique AST node types.
- Hidden Layers: Four dense layers (512 → 256 → 128 → 64 units).
- Regularization:
BatchNorm1dandDropout(0.3) are applied after each hidden layer to prevent overfitting. - Activation:
ReLUfor hidden layers;Sigmoidfor the final output (probability of malice).
# Model Definition Summary
network = nn.Sequential(
nn.Linear(input_size, 512),
nn.BatchNorm1d(512),
nn.ReLU(),
nn.Dropout(0.3),
# ... successive layers ...
nn.Linear(64, 1),
nn.Sigmoid()
)
Usage Scenario: Ideal for detecting sophisticated backdoors where malicious intent is hidden across complex control flows.
3. Hybrid Stacking Model
The Hybrid architecture represents the "high-fidelity" tier of the Soteria pipeline. It utilizes a Stacking Classifier approach, where the strengths of deep learning and classical ensemble methods are combined.
The Stacking Process
- Base Layer: The Neural Engine processes the AST features and outputs a "Vulnerability Probability."
- Augmentation: This probability is appended to the original feature matrix as a new feature (the "Neural Insight" column).
- Meta-Classifier: A final
StackingClassifier(using Random Forest and Gradient Boosting) analyzes the augmented matrix. - Final Decision: A Logistic Regression meta-learner makes the final call based on the consensus of the base estimators.
Usage Scenario: Enterprise-grade security audits where maximum accuracy is required and the cost of a False Negative is high.
Model Comparison
| Feature | Ensemble | Neural Engine | Hybrid | | :--- | :--- | :--- | :--- | | Inference Speed | Fast | Moderate | Slow | | Complexity | Medium | High | Very High | | Best For | Known Patterns | Hidden Behaviors | Maximum Precision | | Framework | Scikit-Learn | PyTorch | Scikit-Learn + PyTorch |
Model Persistence
Models are serialized using joblib (for Scikit-Learn) and torch.save (for PyTorch). The system automatically loads the .pkl or .pt files from the /backend/ML_master directory during the API startup sequence.