Vulnerability Intelligence DB
The Vulnerability Intelligence DB is the core knowledge base that powers Soteria’s detection engine. Unlike traditional signature-based databases that store specific strings or regex patterns, this system stores Structural DNA: a normalized, mathematical representation of Python code behavior derived from Abstract Syntax Trees (AST).
Structural Normalization
To prevent attackers from bypassing detection via simple renaming or obfuscation, the database does not store raw source code. Every contribution to the intelligence layer passes through the codeNormalizer (a custom AST.NodeTransformer).
- Anonymization: Variable names, function names, and constant values are stripped and replaced with generic placeholders.
- Logic Extraction: The system focuses exclusively on the sequence of operations (e.g., a
Callfollowed by anAttributeaccess), which represents the "true intent" of the code. - Resilience: This allows the engine to recognize a backdoor even if the variables are renamed from
backdoor_shelltocalculate_total.
Data Sources & Ingestion
The intelligence layer is fed by a multi-source pipeline managed via dataPipeline_AST.py. It categorizes code into two primary states:
- Clean Dataset: Known-safe implementation patterns, library usages, and standard algorithmic logic.
- Corrupted Dataset: Known malicious patterns, including SQL injection vectors, unauthorized socket connections, and credential exfiltration logic.
External Intelligence Integration
The system supports ingesting bulk datasets from external research sources, such as Hugging Face. These are processed through a hardened pipeline that strips markdown formatting and ensures the code is syntactically valid before normalization.
# Example: Adding external vulnerability data via CSV
# The pipeline automatically handles AST parsing and normalization
hf_csv_path = basePath / "external" / "vulnerability_samples.csv"
Data Integrity & Deduplication
To ensure the machine learning models remain unbiased, the database utilizes SHA-256 Fingerprinting. Every function ingested is hashed based on its raw content. If a function's fingerprint already exists in the "seen_hashes" registry, it is discarded. This prevents "Dataset Leakage," where the model simply memorizes duplicate entries rather than learning generalized behavioral patterns.
The Feature Matrix
The "Intelligence DB" ultimately manifests as a numerical feature matrix. This matrix quantifies the frequency and distribution of specific AST nodes.
| Feature | Description |
| :--- | :--- |
| Assign | Frequency of variable assignments. |
| Call | Number of function or method invocations. |
| BinOp | Binary operations (often used in obfuscated payload assembly). |
| Attribute | Accessing object attributes (common in unauthorized library access). |
Management for Administrators
Administrators can manage and update the intelligence database through the Neural Engine dashboard or by manually interacting with the backend data directories.
Adding New Vulnerability Patterns
To train the system on a new type of exploit:
- Navigate to
backend/data/corrupted/. - Add a new
.txtor.pyfile containing the malicious code snippet. - Trigger the
watch_data.pyprocess or run the trainer manually. - The system will automatically parse the file, extract the functions, normalize them, and update the
numericFeatures.csvmaster file used for model retraining.
Model Exporting
After the database is updated and the model is retrained, the intelligence is serialized into an acidModel.pkl (Ensemble) or acidModel_neural.pt (Neural Network) file. These artifacts represent the "frozen" state of the intelligence database ready for production deployment.