Vulnerability Intelligence DB

The Vulnerability Intelligence DB is the core knowledge base that powers Soteria’s detection engine. Unlike traditional signature-based databases that store specific strings or regex patterns, this system stores Structural DNA: a normalized, mathematical representation of Python code behavior derived from Abstract Syntax Trees (AST).

Structural Normalization

To prevent attackers from bypassing detection via simple renaming or obfuscation, the database does not store raw source code. Every contribution to the intelligence layer passes through the codeNormalizer (a custom AST.NodeTransformer).

Anonymization: Variable names, function names, and constant values are stripped and replaced with generic placeholders.
Logic Extraction: The system focuses exclusively on the sequence of operations (e.g., a Call followed by an Attribute access), which represents the "true intent" of the code.
Resilience: This allows the engine to recognize a backdoor even if the variables are renamed from backdoor_shell to calculate_total.

Data Sources & Ingestion

The intelligence layer is fed by a multi-source pipeline managed via dataPipeline_AST.py. It categorizes code into two primary states:

Clean Dataset: Known-safe implementation patterns, library usages, and standard algorithmic logic.
Corrupted Dataset: Known malicious patterns, including SQL injection vectors, unauthorized socket connections, and credential exfiltration logic.

External Intelligence Integration

The system supports ingesting bulk datasets from external research sources, such as Hugging Face. These are processed through a hardened pipeline that strips markdown formatting and ensures the code is syntactically valid before normalization.

# Example: Adding external vulnerability data via CSV
# The pipeline automatically handles AST parsing and normalization
hf_csv_path = basePath / "external" / "vulnerability_samples.csv"

Data Integrity & Deduplication

To ensure the machine learning models remain unbiased, the database utilizes SHA-256 Fingerprinting. Every function ingested is hashed based on its raw content. If a function's fingerprint already exists in the "seen_hashes" registry, it is discarded. This prevents "Dataset Leakage," where the model simply memorizes duplicate entries rather than learning generalized behavioral patterns.

The Feature Matrix

The "Intelligence DB" ultimately manifests as a numerical feature matrix. This matrix quantifies the frequency and distribution of specific AST nodes.

Management for Administrators

Administrators can manage and update the intelligence database through the Neural Engine dashboard or by manually interacting with the backend data directories.

Adding New Vulnerability Patterns

To train the system on a new type of exploit:

Navigate to backend/data/corrupted/.
Add a new .txt or .py file containing the malicious code snippet.
Trigger the watch_data.py process or run the trainer manually.
The system will automatically parse the file, extract the functions, normalize them, and update the numericFeatures.csv master file used for model retraining.

Model Exporting

After the database is updated and the model is retrained, the intelligence is serialized into an acidModel.pkl (Ensemble) or acidModel_neural.pt (Neural Network) file. These artifacts represent the "frozen" state of the intelligence database ready for production deployment.