Feature Extraction Pipeline
The Feature Extraction Pipeline is the core of Soteria’s "Structural DNA" analysis. Unlike traditional static analysis tools that rely on fragile regex patterns or keyword blacklists, this pipeline deconstructs Python source code into its underlying logic to identify malicious intent, even when the code is obfuscated.
The "Structural DNA" Concept
Soteria treats code as a collection of behaviors rather than strings. By converting a function into a numerical feature matrix, the system can identify "malicious shapes"—patterns of logic common in backdoors, such as excessive network calls combined with dynamic attribute lookups, regardless of the variable names used.
1. Structural Normalization
Before features are extracted, the code passes through a Custom AST NodeTransformer. This stage strips away the "identity" of the code to focus purely on its "skeleton":
- Anonymization: Variable names, function names, and constants are replaced with generic identifiers.
- Obfuscation Resistance: Since the model analyzes the structure (e.g., a
Forloop containing anIfstatement and aCall), renaming a variable fromsend_datatox123does not change the resulting feature vector.
2. The Vectorization Engine
The pipeline transforms raw Python functions into a numerical distribution based on Abstract Syntax Tree (AST) node types. The engine counts the frequency of specific operations within a function's scope.
Key features tracked include:
Assign: Frequency of variable assignments.Call: Number of function invocations (often high in shellcode or injectors).BinOp: Mathematical operations (common in encryption/obfuscation routines).Attribute: Accessing object properties (e.g.,os.system).Expr: Standalone expressions.
Feature Matrix Example
The transformation results in a structured format ready for the Machine Learning models:
| Function Logic | Assign | Call | BinOp | Attribute | Label |
| :--- | :--- | :--- | :--- | :--- | :--- |
| Standard Calculation | 2.0 | 1.0 | 3.0 | 0.0 | Clean (0) |
| Backdoor/Payload | 1.0 | 5.0 | 0.0 | 4.0 | Malicious (1) |
3. Hardened Data Pipeline
To ensure high-quality training and prevent model bias, the pipeline implements a "Hardened" workflow:
- Function-Level Granularity: The pipeline automatically splits large
.pyfiles into individualFunctionDefnodes. This allows the scanner to pinpoint the exact location of a vulnerability within a large codebase. - SHA-256 Deduplication: Every function is hashed before being added to the dataset. If a function's structural logic is identical to one already processed, it is discarded to prevent "dataset poisoning" and over-fitting.
- External Ingestion: The pipeline supports ingesting raw JSON or CSV data from external sources (like Hugging Face), automatically cleaning markdown blocks and stripping non-Python artifacts before analysis.
Usage in Development
If you are extending the data pipeline or adding new features to the vectorizer, the primary interface is located in backend/src/dataPipeline_AST.py.
Extracting Features from a File
To process a directory of Python scripts into a normalized dataset, use the hardened_Dataset_with_Normalization function:
from dataPipeline_AST import hardened_Dataset_with_Normalization
# Processes raw .txt or .py files in /data into a master CSV
# Handles: Parsing -> Normalizing -> Hashing -> Exporting
hardened_Dataset_with_Normalization(data_dir="./custom_samples")
Visualizing the Pipeline
- Input: Raw Python Function.
- AST Parsing: Generate a tree representation using the
aststandard library. - Normalization: Visit each node and anonymize identifiers.
- Vectorization: Count node occurrences.
- Output: A single row in a
numericFeatures.csvfile, tagged with a source and label.