Feature Extraction Pipeline

The Feature Extraction Pipeline is the core of Soteria’s "Structural DNA" analysis. Unlike traditional static analysis tools that rely on fragile regex patterns or keyword blacklists, this pipeline deconstructs Python source code into its underlying logic to identify malicious intent, even when the code is obfuscated.

The "Structural DNA" Concept

Soteria treats code as a collection of behaviors rather than strings. By converting a function into a numerical feature matrix, the system can identify "malicious shapes"—patterns of logic common in backdoors, such as excessive network calls combined with dynamic attribute lookups, regardless of the variable names used.

1. Structural Normalization

Before features are extracted, the code passes through a Custom AST NodeTransformer. This stage strips away the "identity" of the code to focus purely on its "skeleton":

Anonymization: Variable names, function names, and constants are replaced with generic identifiers.
Obfuscation Resistance: Since the model analyzes the structure (e.g., a For loop containing an If statement and a Call), renaming a variable from send_data to x123 does not change the resulting feature vector.

2. The Vectorization Engine

The pipeline transforms raw Python functions into a numerical distribution based on Abstract Syntax Tree (AST) node types. The engine counts the frequency of specific operations within a function's scope.

Key features tracked include:

Assign: Frequency of variable assignments.
Call: Number of function invocations (often high in shellcode or injectors).
BinOp: Mathematical operations (common in encryption/obfuscation routines).
Attribute: Accessing object properties (e.g., os.system).
Expr: Standalone expressions.

Feature Matrix Example

The transformation results in a structured format ready for the Machine Learning models:

| Function Logic | Assign | Call | BinOp | Attribute | Label | | :--- | :--- | :--- | :--- | :--- | :--- | | Standard Calculation | 2.0 | 1.0 | 3.0 | 0.0 | Clean (0) | | Backdoor/Payload | 1.0 | 5.0 | 0.0 | 4.0 | Malicious (1) |

3. Hardened Data Pipeline

To ensure high-quality training and prevent model bias, the pipeline implements a "Hardened" workflow:

Function-Level Granularity: The pipeline automatically splits large .py files into individual FunctionDef nodes. This allows the scanner to pinpoint the exact location of a vulnerability within a large codebase.
SHA-256 Deduplication: Every function is hashed before being added to the dataset. If a function's structural logic is identical to one already processed, it is discarded to prevent "dataset poisoning" and over-fitting.
External Ingestion: The pipeline supports ingesting raw JSON or CSV data from external sources (like Hugging Face), automatically cleaning markdown blocks and stripping non-Python artifacts before analysis.

Usage in Development

If you are extending the data pipeline or adding new features to the vectorizer, the primary interface is located in backend/src/dataPipeline_AST.py.

Extracting Features from a File

To process a directory of Python scripts into a normalized dataset, use the hardened_Dataset_with_Normalization function:

from dataPipeline_AST import hardened_Dataset_with_Normalization

# Processes raw .txt or .py files in /data into a master CSV
# Handles: Parsing -> Normalizing -> Hashing -> Exporting
hardened_Dataset_with_Normalization(data_dir="./custom_samples")

Visualizing the Pipeline

Input: Raw Python Function.
AST Parsing: Generate a tree representation using the ast standard library.
Normalization: Visit each node and anonymize identifiers.
Vectorization: Count node occurrences.
Output: A single row in a numericFeatures.csv file, tagged with a source and label.