Data Ingestion & Cleaning

The Soteria pipeline transforms raw Python source code into a structured, numerical format suitable for machine learning. This process involves multi-source ingestion, rigorous deduplication, and Abstract Syntax Tree (AST) normalization to ensure the model learns behavioral patterns rather than specific variable names or comments.

Overview of Data Sources

Soteria ingests data from two primary channels to create a balanced dataset of "Clean" and "Malicious" samples:

Local Directory Ingestion: The pipeline scans the backend/data/ directory.
- clean/: Python files containing safe, standard functional logic (Label: 0).
- corrupted/: Python files containing backdoors, injections, or malicious logic (Label: 1).
External Datasets: The system supports high-volume ingestion from Hugging Face CSV exports located at backend/data/external/huggingface_raw.csv.

The Ingestion Workflow

The ingestion engine, powered by dataPipeline_AST.py, follows a strict sequence to ensure data purity:

1. SHA-256 Deduplication

To prevent "Data Leakage" and training bias, every function is hashed using the SHA-256 algorithm before entering the dataset.

Mechanism: The raw text of a function is converted into a unique hex fingerprint.
Action: If a fingerprint already exists in the seen_hashes set, the function is discarded. This ensures that duplicate code snippets—common in large datasets—do not skew the model's accuracy.

2. AST Parsing and Granular Splitting

Unlike simple text scanners, Soteria parses code into an Abstract Syntax Tree (AST).

Function-Level Focus: The pipeline automatically identifies FunctionDef nodes within a file.
Isolation: Each function is extracted and processed as an independent "Module," allowing for precise detection even if a malicious function is hidden inside a mostly clean file.

3. Structural Normalization

This is the most critical step for resisting obfuscation. The codeNormalizer (a custom ast.NodeTransformer) anonymizes the code:

Variable Anonymization: Local variables and arguments are renamed to generic placeholders.
Constant Stripping: Hardcoded strings and numbers are removed or normalized.
Logic Preservation: The structural "skeleton"—the arrangement of loops, function calls, and assignments—remains intact.

Processing Hugging Face Data

Soteria includes built-in support for processing raw CSV data typically found on AI research platforms. The pipeline automatically:

Cleans Markdown artifacts (e.g., stripping ```python blocks).
Handles missing or null code snippets.
Integrates external sources into the master CSV_master/numericFeatures.csv file alongside local data.

Usage: Triggering the Pipeline

To rebuild the dataset and prepare it for training, you can invoke the pipeline programmatically within the backend environment.

Example: Building a Hardened Dataset

from backend.src.dataPipeline_AST import hardened_Dataset_with_Normalization

# This will:
# 1. Scan local /data folders
# 2. Ingest Hugging Face CSVs
# 3. Deduplicate and Normalize
# 4. Save the results to /CSV_master/
hardened_Dataset_with_Normalization()

Data Output Format

After ingestion and cleaning, the data is saved in a tabular format. Each row represents a single function's "Structural DNA":

Once the data is cleaned and saved, it is ready for the Vectorization Engine to convert these structures into the numerical matrices used by the Random Forest and Neural Network classifiers.