Data Ingestion & Cleaning
The Soteria pipeline transforms raw Python source code into a structured, numerical format suitable for machine learning. This process involves multi-source ingestion, rigorous deduplication, and Abstract Syntax Tree (AST) normalization to ensure the model learns behavioral patterns rather than specific variable names or comments.
Overview of Data Sources
Soteria ingests data from two primary channels to create a balanced dataset of "Clean" and "Malicious" samples:
- Local Directory Ingestion: The pipeline scans the
backend/data/directory.clean/: Python files containing safe, standard functional logic (Label: 0).corrupted/: Python files containing backdoors, injections, or malicious logic (Label: 1).
- External Datasets: The system supports high-volume ingestion from Hugging Face CSV exports located at
backend/data/external/huggingface_raw.csv.
The Ingestion Workflow
The ingestion engine, powered by dataPipeline_AST.py, follows a strict sequence to ensure data purity:
1. SHA-256 Deduplication
To prevent "Data Leakage" and training bias, every function is hashed using the SHA-256 algorithm before entering the dataset.
- Mechanism: The raw text of a function is converted into a unique hex fingerprint.
- Action: If a fingerprint already exists in the
seen_hashesset, the function is discarded. This ensures that duplicate code snippets—common in large datasets—do not skew the model's accuracy.
2. AST Parsing and Granular Splitting
Unlike simple text scanners, Soteria parses code into an Abstract Syntax Tree (AST).
- Function-Level Focus: The pipeline automatically identifies
FunctionDefnodes within a file. - Isolation: Each function is extracted and processed as an independent "Module," allowing for precise detection even if a malicious function is hidden inside a mostly clean file.
3. Structural Normalization
This is the most critical step for resisting obfuscation. The codeNormalizer (a custom ast.NodeTransformer) anonymizes the code:
- Variable Anonymization: Local variables and arguments are renamed to generic placeholders.
- Constant Stripping: Hardcoded strings and numbers are removed or normalized.
- Logic Preservation: The structural "skeleton"—the arrangement of loops, function calls, and assignments—remains intact.
Processing Hugging Face Data
Soteria includes built-in support for processing raw CSV data typically found on AI research platforms. The pipeline automatically:
- Cleans Markdown artifacts (e.g., stripping
```pythonblocks). - Handles missing or null code snippets.
- Integrates external sources into the master
CSV_master/numericFeatures.csvfile alongside local data.
Usage: Triggering the Pipeline
To rebuild the dataset and prepare it for training, you can invoke the pipeline programmatically within the backend environment.
Example: Building a Hardened Dataset
from backend.src.dataPipeline_AST import hardened_Dataset_with_Normalization
# This will:
# 1. Scan local /data folders
# 2. Ingest Hugging Face CSVs
# 3. Deduplicate and Normalize
# 4. Save the results to /CSV_master/
hardened_Dataset_with_Normalization()
Data Output Format
After ingestion and cleaning, the data is saved in a tabular format. Each row represents a single function's "Structural DNA":
| Feature | Description |
| :--- | :--- |
| rawCode | The original source code for reference. |
| normalizedCode | The anonymized AST-transformed code. |
| label | 0 for Clean, 1 for Malicious. |
| source | The origin filename and function name. |
Once the data is cleaned and saved, it is ready for the Vectorization Engine to convert these structures into the numerical matrices used by the Random Forest and Neural Network classifiers.