Introduction
Soteria (A.C.I.D) is a machine learning-powered security pipeline designed to detect malicious code injections and backdoors within Python environments. Unlike traditional security tools that rely on easily bypassable keyword-based signatures, Soteria analyzes the "Structural DNA" of code using Abstract Syntax Trees (AST). By focusing on behavioral logic rather than variable naming or formatting, the system identifies dangerous patterns that manual reviews or simple regex scanners often miss.
Core Value Proposition
Modern malware often uses obfuscation—renaming variables, encrypting strings, or injecting junk code—to evade detection. Soteria treats code as a structural blueprint. By transforming Python functions into normalized numerical feature matrices, it can identify a "backdoor shell" or a "credential exfiltrator" regardless of how the author has attempted to hide its intent.
- Resilience to Obfuscation: Uses a custom
NodeTransformerto anonymize logic, making it immune to simple renaming tactics. - Behavioral Detection: Focuses on the distribution of operations (e.g., the ratio of system calls to variable assignments).
- Real-Time Intelligence: Provides an end-to-end dashboard for scanning snippets, managing datasets, and monitoring model performance.
Key Features
1. Structural Normalization
At the heart of the pipeline is the codeNormalizer. This engine traverses the AST of a Python file, stripping away superficial metadata (like local variable names and constant values) and replacing them with generic placeholders. This ensures the ML model learns the shape of the logic rather than the vocabulary of the developer.
2. Function-Level Granularity
Large source files are automatically decomposed into individual functions. This allows security teams to pinpoint exactly which part of a codebase is "corrupted" without flagging entire files as false positives, enabling faster remediation.
3. Hybrid ML Intelligence
Soteria utilizes a multi-layered detection approach:
- Random Forest & Gradient Boosting: For rapid, ensemble-based classification based on AST node distribution.
- Neural Networks: A PyTorch-powered deep learning layer for identifying complex, non-linear vulnerability patterns.
- Stacking Classifier: A hybrid model that combines traditional ML and Neural Network predictions for maximum accuracy.
The "Structural DNA" Approach
The pipeline converts Python logic into a numerical matrix. Below is a simplified representation of how Soteria views the difference between clean and malicious logic:
| Feature | calculate_total (Clean) | backdoor_shell (Malicious) |
| :--- | :--- | :--- |
| Assignments (Assign) | High (2.0) | Low (1.0) |
| System Calls (Call) | Low (1.0) | High (4.0) |
| Binary Ops (BinOp) | High (3.0) | Low (0.0) |
| Attributes (Attribute) | 0.0 | High (2.0) |
| Soteria Label | 0 (Clean) | 1 (Malicious) |
System Architecture
Soteria is built as a split-deployment application to ensure high performance during heavy ML inference tasks:
- The Intelligence Engine (Backend): A Flask-based API orchestrated with Scikit-Learn and PyTorch. It handles code parsing, AST vectorization, and model retraining.
- The Cyber Sentinel Dashboard (Frontend): A React/TypeScript interface styled with Tailwind CSS and Framer Motion, providing real-time scanning results and visual data graphs.
- Data Pipeline: A hardened ingestion system that integrates with Hugging Face datasets and utilizes SHA-256 hashing to prevent duplicate bias during training phases.
Use Cases
- CI/CD Integration: Automatically scan incoming Pull Requests for suspicious structural patterns.
- Security Auditing: Rapidly audit large legacy codebases for dormant backdoors.
- Developer Education: Use the gamified "Knowledge Graph" and "Scanner" to help developers understand why certain code structures are considered high-risk.