Data Persistence & Models
Overview
Soteria utilizes a dual-layer persistence strategy. Application-level data, such as user profiles and authentication credentials, are managed via a relational database using SQLAlchemy. High-performance machine learning artifacts—including serialized models, feature scalers, and processed datasets—are maintained as flat-file structures for rapid ingestion by the Intelligence Engine.
User Models & Authentication
The backend utilizes SQLite (via Flask-SQLAlchemy) for managing user accounts and administrative access. The architecture is designed to support secure sessions through JWT (JSON Web Tokens) and hashed credentials.
The User Model
The primary identity entity within the system. It handles both standard users and administrative accounts.
| Field | Type | Description |
| :--- | :--- | :--- |
| id | Integer | Primary key. |
| name | String | The display name for the user. |
| email | String | Unique identifier used for login. |
| password_hash | String | Bcrypt-hashed password. |
| is_admin | Boolean | Flag for access to the Neural Engine and Admin Dashboard. |
| created_at | DateTime | Automatic timestamp of registration. |
Data Security
- Password Hashing: Soteria never stores raw passwords. It uses
flask-bcryptto generate salted hashes. - Session Management: Once authenticated via
/api/auth/login, the system issues a JWT containing theuser_idandis_adminstatus, which must be included in theAuthorization: Bearer <token>header for protected routes.
Machine Learning Model Persistence
The "Intelligence Engine" relies on pre-trained models stored in the backend/ML_master/ directory. Depending on the active engine configuration, the system loads different serialized artifacts.
1. Ensemble Model (acidModel.pkl)
The default classifier. It uses a Voting Classifier (Random Forest + Gradient Boosting + Logistic Regression).
- Storage Format: Joblib Serialized (
.pkl) - Persistence Logic: Exported after 5-fold cross-validation in
trainerModel_AST.py.
2. Neural Network (acidModel_neural.pt)
A deep learning model built with PyTorch for complex feature extraction.
- Storage Format: PyTorch State Dictionary (
.pt) - Metadata: Stores
input_size,hidden_sizes, anddropoutrates to ensure the architecture matches during inference.
3. Hybrid Stacking Model
A stacking ensemble that uses the Neural Network's predictions as augmented features for a final meta-classifier.
- Dependency: Requires
acidModel_scaler.pklto normalize input vectors before they reach the network.
Data Pipeline & "Structural DNA"
To ensure the model remains performant and free of bias, the data pipeline implements rigorous persistence rules for training samples.
Deduplication via SHA-256
Before any Python function is added to the training set, it is passed through a hashing utility. This ensures that duplicate code snippets—which could skew model accuracy—are ignored.
def get_Code_Hash(codeText):
# Generates a unique fingerprint to detect duplicate functions
return sha256(codeText.encode('utf-8')).hexdigest()
The CSV_master Repository
Once functions are normalized (variable names removed, constants anonymized) and vectorized, they are persisted in numericFeatures.csv.
Schema for numericFeatures.csv:
- AST Node Columns: (e.g.,
Assign,Call,BinOp) — Integers representing the frequency of these nodes in the function. - LABEL:
0for Clean,1for Malicious. - SOURCE: A string identifier (Filename + Function Name) for traceability.
Usage Example: Database Access
For developers looking to interact with the persistence layer directly, the app.py context provides the standard SQLAlchemy interface:
# Example: Querying the database within the Flask context
from app import db, User
with app.app_context():
# Find all admin users
admins = User.query.filter_by(is_admin=True).all()
# Check a password
user = User.query.filter_by(email="dev@example.com").first()
if user.check_password("secure_password_123"):
print("Authenticated")