Data Persistence & Models

Overview

Soteria utilizes a dual-layer persistence strategy. Application-level data, such as user profiles and authentication credentials, are managed via a relational database using SQLAlchemy. High-performance machine learning artifacts—including serialized models, feature scalers, and processed datasets—are maintained as flat-file structures for rapid ingestion by the Intelligence Engine.

User Models & Authentication

The backend utilizes SQLite (via Flask-SQLAlchemy) for managing user accounts and administrative access. The architecture is designed to support secure sessions through JWT (JSON Web Tokens) and hashed credentials.

The `User` Model

The primary identity entity within the system. It handles both standard users and administrative accounts.

Data Security

Password Hashing: Soteria never stores raw passwords. It uses flask-bcrypt to generate salted hashes.
Session Management: Once authenticated via /api/auth/login, the system issues a JWT containing the user_id and is_admin status, which must be included in the Authorization: Bearer <token> header for protected routes.

Machine Learning Model Persistence

The "Intelligence Engine" relies on pre-trained models stored in the backend/ML_master/ directory. Depending on the active engine configuration, the system loads different serialized artifacts.

1. Ensemble Model (`acidModel.pkl`)

The default classifier. It uses a Voting Classifier (Random Forest + Gradient Boosting + Logistic Regression).

Storage Format: Joblib Serialized (.pkl)
Persistence Logic: Exported after 5-fold cross-validation in trainerModel_AST.py.

2. Neural Network (`acidModel_neural.pt`)

A deep learning model built with PyTorch for complex feature extraction.

Storage Format: PyTorch State Dictionary (.pt)
Metadata: Stores input_size, hidden_sizes, and dropout rates to ensure the architecture matches during inference.

3. Hybrid Stacking Model

A stacking ensemble that uses the Neural Network's predictions as augmented features for a final meta-classifier.

Dependency: Requires acidModel_scaler.pkl to normalize input vectors before they reach the network.

Data Pipeline & "Structural DNA"

To ensure the model remains performant and free of bias, the data pipeline implements rigorous persistence rules for training samples.

Deduplication via SHA-256

Before any Python function is added to the training set, it is passed through a hashing utility. This ensures that duplicate code snippets—which could skew model accuracy—are ignored.

def get_Code_Hash(codeText):
    # Generates a unique fingerprint to detect duplicate functions
    return sha256(codeText.encode('utf-8')).hexdigest()

The `CSV_master` Repository

Once functions are normalized (variable names removed, constants anonymized) and vectorized, they are persisted in numericFeatures.csv.

Schema for numericFeatures.csv:

AST Node Columns: (e.g., Assign, Call, BinOp) — Integers representing the frequency of these nodes in the function.
LABEL: 0 for Clean, 1 for Malicious.
SOURCE: A string identifier (Filename + Function Name) for traceability.

Usage Example: Database Access

For developers looking to interact with the persistence layer directly, the app.py context provides the standard SQLAlchemy interface:

# Example: Querying the database within the Flask context
from app import db, User

with app.app_context():
    # Find all admin users
    admins = User.query.filter_by(is_admin=True).all()
    
    # Check a password
    user = User.query.filter_by(email="dev@example.com").first()
    if user.check_password("secure_password_123"):
        print("Authenticated")