Reference AI-Native GovTech Architecture

This article describes a reference architecture for a distributed system designed for typical GovTech projects. The key feature of the architecture is the AI-Native approach: sovereign artificial intelligence is embedded as a cross-cutting component at every layer – from data enrichment to analytics and decision-making.

Sovereign AI – the ability of a state to develop, deploy, and control artificial intelligence technologies using its own infrastructure, data, personnel, and business ecosystems that are independent of external platforms.

System Requirements

Functional Requirements

FR-1: Continuous Data Ingestion. The system ingests data from heterogeneous sources (State Information System, State Information Resource, external APIs, file storage, IoT sensors, manual input). Supported modes include batch loading, streaming, and incremental synchronization (Change Data Capture). Sources may have limited bandwidth and unstable connectivity.

FR-2: Unified Data Storage. All collected data is consolidated in a single data lake to leverage the benefits of shared data usage. The lake supports structured, semi-structured, and unstructured formats. The storage layer provides ACID transactions, versioning (time-travel), and access control.

FR-3: Multi-Tier Processing Pipelines. Data passes through three quality tiers: Bronze (raw data as-is), Silver (cleansed, deduplicated, standardized), and Gold (aggregated, enriched, ready for consumption). Sovereign AI may be applied at each tier: classification, entity extraction, metadata generation, anomaly detection.

FR-4: AI Enrichment and Target Logic. The platform enables the application of a sovereign LLM and proprietary ML models to solve applied tasks: semantic search (RAG), document classification, text recognition (OCR), response generation, predictive analytics, and decision support.

FR-5: Analytics and Visualization. The system provides OLAP analytics, interactive dashboards, and report generation for leadership. Supported capabilities include ad-hoc SQL queries, multidimensional analysis, drill-down, and export to standard formats. An AI assistant helps formulate queries in natural language.

FR-6: Model Training and Management. The platform provides a full MLOps lifecycle: experiment versioning, model registry, fine-tuning (LoRA), A/B testing, drift monitoring (Data Drift Monitoring is a tracking changes in incoming data properties), and automatic rollback.

Non-Functional Requirements

NFR-1: Deployment Mode. On-premise, air-gapped environment (full or partial internet isolation). Updates are delivered via Data Diode or local repository. Controlled through architectural audit.

NFR-2: Availability and Consistency. 99.9% for analytics and dashboard services; 99.5% for AI assistants. Strong consistency for mission-critical documents and eventual consistency for pipeline-generated data. Controlled via Prometheus, Grafana, and regular SLA reporting.

NFR-3: Response Time. Dashboards and OLAP queries: < 3 sec; RAG queries to LLM: < 5 sec to first token; Batch processing: < 4 hours for a daily batch. Controlled via OpenTelemetry and Jaeger.

NFR-4: Throughput. ≥ 100 RPS (requests per second) at API Gateway; ≥ 500 concurrent users; ≥ 10 million documents uploaded per month. Controlled via k6 or Locust.

NFR-5: Scalability. Horizontal scaling of all stateless components. Linear throughput growth when adding nodes. Controlled via Kubernetes HPA and utilization monitoring.

NFR-6: Fault Tolerance. RPO (Recovery Point Objective) ≤ 1 hour; RTO (Recovery Time Objective) ≤ 4 hours. Data is replicated in 3 copies. Controlled via quarterly DR testing.

NFR-7: Security. Data encryption at-rest (AES-256) and in-transit (TLS 1.3). RBAC at row and column level. Audit of all data operations. Controlled via penetration testing and log auditing.

NFR-8: Resilience Under Resource Constraints. System capacity is expanded in increments (procurement occurs once a year). The system operates under partial GPU node loss (graceful degradation): AI functions degrade, but analytics remains available. Controlled via chaos engineering (Litmus).

NFR-9: Observability. Logs, metrics, traces. Unified monitoring console. Alerts with escalation. Controlled via ELK/Loki + Prometheus + Jaeger.

NFR-10: Compatibility. API – RESTful (OpenAPI 3.0) and gRPC. Data formats are Parquet, JSON, Avro. Storage is S3-compatible. SQL is ANSI SQL via Trino. Controlled via contract testing (Pact).

General Architecture Concept

The proposed architecture is a multi-tier platform that supports the full lifecycle of AI solutions – from data collection and processing to model inference and integration with application systems.

Note: The specific technology stack may vary depending on project requirements. The key invariant is the architectural layers, not the specific frameworks, which may be changed or replaced with proprietary solutions. The emphasis is on the use of open-source technologies.

Diagram 1: Architectural layers and technology stack of the GovTech AI-Native platform

Architecture Layer Descriptions

L1. Data Ingestion Layer

This layer addresses FR-1: continuous collection of heterogeneous data from multiple government and external systems under conditions of limited bandwidth and unstable connectivity.

Ingestion Mode	Technology	When Applied
Batch	Apache Spark + Airflow	Scheduled exports from SIS, SIR. File arrays. Historical loads.
Streaming	Kafka Connect	Event streams: transactions, logs, IoT sensor data in real time.
CDC (Change Data Capture)	Debezium → Kafka	Incremental synchronization with source relational DBs without load on them (WAL log reading).
Files	SFTP / NFS → Spark	File exchanges with third-party systems.
Manual	Web forms → API Gateway → Kafka	Manual data entry by operators, document uploads.

For sources with unstable connectivity, the following mechanisms are in place: a guaranteed-delivery queue (Kafka with acknowledgement), request retries (retry with exponential backoff), and a source-side buffer (lightweight agent based on Filebeat). When the receiver is temporarily unavailable, data accumulates in the queue and is loaded upon recovery.

L2. Data Storage Layer

This layer addresses FR-2: consolidation of all data types in a single managed storage.

Component	Technology	Purpose
Object Storage	SeaweedFS (S3-compatible)	Unstructured data: documents, PDF scans, audio, video, ML model weights. Scalable storage for the Bronze layer.
Data Lakehouse	Apache Iceberg + SeaweedFS	Primary analytical storage. ACID transactions, time-travel (return to any data version), hidden partitioning. Contains Bronze, Silver, and Gold layer tables.
RDBMS	PostgreSQL	Structured Gold-level data, application and low-code solution metadata.
Vector Database	Qdrant	Vector embeddings of documents and texts for semantic search (RAG). Metadata filtering.
Message Queue	Apache Kafka	Event bus covers all inter-component interactions, CDC streams and notifications. Provides loose coupling and replay capability.

SeaweedFS stores data, Iceberg manages table metadata, Spark and Trino perform computations. This allows storage volume and processing capacity to be scaled independently.

L3. Data Processing Layer

This layer addresses FR-3 and FR-4: multi-tier data transformation pipelines with sovereign AI integration.

Tier	Actions	Technologies
Bronze	Data is saved as-is, an exact copy of the source is made. Minimal processing: adding technical metadata (load date, source, batch ID). Schema-on-read approach.	Spark, Flink, Iceberg
Silver	Data cleansing: deduplication, format standardization (dates, addresses, names), type casting. Quality validation. Text extraction from documents. AI is used for OCR, named entity extraction (NER), document type classification, data normalization. Data quality is ensured via Great Expectations Framework.	Spark, Flink, Airflow, LLM/VLM
Gold	Aggregation, business metric calculation, building analytical data marts and OLAP cubes. Data enrichment for decision-making. Sovereign AI and proprietary ML solutions are applied for citizen appeal sentiment analysis, risk scoring, anomaly detection, predictive models, and embedding generation for RAG.	Spark, Trino, MLflow

Diagram 2: Medallion Architecture pipeline

Apache Airflow is a stateless framework that manages scheduling and dependencies of batch pipelines (DAG). Temporal is used for long-running processes with state persistence and the ability to resume from the last successful step. Both orchestrators have retry logic and failure notifications. In general, Airflow is better suited for orchestrating Bronze→Silver→Gold transformations, while Temporal is better for reliable event-driven Bronze-layer data consumption and critical downstream notifications.

Resilience under resource constraints (NFR-8) is achieved through graceful degradation. If GPU nodes are temporarily unavailable, AI pipeline stages enter graceful degradation mode: documents are saved in Bronze and Silver layers, AI enrichment is deferred and executed upon GPU recovery. Analytics on the Gold layer continues to operate on previously prepared data.

Diagram 3: OCR pipeline structure

OCR process sequence:

Document scans for digitization are placed into the Data Lakehouse.
The OCR Producer service periodically retrieves file metadata from S3 and sends it to Kafka.
The OCR Workflow service, acting as a process orchestrator, consumes messages from Kafka and initiates digitization.
In the Preprocessing stage, the scan is split into pages and each page is standardized.
In the Classification stage, the document type is determined using a predefined classifier.
In the Extraction stage, document content is extracted using the most suitable model for that document type.
In the Validation stage, the result is verified via Pydantic schemas, checksums, LLM, and Guardrails AI.
In the Fixing Errors stage, a human corrects mistakes by annotating data in a labeling client such as Label Studio.
In the Save Data stage, the result is saved to the database and vector store.
In the Transform stage, the text is transformed according to the project's business logic.

L4. AI & ML Layer

This layer addresses FR-4: target data processing logic via sovereign AI and proprietary ML models, as well as FR-6: the full MLOps lifecycle.

Subsystem	Components	Purpose
LLM Inference	vLLM (production), llama.cpp (edge / fallback), NeMo Guardrails (generation safety)	LLM inference, including national models. Response generation, summarization, classification. Guardrails control output safety.
RAG Pipeline	LangChain / LangGraph, Embedding Model, Reranker, Semantic Cache (Redis / Valkey), Langfuse	Semantic document search: query → embedding → vector search (Qdrant) → reranking → context assembly → LLM → response. Semantic Cache reduces GPU load by 30–40% for repeated queries.
ML Training & Serving	MLflow (Tracking, Registry, Serving), Kubeflow (distributed training), Feast (Feature Store)	Experiment versioning, model registry, LoRA fine-tuning orchestration on national data, ML feature management.
ML Observability	Evidently AI	Data drift and model drift monitoring, automatic quality report generation, retraining triggers.

Diagram 4: RAG pipeline structure

RAG pipeline step-by-step:

The user submits a natural language query via UI or API.
The AI Gateway authenticates the request, applies rate limiting, and routes it to the RAG service.
A check is performed to see if a semantically similar query exists in the cache. The query is hashed via embedding and a nearest neighbor is searched in the cache.
If a cache hit occurs, the response is returned directly to the client (3a -> 3b); otherwise, control is passed to the RAG pipeline.
The user query is converted to an embedding, using the same model that was used to index the documents.
A vector search is performed to identify the Top-K nearest documents by cosine similarity. Optional metadata filtering is applied.
The Top-K candidates are passed through a cross-encoder model that more accurately evaluates the relevance of each query-document pair. The Top-N best are selected.
A prompt is assembled: system prompt + context from Top-N documents + user query. Prompt templates from the Prompt Registry are applied.
The assembled prompt is sent to the LLM. vLLM performs inference with streaming (token-by-token).
The LLM response passes through safety filters: toxicity check, hallucination detection, policy compliance. If a violation is detected, the response is blocked or modified.
The response is saved to cache with a defined TTL.
The final response is returned to the user.

It should be noted that NeMo Guardrails and Guardrails AI can perform validation not only after the main LLM request, but at all stages of the request lifecycle.

Stage 1. User Request:

NeMo – Input Rails: jailbreak attempt detection; toxic and offensive content filtering; off-topic request blocking; personal data detection and masking in the query.
Guardrails AI – Input Validation: prompt injection checks; input length and format validation; input schema compliance verification; personal data detection in user query.

Stages 5–7. Vector Search over Knowledge Base:

NeMo – Retrieval Rails: filtering of irrelevant chunks from results; source credibility and admissibility verification; context restriction to permitted documents only; blocking transmission of chunks with prohibited content to LLM.
NeMo – Dialog Rails: dialog flow management via Colang scripts; deterministic responses to specific phrases; fallback scenario handling (what to do when LLM doesn't know); topic control throughout the session.

Stage 8. Prompt Construction for LLM:

NeMo – Execution Rails: control of permitted tool calls; whitelist of allowed external APIs; blocking of unauthorized agent actions; logging of all tool calls for audit.

Stages 9–10. Response Generation:

NeMo – Output Rails: hallucination check; personal data removal from final response; toxic content filtering; off-topic response blocking.
Guardrails AI – Output Validation: parsing LLM response into structured format; response schema validation; toxicity, relevance, and factual accuracy checks; automatic query retry upon validation failure.

Diagram 5: Model Training pipeline

Model Training pipeline step-by-step:

Create experiment. Name, hyperparameters, and dataset reference are defined.
Launch training pipeline as a DAG in Kubernetes.
Request training features. Feast ensures point-in-time correct data sampling, preventing data leakage.
Read historical features from the offline store (Iceberg) and fresh features from the online store (PostgreSQL or Redis).
Prepare data by forming train/validation/test splits, augmentation, normalization, and class balancing.
Train the model. For LLMs – LoRA fine-tuning of the base model. For ML – training via XGBoost / CatBoost / scikit-learn.
Log metrics at each epoch (loss, accuracy, F1), hyperparameters, and artifacts (checkpoints, charts).
Evaluate the model on the test set. Calculate target metrics. For LLMs – MMLU benchmarks and user dataset evaluations.
If metrics exceed the threshold, the model is registered in the registry (version, metadata, metrics) with Staging status.
Promote the model from Staging to Production after review and/or A/B testing.
Deploy the new version: LLM to vLLM (LoRA adapter update), ML to MLflow Serving or Triton. Rolling update with zero downtime.
Continuous monitoring by comparing the distribution of input data and predictions against a reference dataset.
Upon detection of data drift or model drift, an automatic retraining process is triggered. Data drift refers to changes in the statistical characteristics of input data over time, which can negatively impact model performance. Model drift refers to changes in model behavior or quality over time.

L5. Presentation Layer

This layer addresses FR-5: analytics, OLAP, and leadership dashboards.

Component	Technology	Purpose
Web Client with AI Assistant	React / Angular, Chat UI	Interface for dialogue with the corporate LLM: data queries, SQL generation in natural language, report summarization, etc.
AI / API Gateway	Kong AI Gateway	Single entry point for external consumers: authentication, authorization, rate limiting, routing. OpenAPI 3.0. Centralized limit management and token quota distribution, protection against cascading failures.
BI & Dashboards	Apache Superset	Interactive dashboards for leadership: KPIs, trends, drill-down. Connections to Trino (Gold layer) and PostgreSQL. Auto-refresh scheduling. Automatic PDF/Excel report generation and distribution on schedule.
OLAP	Trino	Federated SQL queries to the Gold layer (Iceberg), PostgreSQL, S3. Multidimensional analysis, pivot tables.

The AI assistant allows leaders to formulate analytical queries in natural language: "Show the trend of citizen appeals by region for the last quarter." The LLM translates the query into SQL, Trino executes it on the Gold layer, and Superset visualizes the result.

L6. Security, Governance & Observability

This cross-cutting layer ensures compliance with non-functional requirements NFR-1, NFR-7, NFR-9.

Domain	Components	Purpose
Identity & Access	Keycloak (IAM), RBAC/ABAC	Unified authentication (SSO), row- and column-level authorization, LDAP integration.
Secrets & Encryption	HashiCorp Vault	Key, certificate, and secret management. Key rotation. At-rest encryption.
Network Security	Calico / Cilium	Kubernetes network policies: micro-segmentation, zero-trust between pods.
Data Catalog & Lineage	DataHub	Cataloging of all data, lineage tracking from source through Bronze → Silver → Gold to dashboard. Metadata search.
Data Quality	Great Expectations	Automated validation at the Bronze/Silver boundary: completeness, format, ranges, business rules. Alerting on violations.
Observability	ELK / Loki (logs), Prometheus + Grafana (metrics), OpenTelemetry + Jaeger (traces)	Logs, metrics, traces. Unified console. SLA dashboard with availability, latency, and error metrics.
CI/CD & GitOps	ArgoCD (GitOps), Harbor (Container Registry), Ansible / Terraform (IaC)	Declarative configuration management. Local image registry for air-gapped environment. Reproducible deployments.

Key Architectural Decisions

Event-driven architecture on Kafka. All inter-component interactions – including data ingestion, index updates, and notifications – are implemented through a message queue. This ensures loose coupling and a complete audit trail of all operations.

Storage-Compute Separation. The Data Lakehouse based on Apache Iceberg and S3-compatible storage allows storage and processing to be scaled independently. This is critical as the data lake grows to millions of documents per month, where analytical compute capacity must not compete with inference capacity.

Medallion Architecture. The quality tier separation ensures reproducibility (Bronze – source copy), manageability (Silver – cleansed data), and readiness for use (Gold – analytics). AI enrichment is embedded at the transitions between tiers.

Graceful Degradation. When GPUs are unavailable, AI pipeline stages are deferred, but core data processing and analytics continue to operate. The system automatically resumes AI enrichment upon resource recovery.

Caching. Session caching, intermediate LLM result caching, rate limiting. Caching of semantically similar LLM queries reduces GPU load by 30–40% and decreases latency for common questions.

Air-gapped CI/CD. ArgoCD + Harbor + local Nexus provide a complete deployment cycle without internet access. Updates are delivered via Data Diode and undergo security verification.

Data Lineage (DataHub). End-to-end tracking of data origin from source through all Medallion Architecture layers to a specific dashboard. Required for audit and regulatory compliance.

Architecture Decision Records

ADR-001: Apache Kafka as the Unified Event Bus

Status: Accepted

Context
The system integrates dozens of heterogeneous data sources. Components must be loosely coupled. Environment: air-gapped, on-premise.

Option	Pros	Cons
Apache Kafka	Event replay, high throughput, CDC integration via Debezium, mature ecosystem	Operational complexity
RabbitMQ	Simplicity, low latency	No replay, weak CDC support

Decision: Kafka

Rationale
Event replay capability is critical for the Bronze layer: if AI enrichment fails, data is not lost and the pipeline restarts from the correct point. Debezium natively integrates only with Kafka.

Consequences

➕ Loose coupling of all components, full audit trail of operations

➕ CDC without load on source databases

➖ Requires a dedicated cluster and operational expertise

ADR-002: Apache Iceberg as the Data Lakehouse Format

Status: Accepted

Context
A storage format is needed for Medallion Architecture with ACID support, time-travel, and independent scaling of storage and compute. On-premise, S3-compatible storage.

Option	Pros	Cons
Apache Iceberg	ACID, time-travel, hidden partitioning, support for Spark, Trino, and Flink	Younger format, more complex than Delta Lake in some scenarios
Delta Lake	Mature, excellent Spark integration	Tied to Databricks ecosystem, weaker with Trino
Apache Hudi	Good for CDC upserts	More complex to operate, smaller community
Parquet + Hive	Simplicity	No ACID, no time-travel

Decision: Apache Iceberg

Rationale
Iceberg is an engine-agnostic standard: it is simultaneously read by Spark (processing), Trino (analytics), and Flink (streaming) without data copying. Delta Lake creates a dependency on Databricks, which is unacceptable for a sovereign environment.

Consequences

➕ Storage-Compute Separation: storage and compute scale independently

➕ Time-travel for audit and reproducibility

➖ Requires a metadata catalog (Hive Metastore or Nessie)

ADR-003: SeaweedFS as Object Storage

Status: Accepted

Context
An S3-compatible, open-source object storage is needed for on-premise deployment. It stores documents, PDF scans, ML model weights, and Bronze-layer data. Volume: tens of millions of files.

Option	Pros	Cons
SeaweedFS	Efficient small file storage, built-in Filer, low metadata overhead, S3-compatible	Smaller community, less documentation
MinIO	De-facto S3 on-premise standard, large community, excellent documentation	High overhead with millions of small files; developer discontinued Community Edition development
Ceph	Versatility (block + object + file)	High operational complexity

Decision: SeaweedFS

Rationale
Fully open-source. SeaweedFS is architecturally optimized for storing large numbers of small files.

Consequences

➕ Efficiency when storing millions of documents

➖ Smaller community and ecosystem compared to MinIO

ADR-004: vLLM + llama.cpp as Two-Tier LLM Inference

Status: Accepted

Context
LLM inference is needed in an air-gapped environment. GPU resources are limited and expanded in increments. Graceful degradation is required upon GPU node loss.

Option	Pros	Cons
vLLM (primary)	PagedAttention, continuous batching, high throughput, streaming	Requires GPU, high memory consumption
llama.cpp (fallback)	CPU inference, minimal requirements, quantization	Low speed on large models
Triton Inference Server	Versatility, multi-model	Configuration complexity, overkill for LLM
Ollama	Simplicity	Not production-ready for high load

Decision: vLLM as primary + llama.cpp as edge/fallback

Rationale
The two-tier approach addresses NFR-8 (resilience under resource constraints): when GPU is unavailable, the system does not fail but degrades to CPU inference at reduced speed. vLLM provides production throughput via PagedAttention and continuous batching.

Consequences

➕ Graceful degradation: if GPU is unavailable, llama.cpp continues operation

➕ LoRA adapters are updated in vLLM without restart

➖ Two different runtimes require maintaining two configurations

ADR-005: Langfuse for RAG Observability

Status: Accepted

Context
A tool is needed for tracing and quality evaluation of the RAG pipeline: call chain logging, relevance assessment, prompt A/B testing. Environment: air-gapped.

Option	Pros	Cons
Langfuse	Open-source, self-hosted, full tracing functionality	Fewer integrations than LangSmith
LangSmith	Deep LangChain integration, rich UI	Cloud service – incompatible with air-gapped

Decision: Langfuse

Rationale
LangSmith is a cloud service, fundamentally incompatible with an air-gapped environment and data sovereignty requirements. Langfuse is deployed on-premise, has open-source code, and covers all required scenarios: tracing, evaluation, and Prompt Registry.

Consequences

➕ Full control over tracing data

➕ Prompt Registry for prompt version management

➖ Requires self-deployment and maintenance

ADR-006: Airflow and Temporal for Orchestration

Status: Accepted

Context
The system contains two classes of processes: regular batch data transformations (Bronze→Silver→Gold) and long-running event-driven processes (OCR, downstream notifications). A single orchestrator handles both scenarios poorly.

Option	Pros	Cons
Airflow only	Single tool, mature, large community	Stateless: no resume-from-failure for long-running processes
Temporal only	Stateful, durable execution, retry from any point	Overkill for simple DAG scheduling
Airflow + Temporal	Each tool optimized for its task	Two tools in the stack
Prefect	Modern, simpler than Airflow	Smaller community, weaker for stateful workflows

Decision: Airflow for batch pipelines + Temporal for event-driven processes

Rationale
The OCR pipeline can run for hours and must resume from the point of failure without losing progress – this is Temporal's domain. Bronze→Silver→Gold transformations are classic scheduled DAGs – Airflow's domain. Separation of concerns eliminates compromises.

Consequences

➕ Each orchestrator is optimal for its class of tasks

➕ Reliability for long-running processes (OCR, notifications)

➖ The team must be proficient in both tools

ADR-007: Qdrant as the Vector Database

Status: Accepted

Context
The RAG pipeline requires storage and search of document vector embeddings. Volume: tens of millions of chunks. On-premise, air-gapped. Metadata filtering is required.

Option	Pros	Cons
Qdrant	Rust implementation (performance), rich payload filtering, on-premise, active development	Smaller ecosystem than Weaviate
pgvector	PostgreSQL already in stack, simplicity	Degrades at large volumes (>10M vectors)
Weaviate	Rich functionality, GraphQL API	Go implementation, higher memory consumption
Milvus	High performance	Deployment complexity, etcd dependency
ElasticSearch (kNN)	Already in stack	Not optimized for vector search

Decision: Qdrant

Rationale
At 10M+ documents per month, pgvector loses performance. Qdrant is written in Rust, demonstrates the best latency/throughput ratio in benchmarks, supports metadata filtering (critical for document-level RBAC), and deploys as a simple single binary.

Consequences

➕ High performance at large volumes

➕ Metadata filtering for access control

➖ Additional component in the stack (PostgreSQL cannot be reused)

ADR-008: Medallion Architecture

Status: Accepted

Context
Data arrives from dozens of heterogeneous sources of varying quality. A storage strategy is needed that ensures reproducibility, quality manageability, and analytics readiness. AI enrichment may be unavailable (no GPU).

Option	Pros	Cons
Medallion (Bronze/Silver/Gold)	Reproducibility, quality isolation, AI graceful degradation	Data stored in multiple copies
Data Vault	Auditability, flexibility	High modeling complexity
Kimball DWH	Mature approach, familiar to analysts	Rigid schema, poor fit for unstructured data
Single storage	Simplicity	No quality isolation, difficult to roll back

Decision: Medallion Architecture

Rationale
Bronze as an immutable source copy is insurance against transformation and AI enrichment errors. When GPU is unavailable, documents accumulate in Bronze/Silver, and AI enrichment is executed upon resource recovery. The Gold layer is always available for analytics on previously prepared data.

Consequences

➕ Reproducibility: any stage can be restarted

➕ Graceful degradation: analytics does not depend on AI availability

➖ Storing data in three copies increases disk space requirements

ADR-009: NeMo Guardrails + Guardrails AI as Two-Tier LLM Protection

Status: Accepted

Context
The government system handles citizens' personal data. The LLM may generate toxic content, hallucinations, or personal data leaks. Protection is needed at all stages of the request lifecycle.

Option	Pros	Cons
NeMo Guardrails	Colang scripts for dialog flow, input/output/retrieval/execution rails, from NVIDIA	Ties to NVIDIA ecosystem
Guardrails AI	Structured output validation, Pydantic schemas, retry on failure	Does not cover dialog flow
Prompt engineering only	Simplicity	Unreliable, easily bypassed
Custom filters	Full control	High development and maintenance cost

Decision: NeMo Guardrails (dialog/flow protection) + Guardrails AI (structural validation)

Rationale
The tools cover different aspects: NeMo manages dialog logic and blocks undesirable scenarios at the flow level; Guardrails AI validates the structure and content of output against schemas. Together they form a layered defense critical for a government system handling personal data.

Consequences

➕ Protection at all stages: input → retrieval → prompt → output

➕ Full auditability of all blocking decisions

➖ Latency overhead at each stage (~100–200ms total)

➖ Requires maintaining Colang scripts when business logic changes

Reference AI-Native GovTech Architecture

System Requirements

Functional Requirements

Non-Functional Requirements

General Architecture Concept

Architecture Layer Descriptions

L1. Data Ingestion Layer

L2. Data Storage Layer

L3. Data Processing Layer

L4. AI & ML Layer

L5. Presentation Layer

L6. Security, Governance & Observability

Key Architectural Decisions

Architecture Decision Records

ADR-001: Apache Kafka as the Unified Event Bus

ADR-002: Apache Iceberg as the Data Lakehouse Format

ADR-003: SeaweedFS as Object Storage

ADR-004: vLLM + llama.cpp as Two-Tier LLM Inference

ADR-005: Langfuse for RAG Observability

ADR-006: Airflow and Temporal for Orchestration

ADR-007: Qdrant as the Vector Database

ADR-008: Medallion Architecture

ADR-009: NeMo Guardrails + Guardrails AI as Two-Tier LLM Protection

Comments

Architectures

Digital Transformation of Government Agencies

More from this blog

Advanced RAG Trade-offs

Спиральная динамика архитектур и закон Конвея

Если бы я начинал сейчас

Книги 2025

Command Palette

System Requirements

Functional Requirements

Non-Functional Requirements

General Architecture Concept

Architecture Layer Descriptions

L1. Data Ingestion Layer

L2. Data Storage Layer

L3. Data Processing Layer

L4. AI & ML Layer

L5. Presentation Layer

L6. Security, Governance & Observability

Key Architectural Decisions

Architecture Decision Records

ADR-001: Apache Kafka as the Unified Event Bus

ADR-002: Apache Iceberg as the Data Lakehouse Format

ADR-003: SeaweedFS as Object Storage

ADR-004: vLLM + llama.cpp as Two-Tier LLM Inference

ADR-005: Langfuse for RAG Observability

ADR-006: Airflow and Temporal for Orchestration

ADR-007: Qdrant as the Vector Database

ADR-008: Medallion Architecture

ADR-009: NeMo Guardrails + Guardrails AI as Two-Tier LLM Protection

Comments

Architectures

Digital Transformation of Government Agencies

More from this blog