laputski

Advanced RAG Trade-offs

Alexander Laputski — Fri, 27 Mar 2026 22:22:27 GMT

Abstract

Classic RAG addresses contextual search in small, homogeneous knowledge bases. Advanced RAG extends this model to production scale: it adds hybrid search, multi-layer filtering, prompt management, and protective mechanisms. Github example.

A special role belongs to Sovereign AI – deploying a language model entirely within an organization's own infrastructure, including air-gapped environments with no internet access. This approach is mandatory wherever data cannot leave the security perimeter.

Solution Architecture

Gateway & Auth

The system authenticates and authorizes every request, enforces rate limiting per user and role, validates input data, and protects against prompt injection attacks.

Query Processing

Before retrieval, the user's query goes through three optional transformations. Rewriting reformulates the query to better match the terminology of the knowledge base. This is especially important when users phrase questions in conversational language. Decomposition breaks a complex, multi-part question into several sub-queries, each processed independently with results merged afterward. HyDE (Hypothetical Document Embeddings) generates a hypothetical answer to the query and uses its embedding for retrieval – this improves recall in highly specialized corpora where the query and the document are phrased very differently.

Embedding Model

Converts query text into a vector representation for dense retrieval. The model is deployed locally (Nomic, BGE, or equivalent). In air-gapped environments, model weights are pre-loaded before network isolation. Embedding quality directly determines dense retrieval quality: switching models requires a full re-indexing of the corpus.

Hybrid Retrieval

Search runs in parallel across two indexes. Dense retrieval searches vector embeddings via an HNSW index in the vector database and it captures semantic similarity well. Sparse retrieval via BM25 searches by keywords. It is precise on terms, abbreviations, and numbers where semantic search returns irrelevant results.

RRF Fusion

Reciprocal Rank Fusion merges the results of dense and sparse retrieval into a single ranked list. Each document receives a final score based on its positions in both lists, without the need to normalize heterogeneous scores. This enables correct merging of results from two fundamentally different retrieval methods.

Reranker

A cross-encoder reranker jointly re-scores each query–chunk pair (unlike bi-encoder embeddings, which encode the query and document independently). This yields significantly more accurate ranking at the cost of additional compute. It is applied to the Top-K results after fusion, not to the entire corpus.

Guardrails

Before returning a response, the system checks it for hallucinations, toxicity, and policy compliance. If a check fails, the request is sent for re-generation.

Semantic Cache

Similar queries receive a response from cache without invoking the LLM. TTL policies manage data freshness and prevent stale responses from being served.

Prompt Registry

A centralized store of versioned prompt templates. It enables A/B testing, rollbacks, and separation of templates by task type. All without changing service code.

Streaming Response

The response is delivered to the user token by token as it is generated, without waiting for the full output to complete. This reduces perceived latency. An important constraint: guardrails in streaming mode cannot inspect the full response before delivery begins, bacause this requires either buffering or moving some checks to a post-processing layer.

Observability

Self-hosted Langfuse traces every request across all pipeline stages, from gateway to streaming response. RAGAS metrics (faithfulness, answer relevance, context recall) evaluate response quality automatically. The feedback loop collects explicit user ratings and links them to specific traces. In air-gapped environments, the entire observability stack is deployed inside the perimeter.

Sovereign AI & Air-Gapped Deployment

The language model, embedding model, and reranker are deployed on-premise. Container images and model weights are pre-loaded before network isolation. The vector database, cache, Prompt Registry, and observability stack are hosted inside the perimeter. Guardrails run locally with no calls to external moderation APIs.

Trade-offs: Classic vs Advanced RAG

Classic RAG starts quickly and is easy to maintain. It performs well on knowledge bases of up to 10–50K documents with homogeneous content and straightforward queries. Answer accuracy stays around 60–70%, latency is minimal, and the infrastructure requires no specialized expertise.

Advanced RAG addresses problems that Classic RAG cannot solve by design. Dense-only search handles exact terms and abbreviations poorly as sparse retrieval fills that gap. Without a reranker, Top-K results contain irrelevant chunks that degrade generation quality. Without guardrails, hallucinations reach the user. Without semantic cache, every repeated query consumes LLM budget.

The cost of these improvements is measured in latency and operational complexity. The full stack adds 300–700 ms to the base LLM call. The reranker contributes 100–300 ms, guardrails another 50–150 ms, and both require threshold tuning. With proper configuration of semantic cache and async reranking, P95 latency stays within 2–3 seconds, and the number of LLM calls drops by 60–70%. Hybrid retrieval improves accuracy by 15–25% over dense-only, but requires maintaining two synchronized indexes.

Sovereign AI adds operational overhead for managing a GPU cluster and manually updating model weights. Air-gapped deployment requires pre-loading all artifacts before network isolation and eliminates any external dependencies including moderation APIs and cloud-based observability. PII data in traces requires explicit masking when configuring self-hosted Langfuse. Air-gapped environments are particularly resource-sensitive: every component competes for the same CPUs and GPUs inside the perimeter.

Implementation recommendation. Start with Classic RAG. Run it in a dev environment, load real data, and identify the bottlenecks – where accuracy falls short, where latency is unacceptable, where users receive hallucinations. Only once a specific problem is confirmed should you add the corresponding Advanced RAG component. This is especially relevant for air-gapped environments, where resources are constrained and every additional service requires justification.

Integration with Existing Systems

Both Classic and Advanced RAG are deployed as an isolated module behind a dedicated AI Gateway. The rest of the system calls a single API and has no knowledge of which implementation is currently active. This provides several practical advantages.

Switching between implementations is done via blue/green deployment with no changes on the calling service side. Both variants can run simultaneously, with different request types routed to each – for example, Simple RAG for standard FAQ scenarios and Advanced RAG for complex analytical queries. The choice of implementation is made at the API Gateway level based on request header, user role, or query type.

Both variants are stateless services. State is held in external stores: the vector database, cache, and Prompt Registry. This makes horizontal scaling and switching between implementations a configuration-level operation, not a refactoring effort.

Failure Modes and Degradation

Advanced RAG is more complex than Classic RAG and has more potential failure points. It is important to define upfront how the system behaves when each component degrades.

Semantic Cache unavailable. All requests go directly to the LLM. Latency returns to baseline, and call costs grow proportionally with traffic. The system continues to function but loses the primary economic advantage of Advanced RAG.

Guardrails false positive cascade. Guardrails begin blocking valid responses, so the user receives a rejection where the answer was correct. The system enters a re-generation loop and latency spikes sharply. The fix is threshold tuning and a circuit breaker that disables guardrails when the error rate exceeds a defined threshold.

Reranker degrades or goes down. Search results are returned without reranking, in the order produced by retrieval. Top-K quality drops, but responses continue to be generated. The reranker is a good candidate for graceful degradation: when unavailable, simply skip the step.

Dense and sparse index desynchronization. Documents are indexed in one store but not the other. RRF Fusion operates on incomplete data and search quality silently degrades invisible to the user. Detectable only through monitoring of index sizes and RAGAS metrics.

General degradation principle. If Advanced RAG is too slow, unstable, or generating cascading errors – switch to Simple RAG. This is precisely why both variants are kept behind a single AI Gateway. I would say Simple RAG is not a last-resort fallback, it is more like a fully legitimate operating mode when the complex stack is under stress.

Resource Estimation

Providing specific recommendations for CPU, GPU, and memory is not meaningful as consumption depends too heavily on data volume, the choice of embedding model, request frequency, and the vector database in use. Actually, universal numbers do not exist.

Instead, use an iterative approach. Load 1% of your real data into the vector store, then measure latency, resource consumption, and Advanced RAG response quality. Repeat the test at 10%, then 50%, then 100%. At each step, record how system behavior changes. This produces a real scaling curve for your specific data and allows you to forecast production resource requirements before you get there.

Business Cases

🏦 Financial Services & Banking

Banks process sensitive customer data and are subject to regulatory requirements (GDPR, PCI DSS, Basel III). Sovereign AI eliminates data transfer to external LLM APIs. Guardrails enforce compliance policy adherence. Hybrid retrieval accurately locates regulatory documents by exact terms and codes.

🏥 Healthcare & MedTech

Medical data falls into a specially sensitive category under HIPAA and local data protection regulations. Hallucination detection is critical, because an incorrect response about dosage or diagnosis is not acceptable. Sparse retrieval accurately locates medical terms and ICD codes where semantic search returns irrelevant results. Air-gapped deployment is mandatory for clinical systems.

⚖️ Legal Tech

Legal systems require precise search across case law and regulatory documents. Prompt Registry stores versioned templates for different types of legal queries. The reranker surfaces relevant precedents above general documents. The feedback loop accumulates attorney ratings and incrementally improves ranking quality.

🏛️ Government & Defense

Government systems require complete isolation from public networks. Air-gapped deployment with pre-loaded model weights is the only acceptable option. Input Rails protect against prompt injection attacks. All components, including observability, operate inside the secured perimeter with no external dependencies.

🏢 Enterprise Knowledge Management

Enterprise knowledge bases contain tens of thousands of documents in varied formats. Semantic Cache reduces costs by 60–70% for recurring employee queries. Metadata enrichment enables filtering by department, date, or document type. Hybrid retrieval performs equally well on technical specifications and unstructured text.

🔬 R&D / Scientific Research

Research organizations work with proprietary scientific data and patents. Query decomposition breaks complex scientific questions into sub-queries and improves recall. HyDE generates hypothetical answers to improve retrieval in highly specialized corpora. Sovereign AI preserves competitive advantage – data never leaves for external providers.

The selection rule in one sentence: Advanced RAG justifies the investment in complexity if at least one condition holds – data is confidential, the knowledge base exceeds 50K documents, or answer accuracy is critical to the business or a regulator.

Reference AI-Native GovTech Architecture

Alexander Laputski — Sat, 21 Mar 2026 15:38:46 GMT

This article describes a reference architecture for a distributed system designed for typical GovTech projects. The key feature of the architecture is the AI-Native approach: sovereign artificial intelligence is embedded as a cross-cutting component at every layer – from data enrichment to analytics and decision-making.

Sovereign AI – the ability of a state to develop, deploy, and control artificial intelligence technologies using its own infrastructure, data, personnel, and business ecosystems that are independent of external platforms.

System Requirements

Functional Requirements

FR-1: Continuous Data Ingestion. The system ingests data from heterogeneous sources (State Information System, State Information Resource, external APIs, file storage, IoT sensors, manual input). Supported modes include batch loading, streaming, and incremental synchronization (Change Data Capture). Sources may have limited bandwidth and unstable connectivity.

FR-2: Unified Data Storage. All collected data is consolidated in a single data lake to leverage the benefits of shared data usage. The lake supports structured, semi-structured, and unstructured formats. The storage layer provides ACID transactions, versioning (time-travel), and access control.

FR-3: Multi-Tier Processing Pipelines. Data passes through three quality tiers: Bronze (raw data as-is), Silver (cleansed, deduplicated, standardized), and Gold (aggregated, enriched, ready for consumption). Sovereign AI may be applied at each tier: classification, entity extraction, metadata generation, anomaly detection.

FR-4: AI Enrichment and Target Logic. The platform enables the application of a sovereign LLM and proprietary ML models to solve applied tasks: semantic search (RAG), document classification, text recognition (OCR), response generation, predictive analytics, and decision support.

FR-5: Analytics and Visualization. The system provides OLAP analytics, interactive dashboards, and report generation for leadership. Supported capabilities include ad-hoc SQL queries, multidimensional analysis, drill-down, and export to standard formats. An AI assistant helps formulate queries in natural language.

FR-6: Model Training and Management. The platform provides a full MLOps lifecycle: experiment versioning, model registry, fine-tuning (LoRA), A/B testing, drift monitoring (Data Drift Monitoring is a tracking changes in incoming data properties), and automatic rollback.

Non-Functional Requirements

NFR-1: Deployment Mode. On-premise, air-gapped environment (full or partial internet isolation). Updates are delivered via Data Diode or local repository. Controlled through architectural audit.

NFR-2: Availability and Consistency. 99.9% for analytics and dashboard services; 99.5% for AI assistants. Strong consistency for mission-critical documents and eventual consistency for pipeline-generated data. Controlled via Prometheus, Grafana, and regular SLA reporting.

NFR-3: Response Time. Dashboards and OLAP queries: < 3 sec; RAG queries to LLM: < 5 sec to first token; Batch processing: < 4 hours for a daily batch. Controlled via OpenTelemetry and Jaeger.

NFR-4: Throughput. ≥ 100 RPS (requests per second) at API Gateway; ≥ 500 concurrent users; ≥ 10 million documents uploaded per month. Controlled via k6 or Locust.

NFR-5: Scalability. Horizontal scaling of all stateless components. Linear throughput growth when adding nodes. Controlled via Kubernetes HPA and utilization monitoring.

NFR-6: Fault Tolerance. RPO (Recovery Point Objective) ≤ 1 hour; RTO (Recovery Time Objective) ≤ 4 hours. Data is replicated in 3 copies. Controlled via quarterly DR testing.

NFR-7: Security. Data encryption at-rest (AES-256) and in-transit (TLS 1.3). RBAC at row and column level. Audit of all data operations. Controlled via penetration testing and log auditing.

NFR-8: Resilience Under Resource Constraints. System capacity is expanded in increments (procurement occurs once a year). The system operates under partial GPU node loss (graceful degradation): AI functions degrade, but analytics remains available. Controlled via chaos engineering (Litmus).

NFR-9: Observability. Logs, metrics, traces. Unified monitoring console. Alerts with escalation. Controlled via ELK/Loki + Prometheus + Jaeger.

NFR-10: Compatibility. API – RESTful (OpenAPI 3.0) and gRPC. Data formats are Parquet, JSON, Avro. Storage is S3-compatible. SQL is ANSI SQL via Trino. Controlled via contract testing (Pact).

General Architecture Concept

The proposed architecture is a multi-tier platform that supports the full lifecycle of AI solutions – from data collection and processing to model inference and integration with application systems.

Note: The specific technology stack may vary depending on project requirements. The key invariant is the architectural layers, not the specific frameworks, which may be changed or replaced with proprietary solutions. The emphasis is on the use of open-source technologies.

Diagram 1: Architectural layers and technology stack of the GovTech AI-Native platform

Architecture Layer Descriptions

L1. Data Ingestion Layer

This layer addresses FR-1: continuous collection of heterogeneous data from multiple government and external systems under conditions of limited bandwidth and unstable connectivity.

Ingestion Mode	Technology	When Applied
Batch	Apache Spark + Airflow	Scheduled exports from SIS, SIR. File arrays. Historical loads.
Streaming	Kafka Connect	Event streams: transactions, logs, IoT sensor data in real time.
CDC (Change Data Capture)	Debezium → Kafka	Incremental synchronization with source relational DBs without load on them (WAL log reading).
Files	SFTP / NFS → Spark	File exchanges with third-party systems.
Manual	Web forms → API Gateway → Kafka	Manual data entry by operators, document uploads.

For sources with unstable connectivity, the following mechanisms are in place: a guaranteed-delivery queue (Kafka with acknowledgement), request retries (retry with exponential backoff), and a source-side buffer (lightweight agent based on Filebeat). When the receiver is temporarily unavailable, data accumulates in the queue and is loaded upon recovery.

L2. Data Storage Layer

This layer addresses FR-2: consolidation of all data types in a single managed storage.

Component	Technology	Purpose
Object Storage	SeaweedFS (S3-compatible)	Unstructured data: documents, PDF scans, audio, video, ML model weights. Scalable storage for the Bronze layer.
Data Lakehouse	Apache Iceberg + SeaweedFS	Primary analytical storage. ACID transactions, time-travel (return to any data version), hidden partitioning. Contains Bronze, Silver, and Gold layer tables.
RDBMS	PostgreSQL	Structured Gold-level data, application and low-code solution metadata.
Vector Database	Qdrant	Vector embeddings of documents and texts for semantic search (RAG). Metadata filtering.
Message Queue	Apache Kafka	Event bus covers all inter-component interactions, CDC streams and notifications. Provides loose coupling and replay capability.

SeaweedFS stores data, Iceberg manages table metadata, Spark and Trino perform computations. This allows storage volume and processing capacity to be scaled independently.

L3. Data Processing Layer

This layer addresses FR-3 and FR-4: multi-tier data transformation pipelines with sovereign AI integration.

Tier	Actions	Technologies
Bronze	Data is saved as-is, an exact copy of the source is made. Minimal processing: adding technical metadata (load date, source, batch ID). Schema-on-read approach.	Spark, Flink, Iceberg
Silver	Data cleansing: deduplication, format standardization (dates, addresses, names), type casting. Quality validation. Text extraction from documents. AI is used for OCR, named entity extraction (NER), document type classification, data normalization. Data quality is ensured via Great Expectations Framework.	Spark, Flink, Airflow, LLM/VLM
Gold	Aggregation, business metric calculation, building analytical data marts and OLAP cubes. Data enrichment for decision-making. Sovereign AI and proprietary ML solutions are applied for citizen appeal sentiment analysis, risk scoring, anomaly detection, predictive models, and embedding generation for RAG.	Spark, Trino, MLflow

Diagram 2: Medallion Architecture pipeline

Apache Airflow is a stateless framework that manages scheduling and dependencies of batch pipelines (DAG). Temporal is used for long-running processes with state persistence and the ability to resume from the last successful step. Both orchestrators have retry logic and failure notifications. In general, Airflow is better suited for orchestrating Bronze→Silver→Gold transformations, while Temporal is better for reliable event-driven Bronze-layer data consumption and critical downstream notifications.

Resilience under resource constraints (NFR-8) is achieved through graceful degradation. If GPU nodes are temporarily unavailable, AI pipeline stages enter graceful degradation mode: documents are saved in Bronze and Silver layers, AI enrichment is deferred and executed upon GPU recovery. Analytics on the Gold layer continues to operate on previously prepared data.

Diagram 3: OCR pipeline structure

OCR process sequence:

Document scans for digitization are placed into the Data Lakehouse.
The OCR Producer service periodically retrieves file metadata from S3 and sends it to Kafka.
The OCR Workflow service, acting as a process orchestrator, consumes messages from Kafka and initiates digitization.
In the Preprocessing stage, the scan is split into pages and each page is standardized.
In the Classification stage, the document type is determined using a predefined classifier.
In the Extraction stage, document content is extracted using the most suitable model for that document type.
In the Validation stage, the result is verified via Pydantic schemas, checksums, LLM, and Guardrails AI.
In the Fixing Errors stage, a human corrects mistakes by annotating data in a labeling client such as Label Studio.
In the Save Data stage, the result is saved to the database and vector store.
In the Transform stage, the text is transformed according to the project's business logic.

L4. AI & ML Layer

This layer addresses FR-4: target data processing logic via sovereign AI and proprietary ML models, as well as FR-6: the full MLOps lifecycle.

Subsystem	Components	Purpose
LLM Inference	vLLM (production), llama.cpp (edge / fallback), NeMo Guardrails (generation safety)	LLM inference, including national models. Response generation, summarization, classification. Guardrails control output safety.
RAG Pipeline	LangChain / LangGraph, Embedding Model, Reranker, Semantic Cache (Redis / Valkey), Langfuse	Semantic document search: query → embedding → vector search (Qdrant) → reranking → context assembly → LLM → response. Semantic Cache reduces GPU load by 30–40% for repeated queries.
ML Training & Serving	MLflow (Tracking, Registry, Serving), Kubeflow (distributed training), Feast (Feature Store)	Experiment versioning, model registry, LoRA fine-tuning orchestration on national data, ML feature management.
ML Observability	Evidently AI	Data drift and model drift monitoring, automatic quality report generation, retraining triggers.

Diagram 4: RAG pipeline structure

RAG pipeline step-by-step:

The user submits a natural language query via UI or API.
The AI Gateway authenticates the request, applies rate limiting, and routes it to the RAG service.
A check is performed to see if a semantically similar query exists in the cache. The query is hashed via embedding and a nearest neighbor is searched in the cache.
If a cache hit occurs, the response is returned directly to the client (3a -> 3b); otherwise, control is passed to the RAG pipeline.
The user query is converted to an embedding, using the same model that was used to index the documents.
A vector search is performed to identify the Top-K nearest documents by cosine similarity. Optional metadata filtering is applied.
The Top-K candidates are passed through a cross-encoder model that more accurately evaluates the relevance of each query-document pair. The Top-N best are selected.
A prompt is assembled: system prompt + context from Top-N documents + user query. Prompt templates from the Prompt Registry are applied.
The assembled prompt is sent to the LLM. vLLM performs inference with streaming (token-by-token).
The LLM response passes through safety filters: toxicity check, hallucination detection, policy compliance. If a violation is detected, the response is blocked or modified.
The response is saved to cache with a defined TTL.
The final response is returned to the user.

It should be noted that NeMo Guardrails and Guardrails AI can perform validation not only after the main LLM request, but at all stages of the request lifecycle.

Stage 1. User Request:

NeMo – Input Rails: jailbreak attempt detection; toxic and offensive content filtering; off-topic request blocking; personal data detection and masking in the query.
Guardrails AI – Input Validation: prompt injection checks; input length and format validation; input schema compliance verification; personal data detection in user query.

Stages 5–7. Vector Search over Knowledge Base:

NeMo – Retrieval Rails: filtering of irrelevant chunks from results; source credibility and admissibility verification; context restriction to permitted documents only; blocking transmission of chunks with prohibited content to LLM.
NeMo – Dialog Rails: dialog flow management via Colang scripts; deterministic responses to specific phrases; fallback scenario handling (what to do when LLM doesn't know); topic control throughout the session.

Stage 8. Prompt Construction for LLM:

NeMo – Execution Rails: control of permitted tool calls; whitelist of allowed external APIs; blocking of unauthorized agent actions; logging of all tool calls for audit.

Stages 9–10. Response Generation:

NeMo – Output Rails: hallucination check; personal data removal from final response; toxic content filtering; off-topic response blocking.
Guardrails AI – Output Validation: parsing LLM response into structured format; response schema validation; toxicity, relevance, and factual accuracy checks; automatic query retry upon validation failure.

Diagram 5: Model Training pipeline

Model Training pipeline step-by-step:

Create experiment. Name, hyperparameters, and dataset reference are defined.
Launch training pipeline as a DAG in Kubernetes.
Request training features. Feast ensures point-in-time correct data sampling, preventing data leakage.
Read historical features from the offline store (Iceberg) and fresh features from the online store (PostgreSQL or Redis).
Prepare data by forming train/validation/test splits, augmentation, normalization, and class balancing.
Train the model. For LLMs – LoRA fine-tuning of the base model. For ML – training via XGBoost / CatBoost / scikit-learn.
Log metrics at each epoch (loss, accuracy, F1), hyperparameters, and artifacts (checkpoints, charts).
Evaluate the model on the test set. Calculate target metrics. For LLMs – MMLU benchmarks and user dataset evaluations.
If metrics exceed the threshold, the model is registered in the registry (version, metadata, metrics) with Staging status.
Promote the model from Staging to Production after review and/or A/B testing.
Deploy the new version: LLM to vLLM (LoRA adapter update), ML to MLflow Serving or Triton. Rolling update with zero downtime.
Continuous monitoring by comparing the distribution of input data and predictions against a reference dataset.
Upon detection of data drift or model drift, an automatic retraining process is triggered. Data drift refers to changes in the statistical characteristics of input data over time, which can negatively impact model performance. Model drift refers to changes in model behavior or quality over time.

L5. Presentation Layer

This layer addresses FR-5: analytics, OLAP, and leadership dashboards.

Component	Technology	Purpose
Web Client with AI Assistant	React / Angular, Chat UI	Interface for dialogue with the corporate LLM: data queries, SQL generation in natural language, report summarization, etc.
AI / API Gateway	Kong AI Gateway	Single entry point for external consumers: authentication, authorization, rate limiting, routing. OpenAPI 3.0. Centralized limit management and token quota distribution, protection against cascading failures.
BI & Dashboards	Apache Superset	Interactive dashboards for leadership: KPIs, trends, drill-down. Connections to Trino (Gold layer) and PostgreSQL. Auto-refresh scheduling. Automatic PDF/Excel report generation and distribution on schedule.
OLAP	Trino	Federated SQL queries to the Gold layer (Iceberg), PostgreSQL, S3. Multidimensional analysis, pivot tables.

The AI assistant allows leaders to formulate analytical queries in natural language: "Show the trend of citizen appeals by region for the last quarter." The LLM translates the query into SQL, Trino executes it on the Gold layer, and Superset visualizes the result.

L6. Security, Governance & Observability

This cross-cutting layer ensures compliance with non-functional requirements NFR-1, NFR-7, NFR-9.

Domain	Components	Purpose
Identity & Access	Keycloak (IAM), RBAC/ABAC	Unified authentication (SSO), row- and column-level authorization, LDAP integration.
Secrets & Encryption	HashiCorp Vault	Key, certificate, and secret management. Key rotation. At-rest encryption.
Network Security	Calico / Cilium	Kubernetes network policies: micro-segmentation, zero-trust between pods.
Data Catalog & Lineage	DataHub	Cataloging of all data, lineage tracking from source through Bronze → Silver → Gold to dashboard. Metadata search.
Data Quality	Great Expectations	Automated validation at the Bronze/Silver boundary: completeness, format, ranges, business rules. Alerting on violations.
Observability	ELK / Loki (logs), Prometheus + Grafana (metrics), OpenTelemetry + Jaeger (traces)	Logs, metrics, traces. Unified console. SLA dashboard with availability, latency, and error metrics.
CI/CD & GitOps	ArgoCD (GitOps), Harbor (Container Registry), Ansible / Terraform (IaC)	Declarative configuration management. Local image registry for air-gapped environment. Reproducible deployments.

Key Architectural Decisions

Event-driven architecture on Kafka. All inter-component interactions – including data ingestion, index updates, and notifications – are implemented through a message queue. This ensures loose coupling and a complete audit trail of all operations.

Storage-Compute Separation. The Data Lakehouse based on Apache Iceberg and S3-compatible storage allows storage and processing to be scaled independently. This is critical as the data lake grows to millions of documents per month, where analytical compute capacity must not compete with inference capacity.

Medallion Architecture. The quality tier separation ensures reproducibility (Bronze – source copy), manageability (Silver – cleansed data), and readiness for use (Gold – analytics). AI enrichment is embedded at the transitions between tiers.

Graceful Degradation. When GPUs are unavailable, AI pipeline stages are deferred, but core data processing and analytics continue to operate. The system automatically resumes AI enrichment upon resource recovery.

Caching. Session caching, intermediate LLM result caching, rate limiting. Caching of semantically similar LLM queries reduces GPU load by 30–40% and decreases latency for common questions.

Air-gapped CI/CD. ArgoCD + Harbor + local Nexus provide a complete deployment cycle without internet access. Updates are delivered via Data Diode and undergo security verification.

Data Lineage (DataHub). End-to-end tracking of data origin from source through all Medallion Architecture layers to a specific dashboard. Required for audit and regulatory compliance.

Architecture Decision Records

ADR-001: Apache Kafka as the Unified Event Bus

Status: Accepted

Context
The system integrates dozens of heterogeneous data sources. Components must be loosely coupled. Environment: air-gapped, on-premise.

Option	Pros	Cons
Apache Kafka	Event replay, high throughput, CDC integration via Debezium, mature ecosystem	Operational complexity
RabbitMQ	Simplicity, low latency	No replay, weak CDC support

Decision: Kafka

Rationale
Event replay capability is critical for the Bronze layer: if AI enrichment fails, data is not lost and the pipeline restarts from the correct point. Debezium natively integrates only with Kafka.

Consequences

➕ Loose coupling of all components, full audit trail of operations

➕ CDC without load on source databases

➖ Requires a dedicated cluster and operational expertise

ADR-002: Apache Iceberg as the Data Lakehouse Format

Status: Accepted

Context
A storage format is needed for Medallion Architecture with ACID support, time-travel, and independent scaling of storage and compute. On-premise, S3-compatible storage.

Option	Pros	Cons
Apache Iceberg	ACID, time-travel, hidden partitioning, support for Spark, Trino, and Flink	Younger format, more complex than Delta Lake in some scenarios
Delta Lake	Mature, excellent Spark integration	Tied to Databricks ecosystem, weaker with Trino
Apache Hudi	Good for CDC upserts	More complex to operate, smaller community
Parquet + Hive	Simplicity	No ACID, no time-travel

Decision: Apache Iceberg

Rationale
Iceberg is an engine-agnostic standard: it is simultaneously read by Spark (processing), Trino (analytics), and Flink (streaming) without data copying. Delta Lake creates a dependency on Databricks, which is unacceptable for a sovereign environment.

Consequences

➕ Storage-Compute Separation: storage and compute scale independently

➕ Time-travel for audit and reproducibility

➖ Requires a metadata catalog (Hive Metastore or Nessie)

ADR-003: SeaweedFS as Object Storage

Status: Accepted

Context
An S3-compatible, open-source object storage is needed for on-premise deployment. It stores documents, PDF scans, ML model weights, and Bronze-layer data. Volume: tens of millions of files.

Option	Pros	Cons
SeaweedFS	Efficient small file storage, built-in Filer, low metadata overhead, S3-compatible	Smaller community, less documentation
MinIO	De-facto S3 on-premise standard, large community, excellent documentation	High overhead with millions of small files; developer discontinued Community Edition development
Ceph	Versatility (block + object + file)	High operational complexity

Decision: SeaweedFS

Rationale
Fully open-source. SeaweedFS is architecturally optimized for storing large numbers of small files.

Consequences

➕ Efficiency when storing millions of documents

➖ Smaller community and ecosystem compared to MinIO

ADR-004: vLLM + llama.cpp as Two-Tier LLM Inference

Status: Accepted

Context
LLM inference is needed in an air-gapped environment. GPU resources are limited and expanded in increments. Graceful degradation is required upon GPU node loss.

Option	Pros	Cons
vLLM (primary)	PagedAttention, continuous batching, high throughput, streaming	Requires GPU, high memory consumption
llama.cpp (fallback)	CPU inference, minimal requirements, quantization	Low speed on large models
Triton Inference Server	Versatility, multi-model	Configuration complexity, overkill for LLM
Ollama	Simplicity	Not production-ready for high load

Decision: vLLM as primary + llama.cpp as edge/fallback

Rationale
The two-tier approach addresses NFR-8 (resilience under resource constraints): when GPU is unavailable, the system does not fail but degrades to CPU inference at reduced speed. vLLM provides production throughput via PagedAttention and continuous batching.

Consequences

➕ Graceful degradation: if GPU is unavailable, llama.cpp continues operation

➕ LoRA adapters are updated in vLLM without restart

➖ Two different runtimes require maintaining two configurations

ADR-005: Langfuse for RAG Observability

Status: Accepted

Context
A tool is needed for tracing and quality evaluation of the RAG pipeline: call chain logging, relevance assessment, prompt A/B testing. Environment: air-gapped.

Option	Pros	Cons
Langfuse	Open-source, self-hosted, full tracing functionality	Fewer integrations than LangSmith
LangSmith	Deep LangChain integration, rich UI	Cloud service – incompatible with air-gapped

Decision: Langfuse

Rationale
LangSmith is a cloud service, fundamentally incompatible with an air-gapped environment and data sovereignty requirements. Langfuse is deployed on-premise, has open-source code, and covers all required scenarios: tracing, evaluation, and Prompt Registry.

Consequences

➕ Full control over tracing data

➕ Prompt Registry for prompt version management

➖ Requires self-deployment and maintenance

ADR-006: Airflow and Temporal for Orchestration

Status: Accepted

Context
The system contains two classes of processes: regular batch data transformations (Bronze→Silver→Gold) and long-running event-driven processes (OCR, downstream notifications). A single orchestrator handles both scenarios poorly.

Option	Pros	Cons
Airflow only	Single tool, mature, large community	Stateless: no resume-from-failure for long-running processes
Temporal only	Stateful, durable execution, retry from any point	Overkill for simple DAG scheduling
Airflow + Temporal	Each tool optimized for its task	Two tools in the stack
Prefect	Modern, simpler than Airflow	Smaller community, weaker for stateful workflows

Decision: Airflow for batch pipelines + Temporal for event-driven processes

Rationale
The OCR pipeline can run for hours and must resume from the point of failure without losing progress – this is Temporal's domain. Bronze→Silver→Gold transformations are classic scheduled DAGs – Airflow's domain. Separation of concerns eliminates compromises.

Consequences

➕ Each orchestrator is optimal for its class of tasks

➕ Reliability for long-running processes (OCR, notifications)

➖ The team must be proficient in both tools

ADR-007: Qdrant as the Vector Database

Status: Accepted

Context
The RAG pipeline requires storage and search of document vector embeddings. Volume: tens of millions of chunks. On-premise, air-gapped. Metadata filtering is required.

Option	Pros	Cons
Qdrant	Rust implementation (performance), rich payload filtering, on-premise, active development	Smaller ecosystem than Weaviate
pgvector	PostgreSQL already in stack, simplicity	Degrades at large volumes (>10M vectors)
Weaviate	Rich functionality, GraphQL API	Go implementation, higher memory consumption
Milvus	High performance	Deployment complexity, etcd dependency
ElasticSearch (kNN)	Already in stack	Not optimized for vector search

Decision: Qdrant

Rationale
At 10M+ documents per month, pgvector loses performance. Qdrant is written in Rust, demonstrates the best latency/throughput ratio in benchmarks, supports metadata filtering (critical for document-level RBAC), and deploys as a simple single binary.

Consequences

➕ High performance at large volumes

➕ Metadata filtering for access control

➖ Additional component in the stack (PostgreSQL cannot be reused)

ADR-008: Medallion Architecture

Status: Accepted

Context
Data arrives from dozens of heterogeneous sources of varying quality. A storage strategy is needed that ensures reproducibility, quality manageability, and analytics readiness. AI enrichment may be unavailable (no GPU).

Option	Pros	Cons
Medallion (Bronze/Silver/Gold)	Reproducibility, quality isolation, AI graceful degradation	Data stored in multiple copies
Data Vault	Auditability, flexibility	High modeling complexity
Kimball DWH	Mature approach, familiar to analysts	Rigid schema, poor fit for unstructured data
Single storage	Simplicity	No quality isolation, difficult to roll back

Decision: Medallion Architecture

Rationale
Bronze as an immutable source copy is insurance against transformation and AI enrichment errors. When GPU is unavailable, documents accumulate in Bronze/Silver, and AI enrichment is executed upon resource recovery. The Gold layer is always available for analytics on previously prepared data.

Consequences

➕ Reproducibility: any stage can be restarted

➕ Graceful degradation: analytics does not depend on AI availability

➖ Storing data in three copies increases disk space requirements

ADR-009: NeMo Guardrails + Guardrails AI as Two-Tier LLM Protection

Status: Accepted

Context
The government system handles citizens' personal data. The LLM may generate toxic content, hallucinations, or personal data leaks. Protection is needed at all stages of the request lifecycle.

Option	Pros	Cons
NeMo Guardrails	Colang scripts for dialog flow, input/output/retrieval/execution rails, from NVIDIA	Ties to NVIDIA ecosystem
Guardrails AI	Structured output validation, Pydantic schemas, retry on failure	Does not cover dialog flow
Prompt engineering only	Simplicity	Unreliable, easily bypassed
Custom filters	Full control	High development and maintenance cost

Decision: NeMo Guardrails (dialog/flow protection) + Guardrails AI (structural validation)

Rationale
The tools cover different aspects: NeMo manages dialog logic and blocks undesirable scenarios at the flow level; Guardrails AI validates the structure and content of output against schemas. Together they form a layered defense critical for a government system handling personal data.

Consequences

➕ Protection at all stages: input → retrieval → prompt → output

➕ Full auditability of all blocking decisions

➖ Latency overhead at each stage (~100–200ms total)

➖ Requires maintaining Colang scripts when business logic changes

Спиральная динамика архитектур и закон Конвея

Alexander Laputski — Sun, 04 Jan 2026 19:57:16 GMT

Закон Конвея

Закон Конвея: "Организации проектируют системы, которые копируют структуру коммуникаций в этой организации".

Мартин Фаулер отмечает, что невнимание к закону Конвея может искажать архитектуру системы. Если архитектура разработана вразрез со структурой организации-разработчика, то в структуре ПО возникают противоречия. Взаимодействие модулей, которое изначально задумывалось как простое, становится сложным, поcкольку команды, ответственные за него, плохо взаимодействуют друг с другом. Полезные альтернативные варианты проектирования даже не рассматриваются, потому что необходимые группы разработчиков не общаются друг с другом.

Существует и обратный маневр Конвея, который гласит, что необходимо изменить модели общения разработчиков, чтобы способствовать созданию желаемой архитектуры программного обеспечения.

Выходит, нужно проектировать не только архитектуры, но и структуру коммуникаций. Причем можно сначала прикинуть архитектуру согласно функциональным и нефункциональным требованиям, а затем выстроить подходящую проекту структуру коммуникаций. Очевидно, что организационная структура компаний не меняется по щелчку пальцев, есть корпоративная культура, "то, как тут всё устроено". Желательно вооружиться неким системным видением оргструктуры бизнеса, чтобы уметь эффективно выполнять обратный маневр Конвея.

Спиральная динамика для бизнеса

Одной из самых проработанных систем, упорядочивших структуру и динамику бизнес-организаций, является адаптация спиральной динамики к современным компаниям (см. книгу Спиральная динамика для бизнеса).

Необходимо отметить, что это не строгая научная теория. Скорее, набор психологических паттернов поведения, типичных для различных корпоративных ситуаций. Но и сам по себе закон Конвея соединяет два сущностно различных мира: мир человеческих коммуникаций и мир технологической архитектурной строгости. Это как пытаться соединить застольную беседу и теорему Ферма. И тем не менее, связь существует, ведь именно люди делают проекты, и именно из-за их несогласованности зачастую рушатся системы. Наверняка проблема исчезнет, когда AI начнет создавать сложные распределенные системы от первичной диаграммы до автотестов, но пока в разработке участвуют люди, они привносят неустранимый фактор межличностного взаимодействия.

Базовая гипотеза такова, что у организаций лучше получаются системы, которые соответствуют их доминирующему уровню в спиральной динамике для бизнеса (далее СДБ).

Организационно-коммуникационная структура диктует каким должен быть проект, а выбранная в самом начале технологическая архитектура может этому образу совершенно не соответствовать. Если проект требует технологической архитектуры более высокого СДБ-уровня, нежели текущий, то лучше сначала поднять уровень бизнеса, а затем реализовывать проект.

Если бизнес игнорирует закон Конвея, то возникает драма несовпадения организационной и технологической архитектур. Если реальный СДБ-уровень ниже, чем уровень технологической архитектуры проекта, то бизнес будет неосознанно разрушать её, низводя до своего уровня. Если выше, то будет поднимать через серии рефакторингов.

Уровни СДБ

Подробнее об уровнях СДБ можно прочитать в книге. Здесь же приведем только необходимый минимум.

Большинство организаций находятся на красном, синем и оранжевом уровнях.

Принципы СДБ

Взяты отсюда.

Доминанта. В организации, как правило, нет единой цельной культуры, но всегда есть доминирующая культура, которая задает принципы управления организацией.
Конкурентоспособность. Нет плохих или хороших типов культур. Есть конкурентоспособные и неконкурентоспособные культуры.
Последовательность. Развитие корпоративной культуры происходит последовательно, нельзя "перепрыгнуть" через уровень.
Фундамент. Новый уровень корпоративной культуры внедряется за счет хорошо работающих инструментов предыдущего уровня - эти инструменты являются фундаментом для следующего уровня.
Кризис. В ходе развития организации меняется тип корпоративной культуры, это происходит через кризисы управляемости.
Препятствие. Те принципы, за счет которых корпоративная культура развивалась и становилась сильнее, через какое-то время становятся главным препятствием, разрушающим организацию.
Фон. При переходе к новому типу корпоративной культуры старый тип никуда не девается, а остается в организации как вполне обыкновенное и всем привычное явление (фон), являющее основой для остальных типов культур.
Маятник. Развитие происходит от индивидуалистичной культуры к коллективной, а потом обратно.
Мимикрия. Организация может деградировать (мимикрировать) на ниже находящиеся уровни культуры вследствие непрохождения кризисов. Причем может "проваливаться" сразу на два уровня вниз. Например, это может произойти из-за того, что старые проверенные инструменты предыдущих уровней развития оказываются незаслуженно забытыми.
Лидерство. Для развития организации лидер должен находиться на 0,5-1 уровень выше, чем культура организации. Развитие корпоративной культуры организации происходит только за счет улучшения менеджмента лидеров организации.

Наблюдения о связи СДБ и архитектур

Стеклянный потолок. Организационно незрелая компания не способна перейти на более высокий технологический уровень, поскольку не видит преимуществ. Техлиды могут по собственной инициативе переходить на более высокие СДБ-уровни и внедрять соответствующие архитектуры только при молчаливом согласии начальства, которе не до конца понимает смысл реформ, но по какой-то причине лояльно к ним.

Конфликт двух миров. Флуктуации корпоративных культур и переходы между СДБ-уровнями индивидуальны для каждой компании. Нет общей тенденции к тому, чтобы каждый бизнес неизбежно выходил на синий, затем на оранжевый, затем на зеленый уровень и выше. В то время как IT-архитектуры во всем мире имеют перманентную тенденцию к усложнению, что создает потребность в соответствующей корпоративной культуре. Получается, архитектура тянет за собой культуру, если претендует на то, чтобы быть полноценно реализованной.

Матрешка. СДБ-уровни вкладываются друг в друга как матрешка, то есть следующий уровень не отменяет, а вбирает в себя всё лучшее с предыдущих. Подобных эффект справедлив и для эволюции архитектур. Лучшие практики и фреймворки остаются как база, но к ним добавляются те, что соответствуют вызовами более высоких уровней СДБ.

Естественный маппинг СДБ на архитектуры

Ниже, попытка указать какие архитектурные практики наиболее естественны для каждого СДБ-уровня. Это не значит, что маппинг жёсткий, на любом уровне можно внедрить любую архитектуру, но не факт, что она приживется.

Бежевый уровень

Структура коммуникаций отсутствует, равно как и разделение ответственности.

Программа запускается и нормально.

Риски: хардкод, спагетти-код, анти-паттерны проектирования.

Пример: Proof-of-Concept для проверки имеет ли идея право на жизнь.

Фиолетовый уровень

Структура коммуникаций хаотична, "кажется Вася делал эту фичу, но это не точно".

Каждый вновь пришедший на проект разработчик делает "как видит", игнорируя мнение других, нет ни общего центра координации усилий ни процессов.

Шаблоны проектирования (структурные, порождающие, поведенческие) не используются или используются ограничено. Документация отсутствует как класс, она живет только в головах посвященных.

Могут ли на фиолетовом уровне быть внедрены микросервисы? Безусловно, может быть внедрено что угодно, но нет никакой гарантии, что кто-нибудь будет следовать чужим подходам.

Риски: высокая связность кода (high coupling), каждое серьёзное изменение требует структурного и функционального рефакторинга.

Пример: крупная legacy-система, которая не планирует расширение.

Красный уровень

Структура коммуникаций в форме звезды или осьминога (в центре начальник, по краям все остальные).

Доминирует централизованное управление системой, здесь всё полностью зависит от характера и квалификации первого лица. Архитектурно это не обязательно монолит, но точно появится централизованный блок управления, который соответствует видению мира боссом (по его воле или даже вопреки ей). Никто не заботится о качестве кода, модульности и отказоустйчивости, если только босс специально не обратит на это внимание.

Пример: Имел возможность наблюдать такую ситуацию в одной из компаний с ярко выраженным красным контуром управления, которая внедрила микросервисы и обобщенные библиотеки для использования на многих проектах. Как оказалось, каждая из них содержала не только обобщенный код, но и хардкод, специфичный для отдельных систем. Просто некому было закрыть технический долг и выполнить декомпозицию, поскольку все решения принимал один человек, который "знает лучше". Модель управления была такой, что руководитель лично оценивал загрузку сотрудников и тасовал их между проектами в обход тимлидов. Если бы у проектов были по-настоящему самостоятельные лиды, то они не допустили бы загрязнения своего кода логикой чужих проектов.

Документация, как и на фиолетовом уровне, по-прежнему только в головах, но теперь преимущественно в одной единственной.

Риски удобно классифицировать согласно PAEI модели менеджмента Адизеса:

P доминанта - сиюминутные решения, сопротивление инновациям, как следствие рост техдолга, появление единой точки отказа
A доминанта - забюрокраченность, отсутствие Agile/Lean, сопротивление инновациям, как следствие рост техдолга и низкая адаптивность системы
E доминанта - рисковые необоснованные решения, система развивается слишком быстро и в непредсказуемых направлениях, архитектура не успевает адаптироваться
I доминанта - функциональные и нефункциональные требования подчинены удобству работы команд (это тот случай, когда можно завести микросервисы под команды, чтобы они не поругались, но при этом могут игнорироваться DDD, нагрузка, потоки данных)

Пример: стартап, построенный вокруг компетенций CTO и CEO в одном лице.

Синий уровень

Структура коммуникаций иерархическая, внедрены процессы и роли, на всё есть спецификация и подробная документация.

На этом уровне естественна работа по стандартам: Twelve-Factor App, шаблоны проектирования, чёткие согласованные API и протоколы взаимодействия, Service Oriented Architecture, Reactive Manifesto, шины событий, версионирование, CI/CD пайплайны, SDLC, zero-trust.

При доминанте синего уровня организация довольствуется текущем уровнем автоматизации. Отсутствуют или ограничены механизмы внедрения инноваций, способные преодолеть бюрократию. Маловероятны масштабные рефакторинги.

Agile и Scrum не настроены по-настоящему итеративно, надежность системы важнее и достигается через бюрократический запрет слабопредсказуемых изменений. Рано или поздно всё скатывается в waterfall разработку с редкими и крупными релизами.

Риски: забюрокраченность, отсутствие инноваций, тяжёлая waterfall разработка.

Пример: мобильный клиент для крупного банка-монолиста.

Оранжевый уровень

Структура коммуникаций матричная, с горизонтальными (кросс-функциональные команды) и вертикальными (отделы) связями.

Это высоко конкурентная среда, которой свойственны меритократия и соперничество автономных команд (you build it, you run it).

Именно на этом уровне микросервисы становятся естественной архитектурной базой, повторяющей топологию команд. Вводятся чёткие API контракты как язык взаимодействия между командами.

Появляется реальная потребность в модульной low-coupling архитектуре, Agile и Scrum истинно итеративны, регулярно закрывается техдолг, культура роста проявляется в том, что инновации внедряются регулярно и через бэклог, появляются IaaS/PaaS/SaaS/DaaS (Data-as-a-Service).

Именно на этом уровне внедрение SRE практик становится необходимостью, появляются четкие метрики производительности, трассировка, наблюдаемость. нужно уметь эффективно измерять не только успех, но и конкуренцию. Система оптимизируется под метрики бизнеса и масштабирование. На более низких уровнях SLO/SLA могут не быть четко артикулированы, там систему нужно "просто сдать”, “просто не уронить”, “просто поддерживать".

Риски: возможна конкуренция подсистем, нет высокоуровневой смысловой оркестрации процессов и общего направления эволюции системы.

Пример: корпорация уровня Amazon или Netflix со множеством подсистем и внутренне конкурирующих продуктов.

Зеленый уровень

Структура коммуникаций одноранговая сетецентричная.

Цель такой системы - поддерживать миссию компании. Базовая клиентоориентированность диктует потребность в по-настоящему гибкой и адаптируемой архитектуре.

Здесь естественны Event-Driven Architecture, Data Mesh (Data as a Product), Service Mesh (Data Plane + Control Plane), Event Sourcing, платформы управления API (типа WSO2), FinOps.

Система строится скорее вокруг потоков данных, нежели функций. Создаются платформенные команды (Platform Engineering), обслуживающие продуктовые.

"Миссия" самой системы может заключаться в высокоуровневой оркестрации процессов посредством раздачи инструкций о реконфигурации системы (policy-as-code). Программное ядро оформляется в IDP (Internal Developer Platform) - параметризованный расширяемый фреймворк с достаточным количеством степеней свободы, чтобы у миссии компании было пространство для манёвра.

Риски: если компания слишком часто меняет направление, то это приводит к архитектурным пивотам, что парализует стратегическое планирование и в конечном итоге хаотизирует архитектуру.

Пример: Spotify с моделью Squads.

Жёлтый и бирюзовый уровни

Структура коммуникаций "нейрональная", а именно:

децентрализация управления,
умные динамические связи между акторами, делающие возможным единый сигнал, проходящий по коммуникационной сети,
динамические Ad-hoc команды, создаваемые специально под задачу,
высокий EQ и культура взаимодействия инженеров,
компетентность ценится выше должности.

Здесь естественны эволюционная архитектура (Evolutionary Architecture), базовая децентрализация процессов и точек управления (например через блокчейн, Edge Computing), Serverless подход. Система должна быть фундаментально устойчива к сбоям (Chaos Engineering) и изменениям требований. Появляется потребность в self-healing.

Это пожалуй первый уровень, где действительно к месту роевой интеллект AI агентов как точек принятия локальных решений с возможностью эволюционного трансформирования системы. Всё, что было до этого, может обойтись и без AI агентов. На красном уровне есть только один агент - начальник. На синем разлилось царство согласованных инструкций, там не потерпят AI-галлюцинаций в стратегических вопросах. На оранжевом уровне могут применить агентов в погоне за улучшенными метриками, но это же и рискованно, когда требуется гарантированный результат. На зеленом уровне миссия диктует решения. Желтый же уровень в силу своей высокой базовой адаптивности не боится ошибок и сбоев в подсистемах. Возможно, настоящий self-healing подход будет реализован именно через агентов.

Риски: сложно достигнуть и ещё сложнее удержаться.

Примеры: желтая - Google и Googleyness, бирюзовая - Valve (по крайней мере согласно их Handbook for new employees).

Выводы

Мы строим архитектуру согласно требованиям заказчика, но зачастую далее оказывается, что наша организация не готова к такого рода проекту в силу своего внутреннего устройства.

Во-первых, стейкхолдеры должны об этом знать.

Во-вторых, используя спиральную динамику бизнеса, мы можем продуманно выполнить обратный маневр Конвея.

Если бы я начинал сейчас

Alexander Laputski — Thu, 01 Jan 2026 10:56:25 GMT

Все в панике. AI радикально изменил рынок труда для джунов, сократив количество вакансий на 50%. У меня нет рецепта как найти работу кроме как дольше учиться и делать свои пет-проекты. Это базовая стратегия для повышения уровня абсолютно на любой ступеньке профессиональной лестницы.

В общем случае, чужой опыт не применим к моему. Ни чей опыт не релевантен для другого. Максимум что можно сделать - это препарировать чужой опыт и забрать себе некоторые ингредиенты в виде моделей поведения. Но без контекста они ничего не стоят. А контекст всегда уникален. И тем не менее, что бы посоветовал сам себе, если бы вышел на свою первую работу в 2026?

Лучше добирать знания сразу в процессе работы. Если нужно прикрутить Kafka к Spring Boot и обрабатывать входящие сообщения, а такого опыта нет, то типичная рабочая ситуация - ищу релевантную статью, применяю, двигаюсь к следующей задаче. Никто не даст время на изучение новой технологии, но именно это и стоит сделать сразу же, не отходя от кассы. В свободное время, но сделать. Иначе, через пару лет окажется, что поработал со множеством технологий, но ни одной толком не знаешь.
Стоит держать в голове истинную проблему заказчика, а не только совершенство кода. Бывает, что заказчик сам не до конца понимает свою главную проблему. Нужно ему в этом помочь. Возможно, с его стороны это будет выглядеть как критика его священных идей, но это лучше, чем просто молчать.
Писать код так, чтобы заказчик не мог от вас отказаться, потому что в этом коде никто больше не разберется, - это довольно низко. Лучше делать работу так, чтобы заменить вас было крайне легко, поскольку любой, кто придет на ваше место, сразу разберется как всё работает. Это же касается управления командами и проектами - после вас должна остаться система, а не набор костылей, который рушится как только вынимается главная подпорка. Это может показаться невыгодной стратегией, но зачем вам работодатель, который не понимает ценности вашей работы?
Есть активности, которые тратят заряд внутренней батарейки, и те, что восстанавливают. Очень высаживают заряд перманентные негативные эмоции. Они, как утечка памяти, могут быть незаметны в моменте, но эффект обязательно накопится и тогда потребуется глубокая перезарядка. Необходимо сканировать себя на предмет стресс-факторов и по возможности устранять их. Лучше всего восстанавливает заряд любимое дело. Даже если это сложный физический или когнитивный труд, он все равно повышает заряд за счёт правильных эмоций.
Зачем постоянно изучать новое, держать руку на пульсе и т.д. если не видишь достойной цели? Не потерять работу - это не цель, а выживание. В аутсорсинге вы решаете чужие проблемы, приближаете сбычу мечт кого-то другого, результат вашего труда не принадлежит вам. Это само по себе перманентная утечка энергии. Но если придумать как можно интегрировать этот труд в личную миссию, тогда вы не гребете на галере, а пользуетесь ею как общественным транспортом, который останавливается в нужных вам местах. Вы помогаете и себе и компании.
Линия поведения "мы боремся не друг с другом, мы боремся с проблемой" не бессмысленна, её стоит придерживаться даже когда коллеги или заказчики явно борются против вас. Во-первых, с бОльшей долей вероятности будет решена сама проблема. Во-вторых, станут очевидны люди, которые вносят деструктив.

Книги 2025

Alexander Laputski — Sat, 27 Dec 2025 19:30:38 GMT

Suleyman - The Coming Wave. Автор прогнозирует, что новая технологическая волна, помимо невиданных за счет взаимного усиления AI и SynBio открытий, будет нести левые и правые риски. Риски слева - это опасность неконтролируемого распространения технологий, которые становятся доступны каждому. Риски справа - это системы тотального цифрового надзора с социальным рейтингом и всеми аттракционами. Сулейман предлагает десять шагов к сдерживанию волны, подход надоподобие договора о нераспространении ядерного оружия.

Тиль - От нуля к единице. Семь вопросов, которые должен задавать себе каждый предприниматель. Пойти создать монополию, что ли?

Моженокв - Ген команды. Про построение эффективных команд. Несмотря на то, что всё описано на примере продающих команд (продажи автомобилей Audi), советы можно рассматривать как универсальные.

Адизес - Управление в условиях кризиса. О создании команды, где сбалансированы все PAEI роли. Сначала лучше прочитать "Идеальный руководитель. Почему им нельзя стать и что из этого следует".

Гупта, Сачер - Сила в доверии. Uber не доверял сотрудникам и пользователям и поплатился, не будьте как Uber.

Гладуэлл - Сила мгновенных решений. Мгновенные срезы (первое впечатление) решают.

Ленсиони - Пять пороков команды. Топ-менеджеры - это тоже команда. Пять пороков: взаимное недоверие, уход от конфликтов, необязательность, нетребовательность к другим, безразличие к общему результату.

Ленсиони - Правда о вовлеченности сотрудников. Люди ненавидят свою работу по трем причинам: безликость (сотрудник как ресурс), бессмысленность (не ясно кому помогает моя работа), безоценочность (нет четкой метрики успешности труда).

Логан, Кинг, Фишер-Райт - Лидер и племя. Выдающаяся книга, никакой воды, каждый абзац написан по делу. Теперь хочется работать только в компаниях пятого уровня или создавать их. Никак не меньше четвертого. Но таких к сожалению исчезающе мало.

Стребулаев, Ланг - Венчурное мышление. Так вот как оно устроено. Ценная информация прежде всего для стартаперов.

Рис - Бережливый стартап. Разработка через быстрое и дешевое тестирование гипотез.

Кавасаки - Быстрый старт. Советы стартаперам.

Коллинз, Лазье - Больше чем бизнес. Лучше сначала прочитать “От хорошего к великому“.

HBR - Лидерство. Сборник классических статей.

HBR - Личная эффективность. Сборник классических статей.

Добелли - Искусство ясно мыслить. Сборник когнитивных искажений, некоторые кажутся спорными, поскольку сформулированы в вакууме, без привязки к контексту.

Стрелеки - Кафе на краю земли. Зачем ты здесь?

Элмор - Восемь парадоксов эффективного лидера. Весьма тонко подмеченные противоречия в том, что ожидают от лидера. Но это же и ключ к правильной интерпретации роли лидера.

Бехтерев, Бехтерева - Спиральная динамика для бизнеса. Как и книги Адизеса, как "Лидер и племя", производит эффект зажегшейся в мозгу лампочки, раскладывает всё по полкам. Концепция спиральной динамики действительно работает, начинаешь понимать многие корпоративные процессы, которые раньше казались нелогичными. Но главное, становится ясно куда развиваться из текущего состояния.

Сартр - Тошнота. Про айцишника на галере и его переживания. Не важно, что события происходят в 1932, сейчас абсолютно также.

Юнгер - Перед стеной времени. Юнгер в своих произведениях оперирует особыми, ни на что не похожими образами - гештальтами. Неизвестный солдат, Рабочий, Одиночка, Анарх. Это совокупные философско-социально-антропологические образы, которые выражают суть эпохи. В данном эссе также есть центральный образ - стена времени, за которой ожидает что-то принципиально новое.

Каждая из прочитанных книг по-своему интересна, но если выбирать личный топ, то из достигаторских книг - это "Лидер и племя" и "Спиральная динамика для бизнеса", из философских - однозначно "Перед стеной времени" Юнгера.

Интерпретация эссе "Перед стеной времени"

Время интуитивно ассоциируется с текучей субстанцией, например рекой. Но в образе стены означает границу эпохи, за которой находится ситуация принципиально не похожая на текущую. За стеной нас ожидает царство титанов, поднявшихся из бездны Тартар. Титаны - это техника, создающая сама себя из неорганизованной материи через посредство человека.

Новая эпоха будет царством коллективизма, прямая аналогия с царством количества Генона. Индивидуализм и целостность личности будут маргинализированы и уступят место "коллективному интеллекту" в качестве нового субъекта исторического процесса. В правовом, социологическом и теологическом смысле это действительно будут новые атомы, из которых конструируются массы.

Эссе написано в 1959, но с течением времени становиться всё актуальнее. Сегодняшнее томительное ожидание AGI сродни ожиданию поднимающегося титана, жар которого уже можно ощутить на поверхности.

Гравюра "Прометей коммитит огонь на github":

Прометей (предвидящий) похитил на Олимпе огонь и принес людям. За это Зевс послал людям Пандору, которая засунула нос куда не следовало и человечество поразили неисчислимые беды. В принципе, если эмулированная за счет предсказания токенов функция рассуждения - это похищенный огонь разума, то примерно понятно, что полезет из ларца Пандоры: массовая невостребованность, класс прекариат как норма, неограниченное продление жизни для элит и ББД для масс, человекоданные, социальный рейтинг. Всё это - как левые, так и правые риски из книги Мустафы ~~Монда~~ Сулеймана.

Гигантские датацентры, как плавильные печи Гефеста из которых похищается огонь, доступны только Олимпу, то есть бигтехам. А еще Прометей, согласно мифу, вылепил из глины человека и оживил огнем. В новой версии это очень похоже на постчеловека Ника Бострома.

Если левацки настроенный Прометей раздает опенсорсный децентрализованный огонь всем и каждому, то правая версия Прометея только MAGA-позитивным с золотой картой. Но в любом случае, нервом новой эпохи становится децентрализация. При переходе через стену времени иерархии схлопываются, компрометируются сетецентричными структурами.

GITEX Global 2025 🌍

Alexander Laputski — Sun, 19 Oct 2025 17:10:44 GMT

🔧 It seems that literally every country is striving to create its own sovereign AI ecosystem. These will be closed autonomous AI-based computing infrastructures using open-source AI to ensure data sovereignty. This is perhaps the main trend in GovTech. Countries that succeed in this race will build more adaptive and less bureaucratized governance systems, giving them a productivity boost in literally every area of public administration. Machines have bureaucracy too (like BPMN), but unlike systems consisting of people, such systems adapt their behavior much more easily.

🤖 Agent autonomy is good, but reliability is more important. Agents are increasingly refining their specialization to the stage where they can perform sufficiently reliable atomic operations that don't require continuous human control. It's necessary to descend to this level to feel solid ground under your feet, push off from it, and start building multi-agent systems whose reliability is no lower than that of the weakest link.

🎨🧠 Previously, technology only extended human physicality, but for the first time it has reached cognitive function as well. It's now evident that the objectified reasoning function implemented in LLMs is quietly becoming the foundation of a new technological paradigm. This process forcibly localizes everything unpredictable and non-deterministic in humans within the intentional act, which, more than ever before, deserves to be called creativity. After all, creativity is precisely what translates chaos into order. AI only repeats after humans. As a result, natural sciences are becoming more like humanities, with a clearly defined creative act at the center, while humanities are becoming mechanized, allowing the most predictable creative tasks to be translated into technical ones (Midjourney, Sora, etc.). Somewhere in the middle, they will meet. Apparently, professions will now be divided not so much into humanities and technical fields, but rather into algorithmizable and non-algorithmizable ones.

Digital Transformation of Government Agencies

Alexander Laputski — Sun, 05 Oct 2025 22:01:47 GMT

Abstract

This paper explores the main challenges that government agencies face and suggests a distributed platform architecture for data integration, cleaning, and analysis, incorporating artificial intelligence and machine learning algorithms. It discusses the platform's infrastructure and software core, as well as the idea of using enterprise AI with vector data storage. We describe scenarios where users analyze data using AI agents and where AI agents manage business processes with a controlled feedback loop. The solution is designed for partially or fully isolated environments, using an open-source-first approach and planned capacity allocation.

Introduction

The complexity and fragmentation of government information systems require a method to unify data from multiple sources, enabling analytics and management capabilities that were previously out of reach. The proposed platform creates a unified pipeline for ingesting, storing, cleaning, normalizing, and analyzing data, as well as executing and optimizing processes. It includes enterprise AI that offers contextually relevant answers and secure access to action tools. We focus on practical design choices suitable for single-cluster, resource-constrained, and isolated deployments.

Current Challenges

Based on years of experience building systems for government agencies, we have identified the following priorities:

Data integration and processing. There is a need to combine and process data from fragmented systems to enhance decision-making and gain a complete analytical view that no single system can provide.
Document Digitization. This involves digitizing, analyzing, and classifying large volumes of documents, including long-term archives. It requires strong OCR capabilities, structured entity extraction, topic classification, deduplication, and maintaining legal validity.
Process unification. Linking agency workflows on a single platform to boost operational efficiency and improve decision-making quality.
Document routing optimization. Improve the routing of documents through approvals, reviews, and administrative procedures by using machine learning and AI-assisted process design.
All solutions must work with limited or no internet connectivity, prioritize open-source components, and follow planned resource allocation.

We propose a complete solution that combines reliable production components with new enterprise AI features.

Architecture Overview

The architecture includes:

🖥️ A software and infrastructure core for a distributed, fault-tolerant system.
🌊 A centralized data lake that integrates internal and external sources.
🔄 Multiple ingestion modes: API subscriptions, streaming and batch loads, and user-driven uploads.
🧹 Data cleansing and transformation, including ML-driven enrichment and classification.
🤖 An enterprise, agent-based AI that works with platform data and supports user-defined business scenarios.
📊 An analytics platform.

The infrastructure core is designed as a distributed, fault-tolerant system with interconnected tools that offer system administration, information security, authentication and authorization, and a low-code editor for managing data structures. The microservice architecture ensures horizontal scalability and isolates failures [1][2][3]. It includes built-in GitLab CI/CD and observability features like metrics, logs, and traces.

The centralized data lake combines internal and external sources, with distinct areas for raw data, cleaned data, and consumer data marts. Metadata is added to the data lake along with the main data stream and is utilized in the low-code editor.

ETL and ELT channels are set up using API subscriptions, event buses, streaming, batch loading, and user uploads with validation [4]. Cleaning and transformation include normalization, enrichment, deduplication, feature extraction, and ML classification and regression tasks. We focus on regularly addressing technical debt for data, models, and infrastructure [5]. Data loaders, the ML model software core, and CI/CD follow an Agile development approach within the project development lifecycle (SDLC).

Enterprise AI relies on LLM, vector data storage, and the n8n process orchestrator. Using LLM greatly reduces training time for specific business tasks due to their few-shot learning capability [6]. Few-shot learning is a method where models can perform tasks well with only a few training examples, helping them adapt to new tasks without much extra training. Currently, the n8n process orchestrator can use tools like microservice API calls and email sending. The feedback loop lets system administrators make controlled changes to business process rules using AI agents, supporting ongoing development.

The analytics subsystem offers tools for managing data representation through diagrams and graphs, report generation, dashboards, and OLAP data marts, all with artifact versioning and regulatory compliance.

Overall, the architecture follows the 12-Factor methodology, which includes configuration through the environment, immutable builds, build/release/run separation, and logging as event streams [7]. It also adheres to The Reactive Manifesto, focusing on asynchronous messaging, elasticity, resilience, and responsiveness [8].

Technologies Overview

The platform core uses a microservice architecture built on Kubernetes, Java, Spring Framework, TypeScript, Angular, Keycloak, Gradle, and Docker. It also includes a proprietary framework with a special constructor (low-code structure editor) that allows users to create and visualize data structures and integrate them into business processes without extra programming.

The data lake utilizes tools like MinIO and MongoDB for raw and intermediate data and metadata, PostgreSQL for cleaned data, Elasticsearch for log storage, and Redis for caching. Data is added to the lake through Kafka, RabbitMQ, Spring Cloud Data Flow, gRPC. A separate channel for critical data addition involves document digitization from various file formats, including graphics like PDF, JPG, and PNG.

Data cleaning and transformation are carried out using Python, PyTorch, scikit-learn, transformers, BERT, PyCaret, CatBoost, XGBoost, Optuna, and Apache Airflow for process orchestration.

Enterprise AI is built on Spring AI, using Mistral-24b and gpt-oss-20b as LLMs, llama.cpp for cloud deployment, Ollama for local deployment, Qdrant for embedding storage, Docker, and Kubernetes. Currently, working prototypes are being developed using LangGraph, n8n, and pgvector technologies for hybrid search scenarios.

The analytics platform employs Stimulsoft Reports, Stimulsoft Dashboards, JasperReports, BIRT, and proprietary developments.

Architecture Decision Records

ADR-001. Microservices on Kubernetes. We adopt a microservices architecture orchestrated by Kubernetes to achieve horizontal scalability, fault isolation, and independent releases. Services are built in Java with Spring, containerized with Docker, and delivered via GitLab CI/CD. This decision balances operational control with elasticity suitable for a single-tenant, isolated government cluster.

ADR-003. MongoDB-centered lake. We do not introduce Apache Iceberg because resources are constrained and the data is predominantly government-specific JSON. We choose MongoDB to support fast ingestion, flexible schema evolution, convenient APIs, and multi-document transactions for consistent data handling, even with many temporary artifacts. We implement our own custom Bronze/Silver/Gold layering on MongoDB and MinIO.

ADR-004. RAG as the foundational enterprise AI pattern. We standardize on Retrieval-Augmented Generation with Qdrant as the vector store to keep knowledge fresh without retraining LLMs. This pattern provides controllable grounding of responses and supports offline operation, provided that rigorous offline/online evaluations are conducted to manage hallucination risk.

ADR-005. BPMN for business orchestration and n8n for technical workflows. We use BPMN (e.g., Camunda) to orchestrate end-to-end business processes and n8n to execute technical steps, with an AI agent proposing changes that are subject to human approval. This separation maintains business traceability while allowing quick updates to integrations and automations.

ADR-006. Security for isolated government environments. We adopt a zero-trust approach with OIDC via Keycloak, encryption in transit and at rest with managed keys, data masking for PII. Trust boundaries are explicitly defined to reflect isolated clusters, and external interfaces are minimized while maintaining necessary interoperability.

ADR-007. In-house ML pipeline. We create our own model training and inference pipeline, storing intermediate datasets and features in MongoDB to save resources and maintain full control over execution, scheduling, and debugging. This approach reduces external dependencies, keeps costs predictable, and aligns with the need to self-host all components within the isolated perimeter.

ADR-008. Observability. We implement observability using Prometheus for metrics, Grafana for visualization, Zipkin for tracing, and ELK for logs, ensuring end-to-end visibility across APIs, data pipelines, and AI components. We deliberately exclude Istio because the deployment targets a single Kubernetes cluster in an isolated environment, and adding a service mesh would increase complexity without proportional operational benefit.

SLO/SLI

All monitoring is implemented with self-hosted, open-source tools and stored within an isolated environment.

Platform Availability SLO: 99.5% monthly in single-cluster isolated mode. SLI is measured using uptime probes and request success rates.
Data Ingestion SLO: 95th percentile end-to-end latency is under 30 minutes for daily batch loads and 8 hours for history batch loads (previous years data). SLIs are measured from the first byte received to when the data is stored.
AI Response SLO: 95th percentile inference latency is under 5 seconds for retrieval and generation on local models. SLI includes vector search latency and token generation rate.
Operations: Mean Time to Recovery is under 2 hours. Change failure rate is under 15%.

Enterprise AI

Let's explore how AI is used in the platform. The process starts with centralized data being ingested into the data lake, while content is indexed in vector storage at the same time. Indexing considers different source types like text, tables, and mixed documents.

Users create prompts for the LLM, specifying business scenarios and indicating which documents to consider directly in the user interface. The agent searches the vector storage, gathers relevant fragments, and generates results in the desired format.

The agent can then use system tools through MCP interfaces-calling microservice REST APIs, saving results to databases, sending email reports, starting the next process stage, or passing control to agents with different master-prompt instructions. This method allows for automatic system updates based on a thorough understanding of its state.

Currently, we use n8n as an experimental workflow automation tool. Users create n8n schemes to implement business scenarios. A specialized process orchestration agent suggests corrections to these n8n schemes (such as changing the order of steps, parameters, and routing), which are then reviewed by system administrators. This creates a managed feedback loop where the system evolves based on data analysis and AI recommendations.

Example Use Case

Consider a control example implementing a freight transportation data scenario. The test data corpus includes one thousand documents each of three types: freight transportation requests, work completion certificates, and invoices.

The business process is implemented as a BPMN scheme (request, carrier selection, transportation, work completion certificate, invoice) and can be integrated into the platform using frameworks like Camunda. Several n8n schemes with AI participation are implemented that analyze transportation requests and initiate cooperation, calculate quotes based on cargo data and current tariffs, send transportation status messages to social networks, generate transportation documents, and distribute them to counterparties. Here BPMN acts as the overall process orchestrator, while n8n executes specific technical actions that can be initiated by BPMN handlers, external actors (customer, carrier), or simply on schedule.

The business scenario agent receives a user prompt to analyze three document types and formulate a cost optimization strategy. The result should be saved as a report. The user works with this agent until obtaining an acceptable strategy.

The orchestration agent analyzes customer behavior (correspondence, work reviews), transportation incidents, efficiency of each transportation stage for specific routes, n8n integration fault tolerance, and formulates proposals for updating BPMN and n8n schemes. The result includes a text report and new versions of .bpmn and .json files.

Conclusion

We have examined current government agency challenges and proposed a comprehensive solution through creating a software platform using artificial intelligence. This platform provides comprehensive data source integration, document digitization and intelligent processing, and AI-based business process management. The architecture relies on proven distributed system design practices and supports a managed feedback loop for systematic improvement of data and process orchestration. This development creates a technological foundation for regulated digital transformation of government agencies.

References

Bass L., Clements P., Kazman R. Software Architecture in Practice. – 4th ed. – Boston: Addison-Wesley, 2021. – 624 p.
Tanenbaum A. S., Steen M. Distributed Systems: Principles and Paradigms. – 2nd ed. – Upper Saddle River, NJ: Pearson; Prentice Hall, 2007. – 765 p.
Newman S. Monolith to Microservices: Evolutionary Patterns to Transform Your Monolith. – Sebastopol, CA: O’Reilly Media, 2019. – 272 p.
Kleppmann M. Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. – Beijing; Boston; Farnham; Sebastopol; Tokyo: O’Reilly Media, 2017. – 616 p.
Sculley D., Holt G., Golovin D., Davydov E., Phillips T., Ebner D., Chaudhary V. [et al.]. Hidden Technical Debt in Machine Learning Systems / D. Sculley [et al.] // Advances in Neural Information Processing Systems. – 2015. – Vol. 28. – P. 2503–2511.
Brown T. B., Mann B., Ryder N., Subbiah M., Kaplan J., Dhariwal P., Neelakantan A. [et al.]. Language Models are Few-Shot Learners / T. B. Brown [et al.] // Advances in Neural Information Processing Systems. – 2020. – Vol. 33. – P. 1877–1901.
Wiggins A. The Twelve-Factor App. – URL: https://12factor.net.
Bonér J., Farley D., Kuhn R., Thompson M. The Reactive Manifesto. – URL: https://www.reactivemanifesto.org.

Job Security: Not Just Learning AI

Alexander Laputski — Wed, 10 Sep 2025 13:53:51 GMT

Everyone keeps saying it’s not AI that will replace you, but the person who uses it. And so you’re told: use AI, keep developing nonstop, and then you won’t get fired. But to me, the real subject here is not so much the employee as the business owner and his bugs in the head.

How many stakeholders out there are actually ready to reshape the business and declare new ambitious goals so there’s enough meaningful work for everyone and the team can be kept? Or just admit that now we can work less and, say, move to a four‑day week? Such people are clearly a minority. And those will be teams where they ALREADY value real collaboration and fight not each other, but the PROBLEM.

Accelerated tech learning (when squirrel spinning the wheel faster and faster) guarantees nothing. It’s only a necessary condition in the employment criterion. The sufficient one is belonging to a Stage 4 team (see the book “Tribal Leadership”), where people direct ambition not toward competing with colleagues but toward attacking the problem.