Beyond the dashboard: orchestrating consistent data flows at scale

The distributed research problem

A typical B2B intelligence project does not have one data source. It has eight.

There is the structured survey data from Typeform with Likert-scale responses. There is the qualitative thread from a private Reddit community with sentiment that contradicts the survey. There is the competitor analysis sitting in a Notion workspace. There are SEC filings downloaded as PDFs. There is a Slack channel where the product team has been sharing anecdotal customer feedback for six months. And there is the CRM export that nobody has cleaned since Q2.

Each of these sources speaks a different language. The survey has numbers on a scale. Reddit has unstructured text. The PDF has paragraphs. The CRM has freeform notes and inconsistent field names. Notion has hierarchical documents.

Building a dashboard that displays each of these in isolation is straightforward. Building a system where a claim from the Reddit thread can be compared against a score from the survey, with confidence that both are measuring the same construct at the same point in time, is an engineering problem. This article describes how we solve it.

The hub-and-spoke orchestration model

         [Reddit API]     [Typeform]     [Notion API]
              │               │               │
              │               │               │
              ▼               ▼               ▼
    ┌─────────────────────────────────────────────────┐
    │           NORMALISATION TRANSFORMERS            │
    │  • Sentiment → [-1.0, +1.0] continuous scale    │
    │  • Likert 5pt → normalised 0.0–1.0              │
    │  • Unstructured text → typed claim records      │
    │  • Dates → UTC ISO 8601                         │
    └─────────────────────────────────────────────────┘
              │               │               │
              └───────────────┴───────────────┘
                              │
                              ▼
            ┌──────────────────────────────────────┐
            │    CENTRAL NERVOUS SYSTEM (NestJS)   │
            │                                      │
            │  • Conflict detection engine         │
            │  • Temporal alignment                │
            │  • Source weighting registry         │
            │  • Chain of Custody ledger           │
            └──────────────────────────────────────┘
                              │
              ┌───────────────┴───────────────┐
              │                               │
              ▼                               ▼
   ┌─────────────────────┐      ┌────────────────────────┐
   │  Unified Data Store │      │  Conflict Resolution   │
   │  (PostgreSQL)       │      │  Queue (BullMQ)        │
   └─────────────────────┘      └────────────────────────┘
              │
              ▼
   ┌─────────────────────┐
   │  Presentation Layer │
   │  (Svelte 5 / PDF)   │
   └─────────────────────┘

The NestJS backend is the hub. Every source is a spoke. No source communicates with another source directly. Everything passes through the Central Nervous System, which is where normalisation, conflict detection, and chain of custody tracking occur.

This architecture mirrors the stateless inference engine described in our article on decoupled research systems. The presentation layer, whether a Svelte frontend, a PDF export, or an API endpoint, always reads from the unified store and never queries the sources directly.

Making data comparable: the normalisation problem

The hardest part of multi-source orchestration is not the ingestion. It is making data comparable across sources that were never designed to be compared.

Consider a concrete example: a Typeform survey asks participants to rate their satisfaction with a vendor’s support quality on a 5-point Likert scale. A Reddit thread from the same month contains dozens of comments about the same vendor’s support quality, ranging from effusive praise to descriptions of multi-day response failures.

How do you put these in the same chart? How do you determine whether the survey is more positive than Reddit, or whether they are measuring different populations? The answer starts with normalisation.

// normalisation/likert.normaliser.ts

export function normaliseLikert(
  rawScore: number,
  scaleMin: number,
  scaleMax: number,
): NormalisedScore {
  // Map to [0.0, 1.0] regardless of original scale (5pt, 7pt, 10pt)
  const normalised = (rawScore - scaleMin) / (scaleMax - scaleMin);

  return {
    normalisedValue: normalised,       // 0.0 = most negative, 1.0 = most positive
    originalValue: rawScore,
    scaleMin,
    scaleMax,
    normalisationMethod: 'min-max-linear',
    schemaVersion: 'normalised-score/v1',
  };
}

// normalisation/sentiment.normaliser.ts

export async function normaliseSentiment(
  rawText: string,
  sourceId: string,
): Promise<NormalisedScore> {
  // Sentiment analysis produces a value in [-1.0, +1.0]
  // Map to [0.0, 1.0] to match Likert normalisation
  const sentimentResult = await sentimentAnalyser.analyse(rawText);
  const normalised = (sentimentResult.score + 1) / 2;

  return {
    normalisedValue: normalised,
    originalValue: sentimentResult.score,
    rawText,
    sourceId,
    normalisationMethod: 'sentiment-linear-remap',
    schemaVersion: 'normalised-score/v1',
  };
}

Both normalisers produce a NormalisedScore with the same interface. A survey response with a score of 4/5 produces a normalisedValue of 0.75. A Reddit comment analysed as mildly positive (sentiment score: +0.4) produces a normalisedValue of 0.70. These numbers can now be compared, not as equivalent signals, but as comparable signals with documented provenance.

The normalisationMethod and schemaVersion fields are part of the chain of custody. If the normalisation algorithm changes in a future release, historical records retain their original method identifiers. The comparison is always performed between records with the same schemaVersion.

Conflict detection and resolution

The most operationally interesting case is when sources disagree. A survey says 78% of respondents are satisfied. The Reddit thread from the same month reads as predominantly negative. What does the intelligence system do?

The wrong answer is to silently average them, present whichever is more recent, or leave it to the analyst to notice the discrepancy. The correct answer is to detect the conflict explicitly, classify it, and surface it as a named data event.

// conflict/conflict-detector.service.ts

export interface DataConflict {
  conflictId: string;
  sources: ConflictingSource[];
  conflictType: 'directional' | 'magnitude' | 'temporal';
  detectedAt: string;
  resolutionStrategy: 'source-weight' | 'recency' | 'manual-review' | 'unresolved';
  resolutionOutcome?: ResolvedValue;
}

@Injectable()
export class ConflictDetectorService {

  async detectConflict(
    scores: NormalisedScore[],
    topic: string,
    dateRange: DateRange,
  ): Promise<DataConflict | null> {

    if (scores.length < 2) return null;

    const values = scores.map(s => s.normalisedValue);
    const range = Math.max(...values) - Math.min(...values);

    // Directional conflict: sources point in opposite directions
    const hasDirectionalConflict = values.some(v => v > 0.5) && values.some(v => v < 0.5);

    // Magnitude conflict: same direction, very different intensity (>0.3 gap)
    const hasMagnitudeConflict = !hasDirectionalConflict && range > 0.3;

    if (!hasDirectionalConflict && !hasMagnitudeConflict) return null;

    return {
      conflictId: crypto.randomUUID(),
      sources: scores.map(s => ({
        sourceId: s.sourceId,
        normalisedValue: s.normalisedValue,
        normalisationMethod: s.normalisationMethod,
      })),
      conflictType: hasDirectionalConflict ? 'directional' : 'magnitude',
      detectedAt: new Date().toISOString(),
      resolutionStrategy: this.selectResolutionStrategy(scores),
    };
  }

  private selectResolutionStrategy(
    scores: NormalisedScore[],
  ): DataConflict['resolutionStrategy'] {
    const hasRegisteredWeights = scores.every(s =>
      this.sourceWeightRegistry.has(s.sourceId)
    );
    if (hasRegisteredWeights) return 'source-weight';

    const timeDelta = this.getTimeDeltaDays(scores);
    if (timeDelta > 60) return 'recency'; // sources more than 60 days apart

    return 'manual-review'; // same period, no registered weights → human decision
  }
}

The conflict record is stored alongside the underlying data. When an analyst opens a research output that contains a conflict, they see it explicitly: “Reddit sentiment and survey satisfaction scores are directionally opposed for this topic in Q3. Resolution strategy: manual review.” They are not shown a blended average with the conflict hidden underneath.

This is the system-level equivalent of what good research methodology demands of individual moderators: notice the contradiction, name it, and let the stakeholder decide how to interpret it.

The chain of custody: from raw API response to PDF report

The concept of a data chain of custody originates in legal and forensic practice. In research intelligence, especially in regulated sectors like banking and insurance, the same standard applies. A finding in a board-level report must be traceable, step by step, back to the raw data that produced it.

Citium implements chain of custody as a ledger: an append-only record of every transformation a data point undergoes from the moment it enters the system.

// custody/chain-of-custody.service.ts

export interface CustodyEvent {
  eventId: string;
  dataPointId: string;
  eventType: 'ingested' | 'normalised' | 'conflict-detected' | 'conflict-resolved' | 'cited' | 'exported';
  performedBy: string;       // service identifier or user ID
  performedAt: string;       // ISO 8601
  inputHash: string;         // hash of the input state
  outputHash: string;        // hash of the output state
  metadata: Record<string, unknown>;
}

@Injectable()
export class ChainOfCustodyService {

  async recordEvent(
    dataPointId: string,
    eventType: CustodyEvent['eventType'],
    input: unknown,
    output: unknown,
    metadata?: Record<string, unknown>,
  ): Promise<void> {

    const event: CustodyEvent = {
      eventId: crypto.randomUUID(),
      dataPointId,
      eventType,
      performedBy: this.serviceIdentifier,
      performedAt: new Date().toISOString(),
      inputHash: sha256(JSON.stringify(input)),
      outputHash: sha256(JSON.stringify(output)),
      metadata: metadata ?? {},
    };

    await this.custodyLedger.append(event);
  }
}

Every CustodyEvent is appended to the ledger and never updated. When a data point is cited in a research output and the deterministic connection between the orchestration layer and the audit trail exists, the cited event is recorded. When the output is exported to PDF, the exported event is recorded. The complete provenance chain is always available as an ordered sequence of events, including both input and output hashes.

In an audit scenario, a regulator, a client’s legal team, or an internal compliance review, the chain of custody provides a mathematically verifiable answer to the question: “Was this finding altered between collection and delivery?” The answer is in the ledger.

Orchestration is not a feature

Product teams often treat data orchestration as a feature to be added once the core product is built. This is the wrong sequence.

The normalisation logic, conflict detection, and chain of custody described here are not optional enhancements. They are the preconditions for any intelligence product that needs to be trusted by people making consequential decisions. Without them, a dashboard is a collection of numbers with uncertain provenance and unknown conflicts. With them, it is a research system.

The self-healing pipelines ensure data arrives correctly. Deterministic RAG ensures synthesis is traceable, and the decoupled architecture ensures records remain immutable. The orchestration layer described here guarantees that all of these inputs are aligned before any work begins.

Consistency is not a dashboard property. It is a systems property.