Skip to content

Evaluation & Optimization with Langfuse

ThreatForest integrates with Langfuse for tracing, SME review, and dataset export. This guide walks through the full evaluation pipeline — from connecting Langfuse to exporting scored datasets for prompt optimization.


Prerequisites

  • ThreatForest installed (pipx install . or pip install .)
  • A Langfuse account with API keys (self-hosted or cloud)
  • strands-agents[otel] installed (included in ThreatForest dependencies)
  • Sample applications or your own project to evaluate

Step 1: Configure Langfuse

Interactive Setup (CLI)

threatforest config langfuse

The wizard prompts for your public key, secret key, and host URL. On successful connection, score definitions are automatically registered with Langfuse.

Direct Setup

threatforest config langfuse \
  --enable \
  --public-key pk-lf-xxxx \
  --secret-key sk-lf-xxxx \
  --host https://cloud.langfuse.com \
  --test

The --test flag verifies the connection and auto-registers score configs.

Console UI Setup

Navigate to the Configure page in the ThreatForest web console. The Langfuse Tracing section lets you:

  • Toggle tracing on/off
  • Enter your public key, secret key, and host
  • Test the connection
  • Save settings

Disable Tracing

threatforest config langfuse --disable

When disabled, ThreatForest runs normally without any tracing overhead.


Step 2: Score Definitions

Score definitions are automatically registered when you configure Langfuse (via CLI, wizard, or --test). You can also register them manually:

threatforest config langfuse --register-scores

ThreatForest defines 16 evaluation dimensions across four capabilities:

Threat Statement Generation (5 dimensions)

Score Config NameDescription
threat_overall_qualityHolistic assessment of generated threats
threat_relevance_to_contextMatch to application context
threat_completenessCoverage of threat categories
threat_technical_accuracyTechnical correctness
threat_hallucination_scoreAbsence of fabricated content

Attack Tree Generation (6 dimensions)

Score Config NameDescription
attack_tree_overall_qualityHolistic assessment of the attack tree
attack_tree_structural_qualityDepth, branching, organization
attack_tree_technical_realismFeasibility of attack techniques
attack_tree_attack_path_logicLogical progression from access to impact
attack_tree_completenessCoverage of attack vectors and phases
attack_tree_actionabilityUsefulness for defenders

TTP Matching (1 dimension)

Score Config NameDescription
ttp_mapping_qualityQuality of MITRE ATT&CK technique mapping

Mitigation Quality (4 dimensions)

Score Config NameDescription
mitigation_actionabilityWhether the mitigation provides concrete, implementable steps
mitigation_specificityHow specific the mitigation is to the identified threat and tech stack
mitigation_coverageWhether mitigations address all identified attack paths
mitigation_technical_accuracyCorrectness of recommended controls and configurations

Scoring Scale

All dimensions (except TTP) use a 5-point categorical scale:

CategoryValueMeaning
Excellent1.00Exceptional quality, no issues
Good0.75Above average, minor issues
Acceptable0.50Meets minimum requirements
Poor0.25Below expectations, significant issues
Unacceptable0.00Fails to meet requirements

TTP mapping uses a specialized scale replacing "Unacceptable" with "No Mapping".

To sync with existing configs already in Langfuse:

threatforest config langfuse --sync-scores

Step 3: Run ThreatForest with Tracing

With Langfuse enabled, every ThreatForest run produces two types of traces:

OTEL Traces (Automatic)

Full graph execution traces captured via OpenTelemetry, including:

  • Every LLM call with input/output, token counts, and latency
  • Tool invocations (file reads, structural analysis)
  • Agent event loop cycles
  • All traces share a session ID (e.g., tf-a4162895b65c) for grouping

These are the detailed engineering traces for debugging and performance analysis.

Annotation Traces (Automatic)

Clean input/output pairs per subgraph, pushed via the Langfuse SDK for SME review:

Trace NameInputOutputTags
scannerTask descriptionscanner_context.jsonscanner, annotation
threat-generationScanner contextthreats.jsonthreat, annotation
attack-tree-generationThreatsattack_trees.jsonattack-tree, annotation
ttp-mappingAttack treesttp_mappings.jsonttp, annotation
mitigation-generationAttack trees + TTP mappings + scanner contextmitigations.jsonmitigation, annotation

These traces are designed for annotation queues — filter by the annotation tag in Langfuse.

Using Sample Applications

# S3 Multi-Tenant SaaS (S3, Access Points, Object Lambda, KMS)
threatforest run --project-path sample-applications/s3

# Healthcare Analytics (HIPAA, AWS, S3, Lambda, DynamoDB)
threatforest run --project-path sample-applications/hcls-example

# IoT Device Management (MQTT, Edge Computing, OTA)
threatforest run --project-path sample-applications/iot-device-management

# Connected Vehicle Platform (V2X, Telematics, OBD-II)
threatforest run --project-path sample-applications/vehicle-platform

Using Your Own Project

threatforest run --project-path /path/to/your/project

After each run, traces appear in your Langfuse dashboard under the session ID.


Step 4: SME Review in Langfuse

Create Annotation Queues

In the Langfuse UI, create queues for each capability:

  1. Go to your Langfuse project → Annotation Queues
  2. Click + New Queue and create:
Queue NameScore Configs to Attach
threatforest-threat-statementsthreat_overall_quality, threat_relevance_to_context, threat_completeness, threat_technical_accuracy, threat_hallucination_score
threatforest-attack-treesattack_tree_overall_quality, attack_tree_structural_quality, attack_tree_technical_realism, attack_tree_attack_path_logic, attack_tree_completeness, attack_tree_actionability
threatforest-ttp-matchingttp_mapping_quality
threatforest-mitigationsmitigation_actionability, mitigation_specificity, mitigation_coverage, mitigation_technical_accuracy

Add Traces to Queues

  1. Go to Traces → filter by tag annotation
  2. Select traces and click Add to Annotation Queue
  3. Choose the matching queue based on the trace name

Score Traces

SMEs open the annotation queue and work through each item:

  1. Review the input context and generated output
  2. Assign scores using the categorical scale
  3. Optionally add free-text feedback

Review Workflow

Focus on one queue at a time. Complete all threat statement reviews before moving to attack trees. This improves consistency across scores.


Step 5: Export to Datasets

Export scored traces to Langfuse Datasets for evaluation analysis or prompt optimization:

# Reviewed threat statement traces
threatforest export traces \
  --trace-type threat_statement \
  --status reviewed \
  --dataset-name threat-statements-v1

# Reviewed attack tree traces
threatforest export traces \
  --trace-type attack_tree \
  --status reviewed \
  --dataset-name attack-trees-v1

# Reviewed TTP matching traces
threatforest export traces \
  --trace-type ttp_matching \
  --status reviewed \
  --dataset-name ttp-matching-v1

Export Options Reference

OptionDescription
--trace-type, -tFilter: threat_statement, attack_tree, ttp_matching
--status, -sFilter: pending_review, reviewed
--start-dateISO date lower bound (e.g., 2025-01-01)
--end-dateISO date upper bound
--ground-truth-onlyOnly export ground truth candidates
--dataset-name, -dTarget Langfuse Dataset name (required)
--dataset-descriptionDescription for new datasets
--dry-runPreview without exporting

Full Baseline Workflow

# 1. Configure Langfuse (auto-registers score configs)
threatforest config langfuse --test

# 2. Run sample applications
for domain in s3 hcls-example iot-device-management vehicle-platform; do
  threatforest run --project-path sample-applications/$domain
done

# 3. (Manual) SME review in Langfuse annotation queues

# 4. Export reviewed traces to datasets
threatforest export traces --trace-type threat_statement --status reviewed -d baseline-threats-v1
threatforest export traces --trace-type attack_tree --status reviewed -d baseline-attack-trees-v1
threatforest export traces --trace-type ttp_matching --status reviewed -d baseline-ttp-v1

Resilient Tracing

ThreatForest's tracing is designed to never block your workflow:

  • If Langfuse is unreachable, tracing silently falls back to no-op mode
  • All Langfuse API errors are caught and logged without interrupting execution
  • When Langfuse is disabled (--disable), there is zero tracing overhead
  • OTEL spans are flushed before the process exits to ensure delivery