Metrics
HumaLab provides a comprehensive metrics system for tracking and analyzing your validation experiments.
Metric Types
Standard Metrics
Standard metrics track time-series data across episodes or steps.
from humalab.constants import GraphType
# Create a metric
metric = hl.metrics.Metrics(
graph_type=GraphType.LINE
)
# Add to run
run.add_metric("score", metric)
# Log data points
run.log({"score": 0.95}, x={"score": 0}) # x is optional
run.log({"score": 0.97}, x={"score": 1})
Graph Types
Specify how metrics should be visualized:
from humalab.constants import GraphType
# Line graph (default for time-series)
GraphType.LINE
# Histogram
GraphType.HISTOGRAM
# Bar chart
GraphType.BAR
# Scatter plot (for 2D data)
GraphType.SCATTER
# Gaussian distribution plot
GraphType.GAUSSIAN
# 3D visualization
GraphType.THREE_D_MAP
When to Use Each Graph Type
LINE - Best for continuous time-series data:
# Tracking metrics over episodes or training steps
metric = hl.metrics.Metrics(graph_type=GraphType.LINE)
run.add_metric("loss", metric)
for step in range(1000):
run.log({"loss": compute_loss()})
BAR - Best for categorical or discrete comparisons:
# Comparing performance across different categories
metric = hl.metrics.Metrics(graph_type=GraphType.BAR)
run.add_metric("category_scores", metric)
# Log scores for different categories
categories = ["easy", "medium", "hard"]
for i, category in enumerate(categories):
score = evaluate_category(category)
run.log({"category_scores": score}, x={"category_scores": i})
HISTOGRAM - Best for visualizing distributions:
# Distribution of episode rewards
metric = hl.metrics.Metrics(graph_type=GraphType.HISTOGRAM)
run.add_metric("reward_distribution", metric)
for episode in episodes:
run.log({"reward_distribution": episode.reward})
SCATTER - Best for 2D relationships (requires 2-element data):
# Plotting accuracy vs learning rate
metric = hl.metrics.Metrics(graph_type=GraphType.SCATTER)
run.add_metric("accuracy_vs_lr", metric)
run.log({"accuracy_vs_lr": [learning_rate, accuracy]})
GAUSSIAN - Best for Gaussian distribution plots:
# Gaussian-distributed parameter tracking
metric = hl.metrics.Metrics(graph_type=GraphType.GAUSSIAN)
run.add_metric("noise_samples", metric)
THREE_D_MAP - Best for 3D visualizations (requires 3-element data):
# 3D position tracking
metric = hl.metrics.Metrics(graph_type=GraphType.THREE_D_MAP)
run.add_metric("robot_position", metric)
run.log({"robot_position": [x, y, z]})
Logging Metrics
BAR vs HISTOGRAM: When to Use Which?
Use BAR for:
- Comparing values across named categories
- Discrete data points with labels
- Fixed, known set of options
Use HISTOGRAM for:
- Distribution of continuous values
- Many data points without specific labels
- Understanding value frequency
# BAR: Comparing performance across difficulty levels
bar_metric = hl.metrics.Metrics(graph_type=GraphType.BAR)
run.add_metric("difficulty_performance", bar_metric)
difficulties = ["easy", "medium", "hard"]
for i, level in enumerate(difficulties):
avg_score = evaluate_difficulty(level)
run.log({"difficulty_performance": avg_score}, x={"difficulty_performance": i})
# HISTOGRAM: Distribution of all episode scores
hist_metric = hl.metrics.Metrics(graph_type=GraphType.HISTOGRAM)
run.add_metric("score_distribution", hist_metric)
for episode in episodes:
run.log({"score_distribution": episode.score})
Episode-Level Logging
Log metrics specific to an episode:
with episode:
# Log a dictionary of metrics
episode.log({
"reward": 100.0,
"steps": 150,
"success": True
})
Run-Level Logging
Log metrics across all episodes:
# Create metric
metric = hl.metrics.Metrics(
graph_type=GraphType.LINE
)
# Add to run
run.add_metric("cumulative_reward", metric)
# Log across episodes
for i in range(100):
with run.create_episode() as episode:
result = validate(episode)
run.log({"cumulative_reward": result.reward})
Time-Series Data
Track metrics over time or steps:
# Create metric
metric = hl.metrics.Metrics()
run.add_metric("training_loss", metric)
# Log with explicit x-axis values
for step in range(1000):
loss = train_step()
run.log(
{"training_loss": loss},
x={"training_loss": step}
)
# Or let the SDK auto-increment steps (starts from 1)
for _ in range(1000):
loss = train_step()
run.log({"training_loss": loss}) # x-axis will be 1, 2, 3, ...
Replace vs Append
Control whether to replace the last value or append a new one:
# Append new value (default)
run.log({"score": 0.95})
# Replace last value
run.log({"score": 0.97}, replace=True)
Scenario Statistics
Scenario statistics are automatically tracked when you use distributions in your scenarios. They provide insights into how parameters are being sampled.
# Define scenario with distributions
scenario.init(scenario={
"gravity": "${uniform(-9.8, -8.8)}",
"friction": "${gaussian(0.5, 0.1)}"
})
# Statistics are automatically collected
# Each parameter gets its own ScenarioStats metric
# - Distribution type
# - Sampled values across episodes
# - Episode status correlation
Scenario statistics are automatically uploaded when you finish the run.
Summary Metrics
Summary metrics aggregate data into a single value using various aggregation methods:
from humalab.metrics.summary import Summary
# Create a summary metric with aggregation type
# Supported types: "min", "max", "mean", "last", "first", "none"
summary_max = Summary(summary="max")
summary_mean = Summary(summary="mean")
summary_last = Summary(summary="last")
# Add to run
run.add_metric("max_reward", summary_max)
run.add_metric("avg_success_rate", summary_mean)
run.add_metric("final_score", summary_last)
# Log values - summary will aggregate them
run.log({"max_reward": 95.5})
run.log({"max_reward": 98.2}) # Will keep the max
run.log({"avg_success_rate": 0.85}) # Will compute mean of all logged values
Code Artifacts
Track code versions associated with your runs:
# Read code file
with open("my_agent.py", "r") as f:
agent_code = f.read()
# Log as artifact
run.log_code(
key="agent_implementation",
code_content=agent_code
)
Scenarios are automatically logged as code artifacts:
# Scenario YAML is automatically uploaded
# Access it via run.scenario.yaml
Metric Finalization
Metrics are finalized and uploaded when you finish a run:
# All metrics are automatically finalized
run.finish()
Manual finalization (usually not needed):
# Get finalized metric data
metric_data = metric.finalize()
print(metric_data) # {"values": [...], "x_values": [...]}
Complete Example
import humalab as hl
from humalab.constants import GraphType
# Initialize
hl.init(api_key="your_api_key")
# Create scenario
scenario = hl.scenarios.Scenario()
scenario.init(scenario={
"difficulty": "${uniform(0, 1)}"
})
# Create run
run = hl.Run(scenario=scenario, project="test")
# Add run-level metrics
success_rate = hl.metrics.Metrics(
graph_type=GraphType.LINE
)
run.add_metric("success_rate", success_rate)
avg_reward = hl.metrics.Metrics(
graph_type=GraphType.LINE
)
run.add_metric("avg_reward", avg_reward)
# Execute episodes
successes = 0
total_reward = 0
with run:
for i in range(100):
with run.create_episode() as episode:
# Run validation
result = validate(episode)
# Track episode metrics
episode.log({
"reward": result.reward,
"steps": result.steps,
"success": result.success
})
# Update run metrics
if result.success:
successes += 1
total_reward += result.reward
# Log aggregated metrics
run.log({
"success_rate": successes / (i + 1),
"avg_reward": total_reward / (i + 1)
})
# Cleanup
hl.finish()
Best Practices
-
Choose Appropriate Graph Types: Match the visualization to your data
- Use
LINEfor time-series and continuous data - Use
BARfor categorical comparisons and discrete data - Use
HISTOGRAMfor value distributions - Use
SCATTERfor 2D relationships (requires 2-element data) - Use
GAUSSIANfor Gaussian distribution plots - Use
THREE_D_MAPfor 3D visualizations (requires 3-element data)
- Use
-
Use Meaningful Metric Names: Make it clear what you're tracking
run.add_metric("episode_success_rate", metric) run.add_metric("avg_episode_length", metric) -
Log Consistently: Maintain consistent logging patterns
# Good: log every episode for episode in episodes: run.log({"metric": value}) # Avoid: sparse, inconsistent logging -
Avoid Reserved Names: Don't use reserved metric names
# These will raise ValueError # "scenario", "config", "episode_vals" -
Track What Matters: Focus on metrics that help answer your research questions
- Success rates
- Performance metrics
- Efficiency measures
- Error rates
-
Use Episode vs Run Metrics Appropriately:
- Episode metrics: Data specific to one execution
- Run metrics: Aggregated or cumulative data
-
Include Units in Names (when appropriate):
run.add_metric("training_time_seconds", metric) run.add_metric("memory_usage_mb", metric) -
Understand Auto-Incrementing Steps: When you don't provide explicit x-values, the SDK automatically increments the step counter starting from 1:
# First call: x=1, second call: x=2, etc. run.log({"metric": value})