Add evaluation dataset creator by bjcmit · Pull Request #1279 · microsoft/hve-core

bjcmit · 2026-04-02T17:19:58Z

Description

This pull request adds a comprehensive new prompt, eval-dataset-creator.md, for generating evaluation datasets and documentation to support AI agent testing. The prompt guides users through a structured interview process to curate Q&A pairs, select evaluation metrics, and recommend tooling tailored to user skill level and agent characteristics. It also specifies the output directory structure and includes templates for all generated artifacts.

Key additions and improvements:

Evaluation Dataset Creation Workflow:

Introduces a multi-phase, interview-driven process for collecting agent context, capabilities, evaluation scenarios, and user requirements, ensuring high-quality and relevant dataset generation.
Mandates a review phase where sample Q&A pairs are validated with the user before finalizing the dataset.

Dataset and Documentation Artifacts:

Defines output structure in data/evaluation/ with separate subfolders for datasets (.json, .csv) and documentation (curation-notes.md, metric-selection.md, tool-recommendations.md).
Provides detailed JSON and CSV formats for the evaluation dataset, including metadata and balanced scenario distribution.
Supplies markdown templates for curation notes, metric selection, and tool recommendations, ensuring standardized and thorough documentation.

Tooling and Persona Guidance:

Recommends evaluation# Pull Request

Related Issue(s)

Closes #1267

Type of Change

Select all that apply:

Code & Documentation:

New feature (non-breaking change adding functionality)

Infrastructure & Configuration:

AI Artifacts:

Reviewed contribution with prompt-builder agent and addressed all feedback
Copilot agent (.github/agents/*.agent.md)

Sample Prompts (for AI Artifact Contributions)

User Request:

# Standards review only (invoke the agent directly):
@eval-dataset-creator create an evaluation dataset

Execution Flow:

Here’s a step-by-step breakdown of what happens when the Evaluation Dataset Creator agent is invoked, including tool usage and key decision points:

Structured Interview (Phases 1–4)

Purpose: Gather all necessary context before generating any artifacts.

Phase 1: Agent Context**

The agent asks six questions about the AI agent’s name, business scenario, KPIs, tasks, risks, and user adoption.
Decision Point: Wait for user responses before proceeding.

Phase 2: Agent Capabilities

Three questions about grounding sources, external tools/APIs, and response format.
Decision Point: Wait for user responses before proceeding.

Phase 3: Evaluation Scenarios

Five questions about typical, challenging, negative, and safety scenarios, plus limitations and topics to avoid.
Decision Point: Wait for user responses before proceeding.

Phase 4: Persona & Tooling

Two questions about development mode (low-code vs. pro-code) and evaluation frequency/type.
Decision Point: Wait for user responses before proceeding.

Dataset Generation (Phase 5)

After the interview, the agent generates evaluation datasets:
- JSON Format: Includes metadata, Q&A pairs, category, difficulty, tools expected, source references, and notes.
- CSV Format: Similar structure, tools listed as semicolon-delimited.
Tool Usage: Writes files to data/evaluation/datasets/.
Decision Point: Ensures minimum 30 Q&A pairs, balanced distribution across scenario types.

Dataset Review & Feedback (Phase 6)

Presents 5–8 representative Q&A pairs (covering easy, hard, grounding, negative, safety).
Asks the user for feedback on each pair:
- Is the expected response accurate?
- Should it be more/less detailed?
- Are elements missing or incorrect?
- Should the pair be modified, kept, or removed?
Decision Point: Refines dataset based on feedback. If major changes are needed, offers to regenerate portions.

Documentation & Finalization (Phase 7)

Generates three supporting documents in data/evaluation/docs/:
- Curation Notes: Business context, scope, data sources, review process, dataset balance, maintenance schedule.
- Metric Selection: Agent characteristics, selected metrics, definitions, rationale.
- Tool Recommendations: Persona profile, recommended tool, comparison, getting started, next steps.
Tool Usage: Writes files to data/evaluation/docs/.
Decision Point: Presents summary of all artifacts for user validation.

Decision Points & Tool Usage Summary

Interview: Structured Q&A, waits for user input before proceeding.
Dataset Generation: Automated file creation (JSON/CSV), ensures balance and completeness.
Review: Interactive feedback loop, offers regeneration if needed.
Documentation: Automated file creation for curation, metrics, and tooling.
Summary: Presents all artifacts for validation.

Output Artifacts:

data/evaluation/
├── datasets/
│   ├── <agent-name>-eval-dataset.json   # Full evaluation dataset (Q&A pairs + metadata)
│   └── <agent-name>-eval-dataset.csv    # Flat CSV version for Copilot Studio/manual review
└── docs/
    ├── <agent-name>-curation-notes.md        # Human-readable dataset rationale & scope
    ├── <agent-name>-metric-selection.md      # Metrics chosen + priorities + rationale
    └── <agent-name>-tool-recommendations.md  # MCS vs Azure AI Foundry guidance

data/evaluation/datasets/-eval-dataset.json

{
  "metadata": {
    "schema_version": "1",
    "agent_name": "example-agent",
    "created_date": "2026-04-02",
    "version": "1.0.0",
    "total_pairs": 30,
    "distribution": {
      "easy": 6,
      "grounding_source_checks": 3,
      "hard": 12,
      "negative": 6,
      "safety": 3
    },
    "persona": "pro-code",
    "evaluation_mode": ["manual", "batch"],
    "recommended_tool": "azure-ai-foundry"
  },
  "evaluation_pairs": [
    {

data/evaluation/docs/-curation-notes.md

# Curation Notes: Example Agent

## Business Context
Agent answers employee questions about expense and travel policy.

## Agent Scope
### In Scope
- Policy interpretation
- Step-by-step guidance
- Source citation

### Out of Scope
- Approvals
- Financial decisions

data/evaluation/docs/-metric-selection.md

# Metric Selection: Example Agent

## Selected Core Metrics
- Intent Resolution (High)
- Task Adherence (High)
- Groundedness (High)
- Response Completeness (Medium)

## Tool-Based Metrics
- Tool Call Accuracy (N/A)
- Latency (Medium)
- Token Cost (Medium)

data/evaluation/docs/-tool-recommendations.md

# Tool Recommendations: Example Agent

## Persona Profile
- Skill Level: Pro-Code Developer
- Evaluation Mode: Batch

## Recommended Tool
Azure AI Foundry

Selection Rationale:
Supports batch evaluation, groundedness metrics, and tool-call analysis.

Success Indicators:

All output artifacts exist and are non-empty
Datasets formatted correctly, contain at least 30 pairs, no empty 'expected_response' fields, and JSON = CSV.
Curation notes reflect business context and scope accurately
Metric priorities make sense for KPIs
Recommended tools matches states personas
Reality Check: Dataset imports into either Copilot Studio or Azure AI Foundry

Testing

Ran /prompt-analyze 3 times with all findings addressed
Tested agent against an Out-of-Office (OOO) Rescheduler feature
All validation commands pass:
- npm run lint:all ✅
- npm run lint:md-links ✅
- npm run validate:copyright ✅ (148/148 files, 100%)
- npm run spell-check ✅ (281 files, 0 issues)
- npm run plugin:generate ✅ (14 plugins, 0 errors)
- npm run plugin:validate ✅ (0 errors)
- npm run lint:collections-metadata ✅ (0 errors)

Checklist

Required Checks

Documentation is updated (if applicable)
Files follow existing naming conventions
Changes are backwards compatible (if applicable)
Tests added for new functionality (if applicable)

AI Artifact Contributions

Used /prompt-analyze to review contribution
Addressed all feedback from prompt-builder review
Verified contribution follows common standards and type-specific requirements

Required Automated Checks

The following validation commands must pass before merging:

Markdown linting: npm run lint:md
Spell checking: npm run spell-check
Frontmatter validation: npm run lint:frontmatter
Skill structure validation: npm run validate:skills
Link validation: npm run lint:md-links
PowerShell analysis: npm run lint:ps
Plugin freshness: npm run plugin:generate

Security Considerations

This PR does not contain any sensitive or NDA information
Any new dependencies have been reviewed for security issues
Security-related scripts follow the principle of least privilege

WilliamBerryiii

Thank you for this PR, @bjcmit. The eval-dataset-creator agent is a solid addition to the data-science collection — the structured interview flow and dual-persona support are well thought out.

After review, there are a few suggested changes in the inline comments. Please take a look and let us know if you have any questions.

WilliamBerryiii · 2026-04-03T01:18:50Z

.github/agents/data-science/eval-dataset-creator.agent.md

+<!-- <interview-phase-1> -->
+1. What is the name of the AI agent you are evaluating? If it does not have a name yet, give it one.
+2. What specific business problem or scenario does this agent address?
+3. What are the business KPIs associated with this agent (for example, increase revenue, decrease costs, transform business process)?
+4. What tasks is this agent designed to perform? What is explicitly out of scope?
+5. What are key risks (Responsible AI Framework) in implementing this agent (for example, PII vulnerabilities, negative impact from model inaccuracy)?
+6. Who are the primary users of this agent? How likely is this agent to be adopted by primary users? What are barriers to adoption?
+<!-- </interview-phase-1> -->


The XML comment boundaries ( … ) work as section markers, but the pattern used by other agents in this repo is to express the workflow as an enumerated Required Protocol that spells out each rule or constraint as a numbered item. The current Required Protocol section at the bottom of this file has four items, which is a good start.

Consider moving more of the behavioral expectations from the XML-bounded sections into the protocol list or into the phase headings themselves. For examples of how other agents structure this, see:

.github/agents/hve-core/subagents/phase-implementor.agent.md — Required Protocol with numbered invariants that are referenced from the Required Steps.

.github/agents/hve-core/subagents/prompt-evaluator.agent.md — Required Protocol for evaluation-specific constraints paired with Required Steps.

This would make the constraints directly visible and enumerable rather than embedded in template comment tags.

The workflow is already expressed as an enumerated Required Protocol. It also has XML comment boundaries. I can remove the XML comment boundaries, but it is unclear how to move more of the behavioral expectations into the protocol list or into the phase heading themselves.

.github/plugin/marketplace.json

.github/agents/data-science/eval-dataset-creator.agent.md

codecov-commenter · 2026-04-03T20:12:25Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.71%. Comparing base (84ddd5d) to head (e65a176).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1279      +/-   ##
==========================================
- Coverage   87.72%   87.71%   -0.02%     
==========================================
  Files          61       61              
  Lines        9320     9320              
==========================================
- Hits         8176     8175       -1     
- Misses       1144     1145       +1

Flag	Coverage Δ
pester	`85.31% <ø> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…eat/1267

bjcmit added 3 commits April 1, 2026 21:54

Add evaluation dataset creator

b212ffb

Rename evaluation dataset creator file

a313f60

Fix issues and add and modify documentation

9be7839

bjcmit requested a review from a team as a code owner April 2, 2026 17:20

bjcmit self-assigned this Apr 2, 2026

WilliamBerryiii requested changes Apr 3, 2026

View reviewed changes

bjcmit and others added 2 commits April 3, 2026 20:07

Resolve comments

ee95750

Merge branch 'main' into feat/1267

fa9d8f5

bjcmit added 4 commits April 3, 2026 20:16

Fix data-science version

59bf957

Merge branch 'feat/1267' of https://github.com/bjcmit/hve-core into f…

7af6225

…eat/1267

Fix table format

3236724

Fix failed auto checks

e65a176

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add evaluation dataset creator#1279

Add evaluation dataset creator#1279
bjcmit wants to merge 9 commits intomicrosoft:mainfrom
bjcmit:feat/1267

bjcmit commented Apr 2, 2026 •

edited

Loading

Uh oh!

WilliamBerryiii left a comment

Uh oh!

WilliamBerryiii Apr 3, 2026

Uh oh!

bjcmit Apr 3, 2026

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bjcmit commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Type of Change

Sample Prompts (for AI Artifact Contributions)

Testing

Checklist

Required Checks

AI Artifact Contributions

Required Automated Checks

Security Considerations

Uh oh!

WilliamBerryiii left a comment

Choose a reason for hiding this comment

Uh oh!

WilliamBerryiii Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

bjcmit Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bjcmit commented Apr 2, 2026 •

edited

Loading

codecov-commenter commented Apr 3, 2026 •

edited

Loading