Benchmark Report

AI Research Performance Benchmarks

Two auditors scored both Sorena Research Copilot and ChatGPT (baseline) against the same requirements across 43 real-world compliance, regulatory research, and document analysis sessions.

January 2026Two AuditorsSource CitationsTwo Passes

Key Results

43/43
Perfect Sessions

Two auditors scored Sorena at 100% in every session

4,332
Requirements Evaluated

Scored against granular compliance requirements

100%
Factual Accuracy

100% factual accuracy (0 factual errors) vs 183 factual errors in ChatGPT (baseline) responses

Performance

Benchmark Breakdown

How each tool performed across compliance research sessions

Coverage by Task Type

Scores reflect independent verification against source documentation.

Factual ErrorsWhat is a factual error?
ChatGPT
183errors

Incorrect statements presented as fact across all sessions.

Sorena: 0 errors
Requirement Coverage
ChatGPT
25%Avg of 2 passes

Compliance requirements addressed with accurate information.

Sorena: 100% coverage
Session Details

Results by Research Session

Session-by-session benchmarks for compliance research. Click any row to view the full scenario, score breakdown, and high-level takeaways.

Legend:Scores reflect independent verification against source documentation.
Sorena Research Copilot
ChatGPT (baseline)
Factual errors (ChatGPT)
Incorrect statement presented as fact

Click any row to expand evaluation notes and see the per-auditor breakdown.

For GRC Teams

Why This Matters for Your Organization

Purpose-built AI for compliance research delivers measurable advantages

Complete Coverage

100% coverage across 4,332 requirements, with no surprise gaps left for auditors to find.

Zero Factual Errors

0 factual errors flagged across 43 sessions, reducing the risk of acting on incorrect information.

Audit-Ready Citations

Direct links to exact text passages in legal documents for full traceability.

Specialized Expertise

Purpose-built for regulatory research, not a general-purpose tool stretched thin.

Methodology

How We Evaluated

Transparent two-step scoring process with an independent second review

Evaluation Overview

Period
Jan 2026
Task Categories
6
Total Sessions
43
Requirements Evaluated
4332
Internet Access
Enabled
Reasoning Effort
High

Scoring Criteria

A requirement was marked correct only if the response:

  • Explicitly addressed the requirement
  • Provided accurate information
  • Cited verifiable sources where applicable

Independent Dual Review

Each session was scored independently by two auditors. Neither auditor saw the other's evaluation until scoring was complete.

A1
A2
Two independent compliance reviewers

Scores shown are the combined average from both auditors.

Ready to Experience the Difference?

See how our Research Copilot can transform your compliance research with a personalized demo.