Polybian Labs - Tiered Jurist #017 Results

Polybian Labs

Tiered Jurist #017 Results

MMLU Benchmark
HELM Benchmark
MT-Bench Benchmark

Benchmarks Applied to All Outputs

1 MMLU

Massive Multitask Language Understanding evaluates knowledge and problem-solving across diverse domains including STEM, humanities, and professional subjects.

Evaluation Criteria:

  • Factual accuracy across multiple domains
  • Reasoning and problem-solving capabilities
  • Domain-specific knowledge depth
  • Error minimization in technical content

2 HELM

Holistic Evaluation of Language Models assesses accuracy, robustness, fairness, bias, toxicity, and efficiency in model outputs.

Evaluation Criteria:

  • Output coherence and logical structure
  • Bias and fairness considerations
  • Robustness across diverse inputs
  • Practical utility and implementability

3 MT-Bench

Multi-Turn Benchmark evaluates conversational ability, instruction following, and consistency across extended interactions.

Evaluation Criteria:

  • Multi-turn conversational consistency
  • Instruction adherence and comprehension
  • Context retention across interactions
  • User engagement quality

Output Assessments

PS1-QCR1
Claude Sonnet 4 Pro
Research Edition
Benchmark Score Strengths Weaknesses
MMLU 7/10 Strong factual accuracy in research contexts Limited domain coverage outside research
HELM 6/10 Clear presentation of research findings Lacks strategic context for decision-making
MT-Bench 7/10 Structured research-oriented responses Limited conversational depth
PS1-QCR2
Gemini 2.5 Pro
Deep Research Edition
Benchmark Score Strengths Weaknesses
MMLU 8/10 Exceptional domain coverage in deep research Occasional over-specialization in narrow areas
HELM 9/10 Superior strategic analysis for complex topics Can be overly technical for non-specialists
MT-Bench 8/10 Strong contextual understanding in research dialogues Occasional inconsistency in cross-domain topics
PS1-QCR3
Perplexity Pro
Research Edition
Benchmark Score Strengths Weaknesses
MMLU 5/10 Efficient research sourcing Factual inaccuracies in specialized domains
HELM 7/10 Good accessibility for research findings Inconsistent analysis depth
MT-Bench 6/10 Engaging presentation of research Conversational weaknesses in technical depth

Analysis Type Alignment

Model to Analysis Type Mapping

Fact-Check Analysis

Claude Sonnet 4 Pro: excels in structured research verification with strong factual accuracy. Ideal for technical validation where research precision is prioritized.

Research Edition Features: Advanced fact verification, academic source integration, research methods aware

Critical Review

Gemini 2.5 Pro: superior deep research analysis capabilities. Optimized for complex evaluations requiring contextual understanding and critical insight.

Deep Research Features: Cross-domain knowledge synthesis, source triangulation, strategic insight generation

Accuracy Assessment

Perplexity Pro: research-focused accuracy evaluations. Effective for presentation-focused assessments where accessibility complements research depth.

Research Edition Features: Source aggregation, citation management, research presentation optimization


Performance Insights

Research Accuracy

PS1-QCR1 (Claude Sonnet 4 Pro) shows strong factual precision in research contexts but limited coverage outside academic domains. Excels in technical verification tasks.

Deep Research Analysis

PS1-QCR2 (Gemini 2.5 Pro) demonstrates exceptional strategic depth in complex research domains. Maintains superior HELM score despite occasional over-specialization.

Research Presentation

PS1-QCR3 (Perplexity Pro) provides accessible research presentation but shows factual gaps in specialized domains. Strongest in HELM accessibility metrics.