AI Research Analyst (LLM Evaluation & Benchmarking)

MillionLogics · Posted 2026-04-27

Company Description :MillionLogics, a trusted Oracle Partner, is a global IT solutions provider with offices in London, UK, and a development hub in Hyderabad, India. The company specializes in delivering smart, scalable, and future-ready IT solutions to empower businesses to evolve and lead. With expertise in Data & AI, Cloud migrations, IT consulting, and enterprise application optimization, MillionLogics offers tailored services with a focus on client success. Supported by a team of over 50 AI & Oracle experts, MillionLogics combines cutting-edge technology with a commitment to delivering impactful outcomes. Learn more about our services on our website.Role Description :We are seeking a highly analytical and computationally proficient individual to join our team with a strong research background. You will be instrumental in contributing to this role by either crafting challenging and insightful problems in your respective research domain, devising elegant computational solutions.Responsibilities:Build multi-agent benchmark tasks that require reading, analyzing, and synthesizing large document collectionsCurate real-world research corpora — academic papers, case studies, technical reports — and design questions that require comprehensive analysisWrite structured ground-truth oracles (JSON) with specific, verifiable answers that prove the agent actually read the source materialDesign LLM judge prompts that evaluate agent output field-by-field against the oracleCreate decomposition guides that split research across multiple parallel sub-agents (one per document, one per domain, then synthesis) Required Qualifications:5+ years of research experience — academic or industry research in any scientific domain.Strong reading comprehension and ability to extract structured information from unstructured text.Experience with JSON/data structures — designing schemas, validating output formats Python scripting ability (for judge scripts and data processing).Experience with AI coding benchmarks (SWE-bench, Terminal-bench).Comfortable with Docker — writing Dockerfiles, building images, debugging container issues.Attention to detail — building oracles requires exact values, not approximationsStrong plus:Experience with systematic reviews, meta-analyses, or large-scale literature surveys. Familiarity with medical/legal/scientific document analysis.Experience with NLP or information extraction tasks.Knowledge of LLM evaluation and benchmarking (MMLU, GPQA, SimpleQA).Experience curating datasets for AI evaluation.Example of what you'll produce: A task with 1500 medical case records (500 cardiac, 500 vascular, 500 systemic). The agent must read all cases, identify relevant ones, extract evidence, and produce a cross-domain diagnosis. The oracle requires exact first/last case IDs per file (proves the agent read start to end), verbatim excerpts from specific cases (proves it read individual records), and a cross-domain evidence matrix. The decomposition uses 15 chunk-reader sub-agents, 3 domain synthesizers, and 1 final synthesizer. Oracle scores 1.0, single-agent scores 0.15, multi-agent scores 0.80.Offer Details:Pay: $1500 per month (Net/take-home)Mode of work - Fully RemoteCommitments Required: 8 hours per day with a 4-hour overlap with PST.Employment Type: Contractor position (Note: this role does not include medical/paid leave).Duration of Contract: 6months; [expected start date is next week].How to Apply?Please send us your updated CV to CV@MILLIONLOGICS.COM with job ID 75046 in the email subject line.

Apply for this role