Refactor Heuristic Evaluation Engine to LLM-Based Analysis #2

mohi-devhub · 2025-12-08T05:49:53Z

Summary

This PR replaces the rule-based HeuristicEvaluationEngine with an LLM-driven evaluation pipeline and updates the UIElement model to match real OmniParser output. The system now evaluates H1–H5 using GPT-4 with structured, measurable criteria.

Key Changes

1. OmniParser Alignment (Commit: `7255ae1`)

Updated UIElement to align with the actual OmniParser format:

Includes type, bbox: [x1, y1, x2, y2], interactivity, content.
Removed previously hallucinated fields (hover_state, confirmation, etc.).
Added computed fields: width, height.
Added text alias for backward compatibility.
Implemented from_dict() supporting both old and new formats.
Added infer_heading_level() for text hierarchy calculation.

2. Measurable Criteria for H4 & H5 (Commit: `8957369`)

Added concrete, structured criteria for LLM evaluation of higher-level heuristics:

H4 – Consistency & Standards (e.g., button dimensions, typography, terminology).
H5 – Error Prevention (e.g., input constraints, validation, confirmation prompts).

3. LLM-Based Evaluation Engine (Commit: `ec31f70`)

The core evaluation logic is now LLM-driven:

New methods: _serialize_elements_for_llm(), _evaluate_with_llm(), and _llm_explain_heuristic().
evaluate_heuristic() is now fully LLM-based, supporting H1-H5.
Removed legacy rule-based evaluation for H1–H3 (H4 legacy kept as deprecated for now).
Updated scoring model fields to track LLM usage:
- llm_explanation
- evaluation_version="2.0.0-llm"
- evaluation_method="llm-based"

4. Other Fixes & Integrations

Fixed an unterminated docstring bug.
Merged upstream changes from main.
Added configurable OPENAI_BASE_URL and improved LLM client initialization.

Impact Summary

Metric	Before	After
Evaluation Method	Rule-based	LLM-based
OmniParser Model	Hallucinated attrs	Real format
Heuristics Supported	H1–H3	H1–H5
LOC Change	—	+355 net

Validation

All Python modules compile cleanly.
UIElement serialization confirmed to match real OmniParser output.
Backward compatibility for element access maintained through property aliases.

latishab · 2025-12-09T10:15:04Z

@mohi-devhub

actually, i realized the mock data (omniparser_client.py) was incorrect. it included an attributes={'level': 'h1'} field that the real omniparser does not actually output.

since the real model only returns bounding boxes and text, we cannot rely on font_size or level attributes. please update the logic to infer hierarchy from the element's height (bounds['height']) instead.

here is the example output of what omniparser does:

icon 0: {'type': 'text', 'bbox': [0.31953126192092896, 0.10987482964992523, 0.41015625, 0.13212795555591583], 'interactivity': False, 'content': 'Type here to search'}
icon 1: {'type': 'text', 'bbox': [0.3101562559604645, 0.19332405924797058, 0.3460937440395355, 0.21279555559158325], 'interactivity': False, 'content': 'Pinned'}
icon 2: {'type': 'text', 'bbox': [0.6585937738418579, 0.2990264296531677, 0.694531261920929, 0.3184979259967804], 'interactivity': False, 'content': 'Settings'}
icon 6: {'type': 'icon', 'bbox': [0.4970785975456238, 0.23429809510707855, 0.5733658671379089, 0.34724709391593933], 'interactivity': True, 'content': 'Microsoft '}
icon 7: {'type': 'icon', 'bbox': [0.5727160573005676, 0.4605804979801178, 0.6388594508171082, 0.5722821354866028], 'interactivity': True, 'content': 'Photoshop Express '}

mohi-devhub · 2025-12-09T10:26:02Z

@latishab
Thanks for the clarification ,makes sense. I’ll remove the mock level/font_size fields and update the logic to infer hierarchy purely from the element’s bbox height. I’ll also fix the mock data to match real Omniparser output. Let me know if anything else looks off.

…ierarchy

latishab · 2025-12-09T17:39:46Z

@mohi-devhub

I tested your code and I had two issues:

currently the model name is hardcoded in app/services/heuristic_engine.py, you should be using the model from configurable (hence settings.OPENAI_MODEL)
the method _llm_explain_heuristic() accesses properties that don't exist in the new UIElement format. e.attributes.keys()should not exist, and e.interactive should be e.interactivity based on OmniParser output.

Please make these changes, and possibly provide your API outputs (you can put example raw outputs of OmniParser for testing so you don't have to rely on mock data anymore).

mohi-devhub · 2025-12-10T05:28:03Z

@latishab Have applied the necessary fixes. Everything should now be aligned and working as expected.

latishab

I have tested the code so far. And it works well, thank you for your efforts @mohi-devhub .

One note is that I think you should remove .gitignore from your commits in this pull request because it has nothing to do with the overall codebase ai-heuristics-ruxai-firebase-adminsdk-fbsvc-*.json. You can put your keys in secrets/.

feat: add H4/H5 criteria and H4 rule-based evaluation

8957369

mohi-devhub changed the title ~~feat: add H4/H5 criteria and H4 rule-based evaluation~~ Add H4/H5 criteria and H4 rule-based evaluation Dec 8, 2025

tqmsh self-requested a review December 8, 2025 13:45

tqmsh approved these changes Dec 8, 2025

View reviewed changes

latishab self-requested a review December 9, 2025 08:19

mohi-devhub added 4 commits December 9, 2025 18:30

refactor: align Omniparser with real output format using bbox-based h…

7255ae1

…ierarchy

refactor: replace rule-based evaluation with LLM-based analysis

ec31f70

fix: correct docstring syntax error

babc04f

Merge branch 'ruxailab:main' into feature/h4-h5-criteria

4a59df8

mohi-devhub changed the title ~~Add H4/H5 criteria and H4 rule-based evaluation~~ Refactor Heuristic Evaluation Engine to LLM-Based Analysis Dec 9, 2025

latishab requested a review from tqmsh December 9, 2025 17:21

fix: use configurable model and real OmniParser fields

c1309a3

latishab approved these changes Dec 10, 2025

View reviewed changes

tqmsh approved these changes Dec 10, 2025

View reviewed changes

tqmsh merged commit c1a28ca into ruxailab:main Dec 10, 2025

mohi-devhub deleted the feature/h4-h5-criteria branch December 10, 2025 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor Heuristic Evaluation Engine to LLM-Based Analysis #2

Refactor Heuristic Evaluation Engine to LLM-Based Analysis #2

mohi-devhub commented Dec 8, 2025 •

edited

Loading

Uh oh!

latishab commented Dec 9, 2025

Uh oh!

mohi-devhub commented Dec 9, 2025

Uh oh!

latishab commented Dec 9, 2025 •

edited

Loading

Uh oh!

mohi-devhub commented Dec 10, 2025

Uh oh!

latishab left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Refactor Heuristic Evaluation Engine to LLM-Based Analysis #2

Refactor Heuristic Evaluation Engine to LLM-Based Analysis #2

Conversation

mohi-devhub commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

1. OmniParser Alignment (Commit: 7255ae1)

2. Measurable Criteria for H4 & H5 (Commit: 8957369)

3. LLM-Based Evaluation Engine (Commit: ec31f70)

4. Other Fixes & Integrations

Impact Summary

Validation

Uh oh!

latishab commented Dec 9, 2025

Uh oh!

mohi-devhub commented Dec 9, 2025

Uh oh!

latishab commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mohi-devhub commented Dec 10, 2025

Uh oh!

latishab left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mohi-devhub commented Dec 8, 2025 •

edited

Loading

1. OmniParser Alignment (Commit: `7255ae1`)

2. Measurable Criteria for H4 & H5 (Commit: `8957369`)

3. LLM-Based Evaluation Engine (Commit: `ec31f70`)

latishab commented Dec 9, 2025 •

edited

Loading

latishab left a comment •

edited

Loading