How to handle input with characters having more than one byte in UTF-8

Hi,

first of all thank you for this amazing library.

While playing around with it I stumbled upon this issue.

When matching on strings containing characters that [UTF-8 converts into more then one byte](https://en.wikipedia.org/wiki/UTF-8#Encoding), the end offset is wrong.

See for instance this example:

```python
import hyperscan

matches = []


def match_event_handler(dbid, start, end, flags, context) -> bool | None:
    matches.append(end)


expressions = ("test.+",)
db = hyperscan.Database()
db.compile(
    expressions=[e.encode("utf-8") for e in expressions],
)


text = "test®"
db.scan(text.encode("utf-8"), match_event_handler=match_event_handler)

print(matches)
# [5, 6]
```

The highest end offset is `6` but `len("test®") is `5`. 

Is there any workaround to this? Am I misunderstanding something?

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to handle input with characters having more than one byte in UTF-8 #154

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

How to handle input with characters having more than one byte in UTF-8 #154

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions