Skip to content

Conversation

@foreverallama
Copy link

@foreverallama foreverallama commented Aug 20, 2025

Spinoff from #23 to read the undocumented datatype mxOPAQUE_CLASS. Part of a series of changes to read and write these types across formats v5 and v7.3.

Some context :

  • Subsystem data is written as a uint8 array. However, this looks like another MAT-file that needs to be converted and read into. This needs to parsed before reading any variables in file.
  • mxOPAQUE_CLASS variables are written with the following headers - Flags, Variable Name, Type Name, Class Name and Metadata
  • MATLAB uses the same subsystem format for both v7 and v7.3 files, so starting with v7 is good enough.

Edit:

Added a new file "MAT_subsys.jl" which contains methods for caching, parsing, and retrieving subsystem data to be assigned to an object. With this it should successfully load classdef objects. Additional context regarding how subsystem data is organized below:

  • MCOS subsystem data is a cell array tagged to a class called "FileWrapper__"
  • The first cell is a metadata array. It contains 9 blocks of metadata. Most of these blocks are to be interpreted as uint32 integers even though its written in as uint8
    -- Block 1 is a version indicator and some offset values
    -- Block 2 is a list of class and property names as uint8 integers (null terminated)
    -- Block 3 is a list of class IDs
    -- Blocks 4 and 6 contain some metadata about how linking property names and property values
    -- Block 5 is a list of object ID metadata
    -- Block 7 is a list of dynamic properties attached to the object
    -- Blocks 8 and 9 are unknown
  • Cell 2 is empty (probably reserved?)
  • Cell 3:end-3 are property values (depending on subsystem version it could be up to end-2)
  • The last 2 or 3 cells are some kind of shared class templates. Only the last cell is known - it contains default property values

@foreverallama
Copy link
Author

foreverallama commented Aug 21, 2025

With these changes, full support is added for loading classdef objects in MAT-files in both v5 and HDF5 formats. Classdef objects are returned as a Matrix{Dict{String, Any}}. The Dict is a property name, value dictionary, with an additional key __class__ containing the class name as a String.

The changes support different MAT-file and subsystem versions. It also supports loading all types of MCOS classdef objects (which is most of them), including handle class objects. Some other types I've seen are java and handle (for COM objects) which I don't know how to decode yet, but these are quite rare anyways and probably extremely specific to MATLAB.

Some notes:

  1. I didn't really add a separate test because test/v7.3/struct_table_datetime.mat already seems to contain several objects like datetime string categorical and table. Just updated the test there instead

  2. I just copied the copyright notice template from a different file, but I'm not too sure about adding a copyright notice for MAT_subsys.jl since the code is derived from reverse engineering the file format. Maybe someone else can comment on this?

  3. For most of the classes like datetime or string, you will still need to decode the property map into usable information. It would be good to have some utility functions to do that. I've already documented most of it, I'll get to it some other time though (or maybe someone else may take it up)

Edit:
Squashed and consolidated changes for readability. Adds support for loading mxOPAQUE_CLASS objects from both v7 and v7.3 formats. For a review, the main part of the code would be MAT_subsys.load_subsys! and MAT_subsys.load_mcos_object. I've also highlighted some parts I'm not sure about with FIXME or TODO

* MAT_subsys.jl: New file MAT_subsys with methods to set, parse and retrieve subsystem data
* MAT_v5.jl: New method "read_opaque" to handle mxOPAQUE_CLASS
* MAT_v5.jl: New method "read_subsystem" to handle subsystem data
* MAT.jl (matread): Update to clear subsystem and object cache after load

Support for loading mxOPAQUE_CLASS objects in v7.3 HDF5 format

* MAT_HDF5.jl (matopen): New argument Endian indicator, Reads and parses subsystem on load
* MAT_HDF5.jl (close): Update to write endian header based on system endianness
* MAT_HDF5.jl (m_read::HDF5.Dataset): Update to handle MATLAB_object_decode (mxOPAQUE_CLASS) types
* MAT_HDF5.jl (m_read::HDF5.Group): Update to read subsystem data and function_handles
* MAT.jl (matopen): Update function calls

Updated test for struct_table_datetime.mat to ensure accurate deserialization (including nested properties) in both v7 and v7.3 formats

* test/read.jl: Update tests for "function_handles.mat" and "struct_table_datetime.mat"
@matthijscox
Copy link
Contributor

Hi there, I like your PR and I hope you still want to work on it.

I have my own PR-207 and I will use Array{Dict{String,Any}} to write struct arrays, and maybe in the future we can also read struct arrays like that. I was wondering if there will be a conflict with your PR, since you will read (not write?) MATLAB classes as Array{Dict{String,Any}}.

Is there a good way to separate normal struct arrays from MATLAB classes in the MAT.jl interface? Would the distinction come from the existence of the "__class__" key inside each Dict in the Array{Dict{String,Any}}?

(I also hope an active contributor/maintainer pops up for MAT.jl to review our work. I am trying to get in contact with them via email)

@foreverallama
Copy link
Author

Hey!

It does look like there will be a conflict in our PRs. I aimed to provide read-write support in this module, but I only included read support for MATLAB classes in this PR to make it reasonable to review.

Coming to the typing, my opinion is that the same types should be used for both read and write, and the typing should be consistent across the different MAT-file versions. I worked on a Pythonic version of this module, and there I used a wrapper class MatioOpaque to identify MATLAB classes instead. Something like that could be done here as well, or as you mentioned the __class__ key (or some other special key) could be used as an identifier.

Ultimately, it's a design choice that would benefit from a discussion surrounding user requirements. It would be good if a maintainer (@ViralBShah ?) could pitch in their thoughts. As I understand there's not much activity here, but I would be happy to help out in any way I can!

@matthijscox
Copy link
Contributor

matthijscox commented Nov 11, 2025

I like the idea of a wrapper type, at least for read/write consistency. Having obscure key names like __class__ inside a Dict seems a little dodgy.

I wonder if it would be smart to have another wrapper type for the struct arrays? Some kind of mutable named tuple like type would be great, like: Array{MatlabStruct{(:field1, :field2)}}. And/or maybe actually use StructArrays.jl

Looking at matio for inspiration, I found a struct array test, but I cannot quickly infer what Python type you are reading/writing? How do you differentiate between cell arrays of structs and struct arrays? Right now MAT.jl does not have read/write consistency for struct arrays (they get re-written as a struct with cell arrays per field name).

@ViralBShah
Copy link
Contributor

@matthijscox is a maintainer and we can add others too.

@foreverallama
Copy link
Author

How do you differentiate between cell arrays of structs and struct arrays?

So my main point was to keep it consistent with scipy which uses numpy arrays to represent MATLAB types. Struct and Cells are both numpy arrays. The only difference is structs use numpy record arrays with each field having an object dtype, whereas Cell arrays have dtype=object.

For example,

struct_arr_dtype = [('field1', object), ('field2', object)]
cell_arr_dtype = object

arr = some_arr
if arr.dtype.hasobject:
    if arr.dtype.names:
        # struct array
    else:
        # cell array

The only place this doesn't work is for empty structs which don't have field names. In this case, I've used a wrapper class EmptyMatStruct that contains the shape of the empty struct.

Having obscure key names like class inside a Dict seems a little dodgy

I do agree with this. In fact, I went through several iterations in my Python module before arriving at the current data representations. The main problem I had was ensuring separate types for simple read-write operations whilst keeping it user-friendly and scalable for the newer MATLAB classdef-based types. This lead to me using a bunch of wrapper classes that keep a balance between allowing numpy array operations and differentiability.

Something of that sort can be done here if it aligns with the scope of this package, and I would be happy to help. I'm not familiar with Julia typing and will need some time, but in the meantime you can read the type conversions I used in matio, relevant code here. I'll also take a look at MAT.jl and see how we can unify the schema. Once a data representation is fixed it should be pretty straightforward from there.


object_arr = Array{Dict{String,Any}}(undef, convert(Vector{Int}, dims)...)

for i = 1:length(object_arr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor issue: more Julian syntax would be for (i, oid) in enumerate(object_ids)

end

const subsys_cache = Ref{Union{Nothing,Subsys}}(nothing)
const object_cache = Ref{Union{Nothing, Dict{UInt32, Dict{String,Any}}}}(nothing)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is pretty complicated, so I cannot oversee the decisions here, but I am curious why you need this stateful cache? Is it not possible to rewrite the code to just pass the subsys_cache and object_cache through each function?

I think we now we have the risk that someone tries to read multiple .mat files in multiple threads and this will break down.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a while so I'm not sure exactly, but I think it had to with it involving a whole bunch of function definition changes to incorporate both the caches. There's a lot of recursive calls so the caches would have to be passed to all functions even if they don't actually use it.

I would prefer a better alternative to handle thread safety, but your suggestion works as well.

@matthijscox
Copy link
Contributor

I've begun reading the code. The classes seem to be read mostly here as dicts in MAT_subsys.jl R285-R288.

I suppose we could define a MatlabOpaque (or MatlabClass or MatlabObject) type that mirrors a dict:

struct MatlabOpaque{T}
    classname::Symbol
    properties::Vector{String} # or Vector{Symbol}
    values::Vector{T}
end

Optionally I wonder if we want to be able to expose the class "type" to the Julia type system. (These Val types do come with a small performance penalty I remember)

struct MatlabOpaque{C<:Val, T}
    properties::Vector{String} # or Vector{Symbol}
    values::Vector{T}
end
# outer constructor
function MatlabOpaque(classname::Symbol, properties::Vector{String}, values::Vector{T}) where T
   return MatlabOpaque{Val{classname}, T}(properties, values)
end

Now you can identify a vector of same class (class/struct array?) from a vector different classes (cell array?)

MatlabOpaque{Val{:table}, Any}[] # class array of Matlab tables
MatlabOpaque[] # class (cell) array of different classes
Any[] # cell array of any kind of (matlab) type

This also means we can write automatic converters:

Base.convert(::Type{DataFrame}, MatlabOpaque{Val{:table}}) = ...

or better yet add the Tables.jl interface:

Tables.istable(::Type{MatlabOpaque{Val{:table}}) = true
Tables.istable(::Type{MatlabOpaque}) = false

Ofcourse instead of the Val type parameter, we could just convert immediately to different known types during the reading. Then we have to define the multiple MAT.jl internal types we need, or immediately convert to the most sensible Julia type.

struct MatlabTable 
...
end
Tables.istable(::Type{MatlabTable}) = true

struct MatlabDuration <: Dates.AbstractTime # or use Dates.Millisecond type
...
end

I would propose to first implement the MatlabOpaque type (with or without Val parameter) in this PR and then in future PRs we can think about the (automatic) conversion to Julia types for table, datetime, duration, etc.

@foreverallama
Copy link
Author

If I'm understanding correctly, we'd have MatlabOpaque{Val{:classname}, Any}[] for MATLAB object arrays. So this resolves the conflict with struct arrays you proposed?

Some points to clarify/get clarified:

  • The MatlabOpaque type would have two vectors for properties and values. Do we retain dictionary behaviour instead of indexing? Idk about the performance point of view, but dict behaviour would be helpful, especially for conversions to Julia types as MATLAB types can contain like 10-20 properties for complex types like timetable.
  • Can we currently load/represent 1x0 0x1 or 0x0 structs with the Dict type? We need that functionality to write classdef objects to MAT-files.
  • I guess object arrays would then be Matrix{MatlabOpaque{T}}(undef, dims) right?
  • What do you suggest for mxOBJECT_CLASS and mx_FUNCTION_CLASS. The serialized data for these types are saved as structs, and MAT.jl currently loads them as Dict with an extra key class.

The rest sounds good. Doesn't sound like it diverges too much from the current state of the PR. To summarize the main edits:

  • Thread safe cache handling
  • Change return type to MatlabOpaque

@matthijscox
Copy link
Contributor

If I'm understanding correctly, we'd have MatlabOpaque{Val{:classname}, Any}[] for MATLAB object arrays. So this resolves the conflict with struct arrays you proposed?

Indeed.

Note I now created a MatlabStructArray type in my PR to fully avoid any conflict with your types I hope.

The MatlabOpaque type would have two vectors for properties and values. Do we retain dictionary behaviour instead of indexing? Idk about the performance point of view, but dict behaviour would be helpful, especially for conversions to Julia types as MATLAB types can contain like 10-20 properties for complex types like timetable.

You mean that you want to index by name object["name"] or object[:name]? Coincidentally I added Dict-like behavior to my MatlabStructArray type (string-based indexing and iteration returns key-value pairs). You can have a look for inspiration: see here for example.

Basically we can define methods like:

function Base.getindex(object::MatlabOpaque, prop::AbstractString)
   idx = findfirst(isequal(prop), object.properties)
   return object.values[idx]
end

Can we currently load/represent 1x0 0x1 or 0x0 structs with the Dict type? We need that functionality to write classdef objects to MAT-files.

Pff, I will have to check. You also make me worried about these empty sized struct arrays for my PR.

Is this the way to construct them?

>> f = {'a', 'b', 'c'};
>> f{2,1} = {};
>> s = struct(f{:})

ans = 

  0×0 empty struct array with fields:

    a
    b
    c

>> reshape(s, 0, 1);
>> reshape(s, 1, 0);

I guess object arrays would then be Matrix{MatlabOpaque{T}}(undef, dims) right?

Yeah, well Matrix{MatlabOpaque{C,T}}(undef, dims) if we choose the Val-based dispatching option. Though I'm thinking now that we can probably drop the T parameter entirely in the struct definition if it's always of type Any. Which seems likely since we don't often expect the same types between MATLAB properties?

What do you suggest for mxOBJECT_CLASS and mx_FUNCTION_CLASS. The serialized data for these types are saved as structs, and MAT.jl currently loads them as Dict with an extra key class.

I very new to MAT.jl myself I admit. I saw the mxOBJECT_CLASS in the MAT_v5.jl file and I'm not sure what it is? Is it the old matlab object class created in @classname folders? I don't see it mentioned in MAT_HDF5.jl (v7.3).


It's possible we're going to have merge conflicts with my PR, since I've been refactoring a bit there. So if you have any time to review my PR that would be great.

@matthijscox
Copy link
Contributor

Can we currently load/represent 1x0 0x1 or 0x0 structs with the Dict type? We need that functionality to write classdef objects to MAT-files.

To quickly follow-up on your question:

The bad news

Currently (MAT v0.10) we cannot. An empty 0x0 struct array will be read as Dict{String, Any}("c"=>Matrix{Any}(undef, 0, 0), "b"=>Matrix{Any}(undef, 0, 0), "a"=>Matrix{Any}(undef, 0, 0)) and written back as a 1x1 struct with empty cell arrays in the field:

>> s00

s00 = 

  struct with fields:

    c: {}
    b: {}
    a: {}

Good news

In my new struct array PR I can write them! (reading goes wrong, I'll fix that):

In Julia:

empty_sarr = MAT.MatlabStructArray(["a", "b", "c"], [Matrix{Any}(undef, 0, 0), Matrix{Any}(undef, 0, 0), Matrix{Any}(undef, 0, 0)])
matwrite("test.mat", Dict("s00" => empty_sarr))

In MATLAB:

>> load('test.mat')
>> s00

s00 = 

  0×0 empty struct array with fields:

    a
    b
    c

@foreverallama
Copy link
Author

You also make me worried about these empty sized struct arrays for my PR

Haha yeah, sorry. It's an edge case from a user perspective, but MATLAB uses these zero-element dimensions as placeholders, and getting them wrong results in unexpected errors when loading in MATLAB. Ideally if the read_struct and write_struct methods can take care of this I can simply use MatlabStructArray in subsystem. In fact it might be better to finish your PR first and then come back to this.

we can probably drop the T parameter entirely in the struct definition

yeah, sounds good.

I saw the mxOBJECT_CLASS in the MAT_v5.jl file and I'm not sure what it is? Is it the old matlab object class created in @classname folders? I don't see it mentioned in MAT_HDF5.jl (v7.3).

I believe mxOBJECT_CLASS is legacy classdef implementation in MATLAB. I guess we can keep it as is, maybe add MatlabFunction and MatlabObject types later if required.

I did notice some differences between the HDF5 and v5 versions. The HDF5 file isn't as extensive as the v5 file, and I honestly think it needs a rewrite to allow consistency between versions. But that's for another time.

@matthijscox
Copy link
Contributor

I now support empty struct arrays in my PR #207. Create one easily via MatlabStructArray(["a", "b"], (0,0))

I also created the MatlabClassObject to better support the mxClassObject in v5 format and read-write the class object in v7.3 HDF5 format.
I allow class object arrays via MatlabStructArray if a class name is provided.

I think that's all we need to continue with your PR. I will probably merge soon and then if you want I can help with this PR to solve any merge conflicts for you.

@matthijscox matthijscox mentioned this pull request Nov 14, 2025
@foreverallama
Copy link
Author

Thanks! I'm a bit swamped till the end of the month and will most probably take a look at this after, will try to squeeze something in as soon as I can

merge!(prop_dict, get_properties(subsys, oid))
# cache it
obj = MatlabOpaque(prop_dict, classname)
subsys.object_cache[oid] = obj
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The object should be cached before merge!. Else get_properties can result in an infinite recursion for handle class objects

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see... Should we add a unit test for these handle class objects?


function get_default_properties(subsys::Subsystem, class_id::UInt32)
prop_vals_class = subsys.prop_vals_defaults[class_id+1, 1]
# is it always a MatlabStructArray?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think prop_vals_class will be a MatlabStructArray. It should always be a scalar 1x1 struct. Did you come across an example that goes against this?

Also, we can get rid of the copy statement. Instead, something like this to identify nested objects should work.

for (k, v) in prop_vals_class
        prop_vals_class[k] = update_nested_props!(v, subsys)
    end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had errors for 0x0 MatlabStructArray so this case definitely happens in the test data somewhere

src/MAT_types.jl Outdated
return DateTime[]
end
if !isempty(obj["tz"])
@warn "no timezone conversion yet for datetime objects. timezone ignored"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be a good idea to display the timezone the datetime object was saved with in the warning?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Note I didn't want to add TimeZones.jl yet because I remember that it's a massive dependency, adding lots of precompile or artifact download time

if num_strings==1
return first(strings)
else
return reshape(strings, shape...)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the job is failing here. Maybe explicitly cast to a tuple?

@matthijscox
Copy link
Contributor

Somehow I increased SnoopCompile invalidations a bit, this might mean package loading is slightly slower, but it's all due to dependencies (mostly HDF5.jl). Not sure I know how to fix this, so maybe we'll just take the hit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants