-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Hello DataKnots team,
I'm looking into the DataKnots project and I'm excited about what I see. It looks like a very powerful tool.
I do have an issue that I'd like to discuss. In my use cases, values are "missing not at random", and I need to treat them with caution. For example, it might be that the lowest true values are always unobserved. Naive behavior when filtering, joining, or aggregating on missing values will lead me to incorrect conclusions.
In base Julia, filter lets me be confident I'm not accidentally dropping significant missing values.
# Note the missing salary.
julia> employee_csv = """
name,department,position,salary
"ANTHONY A","POLICE","POLICE OFFICER",72510
"DANIEL A","FIRE","FIRE FIGHTER-EMT",95484
"JAMES A","FIRE","FIRE ENGINEER-EMT",103350
"JEFFERY A","POLICE","SERGEANT",101442
"NANCY A","POLICE","POLICE OFFICER",80016
"ROBERT K","FIRE","FIRE FIGHTER-EMT",
""" |> IOBuffer |> CSV.File
6-element CSV.File{false}:
CSV.Row: (name = "ANTHONY A", department = "POLICE", position = "POLICE OFFICER", salary = 72510)
CSV.Row: (name = "DANIEL A", department = "FIRE", position = "FIRE FIGHTER-EMT", salary = 95484)
CSV.Row: (name = "JAMES A", department = "FIRE", position = "FIRE ENGINEER-EMT", salary = 103350)
CSV.Row: (name = "JEFFERY A", department = "POLICE", position = "SERGEANT", salary = 101442)
CSV.Row: (name = "NANCY A", department = "POLICE", position = "POLICE OFFICER", salary = 80016)
CSV.Row: (name = "ROBERT K", department = "FIRE", position = "FIRE FIGHTER-EMT", salary = missing)
julia> filter(x->x.salary < 100_000, employee_csv)
ERROR: TypeError: non-boolean (Missing) used in boolean context
Stacktrace:
[1] filter(f::var"#11#12", a::CSV.File{false})
@ Base ./array.jl:2522
[2] top-level scope
@ REPL[29]:1
julia> filter(x-> coalesce(x.salary < 100_000, false), employee_csv)
3-element Vector{CSV.Row}:
CSV.Row: (name = "ANTHONY A", department = "POLICE", position = "POLICE OFFICER", salary = 72510)
CSV.Row: (name = "DANIEL A", department = "FIRE", position = "FIRE FIGHTER-EMT", salary = 95484)
CSV.Row: (name = "NANCY A", department = "POLICE", position = "POLICE OFFICER", salary = 80016)On the other hand, currently DataKnots.jl silently drops missing values.
julia> chicago = DataKnot(:employee => employee_csv);
julia> @query chicago begin
employee
filter(salary < 100000)
end
│ employee │
│ name department position salary │
──┼─────────────────────────────────────────────────┼
1 │ ANTHONY A POLICE POLICE OFFICER 72510 │
2 │ DANIEL A FIRE FIRE FIGHTER-EMT 95484 │
3 │ NANCY A POLICE POLICE OFFICER 80016 │Using tools that require me to mentally track missingness and ensure rows aren't silently dropped takes effort I'd rather spend on other parts of my analysis. Tools like Missings.jl's passmissing(f)(x) and f(skipmissing(xs)) make it easier to do this explicitly.
For more discussion, see JuliaData/DataFrames.jl#2499 about joining tables on missing values.