Skip to content

Use DuckDB for querying and analyzing Parquet forecast data(suggestion) #323

@yuvraajnarula

Description

@yuvraajnarula

Detailed Description

Currently, Quartz Solar Forecast stores forecast outputs and historical weather data in Parquet and csv files. While Parquet is efficient for storage, users still need to load entire files into memory or write custom aggregation scripts to analyze the data.

Adding DuckDB as a query layer would allow:

  • Direct SQL queries on Parquet files without loading all data into memory.
  • Fast aggregation, filtering, and joining across multiple Parquet files (e.g., multi-site forecasts or weather data).
  • Easier ad hoc analysis and evaluation of model performance.
  • Integration with Python via Pandas DataFrames while still leveraging SQL efficiency.

Example use cases:

  • Compute hourly average predicted power across all sites.
  • Filter forecasts for specific time ranges or locations.
  • Join forecast data with historical weather datasets for feature engineering.

Context

This change would benefit:

  • Developers and analysts working with large forecast datasets.
  • Users who want to perform quick queries without writing custom scripts for every analysis.
  • Situations where multiple sites or long time periods generate large Parquet datasets.
  • Fast querying and analytics on local or in-memory datasets (especially PV site data, weather predictions, or timeseries CSV).
  • Effortless integration with Pandas and Polars DataFrames.
  • SQL queries inside Python/Jupyter Notebooks, improving ETL and aggregation operations compared to pure Pandas for larger datasets.
  • Easy CSV/Parquet/Arrow file ingestion without complex setup.
  • No server or infrastructure overhead, making setup simple.

Possible Implementation

import duckdb
import pandas as pd

# Load CSV or DataFrame directly into DuckDB
df = pd.read_csv('your_data.csv')
con = duckdb.connect()

# Register as DuckDB table and query using SQL
con.register('my_df', df)
result = con.execute('SELECT avg(power_output) FROM my_df WHERE timestamp > ?',
                     ['2024-01-01']).fetchdf()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions