Releases: dataflint/spark
Version 0.7.0
Release Notes - v0.7.0
Release Date: December 1, 2025
π What's New
Delta Lake Instrumentation π
This release introduces comprehensive Delta Lake monitoring and instrumentation capabilities:
- Delta Lake Table Monitoring: New
spark.dataflint.instrument.deltalakeconfiguration flag to enable Delta Lake-specific instrumentation - Delta Lake Scan Page: New dedicated UI page showing Delta Lake scan operations and metrics
- Full Table Scan Detection: Automatic alerts for full table scans on Delta Lake tables to help identify performance issues
- Z-Order Cache Tracking: Monitor Z-Order optimization cache usage in table properties
- Delta Log Integration: Direct integration with Delta Lake's cached snapshots for improved performance monitoring
Enhanced UI & User Experience π
Alerts Tab Improvements
- Grouped Alerts: Alerts are now organized by alert type for better visibility and navigation
- Search Functionality: New description search bar to quickly find specific alerts
- Spill Selector: New UI component to identify and navigate to operations with data spills
- Duration/Alert Navigation: Improved button logic for advancing through alerts by duration or index
SQL Flow Enhancements
- Subquery Differentiation: Better visual differentiation for subqueries in the SQL execution plan
- Union Support: Improved stage identification algorithm for UNION operations and missing nodes with same-stage neighbors
- SQL Text Display: Enhanced SQL text rendering and display
JDBC Support
- JDBC Scan Detection: Better support for JDBC scan operations with dedicated parsing and visualization
- JDBC Examples: New comprehensive JDBC example demonstrating monitoring capabilities
Telemetry & Analytics π
- Scarf Pixel Integration: Optional telemetry to help monitor OSS usage patterns
- Can be disabled with
spark.dataflint.telemetry.enabled=falseflag
- Can be disabled with
Technical Improvements π§
Core Enhancements
- Delta Lake Reflection Utils: New utility classes for Delta Lake introspection and monitoring
- Delta Table Path Parser: Robust parsing of Delta Lake table paths and identifiers
- Improved Metrics Processing: Enhanced metric processors for better performance data collection
Bug Fixes
- Fixed
bytesToHumanReadableSizeutility to handle comma-separated values correctly - Fixed read parser unit tests
- Improved central snapshot deployment configuration
- Better handling of missing nodes in stage identification
Build & CI/CD
- Updated CI/CD workflows for improved reliability
- Enhanced build configuration for Spark 3.x and 4.x compatibility
- Improved artifact publishing process
π Full Changelog
Full Changelog: v0.6.1...v0.7.0
Features
- Delta Lake instrumentation and monitoring (#multiple commits)
- Alert tab grouping by alert type (bce9c5f)
- Spill selector component (0051e5c)
- Description search bar (e9f1a36)
- Scarf pixel telemetry integration (71d8287)
- SQL text improvements (#39)
- Subquery UI differentiation (4ec35fb)
- JDBC scan support (3e2060f)
- Full scan table alerts for Delta Lake (b8709cc)
- Z-Order cache tracking (8718974)
Bug Fixes & Improvements
- Improved stage identification algorithm for unions (d9e2902)
- Fixed duration/alert navigation logic (f7b8c16)
- Fixed read parser unit tests (b49efb7)
- Fixed bytesToHumanReadableSize comma handling (c7e3695)
- Improved Delta Lake listener implementation (def2b01)
- Refactored listener architecture (3edef71)
- Enhanced Delta Log integration (120a910, 1bc376c)
- Added more supported SQL plan nodes (6e5b64e)
CI/CD & Build
- CI improvements (#38)
- Fixed central snapshot deployment (5e9a966)
- Updated README and documentation (18810dc, #36)
π Contributors
Special thanks to:
- @menishmueli - Core development and features
- @cxzl25 - SQL text improvements and CI enhancements
- @daniel Aronovich - Documentation updates
π Documentation
For detailed usage instructions, see the README.
For Delta Lake instrumentation setup:
spark.conf.set("spark.dataflint.instrument.deltalake", "true")To disable telemetry:
spark.conf.set("spark.dataflint.telemetry.enabled", "false")New Contributors
Version 0.6.1
- Fix POM dependencies for users who import DataFlint using mvn
- Improved delta lake improved support - inserts, optimized writes and optimized command
Version 0.6.0
2 major changes:
Spark 4 Support β
DataFlint now support spark 4! due to breaking changes in spark, it does require using different dataflint artifact name:
io.dataflint:dataflint_spark4_2.13:0.6.0
(Instead of io.dataflint:dataflint_spark4_2.13:0.6.0 ) for spark 3
Stage identification by stages with statistics π
This means better DataFlint support for spark accelerators such as Nvidia RAPIDs for Spark.
Now DataFlint can also identify sql node stages by metrics with statistics that mentioned the stage and task data it got the min/median/max from.
v0.5.1
π New Features
- Enhanced Navigation & User Experience
- Added 2 new navigation icons for quick access to:
- Slowest node in the execution plan
- Nodes with alerts/issues
- Added visual indicators in the minimap showing where alerts are located
- Added color coding for minimap based on node performance state
- Improved Node Support
- Added support for "ShuffledHashJoin" node (with "d" suffix as used in Databricks)
- Added support for search window nodes that don't have associated stages (common in Databricks window functions outside codegen cluster nodes)
π¨ Visual & UI Improvements
- Node Visualization Enhancements
- Unified color system for all nodes to ensure consistency
- Tuned node performance colors for better visual clarity
- Increased metrics text font size in nodes for improved readability
- Added color coding for minimap based on node state
- Stage Distribution Improvements
- Extended shuffle read/write metrics to all stage distributions (not limited to exchange nodes)
- Fixed ordering of read/write nodes in exchange stages for better logical flow
β‘ Performance Optimizations
- Real-time Performance
- Significant performance improvements for SQL plan visualization during real-time execution with live updates
- Better aggregation of task attempts for more accurate node duration calculations
- Added capping mechanism for codegen duration that exceeds stage duration
π Bug Fixes & Error Handling
- Connection & Error Management
- Added proper error messaging for "server disconnected" modal
- Improved error handling and user feedback
π§ Technical Improvements
- Code Quality & Maintenance
- Standardized color management system across all node types
- Enhanced metrics calculation accuracy
- Improved real-time data processing efficiency
Full Changelog: v0.5.0...v0.5.1
Version 0.5.0
New and updated design for the sql plan nodes
Version 0.4.4
What's Changed
- Adding support for Expand nodes, and calculating the expand ratio
- Improving aggregation node names - Aggregate (partial_count) is now Count Within Partition
- Projecting (i.e. selecting) all fields is shown as * (instead of empty text in Spark UI)
Full Changelog: v0.4.3...v0.4.4
Version 0.4.3
Added query url support for sql_id and node_ids
Version 0.4.2
- Support zorder based pruning metrics in databricks
- Showing python UDF function name in filter/select nodes
- Add support for Generate (explode, inline, etc) node
Version 0.4.1
- Support DBR 15/16
Version 0.4.0
- Support map by pandas and arrow functions
- Added new flag to silence alert for a job -
spark.dataflint.alert.disabled
Which accepts a column seperated list of alerts such as:
smallTasks,idleCoresTooHigh - Added short recommendation on top of alert
- Updated DataFlint logo
- Support better stage identifications for varios readers
- Shows stage failures with an orange V on sql node and list of complete stage failures