nubank
diff --git a/‎STATIC_QUERY_EXECUTION_PARSER_OVERVIEW.md‎
Lines changed: 220 additions & 0 deletions b/‎STATIC_QUERY_EXECUTION_PARSER_OVERVIEW.md‎
Lines changed: 220 additions & 0 deletions
diff --git a/‎integration/spark/run-static-parser.sh‎
Lines changed: 25 additions & 0 deletions b/‎integration/spark/run-static-parser.sh‎
Lines changed: 25 additions & 0 deletions
@@ -0,0 +1,220 @@
+# Static Query Execution Parser for OpenLineage
+
+## Overview
+
+The **StaticQueryExecutionParser** is a comprehensive solution for processing Apache Spark query execution plans provided as JSON files and generating valid OpenLineage events with COMPLETE status. This parser is specifically designed for in-memory DataFrame operations in dry-run ETL processes.
+
+## Architecture
+
+### Core Components
+
+1. **StaticQueryExecutionParser** (`integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/StaticQueryExecutionParser.java`)
+   - Main parser class that processes JSON execution plans
+   - Generates OpenLineage events using existing OpenLineage client classes
+   - Handles schema extraction and column lineage mapping
+
+2. **StaticQueryExecutionParserMain** (`integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/StaticQueryExecutionParserMain.java`)
+   - Demo runner class with main method
+   - Processes individual files or batch processes all example files
+   - Includes validation and detailed logging
+
+3. **Test Suite** (`integration/spark/shared/src/test/java/io/openlineage/spark/agent/lifecycle/StaticQueryExecutionParserTest.java`)
+   - Comprehensive test coverage for all parser functionality
+   - Tests schema extraction, column lineage, and event validation
+
+## Key Features
+
+### 1. JSON Structure Handling
+- Processes flat arrays of Spark plan nodes with index-based references
+- Handles various Spark operation types (Project, Filter, Join, Aggregate, etc.)
+- Correctly identifies `LogicalRDD` nodes as input sources
+
+### 2. Dataset Identification
+- Generates unique, deterministic identifiers for in-memory DataFrames
+- Uses schema-based hashing for consistent dataset naming
+- Assigns `memory://dataframes` namespace for all datasets
+
+### 3. Schema Extraction
+- Extracts complete schema information from execution plan nodes
+- Handles nested attribute structures and complex data types
+- Supports both `output` and `attributes` field patterns
+
+### 4. Column Lineage Mapping
+- Traces column relationships from input to output datasets
+- Handles direct column mappings and transformations
+- Generates OpenLineage-compliant column lineage facets
+
+### 5. Event Generation
+- Creates complete OpenLineage events with COMPLETE status
+- Includes all required metadata (runId, eventTime, job information)
+- Supports event emission through existing OpenLineage client
+
+## Usage
+
+### Basic Usage
+```java
+StaticQueryExecutionParser parser = new StaticQueryExecutionParser();
+OpenLineage.RunEvent event = parser.parseExecutionPlanFile("path/to/query_plan.json");
+
+// Emit the event
+OpenLineageClient client = OpenLineageClient.builder()
+    .transport(new ConsoleTransport())
+    .build();
+parser.emitEvent(event, client);
+```
+
+### Command Line Usage
+```bash
+# Process a specific file
+java StaticQueryExecutionParserMain path/to/query_plan.json
+
+# Process all files in query_execution_examples directory
+java StaticQueryExecutionParserMain
+```
+
+## Implementation Details
+
+### JSON Structure Processing
+The parser handles the specific structure found in the provided examples:
+- Root level: Array of plan nodes
+- Each node: Contains `class`, operation-specific fields, and child references
+- LogicalRDD nodes: Contain `output` arrays with schema information
+
+### Dataset ID Generation
+```java
+// Generates deterministic IDs based on schema signature
+String schemaSignature = String.join(",", columns.stream().sorted().collect(Collectors.toList()));
+int hash = (nodeClass + schemaSignature + depth).hashCode();
+String datasetId = "input_dataset_" + Math.abs(hash);
+```
+
+### Schema Extraction Algorithm
+1. Try extracting from `output` field first
+2. Fallback to `attributes` field if no output found
+3. Handle nested array structures for field definitions
+4. Extract field names, data types, and nullability
+
+### Column Lineage Tracing
+1. Extract output column names from root node
+2. For each output column, search input sources for matching columns
+3. Generate lineage relationships with transformation descriptions
+4. Create OpenLineage column lineage facets
+
+## Handling Edge Cases
+
+### 1. Missing Input Sources
+- Creates mock input datasets when no LogicalRDD nodes found
+- Ensures events always have at least one input for compliance
+
+### 2. Complex Data Types
+- Handles nested data type structures
+- Supports decimal precision specifications
+- Gracefully handles unknown types with fallback
+
+### 3. Missing Schema Information
+- Provides default empty schemas when extraction fails
+- Logs warnings for debugging purposes
+
+## Output Format
+
+The parser generates OpenLineage events with the following structure:
+
+```json
+{
+  "eventType": "COMPLETE",
+  "eventTime": "2024-01-01T12:00:00Z",
+  "run": {
+    "runId": "uuid-here"
+  },
+  "job": {
+    "namespace": "static_analysis",
+    "name": "extracted-from-filename"
+  },
+  "inputs": [
+    {
+      "namespace": "memory://dataframes",
+      "name": "input_dataset_12345",
+      "facets": {
+        "schema": {
+          "fields": [...]
+        }
+      }
+    }
+  ],
+  "outputs": [
+    {
+      "namespace": "memory://dataframes",
+      "name": "job-name_output",
+      "facets": {
+        "schema": {
+          "fields": [...]
+        },
+        "columnLineage": {
+          "fields": {
+            "column_name": {
+              "inputFields": [...],
+              "transformationType": "DIRECT"
+            }
+          }
+        }
+      }
+    }
+  ]
+}
+```
+
+## Testing
+
+The test suite covers:
+- Basic query plan parsing
+- Schema extraction from various node types
+- Multiple input source handling
+- Column lineage generation
+- Event validation
+- Error handling for edge cases
+
+## Error Handling
+
+- Comprehensive exception handling with meaningful error messages
+- Graceful degradation when schema information is incomplete
+- Validation of generated events before emission
+- Detailed logging for debugging
+
+## Benefits
+
+1. **No Runtime Dependencies**: Works with static JSON files without requiring Spark runtime
+2. **Deterministic Output**: Generates consistent dataset IDs for reproducible results
+3. **Comprehensive Lineage**: Captures both dataset and column-level lineage
+4. **OpenLineage Compliant**: Generates valid OpenLineage events that conform to specification
+5. **Extensible**: Modular design allows easy extension for additional Spark operations
+6. **Well Tested**: Comprehensive test coverage ensures reliability
+
+## Example Processing
+
+Given the query execution examples in `query_execution_examples/`, the parser:
+
+1. **nu-br-dataset-savings-svr-paid_query_plan.json**
+   - Identifies 2 LogicalRDD input sources
+   - Extracts schemas with 9 columns each
+   - Generates column lineage for transformations
+   - Creates event with complete metadata
+
+2. **nu-co-dataset-nelson-muntz-label-aggregate-label_query_plan.json**
+   - Handles complex aggregation operations
+   - Maps column transformations through multiple projection layers
+   - Generates comprehensive lineage information
+
+3. **nu-br-dataset-insurance-customer-id-to-gross-income_query_plan.json**
+   - Processes large execution plans with many operations
+   - Extracts detailed schema information
+   - Maintains performance with complex structures
+
+## Integration
+
+The parser integrates seamlessly with existing OpenLineage infrastructure:
+- Uses existing OpenLineage client classes
+- Compatible with all OpenLineage transports
+- Follows OpenLineage event specification
+- Can be embedded in existing Spark applications
+
+This implementation provides a complete solution for generating OpenLineage events from static Spark query execution plans, enabling comprehensive data lineage tracking in dry-run ETL environments. 
@@ -0,0 +1,25 @@
+#!/bin/bash
+
+# Build the project first
+echo "Building the project..."
+./gradlew :shared:compileJava
+
+# Set up classpath
+CLASSPATH="shared/build/classes/java/main"
+
+# Add dependencies
+for jar in $(find ~/.gradle/caches/modules-2/files-2.1 -name "*.jar" 2>/dev/null | grep -E "(openlineage-client|jackson|slf4j)" | head -20); do
+    CLASSPATH="$CLASSPATH:$jar"
+done
+
+# Run the static parser
+echo "Running Static Query Execution Parser..."
+echo "Processing files from query_execution_examples/"
+
+if [ $# -eq 0 ]; then
+    # Process all example files
+    java -cp "$CLASSPATH" io.openlineage.spark.agent.lifecycle.StaticQueryExecutionParserMain
+else
+    # Process specific file
+    java -cp "$CLASSPATH" io.openlineage.spark.agent.lifecycle.StaticQueryExecutionParserMain "$1"
+fi