DataStore Execution Model
Understanding DataStore's lazy evaluation model is key to using it effectively and achieving optimal performance.
Lazy Evaluation
DataStore uses lazy evaluation - operations are not executed immediately but are recorded and compiled into optimized SQL queries. Execution happens only when results are actually needed.
Example: Lazy vs Eager
Benefits of Lazy Evaluation
- Query Optimization: Multiple operations compile to a single optimized SQL query
- Filter Pushdown: Filters are applied at the data source level
- Column Pruning: Only needed columns are read
- Deferred Decisions: Execution engine can be chosen at runtime
- Plan Inspection: You can view/debug the query before executing
Execution Triggers
Execution is triggered automatically when you need actual values:
Automatic Triggers
| Trigger | Example | Description |
|---|---|---|
print() / repr() | print(ds) | Display results |
len() | len(ds) | Get row count |
.columns | ds.columns | Get column names |
.dtypes | ds.dtypes | Get column types |
.shape | ds.shape | Get dimensions |
.index | ds.index | Get row index |
.values | ds.values | Get NumPy array |
| Iteration | for row in ds | Iterate over rows |
to_df() | ds.to_df() | Convert to pandas |
to_pandas() | ds.to_pandas() | Alias for to_df |
to_dict() | ds.to_dict() | Convert to dict |
to_numpy() | ds.to_numpy() | Convert to array |
.equals() | ds.equals(other) | Compare DataStores |
Examples:
Operations That Stay Lazy
| Operation | Returns | Description |
|---|---|---|
filter() | DataStore | Adds WHERE clause |
select() | DataStore | Adds column selection |
sort() | DataStore | Adds ORDER BY |
groupby() | LazyGroupBy | Prepares GROUP BY |
join() | DataStore | Adds JOIN |
ds['col'] | ColumnExpr | Column reference |
ds[['col1', 'col2']] | DataStore | Column selection |
Examples:
Three-Phase Execution
DataStore operations follow a three-phase execution model:
Phase 1: SQL Query Building (Lazy)
Operations that can be expressed in SQL are accumulated:
Phase 2: Execution Point
When a trigger occurs, the accumulated SQL is executed:
Phase 3: DataFrame Operations (if any)
If you chain pandas-only operations after execution:
Viewing Execution Plans
Use explain() to see what will be executed:
Output:
Use verbose=True for more details:
See Debugging: explain() for complete documentation.
Caching
DataStore caches execution results to avoid redundant queries.
How Caching Works
Cache Invalidation
Cache is invalidated when operations modify the DataStore:
Manual Cache Control
Mixing SQL and Pandas Operations
DataStore intelligently handles operations that mix SQL and pandas:
SQL-Compatible Operations
These compile to SQL:
filter(),where()select()groupby(),agg()sort(),orderby()limit(),offset()join(),union()distinct()- Column operations (math, comparison, string methods)
Pandas-Only Operations
These trigger execution and use pandas:
apply()with custom functionspivot_table()with complex aggregationsstack(),unstack()- Operations on executed DataFrames
Hybrid Pipelines
Execution Engine Selection
DataStore can execute operations using different engines:
Auto Mode (Default)
Force chDB Engine
Force pandas Engine
See Configuration: Execution Engine for details.
Performance Implications
Good: Filter Early
Bad: Filter Late
Good: Select Columns Early
Good: Let SQL Do the Work
Best Practices Summary
- Chain operations before executing - Build the full query, then trigger once
- Filter early - Reduce data at the source
- Select only needed columns - Column pruning improves performance
- Use
explain()to understand execution - Debug before running - Let SQL handle aggregations - ClickHouse is optimized for this
- Be aware of execution triggers - Avoid accidental early execution
- Use caching wisely - Understand when cache is invalidated