DataStore Execution Model

Understanding DataStore's lazy evaluation model is key to using it effectively and achieving optimal performance.

Lazy Evaluation

DataStore uses lazy evaluation - operations are not executed immediately but are recorded and compiled into optimized SQL queries. Execution happens only when results are actually needed.

Example: Lazy vs Eager

from pathlib import Path
Path("sales.csv").write_text("""\
region,product,category,amount,quantity,price,date,order_id
East,Widget,Electronics,5200,10,120,2024-01-15,1001
West,Gadget,Electronics,800,5,160,2024-02-20,1002
East,Gizmo,Home,6500,3,100,2024-03-10,1003
North,Widget,Electronics,4500,6,150,2024-06-18,1004
West,Gadget,Electronics,2000,8,250,2024-09-14,1005
""")

from chdb import datastore as pd

ds = pd.read_csv("sales.csv")

# These operations are NOT executed yet
result = (ds
    .filter(ds['amount'] > 1000)    # Recorded, not executed
    .select('region', 'amount')      # Recorded, not executed
    .groupby('region')               # Recorded, not executed
    .agg({'amount': 'sum'})          # Recorded, not executed
    .sort('sum', ascending=False)    # Recorded, not executed
)

# Still no execution - just building the query plan
print(result.to_sql())
# SELECT region, SUM(amount) AS sum
# FROM file('sales.csv', 'CSVWithNames')
# WHERE amount > 1000
# GROUP BY region
# ORDER BY sum DESC

# NOW execution happens
df = result.to_df()  # <-- Triggers execution

Benefits of Lazy Evaluation

Query Optimization: Multiple operations compile to a single optimized SQL query
Filter Pushdown: Filters are applied at the data source level
Column Pruning: Only needed columns are read
Deferred Decisions: Execution engine can be chosen at runtime
Plan Inspection: You can view/debug the query before executing

Execution Triggers

Execution is triggered automatically when you need actual values:

Automatic Triggers

Trigger	Example	Description
`print()` / `repr()`	`print(ds)`	Display results
`len()`	`len(ds)`	Get row count
`.columns`	`ds.columns`	Get column names
`.dtypes`	`ds.dtypes`	Get column types
`.shape`	`ds.shape`	Get dimensions
`.index`	`ds.index`	Get row index
`.values`	`ds.values`	Get NumPy array
Iteration	`for row in ds`	Iterate over rows
`to_df()`	`ds.to_df()`	Convert to pandas
`to_pandas()`	`ds.to_pandas()`	Alias for to_df
`to_dict()`	`ds.to_dict()`	Convert to dict
`to_numpy()`	`ds.to_numpy()`	Convert to array
`.equals()`	`ds.equals(other)`	Compare DataStores

Examples:

# All these trigger execution
print(ds)              # Display
len(ds)                # 1000
ds.columns             # Index(['name', 'age', 'city'])
ds.shape               # (1000, 3)
list(ds)               # List of values
ds.to_df()             # pandas DataFrame

Operations That Stay Lazy

Operation	Returns	Description
`filter()`	DataStore	Adds WHERE clause
`select()`	DataStore	Adds column selection
`sort()`	DataStore	Adds ORDER BY
`groupby()`	LazyGroupBy	Prepares GROUP BY
`join()`	DataStore	Adds JOIN
`ds['col']`	ColumnExpr	Column reference
`ds[['col1', 'col2']]`	DataStore	Column selection

Examples:

# These do NOT trigger execution - they stay lazy
result = ds.filter(ds['age'] > 25)      # Returns DataStore
result = ds.select('name', 'age')        # Returns DataStore
result = ds['name']                      # Returns ColumnExpr
result = ds.groupby('city')              # Returns LazyGroupBy

Three-Phase Execution

DataStore operations follow a three-phase execution model:

Phase 1: SQL Query Building (Lazy)

Operations that can be expressed in SQL are accumulated:

result = (ds
    .filter(ds['status'] == 'active')   # WHERE
    .select('user_id', 'amount')         # SELECT
    .groupby('user_id')                  # GROUP BY
    .agg({'amount': 'sum'})              # SUM()
    .sort('sum', ascending=False)        # ORDER BY
    .limit(10)                           # LIMIT
)
# All compiled into one SQL query

Phase 2: Execution Point

When a trigger occurs, the accumulated SQL is executed:

# Execution triggered here
df = result.to_df()  
# The single optimized SQL query runs now

Phase 3: DataFrame Operations (if any)

If you chain pandas-only operations after execution:

# Mixed operations
result = (ds
    .filter(ds['amount'] > 100)          # Phase 1: SQL
    .to_df()                             # Phase 2: Execute
    .pivot_table(...)                    # Phase 3: pandas
)

Viewing Execution Plans

Use explain() to see what will be executed:

ds = pd.read_csv("sales.csv")

query = (ds
    .filter(ds['amount'] > 1000)
    .groupby('region')
    .agg({'amount': ['sum', 'mean']})
)

# View execution plan
query.explain()

Output:

Pipeline:
  1. Source: file('sales.csv', 'CSVWithNames')
  2. Filter: amount > 1000
  3. GroupBy: region
  4. Aggregate: sum(amount), avg(amount)

Generated SQL:
SELECT region, SUM(amount) AS sum, AVG(amount) AS mean
FROM file('sales.csv', 'CSVWithNames')
WHERE amount > 1000
GROUP BY region

Use verbose=True for more details:

query.explain(verbose=True)

See Debugging: explain() for complete documentation.

Caching

DataStore caches execution results to avoid redundant queries.

How Caching Works

from pathlib import Path
Path("data.csv").write_text("""\
name,age,city,salary,department
Alice,25,NYC,55000,Engineering
Bob,30,LA,65000,Product
Charlie,35,NYC,80000,Engineering
Diana,28,SF,70000,Design
Eve,42,NYC,95000,Product
""")

ds = pd.read_csv("data.csv")
result = ds.filter(ds['age'] > 25)

# First access - executes query
print(result.shape)  # Executes and caches

# Second access - uses cache
print(result.columns)  # Uses cached result

# Third access - uses cache
df = result.to_df()  # Uses cached result

Cache Invalidation

Cache is invalidated when operations modify the DataStore:

result = ds.filter(ds['age'] > 25)
print(result.shape)  # Executes, caches

# New operation invalidates cache
result2 = result.filter(result['city'] == 'NYC')
print(result2.shape)  # Re-executes (different query)

Manual Cache Control

# Clear cache
ds.clear_cache()

# Disable caching
from chdb.datastore.config import config
config.set_cache_enabled(False)

Mixing SQL and Pandas Operations

DataStore intelligently handles operations that mix SQL and pandas:

SQL-Compatible Operations

These compile to SQL:

filter(), where()
select()
groupby(), agg()
sort(), orderby()
limit(), offset()
join(), union()
distinct()
Column operations (math, comparison, string methods)

Pandas-Only Operations

These trigger execution and use pandas:

apply() with custom functions
pivot_table() with complex aggregations
stack(), unstack()
Operations on executed DataFrames

Hybrid Pipelines

# SQL phase
result = (ds
    .filter(ds['amount'] > 100)      # SQL
    .groupby('category')              # SQL
    .agg({'amount': 'sum'})           # SQL
)

# Execution + pandas phase
result = (result
    .to_df()                          # Execute SQL
    .pivot_table(...)                 # pandas operation
)

Execution Engine Selection

DataStore can execute operations using different engines:

Auto Mode (Default)

from chdb.datastore.config import config

config.set_execution_engine('auto')  # Default
# Automatically selects best engine per operation

Force chDB Engine

config.set_execution_engine('chdb')
# All operations use ClickHouse SQL

Force pandas Engine

config.set_execution_engine('pandas')
# All operations use pandas

See Configuration: Execution Engine for details.

Performance Implications

Good: Filter Early

# Good: Filter in SQL, then aggregate
result = (ds
    .filter(ds['date'] >= '2024-01-01')  # Reduces data early
    .groupby('category')
    .agg({'amount': 'sum'})
)

Bad: Filter Late

# Bad: Aggregate all, then filter
result = (ds
    .groupby('category')
    .agg({'amount': 'sum'})
    .to_df()
    .query('sum > 1000')  # Pandas filter after aggregation
)

Good: Select Columns Early

# Good: Select columns in SQL
result = (ds
    .select('user_id', 'amount', 'date')
    .filter(ds['date'] >= '2024-01-01')
    .groupby('user_id')
    .agg({'amount': 'sum'})
)

Good: Let SQL Do the Work

# Good: Complex aggregation in SQL
result = (ds
    .groupby('category')
    .agg({
        'amount': ['sum', 'mean', 'count'],
        'quantity': 'sum'
    })
    .sort('sum', ascending=False)
    .limit(10)
)
# One SQL query does everything

# Bad: Multiple separate queries
sums = ds.groupby('category')['amount'].sum().to_df()
means = ds.groupby('category')['amount'].mean().to_df()
# Two queries instead of one

Best Practices Summary

Chain operations before executing - Build the full query, then trigger once
Filter early - Reduce data at the source
Select only needed columns - Column pruning improves performance
Use explain() to understand execution - Debug before running
Let SQL handle aggregations - ClickHouse is optimized for this
Be aware of execution triggers - Avoid accidental early execution
Use caching wisely - Understand when cache is invalidated

Lazy Evaluation​

Example: Lazy vs Eager​

Benefits of Lazy Evaluation​

Execution Triggers​

Automatic Triggers​

Operations That Stay Lazy​

Three-Phase Execution​

Phase 1: SQL Query Building (Lazy)​

Phase 2: Execution Point​

Phase 3: DataFrame Operations (if any)​

Viewing Execution Plans​

Caching​

How Caching Works​

Cache Invalidation​

Manual Cache Control​

Mixing SQL and Pandas Operations​

SQL-Compatible Operations​

Pandas-Only Operations​

Hybrid Pipelines​

Execution Engine Selection​

Auto Mode (Default)​

Force chDB Engine​

Force pandas Engine​

Performance Implications​

Good: Filter Early​

Bad: Filter Late​

Good: Select Columns Early​

Good: Let SQL Do the Work​

Best Practices Summary​