Skip to content

feat: Bigquery as OLAP engine#9161

Open
k-anshul wants to merge 11 commits intomainfrom
bigquery_olap
Open

feat: Bigquery as OLAP engine#9161
k-anshul wants to merge 11 commits intomainfrom
bigquery_olap

Conversation

@k-anshul
Copy link
Copy Markdown
Member

@k-anshul k-anshul commented Apr 1, 2026

closes https://linear.app/rilldata/issue/PLAT-450/metrics-views-on-bigquery

Added

TODOs to be done with follow ups:

  • Exports are broken
  • remove conversion of civil.Date to time.Time in the rill driver and handle it wherever required

Checklist:

  • Covered by tests
  • Ran it and it works as intended
  • Reviewed the diff before requesting a review
  • Checked for unhandled edge cases
  • Linked the issues it closes
  • Checked if the docs need to be updated. If so, create a separate Linear DOCS issue
  • Intend to cherry-pick into the release branch
  • I'm proud of this work!

@k-anshul k-anshul self-assigned this Apr 1, 2026
}

rangeSQL := fmt.Sprintf(
"SELECT min(%[1]s) as `min`, max(%[1]s) as `max`, %[2]s as `watermark` FROM %[3]s %[4]s",
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an efficient query even when running on partition column

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An optimization can be done where we check if this is the partition column in the table and directly check on min/max partition metadata.
Given this is an often executed query I think it can done in a follow-up. @begelundmuller thoughts ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the optimization can be done in a fast/cheap/safe way, then yeah it sounds good to me

@k-anshul k-anshul requested a review from begelundmuller April 2, 2026 13:08
Comment on lines +58 to +61
// MaxBytesBilled is the maximum number of bytes billed for a query. This is a safety mechanism to prevent accidentally running large queries.
// Set this to -1 for project defaults.
// Only applies to dashboard queries and does not apply when ingesting data from BigQuery into Rill.
MaxBytesBilled int64 `mapstructure:"max_bytes_billed"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it normal for BI tools to offer this for BigQuery? If we do, I think it should be much higher than 10 GB by default (which at the high usage-based list price is only $0.0625).

Another danger of setting a default limit is that it may break existing projects that use a BigQuery query for partition discovery.

ctx, cancel := context.WithTimeout(ctx, drivers.DefaultQuerySchemaTimeout)
defer cancel()

res, err := c.Query(ctx, &drivers.Statement{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check if BigQuery's dry run feature can be used for this? IIRC they have some of the best dry run support of any database.

Comment on lines +596 to +599
if d == DialectBigQuery && isFullJoin {
// BigQuery requires plain equality for FULL joins
// TODO: find a better way to handle this
return fmt.Sprintf("coalesce(CAST(%s AS STRING), '__rill_sentinel__') = coalesce(CAST(%s AS STRING), '__rill_sentinel__')", lhs, rhs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this TODO need to be addressed?

(Since optimizing small things like this for BigQuery perhaps doesn't really make sense, consider removing the isFullJoin param and just using this approach for the BigQuery dialect for any join type)

Comment on lines +873 to +874
case DialectBigQuery:
// BigQuery uses UNION ALL for generating time series
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +944 to +951
// BigQuery converts time.Time type to TIMESTAMP which is not compatible with DATE type dimensions
// so we need to convert it back to civil.Date if the dimension type is DATE
// TODO: remove conversion of civil.Date in the rill driver and handle it wherever required and remove this conversion here
if result.Schema.Fields[0].Type.Code == runtimev1.Type_CODE_DATE {
if t, ok := dimVal.(time.Time); ok {
dimVal = civil.DateOf(t)
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this TODO need to be addressed?

Did you consider moving this conversion into the BigQuery driver (i.e. iterate over the args passed to Query and Exec?)

Comment on lines +643 to +645
// Store the time dimension's data type so it's available for downstream queries (e.g. Schema validation).
e.metricsView.Dimensions = append(e.metricsView.Dimensions, &runtimev1.MetricsViewSpec_Dimension{
Name: e.metricsView.TimeDimension,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous comment applies here as well. I thought we already had this

Comment on lines +536 to +538
if d == DialectBigQuery && typeCode == runtimev1.Type_CODE_DATE {
return "DATE(?)"
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this also needed if you cast to civil in Query? Or is there any other way we could push this into the driver and avoid knowing the time type in advance?

ctx := t.Context()

_, currentFile, _, _ := goruntime.Caller(0)
projectPath := filepath.Join(currentFile, "..", "testdata", "ad_bids_bigquery")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testruntime/testdata directory is only used for legacy tests. Is it really necessary here? Or can the same be accomplished with a normal NewInstanceWithOptions call like we use for new tests that defines the file(s) needed for the specific test inline?


inst := &drivers.Instance{
Environment: "test",
OLAPConnector: "bigquery",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hard-coded OLAP connector option predates the time when we supported changing the OLAP connector in rill.yaml. It shouldn't be necessary in new tests – can you just put olap_connector: bigquery in the test's rill.yaml?

@@ -180,33 +181,157 @@ func (q *TableHead) generalExport(ctx context.Context, rt *runtime.Runtime, inst
}

func (q *TableHead) buildTableHeadSQL(ctx context.Context, olap drivers.OLAPStore) (string, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there's a huge complexity increase in this function. Two questions:

  1. We don't run TableHead very often, so is it necessary to optimize it so hard? In general, I would assume people who connect a BI tool to a data warehouse are fine with a SELECT * FROM tbl LIMIT 100 query being run.
  2. If it really is necessary, is it possible to combine it into one nested query and push it into the dialect somehow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants