-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix struct field lookup conversions for PIVOT/UNPIVOT aliases #4868
base: main
Are you sure you want to change the base?
Fix struct field lookup conversions for PIVOT/UNPIVOT aliases #4868
Conversation
Improved the handling of struct field lookups by preventing incorrect conversion of PIVOT/UNPIVOT aliases to Dot expressions. This patch: - Adds early returns to skip columns that don't need conversion - Checks for PIVOT/UNPIVOT aliases in both current and parent scopes - Optimizes performance by checking current scope first before traversing parents - Adds robust type comparison between string and identifier objects - Preserves original conversion logic for actual struct field lookups This fixes an issue where columns referencing PIVOT/UNPIVOT aliases were incorrectly converted to Dot expressions, causing incorrect query results.
@zeta9044 could add the test into the pr? |
if ( | ||
column_table | ||
and column_table not in scope.sources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the pivot aliases just be added in scope.sources
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@georgesittas I've thought about your suggestion to add pivot aliases to scope.sources instead of my approach, but I have some concerns.
When we look at how qualification works, it processes from nested scopes upward. For complex queries, I don't believe we have a structure that can reliably identify and insert these aliases into the correct scope.sources at the right time.
My explicit checking approach ensures that we can identify PIVOT/UNPIVOT aliases regardless of scope hierarchy. While it might look more verbose, it provides more reliable handling, especially for complex nested queries where scope relationships and execution order become critical.
For your suggestion to work in all cases, we would need to guarantee that PIVOT/UNPIVOT aliases are always added to the correct scope's sources at the exact right moment - which could be difficult to ensure in complex nested query scenarios.
I believe my implementation, though more explicit, provides more predictable and robust behavior across different query complexities. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zeta9044 I think the scope module should be the source of truth for what sources are available in each scope, instead of implementing lexical scoping here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zeta9044 I think the scope module should be the source of truth for what sources are available in each scope, instead of implementing lexical scoping here.
@georgesittas Currently, in the code state, scope source does not include pivot/unpivot aliases. So, are you hoping that I modify the scope processing? Do you want me to modify the scope processing and change this PR to closed? Centralizing the scope logic seems like it would be difficult for me to handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@georgesittas Currently, if there's no issue with merging my PR, how about proceeding with the merge, and then you handle the scope centralization modifications?
@zeta9044 probably best to create a new method for testing pivots within |
@zeta9044 just double-checking, did you pass a schema in from sqlglot import parse_one
from sqlglot.optimizer.scope import build_scope
from sqlglot.optimizer.qualify import qualify
sql = """
SELECT
UNPVT.EMPLOYEE_ID,
UNPVT.QUARTER,
UNPVT.SALES,
CASE
WHEN UNPVT.QUARTER = 'Q1_SALES' THEN 'First Quarter'
WHEN UNPVT.QUARTER = 'Q2_SALES' THEN 'Second Quarter'
WHEN UNPVT.QUARTER = 'Q3_SALES' THEN 'Third Quarter'
WHEN UNPVT.QUARTER = 'Q4_SALES' THEN 'Fourth Quarter'
END AS QUARTER_NAME
FROM EMPLOYEE_SALES
UNPIVOT (
SALES FOR QUARTER IN (
Q1_SALES,
Q2_SALES,
Q3_SALES,
Q4_SALES
)
) UNPVT
WHERE EXISTS (
SELECT 1
FROM SALES S
WHERE S.EMPLOYEE_ID = UNPVT.EMPLOYEE_ID
AND (
(UNPVT.QUARTER = 'Q1_SALES' AND S.QUARTER = 'Q1') OR
(UNPVT.QUARTER = 'Q2_SALES' AND S.QUARTER = 'Q2') OR
(UNPVT.QUARTER = 'Q3_SALES' AND S.QUARTER = 'Q3') OR
(UNPVT.QUARTER = 'Q4_SALES' AND S.QUARTER = 'Q4')
)
)
ORDER BY UNPVT.EMPLOYEE_ID,
CASE
WHEN UNPVT.QUARTER = 'Q1_SALES' THEN 1
WHEN UNPVT.QUARTER = 'Q2_SALES' THEN 2
WHEN UNPVT.QUARTER = 'Q3_SALES' THEN 3
WHEN UNPVT.QUARTER = 'Q4_SALES' THEN 4
END;
"""
expr = parse_one(sql)
# scope = build_scope(expr)
schema = {
"employee_sales": {
"q1_sales": "int",
"q2_sales": "int",
"q3_sales": "int",
"q4_sales": "int"
},
"sales": {
"quarter": "text",
"employee_id": "id",
},
}
print(qualify(expr, schema=schema).sql(pretty=True)) This produces: SELECT
"unpvt"."employee_id" AS "employee_id",
"unpvt"."quarter" AS "quarter",
"unpvt"."sales" AS "sales",
CASE
WHEN "unpvt"."quarter" = 'Q1_SALES'
THEN 'First Quarter'
WHEN "unpvt"."quarter" = 'Q2_SALES'
THEN 'Second Quarter'
WHEN "unpvt"."quarter" = 'Q3_SALES'
THEN 'Third Quarter'
WHEN "unpvt"."quarter" = 'Q4_SALES'
THEN 'Fourth Quarter'
END AS "quarter_name"
FROM "employee_sales" AS "employee_sales"
UNPIVOT("sales" FOR "quarter" IN ("employee_sales"."q1_sales", "employee_sales"."q2_sales", "employee_sales"."q3_sales", "employee_sales"."q4_sales")) AS "unpvt"
WHERE
EXISTS(
SELECT
1 AS "1"
FROM "sales" AS "s"
WHERE
"s"."employee_id" = "unpvt"."employee_id"
AND (
(
"unpvt"."quarter" = 'Q1_SALES' AND "s"."quarter" = 'Q1'
)
OR (
"unpvt"."quarter" = 'Q2_SALES' AND "s"."quarter" = 'Q2'
)
OR (
"unpvt"."quarter" = 'Q3_SALES' AND "s"."quarter" = 'Q3'
)
OR (
"unpvt"."quarter" = 'Q4_SALES' AND "s"."quarter" = 'Q4'
)
)
)
ORDER BY
"unpvt"."employee_id",
CASE
WHEN "unpvt"."quarter" = 'Q1_SALES'
THEN 1
WHEN "unpvt"."quarter" = 'Q2_SALES'
THEN 2
WHEN "unpvt"."quarter" = 'Q3_SALES'
THEN 3
WHEN "unpvt"."quarter" = 'Q4_SALES'
THEN 4
END |
SELECT
employee_id,
quarter,
sales,
CASE
WHEN quarter = 'Q1_SALES' THEN 'First Quarter'
WHEN quarter = 'Q2_SALES' THEN 'Second Quarter'
WHEN quarter = 'Q3_SALES' THEN 'Third Quarter'
WHEN quarter = 'Q4_SALES' THEN 'Fourth Quarter'
END AS quarter_name
FROM EMPLOYEE_SALES
UNPIVOT (
sales FOR quarter IN (
Q1_SALES,
Q2_SALES,
Q3_SALES,
Q4_SALES
)
)
ORDER BY employee_id,
CASE
WHEN quarter = 'Q1_SALES' THEN 1
WHEN quarter = 'Q2_SALES' THEN 2
WHEN quarter = 'Q3_SALES' THEN 3
WHEN quarter = 'Q4_SALES' THEN 4
END;
SELECT
unpvt.employee_id,
unpvt.quarter,
unpvt.sales,
CASE
WHEN unpvt.quarter = 'Q1_SALES' THEN 'First Quarter'
WHEN unpvt.quarter = 'Q2_SALES' THEN 'Second Quarter'
WHEN unpvt.quarter = 'Q3_SALES' THEN 'Third Quarter'
WHEN unpvt.quarter = 'Q4_SALES' THEN 'Fourth Quarter'
END AS quarter_name
FROM EMPLOYEE_SALES
UNPIVOT (
sales FOR quarter IN (
Q1_SALES,
Q2_SALES,
Q3_SALES,
Q4_SALES
)
) unpvt
ORDER BY unpvt.employee_id,
CASE
WHEN unpvt.quarter = 'Q1_SALES' THEN 1
WHEN unpvt.quarter = 'Q2_SALES' THEN 2
WHEN unpvt.quarter = 'Q3_SALES' THEN 3
WHEN unpvt.quarter = 'Q4_SALES' THEN 4
END;
-- The first query and the second query return the same results. Since it's a query built on a single table, there's no issue with qualifying columns regardless of whether the schema is provided or not.
-- However, when qualification is applied, the first query and the second query return completely different results, which became the reason for creating this PR. |
- Restructured test_qualify_columns_with_pivots to use parameterized tests - Added separate expected outputs for cases with and without schema parameter - Addressed UNPIVOT column qualification differences when schema is provided - Reduced code duplication by using a 3-tuple test case structure - Improved maintainability for future test additions - Added clear comments to explain test case organization
@tobymao @georgesittas DBMS Supports Qualified Column Names in UNPIVOT 'in' Clause Notes |
yes, it is the intent of the optimizer to always produce correct runnable sql |
Hey appreciate the patience, I'll take another look at this soon. Been a bit busy these days. |
Yes, the optimizer should produce valid SQL as Toby said. But I'm not sure which tests you're referring to. The optimizer expects a On a separate note, I'd suggest providing shorter examples and including the actual code that you use to reproduce the problematic behaviors so we can accurately test it as well. The snippets so far are quite noisy, so it's hard to iterate with them as inputs. |
@tobymao @georgesittas The following block is errors in optimizer.sql of test fixture. # from 702 line
# title: unpivoted table source, unpivot has column aliases
# execute: false
SELECT * FROM (SELECT * FROM m_sales) AS m_sales(empid, dept, jan, feb) UNPIVOT(sales FOR month IN (jan, feb)) AS unpiv(a, b, c, d);
SELECT
"unpiv"."a" AS "a",
"unpiv"."b" AS "b",
"unpiv"."c" AS "c",
"unpiv"."d" AS "d"
FROM (
SELECT
"m_sales"."empid" AS "empid",
"m_sales"."dept" AS "dept",
"m_sales"."jan" AS "jan",
"m_sales"."feb" AS "feb"
FROM "m_sales" AS "m_sales"
) AS "m_sales"
UNPIVOT("sales" FOR "month" IN ("m_sales"."jan", "m_sales"."feb")) AS "unpiv"("a", "b", "c", "d");
# from 741 line
# title: unpivoted table source with a single value column, unpivot columns can be qualified
# execute: false
# dialect: bigquery
# note: the named columns aren not supported by BQ but we add them here to avoid defining a schema
SELECT * FROM produce AS produce(product, q1, q2, q3, q4) UNPIVOT(sales FOR quarter IN (q1, q2, q3, q4));
SELECT
`_q_0`.`product` AS `product`,
`_q_0`.`quarter` AS `quarter`,
`_q_0`.`sales` AS `sales`
FROM `produce` AS `produce`
UNPIVOT(`sales` FOR `quarter` IN (`produce`.`q1`, `produce`.`q2`, `produce`.`q3`, `produce`.`q4`)) AS `_q_0`;
# from 753 line
# title: unpivoted table source with multiple value columns
# execute: false
# dialect: bigquery
SELECT * FROM produce AS produce(product, q1, q2, q3, q4) UNPIVOT((first_half_sales, second_half_sales) FOR semesters IN ((Q1, Q2) AS 'semester_1', (Q3, Q4) AS 'semester_2'));
SELECT
`_q_0`.`product` AS `product`,
`_q_0`.`semesters` AS `semesters`,
`_q_0`.`first_half_sales` AS `first_half_sales`,
`_q_0`.`second_half_sales` AS `second_half_sales`
FROM `produce` AS `produce`
UNPIVOT((`first_half_sales`, `second_half_sales`) FOR `semesters` IN ((`produce`.`q1`, `produce`.`q2`) AS 'semester_1', (`produce`.`q3`, `produce`.`q4`) AS 'semester_2')) AS `_q_0`; above block, it's needed to valid sql although dialect. # from 702 line
# title: unpivoted table source, unpivot has column aliases
# execute: false
SELECT * FROM (SELECT * FROM m_sales) AS m_sales(empid, dept, jan, feb) UNPIVOT(sales FOR month IN (jan, feb)) AS unpiv(a, b, c, d);
SELECT
"unpiv"."a" AS "a",
"unpiv"."b" AS "b",
"unpiv"."c" AS "c",
"unpiv"."d" AS "d"
FROM (
SELECT
"m_sales"."empid" AS "empid",
"m_sales"."dept" AS "dept",
"m_sales"."jan" AS "jan",
"m_sales"."feb" AS "feb"
FROM "m_sales" AS "m_sales"
) AS "m_sales"
UNPIVOT("m_sales"."sales" FOR "m_sales"."month" IN ("jan", "feb")) AS "unpiv"("a", "b", "c", "d");
# from 741 line
# title: unpivoted table source with a single value column, unpivot columns can be qualified
# execute: false
# dialect: bigquery
# note: the named columns aren not supported by BQ but we add them here to avoid defining a schema
SELECT * FROM produce AS produce(product, q1, q2, q3, q4) UNPIVOT(sales FOR quarter IN (q1, q2, q3, q4));
SELECT
`_q_0`.`product` AS `product`,
`_q_0`.`quarter` AS `quarter`,
`_q_0`.`sales` AS `sales`
FROM `produce` AS `produce`
UNPIVOT(`produce`.`sales` FOR `produce`. `quarter` IN (`q1`, `q2`, `q3`, `q4`)) AS `_q_0`;
# from 753 line
# title: unpivoted table source with multiple value columns
# execute: false
# dialect: bigquery
SELECT * FROM produce AS produce(product, q1, q2, q3, q4) UNPIVOT((first_half_sales, second_half_sales) FOR semesters IN ((q1, q2), (q3, q4)));
SELECT
`produce`.`product` AS `product`,
`_q_0`.`semesters` AS `semesters`,
`_q_0`.`first_half_sales` AS `first_half_sales`,
`_q_0`.`second_half_sales` AS `second_half_sales`
FROM `produce` AS `produce`
UNPIVOT((`produce`.`first_half_sales`, `produce`.`second_half_sales`)
FOR `produce`.`semesters`
IN ((`q1`, `q2`), (`q3`, `q4`))) AS `_q_0`; |
Fix struct field lookup conversions for PIVOT/UNPIVOT aliases
Issue Description
The
_convert_columns_to_dots
function in the qualify module incorrectly converts column references to PIVOT/UNPIVOT aliases into Dot expressions. This causes inappropriate conversion of SQL queries involving PIVOT/UNPIVOT operations and leads to incorrect query results.Approach
The core of this fix involves:
The detection logic first checks the current scope for efficiency, and only traverses parent scopes when necessary, with robust type comparison between string and identifier objects.
Tests
Test cases covering the following scenarios verify the correctness of the fix:
before code
changed
Dependency for Lineage Module Enhancement
This fix is a prerequisite for adding full UNPIVOT support in the lineage.py module. The current incorrect conversion of PIVOT/UNPIVOT aliases prevents proper column lineage tracking when working with UNPIVOT operations.
Future Work
Following the acceptance of this PR, I plan to submit another PR that will provide integrated lineage support for both PIVOT and UNPIVOT operations, enabling complete column lineage tracking through these complex transformations.