improve cardinality estimate

Discussion:

Laurentiu Oprea

2021-03-18 13:51:14 UTC

Regarding parallel I`m relaying my affirmation of the fact that I tested
with hint /*+ dynamic_sampling(8) */ (the DS level with parallel hint) and
gives the same estimate.

Yes there is a relation order_procesing_date between
business_procesing_date -1 and business_procesing_date + 1 and there is no
fooling like null values or year 1900/9999

The pattern is constant , grouping by partition_date for the last 7 days
it ranges from 0 to 130k (out of a total of 24M) rows filtered by
condition order_procesing_date > = business_procesing_date .

Also the columns are not virtual columns (strange expressions)

The change for parallel processing is one I'd have to think about - it may
be something to do with the way the predicates are rewritten for the
parallel query slaves - so I'll postpone thinking about that bit.
a) what's the business logic. is there an approximate relationship
between order_processing_date and business_processing date (e.g. the former
is usually about 3 days before the latter).
b) has the optimizer been fooled by a "silly null" effect - like "not yet
processed" = "31-Dec-9999"
c) how much does the pattern change over time - e.g. if the partition date
is sysdate - 1 is this a partition where a lot of rows have not yet had
business_processing_date set, but if partition date = sysdate - 7 then
maybe lots of rows now have a business_processing_date set.
Regards
Jonathan Lewis

Hello,
I recently saw a query with a wrong execution plan and looks to be caused
select
count(*)
from
orders
where
partition_date = sysdate-1 and
order_procesing_date > = business_procesing_date
the count itself is = 42K
the count itself without second condition ~ 24M
the cardinality estimate ~ 24M
the cardinality estimate if I add parallel hint: 1M (this is an
interesting one and looks to be related with the fact that parallel queries
get DS level 8)
order_procesing_date - business_procesing_date >= interval '0' minute
:1.4M
I haven't tried yet but I suspect that if I create extended stats on
expression ( order_procesing_date - business_procesing_date) can get me
better results
The question is what is the logic behind beter cardinality estimates if I
rewrite the where clause?
What will be the best approach for this query?
Thank you.

m***@tempus-consulting-group.com

2021-03-18 15:09:19 UTC

Permalink

Jonathan,
Is creating a virtual column and then gathering a histogram on that different/better than gathering expression statistics via something like:
BEGINÂ DBMS_STATS.GATHER_TABLE_STATS(user, 'ORDERS', method_opt => 'FOR ALL COLUMNS SIZE SKEWONLY FOR COLUMNS ( order_processing_date - business_processing_date) SIZE SKEWONLY');END;
?

On Thursday, March 18, 2021, 10:31:16 AM EDT, Jonathan Lewis <***@gmail.com> wrote:

Are all these dates date-only, or is there a time component allowed ?Is this one partition per day.
From what you've said, and assuming YES as the answer to the questions above, creating a virtual column on (order_processing_date - business_processing_date) looks like an obvious thing to do. And then create a (frequency) histogram on the virtual column so that the optimizer can see there are (usually?)Â only 3 distinct values and the one you're interested in is rare.

RegardsJonathan Lewis

On Thu, 18 Mar 2021 at 13:52, Laurentiu Oprea <***@gmail.com> wrote:

Regarding parallel I`m relaying my affirmationÂ of the fact that I tested with hint /*+ dynamic_sampling(8) */ (the DS level with parallel hint) and gives the same estimate.
Yes there is a relation order_procesing_date between business_procesing_date -1 and business_procesing_dateÂ + 1 and there is no fooling like null values or year 1900/9999
The patternÂ is constant , grouping by partition_dateÂ for the last 7 days it ranges from 0 to 130k (out of a total of 24M)Â rows filtered by condition order_procesing_date > = business_procesing_date .Â
Also the columns are not virtual columns (strange expressions)
Ãn joi, 18 mar. 2021 la 15:14, Jonathan Lewis <***@gmail.com> a scris:

The change for parallel processing is one I'd have to think about - it may be something to do with the way the predicates are rewritten for the parallel query slaves - so I'll postpone thinking about that bit.

The questions to ask first of all are:a) what's the business logic.Â is there an approximate relationship between order_processing_date and business_processing date (e.g. the former is usually about 3 days before the latter).b) has the optimizer been fooled by a "silly null" effect - like "not yet processed" = "31-Dec-9999"c) how much does the pattern change over time - e.g. if the partition date is sysdate - 1 is this a partition where a lot of rows have not yet had business_processing_date set, but if partition date = sysdate - 7 then maybe lots of rows now have a business_processing_date set.
RegardsJonathan Lewis

On Thu, 18 Mar 2021 at 12:26, Laurentiu Oprea <***@gmail.com> wrote:

Hello,
I recently saw a query with a wrong execution plan and looks to be caused by an overestimate of a particular where clause. Ex:
selectÂ Â Â count(*)fromÂ Â Â orderswhereÂ Â Â partition_date = sysdate-1 andÂ Â order_procesing_date > = business_procesing_date
the count itself is = 42Kthe count itself without second condition ~ 24Mthe cardinality estimate ~ 24Mthe cardinality estimate if I add parallel hint: 1M (this is an interesting one and looks to be related with the fact that parallel queries get DS level 8)the cardinality estimateÂ if I rewrite the condition : order_procesing_date - business_procesing_dateÂ >= interval '0' minute :1.4MÂ
I haven't tried yet but I suspect that if I create extended stats on expressionÂ ( order_procesing_date - business_procesing_date) can get me better resultsÂ
The question is what is the logic behind beter cardinality estimates if I rewrite the where clause?What will be the best approachÂ for this query?
Thank you.

Jonathan Lewis

2021-03-18 16:21:00 UTC

Permalink

In many ways there's not a lot to choose between them. (You can find
examples where the optimizer "misbehaves" wrong in either case.)
As a general principle, though, I prefer to keep the "extended stats"
method available for "column group" statistics and use virtual columns for
user-visible expressions.

In passing, I wouldn't use "for all columns skewonly" and generally I'd
avoid skewonly (or anything else that allowed the code to decide whether or
not to create a histogram).

Regards
Jonathan Lewis

Post by m***@tempus-consulting-group.com
Jonathan,
Is creating a virtual column and then gathering a histogram on that
BEGIN
DBMS_STATS.GATHER_TABLE_STATS(user, 'ORDERS', method_opt => 'FOR ALL
COLUMNS SIZE SKEWONLY FOR COLUMNS ( order_processing_date -
business_processing_date) SIZE SKEWONLY');
END;
?

Laurentiu Oprea

2021-03-19 07:59:02 UTC

Permalink

order_procesing_date and business_procesing_date are timestamps
partition_date is date virtual column as nvl(
trunc(business_procesing_date), 01-01-1000)

I`m curios why rewriting instead of a>=b to a-b>=0 helped to get closer to
cardinality estimate

Post by m***@tempus-consulting-group.com
Are all these dates date-only, or is there a time component allowed ?
Is this one partition per day.
From what you've said, and assuming YES as the answer to the questions
above, creating a virtual column on (order_processing_date -
business_processing_date) looks like an obvious thing to do. And then
create a (frequency) histogram on the virtual column so that the optimizer
can see there are (usually?) only 3 distinct values and the one you're
interested in is rare.
Regards
Jonathan Lewis

Post by Laurentiu Oprea
Regarding parallel I`m relaying my affirmation of the fact that I tested
with hint /*+ dynamic_sampling(8) */ (the DS level with parallel hint) and
gives the same estimate.
Yes there is a relation order_procesing_date between
business_procesing_date -1 and business_procesing_date + 1 and there is no
fooling like null values or year 1900/9999
The pattern is constant , grouping by partition_date for the last 7 days
it ranges from 0 to 130k (out of a total of 24M) rows filtered by
condition order_procesing_date > = business_procesing_date .
Also the columns are not virtual columns (strange expressions)

The change for parallel processing is one I'd have to think about - it
may be something to do with the way the predicates are rewritten for the
parallel query slaves - so I'll postpone thinking about that bit.
a) what's the business logic. is there an approximate relationship
between order_processing_date and business_processing date (e.g. the former
is usually about 3 days before the latter).
b) has the optimizer been fooled by a "silly null" effect - like "not
yet processed" = "31-Dec-9999"
c) how much does the pattern change over time - e.g. if the partition
date is sysdate - 1 is this a partition where a lot of rows have not yet
had business_processing_date set, but if partition date = sysdate - 7 then
maybe lots of rows now have a business_processing_date set.
Regards
Jonathan Lewis
On Thu, 18 Mar 2021 at 12:26, Laurentiu Oprea <

Hello,
I recently saw a query with a wrong execution plan and looks to be
select
count(*)
from
orders
where
partition_date = sysdate-1 and
order_procesing_date > = business_procesing_date
the count itself is = 42K
the count itself without second condition ~ 24M
the cardinality estimate ~ 24M
the cardinality estimate if I add parallel hint: 1M (this is an
interesting one and looks to be related with the fact that parallel queries
get DS level 8)
order_procesing_date - business_procesing_date >= interval '0' minute
:1.4M
I haven't tried yet but I suspect that if I create extended stats on
expression ( order_procesing_date - business_procesing_date) can get me
better results
The question is what is the logic behind beter cardinality estimates if
I rewrite the where clause?
What will be the best approach for this query?
Thank you.

Jonathan Lewis

2021-03-19 09:03:32 UTC

Permalink

Imaginfe a few models of the data - stick with date only to make it easier
to think:
e.g. a table with 365 rows, the business processing_date ranges from 1st
Jan to 31st Dec, the order_processing_date (opd) also ranges from 1st Jan
to 31st Dec. In how many cases is opd > bpd ?
Scenario 1
For two rows (1st Jan, 31st Dec) they match for every other row opd = bpd +
1
Scenario 2
For two rows (1st Jan, 31st Dec) they match for every other row opd = bpd -
1
Scenario 3
For every row opd = bpd.

So looking at the stats for two separate columns the optimizer has no idea
of the relationship.
Now consider the same scenarios and work out opd - bpd

Scenario 1: 2 rows = 0, 363 rows = +1
Scenario 2: 2 rows = 0, 363 rows = -1
Scenario 3: 365 rows = 0

The pattern is now very oobvious to the optimizer.

The same principal applies with timestamps - you get a range of
non-integral values, but it's easy for the optimizer to estimate the
fraction where it's greater than zero, zero, or less than zero.

Two general points about your partitioning: you ought to have an explicit
to_date() formatting option with your '01-01-1000'; and if you really need
to represent a null through a faked value you ought to get it as close to
the real data as possible to minimise the probability of the optimizer
producing very bad estimates for range-based predicates.

It's also worth thinking about the impact on interval partitioning - if
you've set this up interval partitioned by day then you MAY already have
used up about 365,000 of the available partitions (1000 years * 365 days)
which MIGHT show up some unexpected side effects on (e.g.) the dictionary
cache or library cache. (It would also be quite interesting to see what
Oracle does about dates in the 11 days when the switch from the Julian to
the Gregorian calendars took place.)

Regards
Jonathan Lewis

Post by Laurentiu Oprea
order_procesing_date and business_procesing_date are timestamps
partition_date is date virtual column as nvl(
trunc(business_procesing_date), 01-01-1000)
I`m curios why rewriting instead of a>=b to a-b>=0 helped to get closer to
cardinality estimate

Post by Laurentiu Oprea
Regarding parallel I`m relaying my affirmation of the fact that I tested
with hint /*+ dynamic_sampling(8) */ (the DS level with parallel hint) and
gives the same estimate.
Yes there is a relation order_procesing_date between
business_procesing_date -1 and business_procesing_date + 1 and there is no
fooling like null values or year 1900/9999
The pattern is constant , grouping by partition_date for the last 7
days it ranges from 0 to 130k (out of a total of 24M) rows filtered by
condition order_procesing_date > = business_procesing_date .
Also the columns are not virtual columns (strange expressions)

The change for parallel processing is one I'd have to think about - it
may be something to do with the way the predicates are rewritten for the
parallel query slaves - so I'll postpone thinking about that bit.
a) what's the business logic. is there an approximate relationship
between order_processing_date and business_processing date (e.g. the former
is usually about 3 days before the latter).
b) has the optimizer been fooled by a "silly null" effect - like "not
yet processed" = "31-Dec-9999"
c) how much does the pattern change over time - e.g. if the partition
date is sysdate - 1 is this a partition where a lot of rows have not yet
had business_processing_date set, but if partition date = sysdate - 7 then
maybe lots of rows now have a business_processing_date set.
Regards
Jonathan Lewis
On Thu, 18 Mar 2021 at 12:26, Laurentiu Oprea <

Hello,
I recently saw a query with a wrong execution plan and looks to be
select
count(*)
from
orders
where
partition_date = sysdate-1 and
order_procesing_date > = business_procesing_date
the count itself is = 42K
the count itself without second condition ~ 24M
the cardinality estimate ~ 24M
the cardinality estimate if I add parallel hint: 1M (this is an
interesting one and looks to be related with the fact that parallel queries
get DS level 8)
order_procesing_date - business_procesing_date >= interval '0' minute
:1.4M
I haven't tried yet but I suspect that if I create extended stats on
expression ( order_procesing_date - business_procesing_date) can get me
better results
The question is what is the logic behind beter cardinality estimates
if I rewrite the where clause?
What will be the best approach for this query?
Thank you.

pier paolo Bruno

2021-03-18 12:39:08 UTC

Permalink

Are there histograms under orders. order_procesing_date ?

Il giorno gio 18 mar 2021 alle ore 13:26 Laurentiu Oprea <

Laurentiu Oprea

2021-03-18 12:50:24 UTC

Permalink

partition_date, order_procesing_date and business_procesing_date are
columns with histograms (hybrid) of table orders.

Post by pier paolo Bruno
Are there histograms under orders. order_procesing_date ?
Il giorno gio 18 mar 2021 alle ore 13:26 Laurentiu Oprea <