Discussion:
Oracle RAC - Any formula to calculate/preempt the right number of LMS processes?
PD Malik
14 years ago
Permalink
Hello,

We are on 10gR2 5 node RAC (a huge and very busy DB) and also very high
number of CPUs (80ish) hence the default number of LMS processes that Oracle
generates for us (num_cpu dependant) is way too high and is causing CPU
consumption issue (due to their high priority) for some time now so we need
to look to reduce (and fix) them to a sensible number and going by RAC
expert Riyaz's advice we shud reduce them down to 4 (one for each 'other'
RAC node) which makes sense to us particularly becs if we were going 11g and
left this parameter to Oracle's default calculation it'll come down to 3.
(Current number of LMS processes : (Due to the varied number of CPUs ) - On
2 nodes, 10, on 1 12 and on the other 2 : 8).

Moreover, we've done the analysis and there are no major RAC wait events (as
in we are not much suffering from RAC) and we "think" we should be fine by
reducing these and obviously we'll take them thru to Performance Test to
gain some confidence.

However the question is, is there any other mathematical way of getting to
just the right number of LMS processes for a RAC instance pls? Is there any
formula to predict them or even to predict if there will be any latency
impact by reducing these?

Thanks in advance.
K Gopalakrishnan
14 years ago
Permalink
Malik,

Are you running them in real time priority? If yes, you can reduce it to 4
or (# of instances-1) and monitor the latencies of the RAC wait events/CPU
usage of the LMS processes. AFAIK there is no magic formula for
calculating the # of LMSs.

One of the most common practices is to schedule the number of RT processes <
# of CPUs to avoid spin/hang on CPUs. If your current settings works fine
for your workload, I would stick to it.

-Gopal
...
Riyaj Shamsudeen
14 years ago
Permalink
Hello Malik

I am no RAC expert, seems like a ocean more to learn.

I would agree with KG's comments with one enhancement. I would go with an
odd number, 3 or 5, prior to 11g version. There is an internal bug (10.2)
where the distribution of workload is non-uniform if # of LMS processes is
an even number. ( I know, it is weird, but I don't have any more details,
can't reproduce it, a RAC developer told me this few years ago, and seems
like it is worth a risk to take making it a prime number. This comment does
not apply to 11g and above. )

As you have figured out, 11.2 by default, sets this parameter to more
meaningful value. If you are looking for an algorithm, why not use 11.2
default algorithm, then? In this case, for 80 CPUs, formula will be
2+(80/32)= 4.5, round that up to 5.So, set gcs_server_processes to 5.


http://download.oracle.com/docs/cd/E14072_01/server.112/e10820/initparams088.htm

Remember that just because LMS is running in RT priority (Solaris lingo)
doesn't mean that it consumes CPU. RT simply bumps the LMS process to the
front of the CPU run queue for immediate scheduling, if LMS needs to use
CPU. You should also review to understand why LMS consuming CPU? CR
fabrication or Current mode buffer creation (most common reasons) or is it
due to some other reason?

Cheers

Riyaj Shamsudeen
Principal DBA,
Ora!nternals - http://www.orainternals.com - Specialists in Performance,
RAC and EBS11i
Blog: http://orainternals.wordpress.com
OakTable member http://www.oaktable.com

Co-author of the books: Expert Oracle
Practices<http://tinyurl.com/book-expert-oracle-practices/>,
Pro Oracle SQL, Expert PL/SQL
Practices<http://tinyurl.com/book-expert-plsql-practices>
...
PD Malik
14 years ago
Permalink
Riyaj & Gopal - Thanks for responding to my query.

Riyaj Bhai - You are an expert - no denying there - for yourself you may
have the entire ocean to cover for people like me who are still learning to
swim - you are an expert :-)

Just to answer your & Gopal's Que on RT to start with : We are on 10gR2 on
AIX 6 and LMS is in _high_priority_processes and _os_sched_high_priority is
also set to default 1 so I'd assume that it is set on high priority.

Gopal, Is this somehow can be done in AIX as well pls do you know?
"One of the most common practices is to schedule the number of RT processes
< # of CPUs to avoid spin/hang on CPUs."
and also how to check whether these settings are working fine or not?

A bit more explanation of our CPU behviour (very broadly) - We've seen that
in past some months our CPU usage has gone very high compared to the
incremental user load on the system and our capacity management departement
has highlighted LMS being one of the top consumers (amongst other
candidates). Its not that it always used to behave earlier no apparanely its
always been bad and has come to our attention now because of this CPU issue.

While also looking at other things it seems that this LMS setting is anyway
too high and looks like Oracle is 'correcting' it in 11g so this is
something we shuold look at doing anyway.

I had a bit of google digging on this CR Fabrication lead that you
provided (Thanks to Gopal and his answer he posted in one of the forums)
and this is probably you meant :

SQL> select inst_id, cr_requests, light_works, data_requests,
fairness_down_converts, (fairness_down_converts/data_requests)*100
INST_ID CR_REQUESTS LIGHT_WORKS DATA_REQUESTS FAIRNESS_DOWN_CONVERTS
RATIO
---------- ----------- ----------- ------------- ----------------------
----------
2 58831118 74411475 58831114 9720672
16.523012
1 350485652 884185071 350485608 59529981
16.9850001
6 14175101 20025830 14175101 2019579
14.2473694
4 13457882 19142318 13457880 1960022
14.5641215
3 440964324 909842437 440964274 79782554
18.0927478

And the ratio looks OK to me (less than 40% - even though aggregated but
still indicative I think) and it seems there isn't much of CR Fabrication
going on BUT I may be looking totally at wrong thing and if thats the case
then I'd request if you can point me to any writeups/whitepaperts etc. that
I can go thru to understand it a bit better - that'll be a great help.

I know, to know RAC I'll need to read Gopal's book which is already in my To
Do list but unfortunately thats not an overnight thing.

So our current approach is take these new settings (5 most probably) thru a
Performance Test and see how these settings are bahiving by looking and gc
waits and stats. I know its not the best approch as its 'trying to fix'
without getting to the root cause (due to lack of total mastery in this
area) but hopefully a performance test will give us a good confidence. Any
comments about our approach pls?

The only other thing that I've been thinking is that in 11g as you know
remastering is at file level instead on object level so is it that this file
block remastering is more 'efficient' than object remastering hence Oracle
has massively downsized the formula to calculate the default number of LMS?
This along with other optimizations in building/pinning/sending a block etc.
in 11g might have reduced the pressure on LMS hence the lower number may be
OK in 11g but it may not be a good idea in 10g yet? Just thinking ...

Sorry this email has gone a bit long ..

Thanks.


On Tue, Jun 21, 2011 at 5:14 PM, Riyaj Shamsudeen <
...
PD Malik
14 years ago
Permalink
And you say ... you are not an expert ;-)

Thanks Riyaj - You've given me the best possible leads that I could ever get
! Really appreciated! I'll get on with the job now ...

Many Thanks again!

On Wed, Jun 22, 2011 at 8:09 PM, Riyaj Shamsudeen <
...
Tanel Poder
14 years ago
Permalink
I disagree with you, Riyaj - you *are *a RAC expert (and there's no way to
escape that ;-)

--
Tanel
http://blog.tanelpoder.com

On Tue, Jun 21, 2011 at 7:14 PM, Riyaj Shamsudeen <
Post by Riyaj Shamsudeen
Hello Malik
I am no RAC expert, seems like a ocean more to learn.
Loading...