Tip:
Highlight text to annotate it
X
My name is Attilio Rao and
I think that we are in time for the presentation
I want to ask sorry for my English because it's not really British English but I will
try to make this
a little bit uncomfortable
Better?
Ok.Thank you.So we are going to speak about the
the locking infrastructure in the FreeBSD kernel which
is a bit interesting topic because
Its going to be with time very widely discussed on our mailing list not only
from developer's perspective but even from user's perspective.
and we will see why later
In this presentation we will specifically see what
was the situation
of the first
FreeBSD implementations
and what changed from that
what specifically what's called the SMPng era
and what
we had prior that
we are going to discuss
specifically
locking primitives that has been introduced with time until now
and
problems linked to
parellelism in general and how we solve that in
the FreeBSD kernel
You can see a table of content a little bit more detailed as
listing precisely what we
some problems like
Priority Inheritance
and Adaptive Spinning that we are going to discuss fruitfullly.
Mostly until FreeBSD 4.x
We had already moved to multitasking.
so the slide is a little bit confusing but multitasking and preemptive system
since
that transition was not very
was not very difficult to implement in such systems because
if you can see then our uniprocessor machine
you can get that
well
the sequential execution was just
stopped by
preemption
and by arrival of interrupts
so you should adjustment in consistency of data structures
about these two issues
more precisely
we were handling
the interrupts and transitions through
a mechanism
called SPL
and for kernel threads, threads running in the kernel we were disabling
preemption
in order to avoid
the corruption of the data structure
This approach while was
pretty good on uniprocessor machines
was actually
impredictable for
the SMP environments
more precisely because we had more coures that
was running thread per time
and so
parallel
accesses to the data structures were possible
in order to
to avoid
big problems in the kernel
we have to just
allow
the entering of
one thread per time into kernel
while that was a pretty good approach
for workloads that were nearly user space
for work loads
requiring a lot of IO for example they were wateful because they wasn't
getting any advantage from the new
SMP architecture
like the parallelism was basically zero
at least in the kernel
in order
to fix that a new project was created called SMP
New generation
or NG
as you can see it from the slide
the entering in the kernel was preempted
by using Big Lock
called BKL basically
With FreeBSD 5.x we had the SMP new generation project
basically it was
a sanitization of the
of all
our kernel and the engineering over lot of mechanism inside our kernel. We could see it
FreeBSD 4.x and FreeBSD 5.x as mainly two different kernels
because of
substantial subsystem were rewritten and
were written with the
idea to use and implement a real parallelism in mind.
we can say that basically it was a major task a very big task
and that it required
a lot of years to be brought
in a good shape at least
In Italy, the people gave
a lot of
complaining about the
un-robustness of FreeBSD 5.x but
probably that's because they couldn't even
see that the changes were really really important and really huge
however for FreeBSD 5.x based this initial SMP system
inheriting from BSD/OS
that kindly
released this code above that
and the
the process was break up in
some precise tasks at least in Italy
Mainly the first things was introducing in the kernel
new set of atomic instruction and locking primitives
Then introducing
an abstraction called interrupt threads that we are going to discuss
rather later but
it was basically restored completely the interrupt mechanism that was in the FreeBSD 4.x
the the BKL
lock was moved to a real
mutex called Giant
that still exists in our kernel
and they were introduced some threading primitives
the
like and and on and
threading primitives
called also KSE
which are actually never used in our kernel
and that being their being exit out in the past year
and the
slowly of the porting of
all the older subsystems to a finer locking was started
I have to say this task is not still completed, its still going on but
we are really good shape about that
just few subsystems remain which are still Giant protected
and with new release that we're going to ship this year, I think that we made
a very huge step forward in this direction
really the the SMPng has been considered closed around the end of
2007
but the
the
the important parts where this initial moving
I rather thing that's not listed here but I can tell you is that
even that if Giant was preventing any parallelism initial parallelism
that were imported new kernel memory allocator that was
that I discovered
and the scheduler was move with a seperate lock
in order to
start getting some
a little bit of concurrency
a real concurrency
the
before to speak about FreeBSD specifics we can start digging in about
what kind of
of locking primitives you can find in our kernel.
from a more historical point of view
we have some versions of mutex which
I assume
people here knows about that but I'm going to give a little explanation
for people that doesn't know
a mutex is basically
a lock allowing to access
to some protected data's thread to just one thread per time
so if a thread
owns the lock,
owns the mutex
other threads
won't be able to
to access to this until
this lock is released
we offer even
some kind of locks called R/W lock Read/Write lock
which are basically a
locks that can be acquired in two different versions
one version
is the write lock which is the same as the mutex just one
in the protected part per time
and other one is the read mode which basically
allows
all the thread willing to acquire to read mode to
concurrently adjust to the structure but prevents the threads from
writing nto the protected path.
while the reader..while they are readers
then we have even
the locks called the Read Mostly
which are basically the same of Read/Write Locks but are
they have some optimization in order to make the Read
part be really fast
and to have like
zero overhead
zero overhead kind of lock
from the read path while
probably the write path is even
heavier than the other one but if you think about cases that
just
where
there are a lot of reader chases and very few writer chases you can find that a
very useful
very useful primitive
then we have
some form of Wait channels
Wait channels
basically are what
generalizations of what other people con call like
condition variable and
they basically let that thread sleep
under some conditions that are
that are previously started with some
some variables
usually
having a Wait channel means that its
chases are controlled through another locking primitive like a mutex
or R/Wlock
and so often the Wait channel is associated to its
to its locking primitive
usually if you have no necessity to use a Wait channel without a primitive
a locking primitive you probably have bad code
but there are some edge cases
with that seem possible
As last thing FreeBSD sub primitive counting semaphore
even if thats considered not featured
as we are going to see I think they're going to see it and
its usage is pretty much discouraged
basically FreeBSD you can consider locking primative divided into three classes
three classes of
of locking 0:11:32.450,0:11:34.090 based mainly in
particular
from an outside perspective
based on the behavior
the contending threads as you regard of the lock 0:11:42.680,0:11:48.100 for example in case of a mutex you can can get that
spinning and blocking mutex do very different things about the contenders
as we are going to see more of this later
usually in the traditional literature,
there are just two
cases of the lock classes mainly
you will find the
spinning lock and the blocking lock
or what they called the sleeping lock
the I think that
as we're going to see why we have three types I think that things will be clear but
if you have any questions please ask us. Thats not a problem
Spinning primitives as I told you
allows the contesting thread to
to check the status of the lock periodically
and the
and they just do busy waiting around
the locking variable
as the spinning primitive FreeBSD just offers mutex
What are the problems linked with this kind of, with this class
of locks? Mainly its that CPU
remains busy without doing really nothing useful
it happens
that if several threads contest on the
on the locks
basically they share the same cache line where the lock is
where the lock is
that means that
contesting or sharing a cache line is a lot underlying activity
on a lot of architectures like for example
having a lot of snoop messages between CPUs
and some buses
some buses traffic
which means in a variety operations
and the last things even the most important you can note is that interrupts
are disabled
while spin locks are held
that was
that happens mainly because there are
there were identified in the past by some
kind of deadlocks possible
if you were going to lead
the spin locks
the interrupts enabled while holding a spin lock. In particular
you could find that there are
some problems with the interrupts angling good in the botom half that was
going to deadlock
Its not very simple to understand the thing so I've left out
but if you want to know
we could speak later probably
with spinning primitives we are even blocking primitives
blocking primitives
allows the
basically the contenders to be
descheduled from the runqueue
to be put on another kind of container
put on another kind of container
and basically
context switch immediately
immediately.
then we put again on runqueue of the scheduler just once the just when the owner
is going to release the lock
and it will be the owner
the owner that was going to
do all the operations about that
we have several primitives implemented as blocking primitives like mutexes
R/W locks and R-M locks
with
basically with
blocking primitives we have a lot of advantages over the spinning mutex
like having the contenders
that
that sleeps
or that blocks avoids CPU busyness
and mainly we can leave the
we can leave the
we can leave that basically the interrupts out
that happens mainly because the interrupts code is just allowed
at least the bottom of one is just allowed
to use spin locks
probably if it was going to use blocking primitives
we wouldnt have been able to disable interrupts here
There are however some
big drawbacks that as you will see
we handle in FreeBSD
in order to make the blobking primitives our
how could I tell
the first choice in terms of blocking
where the problem called Priority Inversion
and we have
the problem that context switches are very heavy in particular
on machines that FreeBSD uses as referral
like E38 and the MD64
but as you're going to see we've used two techniques in order to
to cope with that
another thing is that while you cant
allow
context switches while having
while holding spin lock
it's obvious you cant
acquire a locking primitive while holding a spin lock
that's an important rule in FreeBSD
that sometimes its confused and often its not
observed
that leads to block refusal
usually you will always prefer a blocking primitive for a spin lock
if not in some very particular condition like what
Alrick said about the interrupt and even
about the
some parts that are very very short
we should have some example in the kernel even if I can
I can tell you one right now
I have no idea actually
so that
we're going to see the problemslinked with the blocking primitives the first one is
called Priority Inversion
basically
it could happen that like a thread A
which has a priority
owns a lock. call it L for example
then another thread with another priority than this one
locks on this lock
what happens is that the second thread
the thread B
for example
will need to wait for a lower priority thread
to finish its work load
we
solve this problem actually in the
kernel using a technique called priority propogation
basically
what happens is that priority of thread B
is lent to thread A
until it doesn't release the lock
of its directly implemented in the container
the turnstiles
while that could be done
even on the primitive it has been much convenient to use the container for
that
because
it was going to offer some advantage we are going to see right now
just note that
Read locks
cannot support
priority propogation fixes for read lock that happens because you'd like to
the turnstile should keep track of all the readers
and these would be very very expensive from
from a
from a point of view of the overhead
and even I think I've tried to do something in this regard and I
saw that there was some races that were trying to
acquire a spin lock as base even in fast path so it was a
an impredicable way
I will tell
at least for what we found so far
basically
what happens
about the priority propogation is that the
the threads and the turnstiles
are chained together
the thread
owns the a pointer
to wrench the turnstile is sleeping on
and the turnstile owns a pointer above
the owner of the lock
what happens is that for example in this case we have
a sleeper which is going to sleep on a turnstile
the first lock
which has a priority of one hundred and twenty eight
the turnstile
to the pointer
ts_owner knows which is its owner
and this owner has a priority of two hundred and fifty six
well as you know higher level, higher value means lower priority. so if this is 0:20:31.120,0:20:34.960 a suitable pace for priority propogation
but what happens is that this owner is actually sleeping on another turnstile
and the other owner
of the second turnstile has always the same priority of its sleepers
so
just propogating priority to the first owner was just unuseful because the first
one
could
still
keep the chain to a
lower priority so it's was going to be propogated to the first one
actually running
owner of the chain
this is the situation after the propogation as you can see all of threads in the chain
has the same priority
either possible
in this case the one the last one arriving
there are question about that
no?
yeah when the
when the for example the third owner
the second owner there
when it goes to release the lock
it basically brings back the priority to the
to the
twenty hundred and sixty five to all the chains
he is responsible for
so it just happens at locking operation
and that is what we do about the Priority Inversion
inorder to fix instead the overhead given by the
big amount of context switch we use another technique called adaptive spinning
basically
as the context switch brings a lot of overhead
we prefer to not do
completely a context switch
in the case the lock owner is still running
on a runqueue
because there are very good chance that the owner is going to release the lock very early
that means that for example
we choose just to spin
in order to wait that the state of the
lock changed or the state of the owner
was going to change like the owner going to sleep on another turstile
and the
basically we, there have been very big measurement even in the
another operating system like solice that
where I think we brought in this approach the first time
that we're we're showing
a very big improvement in performance from this technique
apart from the two types of primitives, these are sleeping primitives
now there is a consideration we have to make about that
basically sleeping primitives
should be in theory just the
the wait channels
wait channels should have been the only one implemented using the
container called
sleepqueue
but
due to some legacy
the actually the sleepqueues were used to implement other kind of other
kinds of lock like the
lockmgr
and the sx locks and the
basically the
semaphore's condvars too
that has been this is
going to give some problems actually
because
as we're going to see
and as you can see on the line too
in the FreeBSD
while sleeping threads should not hold any kind of lock
neither blocking nor spinning
thats a simple thing to explain
we just want to enforce very
we just want to enforce
correct semantics of locking
so imagine to keep a lock
a blocking primitive while
sleeping
it's going to waste a lot of time
because all the contenders are going to
are going to start on the
lock owner which is sleeping
basically in fact what
as you should know condition variables do usually is to drop the lock
once it was passed to the primitives
in this case
basically we just dont allow that this means that's the
the same conditions happens even for other kinds of lock
lockmgr and the sx lock
so you cant hold
a mutex for example
of blocking mutex an R/W lock while trying to acquire
a lockmgr and sx
this is going to create some problems because
in some parts that is unavoidable so you have to drop the lock for example and try
to acquire
the other primitive
which is going to
and so can create some raisee problems
as the sleepqueues are born just to serve wait channels
they don't track owner too so they dont care about priority propogation and priority inversion problem
just because sleepqueues entirely should not have work
so for example lockmgr and sx have not priority propogation
systems and the
so they are discouraged to be used even for this thing mainly
sure
it's you mean why it's not
why doesnt blocking primitives exist yeah?
so imagine that for example the
you have a wait channel
condvar a condition variable
or M sleep
M sleep
the primitive that allows you to sleep on
a condition variable for example
however
the you are
using the blocking
the using the turnstile you will go to a
always the mechanism of priority propogation and priority inversion handling.Its
not very
it's pretty
it's not a simple operation
it acquires even some kind of spin locks
in order to avoid some raises
and so
it
so it has an overhead
if you do in this case it will be not to be useful it will be completely unuseful to have
a mechanism like that so
in theory if you just would have used
a sleeping the sleepqueue for wait channels
you are to add
bigperformance boost than just using the turnstile
for the same problem
in theory
but what happened is that other locks are implementedo
using this sleepqueue
that should have not be happened
on the principle
really I'm not sure who introduced the sx lock
I'm actually not sure
and even the lockmgr
but
however
as you could have seen before the three containers create a heirarchy that
should not be broken like
you have spinqueues
you have spin locks you have blocking primitives and sleeping primitives and
you cannot acquire you cannot mix them there are precise rules like
on the top the sleeping primitive
in the mid the blocking primitive and in the end the spinning primitive
the main choice will be to use blocking primitives always
because as you can see we handled a lot of problem that they have
and the practice
they have proven to be very
very helpful
but sometimes
some nasty conditions can happen
for example one of the most widespread is the
using a mallok with a flag M_WAITOK
in FreeBSD that means that if the allocator is pretty busy or going to
to sleep
in order to retreive your memory
and if you do with a lock hold
you're going to violate one of our rules and its not
possible
another one is just we just
said before like call a sleeping lock while
holding a blocking primitive
in the next example in the next I'm going to show you a way to
to handle for example the Mallock case
and similar
but the that usually
usually that
are not very common cases
at least for simple parts
you should even try to avoid the
the
yes
even in the
in the
wait channel
as in the FreeBSD you can differentiate between the condition variables and
Msleep
usually Msleep was
really Msleep was introduced as the first primitive
but it has an interface very very difficult to
to make saner and to understand
at least for
for people
which are
comfortable with
with interface of condition variable that we all saw but they are
newer primitive
mainly there is
so far the newer code
what you should do is just to
use condition variables
and not Msleep
basically
Msleep should be dropped off but they have avery nice feature which
is the the possibility to specify a wake up priority on the sleeping threads
once they are asleep
that condvar still doesnt
maybe if we could port these features to the condition variables we we will be able
to completely drop off Msleep
from the work arena
this is a
simple case that it's going to show a way to
a simple way to deal with the for example
condition I told before the Mallock willing to
to sleep
and the doing that while holding a lock
as you see we have some fake C as some members like flags
and an object called instructful
which needs to be allocated
and that they are protected by an internal lock
you imagine that for example the fake C create
holds lock of the object and does some things
which are not important
then in the end for example it's going to
to allocate
the FC object and that should be protected in
in anatomic part
something you can do is just to set the flag
for that
saying
the allocation is going to happen if you're adjust to this structure concurrently
just keep the allocation
and that's what we do
we check for this flag
and if its present it means that another thread is still
is already allocating and we just keep
so otherwise we set it and then we have locked the mutex
then we allocate the memory for the
for the object
acquire again the lock
and we simply have seen
please note that Ive used the temporary storage for that in order to make
some search on
like the MS
about the
the pointer
it was just a tricky note that you verify that really the structure was not
really allocated
and so that we can get some
kind of session about that
one of the biggest innovation that was brought to FreeBSD
about the locking primitive about the locking primitives
are the interrupts that
mainly
this is pretty simple to explain maybe
As the top half remains basically the same
and was going to handle the ISR for the interrupt line for example
the bottom half changed set and running the interrupts
handler is solid on that line as it was traditionally happened
it was going just to schedule a thread
that was going to run the
the interrupt handler in a
--- context and not the kind of --it was going to happen
traditionally in a lot of unique system
this has the big advantage that in using your own context you can
basically
you're not forced to use spin locks and you can do a lot of other fancy things
this necesity came over because
often
interrupts handlers needs to adjust to some
needs to adjust to some subsystem locks and the
as we were going to use blocking ---around
we had the necessity to support the
the locking of the
the possibilities of wide mutex actually
A similar thing was implemented using taskqueues
previously
and the sometimes it
I think I saw a lenux too
using taskqueues maybe
but the
it was basically something similar but not exactly in this way
a actually FreeBSD
from the release seven
the interrupt threads
are this model is a little bit changed
in order to include the
a new mechanism called the filtering
we have interrupt filters that basically if set then directly
directly
schedule the thread
linked to the parked line
they just check for
they just let run some new thing in the kernel or context
that will decide if
handle directly to requests or just schedule the kernel
it's like if you have the old bottom handler
that add the possibility to register a handler
still running in interrupt context and at the same time
decide if scheduled or not
so that it's no
no more madatory
So I think that the first part is going to finish so if you have some questions we can
handle
right now
this should be material for the second part actually
a new bus for example
some
some drivers that kind of a frequently used I'm not sure but which ones but all
the big ones are compared to finer locking
%um
actually the problem is not which parts are under Giant
well how we could
optimize the locking of some subsystems because
for example we have to virtual memory
which is not on the Giant but its
not locate
optimally and it's going to bring a lot of contention
so
it's not under Giant but it should be optimized
because the parts under Giant are very tiny.New bus for example
some parts relating to the VFS on the mounting but yet a very short parts
I'm not sure about others
sorry
well usually it should be moved completely but
yes
it could
okay although
in the kernel we have a basically
%um
as you should know we already imported the trays for example
and I have wondered, I have submitted by developed
my country
called ---some patches that brings the
the ----- directly in our locking
in order to
allow it to be tracked with the trace.
which is very nice but it's still not completed
we are reviewing
above that we have a very the other useful tool called the lock profiling
that has been very helpful in the past in order to
find the most contended lock
and the to try to propose them to finer locking
so at least for the kernel we have such mechanism
I'm not sure what should
have been the user space.I'm sure we've not something similar
but maybe other systems
have
similar tools
I don't know I just know FreeBSD so
not sure
would you repeat
some voice please. No I cant hear
It seems to me that
you don't you have to do all the work that you do with locking
well if you're not on SMP right?
well no
it's not right because the
you have to protect even against some mechanism like preemption
which is going to be tricky.It is dfferent implemented than FreeBSD 4.x so
it's going to be with preemption its like
from
it's like if you have a real SMP system from our technical point of view
so you have to handle
problems typical of that
really in the kernel we have other kind of synchronization like atomics
I don't, I should have had
a slide about that but it disappeared so I can tell you by voice
its well like we have the possibility to use atomic instruction in the
in FreeBSD kernel directly
but the
to use even memory bytes linked with them
the only pitfall is that you cannot really trust about the
cash coherency
because as long as it's Im be specific you can just
you can just be trust about
what happens in your CPU where use the atomic and where to use the memory byte
you cannot make assumptions about the what happens about if other CPUs
can see your modifiers or not
and if the cache can handle that
we have a specific primitives in order to for example disable preemption
which are the critical sections
critical entry and critical exit
that what you call them you are not to
the preemption is simply allowed
it's that's a very fast primitive so there is not much overhead
so there's not much overhead
we also have a way to disable interrupt which is unofficial.I will tell
that
because you can do that in machine dependant way
with a spin lock entry and spin lock exit
and then
yeah that you can
even disable
some thread migration
using skid primitives
that are very useful
when you are going to adjust for example to per-CPU datas
and you have several chases and you don't want the CPU migrate
from that
thread migrate from that CPU
because you could read different
values from different CPU then
I'm not sure
if there is something else okay
questions? no?
so i'll see you later"