A. Rao - The locking infrastructure in the freebsd kernel #1

My name is Attilio Rao and I think that we are in time for the presentation I want to ask sorry for my English because it's not really British English but I will try to make this a little bit uncomfortable Better? Ok.Thank you.So we are going to speak about the the locking infrastructure in the FreeBSD kernel which is a bit interesting topic because Its going to be with time very widely discussed on our mailing list not only from developer's perspective but even from user's perspective. and we will see why later In this presentation we will specifically see what was the situation of the first FreeBSD implementations and what changed from that what specifically what's called the SMPng era and what we had prior that we are going to discuss specifically locking primitives that has been introduced with time until now and problems linked to parellelism in general and how we solve that in the FreeBSD kernel You can see a table of content a little bit more detailed as listing precisely what we some problems like Priority Inheritance and Adaptive Spinning that we are going to discuss fruitfullly. Mostly until FreeBSD 4.x We had already moved to multitasking. so the slide is a little bit confusing but multitasking and preemptive system since that transition was not very was not very difficult to implement in such systems because if you can see then our uniprocessor machine you can get that well the sequential execution was just stopped by preemption and by arrival of interrupts so you should adjustment in consistency of data structures about these two issues more precisely we were handling the interrupts and transitions through a mechanism called SPL and for kernel threads, threads running in the kernel we were disabling preemption in order to avoid the corruption of the data structure This approach while was pretty good on uniprocessor machines was actually impredictable for the SMP environments more precisely because we had more coures that was running thread per time and so parallel accesses to the data structures were possible in order to to avoid big problems in the kernel we have to just allow the entering of one thread per time into kernel while that was a pretty good approach for workloads that were nearly user space for work loads requiring a lot of IO for example they were wateful because they wasn't getting any advantage from the new SMP architecture like the parallelism was basically zero at least in the kernel in order to fix that a new project was created called SMP New generation or NG as you can see it from the slide the entering in the kernel was preempted by using Big Lock called BKL basically With FreeBSD 5.x we had the SMP new generation project basically it was a sanitization of the of all our kernel and the engineering over lot of mechanism inside our kernel. We could see it FreeBSD 4.x and FreeBSD 5.x as mainly two different kernels because of substantial subsystem were rewritten and were written with the idea to use and implement a real parallelism in mind. we can say that basically it was a major task a very big task and that it required a lot of years to be brought in a good shape at least In Italy, the people gave a lot of complaining about the un-robustness of FreeBSD 5.x but probably that's because they couldn't even see that the changes were really really important and really huge however for FreeBSD 5.x based this initial SMP system inheriting from BSD/OS that kindly released this code above that and the the process was break up in some precise tasks at least in Italy Mainly the first things was introducing in the kernel new set of atomic instruction and locking primitives Then introducing an abstraction called interrupt threads that we are going to discuss rather later but it was basically restored completely the interrupt mechanism that was in the FreeBSD 4.x the the BKL lock was moved to a real mutex called Giant that still exists in our kernel and they were introduced some threading primitives the like and and on and threading primitives called also KSE which are actually never used in our kernel and that being their being exit out in the past year and the slowly of the porting of all the older subsystems to a finer locking was started I have to say this task is not still completed, its still going on but we are really good shape about that just few subsystems remain which are still Giant protected and with new release that we're going to ship this year, I think that we made a very huge step forward in this direction really the the SMPng has been considered closed around the end of 2007 but the the the important parts where this initial moving I rather thing that's not listed here but I can tell you is that even that if Giant was preventing any parallelism initial parallelism that were imported new kernel memory allocator that was that I discovered and the scheduler was move with a seperate lock in order to start getting some a little bit of concurrency a real concurrency the before to speak about FreeBSD specifics we can start digging in about what kind of of locking primitives you can find in our kernel. from a more historical point of view we have some versions of mutex which I assume people here knows about that but I'm going to give a little explanation for people that doesn't know a mutex is basically a lock allowing to access to some protected data's thread to just one thread per time so if a thread owns the lock, owns the mutex other threads won't be able to to access to this until this lock is released we offer even some kind of locks called R/W lock Read/Write lock which are basically a locks that can be acquired in two different versions one version is the write lock which is the same as the mutex just one in the protected part per time and other one is the read mode which basically allows all the thread willing to acquire to read mode to concurrently adjust to the structure but prevents the threads from writing nto the protected path. while the reader..while they are readers then we have even the locks called the Read Mostly which are basically the same of Read/Write Locks but are they have some optimization in order to make the Read part be really fast and to have like zero overhead zero overhead kind of lock from the read path while probably the write path is even heavier than the other one but if you think about cases that just where there are a lot of reader chases and very few writer chases you can find that a very useful very useful primitive then we have some form of Wait channels Wait channels basically are what generalizations of what other people con call like condition variable and they basically let that thread sleep under some conditions that are that are previously started with some some variables usually having a Wait channel means that its chases are controlled through another locking primitive like a mutex or R/Wlock and so often the Wait channel is associated to its to its locking primitive usually if you have no necessity to use a Wait channel without a primitive a locking primitive you probably have bad code but there are some edge cases with that seem possible As last thing FreeBSD sub primitive counting semaphore even if thats considered not featured as we are going to see I think they're going to see it and its usage is pretty much discouraged basically FreeBSD you can consider locking primative divided into three classes three classes of of locking 0:11:32.450,0:11:34.090 based mainly in particular from an outside perspective based on the behavior the contending threads as you regard of the lock 0:11:42.680,0:11:48.100 for example in case of a mutex you can can get that spinning and blocking mutex do very different things about the contenders as we are going to see more of this later usually in the traditional literature, there are just two cases of the lock classes mainly you will find the spinning lock and the blocking lock or what they called the sleeping lock the I think that as we're going to see why we have three types I think that things will be clear but if you have any questions please ask us. Thats not a problem Spinning primitives as I told you allows the contesting thread to to check the status of the lock periodically and the and they just do busy waiting around the locking variable as the spinning primitive FreeBSD just offers mutex What are the problems linked with this kind of, with this class of locks? Mainly its that CPU remains busy without doing really nothing useful it happens that if several threads contest on the on the locks basically they share the same cache line where the lock is where the lock is that means that contesting or sharing a cache line is a lot underlying activity on a lot of architectures like for example having a lot of snoop messages between CPUs and some buses some buses traffic which means in a variety operations and the last things even the most important you can note is that interrupts are disabled while spin locks are held that was that happens mainly because there are there were identified in the past by some kind of deadlocks possible if you were going to lead the spin locks the interrupts enabled while holding a spin lock. In particular you could find that there are some problems with the interrupts angling good in the botom half that was going to deadlock Its not very simple to understand the thing so I've left out but if you want to know we could speak later probably with spinning primitives we are even blocking primitives blocking primitives allows the basically the contenders to be descheduled from the runqueue to be put on another kind of container put on another kind of container and basically context switch immediately immediately. then we put again on runqueue of the scheduler just once the just when the owner is going to release the lock and it will be the owner the owner that was going to do all the operations about that we have several primitives implemented as blocking primitives like mutexes R/W locks and R-M locks with basically with blocking primitives we have a lot of advantages over the spinning mutex like having the contenders that that sleeps or that blocks avoids CPU busyness and mainly we can leave the we can leave the we can leave that basically the interrupts out that happens mainly because the interrupts code is just allowed at least the bottom of one is just allowed to use spin locks probably if it was going to use blocking primitives we wouldnt have been able to disable interrupts here There are however some big drawbacks that as you will see we handle in FreeBSD in order to make the blobking primitives our how could I tell the first choice in terms of blocking where the problem called Priority Inversion and we have the problem that context switches are very heavy in particular on machines that FreeBSD uses as referral like E38 and the MD64 but as you're going to see we've used two techniques in order to to cope with that another thing is that while you cant allow context switches while having while holding spin lock it's obvious you cant acquire a locking primitive while holding a spin lock that's an important rule in FreeBSD that sometimes its confused and often its not observed that leads to block refusal usually you will always prefer a blocking primitive for a spin lock if not in some very particular condition like what Alrick said about the interrupt and even about the some parts that are very very short we should have some example in the kernel even if I can I can tell you one right now I have no idea actually so that we're going to see the problemslinked with the blocking primitives the first one is called Priority Inversion basically it could happen that like a thread A which has a priority owns a lock. call it L for example then another thread with another priority than this one locks on this lock what happens is that the second thread the thread B for example will need to wait for a lower priority thread to finish its work load we solve this problem actually in the kernel using a technique called priority propogation basically what happens is that priority of thread B is lent to thread A until it doesn't release the lock of its directly implemented in the container the turnstiles while that could be done even on the primitive it has been much convenient to use the container for that because it was going to offer some advantage we are going to see right now just note that Read locks cannot support priority propogation fixes for read lock that happens because you'd like to the turnstile should keep track of all the readers and these would be very very expensive from from a from a point of view of the overhead and even I think I've tried to do something in this regard and I saw that there was some races that were trying to acquire a spin lock as base even in fast path so it was a an impredicable way I will tell at least for what we found so far basically what happens about the priority propogation is that the the threads and the turnstiles are chained together the thread owns the a pointer to wrench the turnstile is sleeping on and the turnstile owns a pointer above the owner of the lock what happens is that for example in this case we have a sleeper which is going to sleep on a turnstile the first lock which has a priority of one hundred and twenty eight the turnstile to the pointer ts_owner knows which is its owner and this owner has a priority of two hundred and fifty six well as you know higher level, higher value means lower priority. so if this is 0:20:31.120,0:20:34.960 a suitable pace for priority propogation but what happens is that this owner is actually sleeping on another turnstile and the other owner of the second turnstile has always the same priority of its sleepers so just propogating priority to the first owner was just unuseful because the first one could still keep the chain to a lower priority so it's was going to be propogated to the first one actually running owner of the chain this is the situation after the propogation as you can see all of threads in the chain has the same priority either possible in this case the one the last one arriving there are question about that no? yeah when the when the for example the third owner the second owner there when it goes to release the lock it basically brings back the priority to the to the twenty hundred and sixty five to all the chains he is responsible for so it just happens at locking operation and that is what we do about the Priority Inversion inorder to fix instead the overhead given by the big amount of context switch we use another technique called adaptive spinning basically as the context switch brings a lot of overhead we prefer to not do completely a context switch in the case the lock owner is still running on a runqueue because there are very good chance that the owner is going to release the lock very early that means that for example we choose just to spin in order to wait that the state of the lock changed or the state of the owner was going to change like the owner going to sleep on another turstile and the basically we, there have been very big measurement even in the another operating system like solice that where I think we brought in this approach the first time that we're we're showing a very big improvement in performance from this technique apart from the two types of primitives, these are sleeping primitives now there is a consideration we have to make about that basically sleeping primitives should be in theory just the the wait channels wait channels should have been the only one implemented using the container called sleepqueue but due to some legacy the actually the sleepqueues were used to implement other kind of other kinds of lock like the lockmgr and the sx locks and the basically the semaphore's condvars too that has been this is going to give some problems actually because as we're going to see and as you can see on the line too in the FreeBSD while sleeping threads should not hold any kind of lock neither blocking nor spinning thats a simple thing to explain we just want to enforce very we just want to enforce correct semantics of locking so imagine to keep a lock a blocking primitive while sleeping it's going to waste a lot of time because all the contenders are going to are going to start on the lock owner which is sleeping basically in fact what as you should know condition variables do usually is to drop the lock once it was passed to the primitives in this case basically we just dont allow that this means that's the the same conditions happens even for other kinds of lock lockmgr and the sx lock so you cant hold a mutex for example of blocking mutex an R/W lock while trying to acquire a lockmgr and sx this is going to create some problems because in some parts that is unavoidable so you have to drop the lock for example and try to acquire the other primitive which is going to and so can create some raisee problems as the sleepqueues are born just to serve wait channels they don't track owner too so they dont care about priority propogation and priority inversion problem just because sleepqueues entirely should not have work so for example lockmgr and sx have not priority propogation systems and the so they are discouraged to be used even for this thing mainly sure it's you mean why it's not why doesnt blocking primitives exist yeah? so imagine that for example the you have a wait channel condvar a condition variable or M sleep M sleep the primitive that allows you to sleep on a condition variable for example however the you are using the blocking the using the turnstile you will go to a always the mechanism of priority propogation and priority inversion handling.Its not very it's pretty it's not a simple operation it acquires even some kind of spin locks in order to avoid some raises and so it so it has an overhead if you do in this case it will be not to be useful it will be completely unuseful to have a mechanism like that so in theory if you just would have used a sleeping the sleepqueue for wait channels you are to add bigperformance boost than just using the turnstile for the same problem in theory but what happened is that other locks are implementedo using this sleepqueue that should have not be happened on the principle really I'm not sure who introduced the sx lock I'm actually not sure and even the lockmgr but however as you could have seen before the three containers create a heirarchy that should not be broken like you have spinqueues you have spin locks you have blocking primitives and sleeping primitives and you cannot acquire you cannot mix them there are precise rules like on the top the sleeping primitive in the mid the blocking primitive and in the end the spinning primitive the main choice will be to use blocking primitives always because as you can see we handled a lot of problem that they have and the practice they have proven to be very very helpful but sometimes some nasty conditions can happen for example one of the most widespread is the using a mallok with a flag M_WAITOK in FreeBSD that means that if the allocator is pretty busy or going to to sleep in order to retreive your memory and if you do with a lock hold you're going to violate one of our rules and its not possible another one is just we just said before like call a sleeping lock while holding a blocking primitive in the next example in the next I'm going to show you a way to to handle for example the Mallock case and similar but the that usually usually that are not very common cases at least for simple parts you should even try to avoid the the yes even in the in the wait channel as in the FreeBSD you can differentiate between the condition variables and Msleep usually Msleep was really Msleep was introduced as the first primitive but it has an interface very very difficult to to make saner and to understand at least for for people which are comfortable with with interface of condition variable that we all saw but they are newer primitive mainly there is so far the newer code what you should do is just to use condition variables and not Msleep basically Msleep should be dropped off but they have avery nice feature which is the the possibility to specify a wake up priority on the sleeping threads once they are asleep that condvar still doesnt maybe if we could port these features to the condition variables we we will be able to completely drop off Msleep from the work arena this is a simple case that it's going to show a way to a simple way to deal with the for example condition I told before the Mallock willing to to sleep and the doing that while holding a lock as you see we have some fake C as some members like flags and an object called instructful which needs to be allocated and that they are protected by an internal lock you imagine that for example the fake C create holds lock of the object and does some things which are not important then in the end for example it's going to to allocate the FC object and that should be protected in in anatomic part something you can do is just to set the flag for that saying the allocation is going to happen if you're adjust to this structure concurrently just keep the allocation and that's what we do we check for this flag and if its present it means that another thread is still is already allocating and we just keep so otherwise we set it and then we have locked the mutex then we allocate the memory for the for the object acquire again the lock and we simply have seen please note that Ive used the temporary storage for that in order to make some search on like the MS about the the pointer it was just a tricky note that you verify that really the structure was not really allocated and so that we can get some kind of session about that one of the biggest innovation that was brought to FreeBSD about the locking primitive about the locking primitives are the interrupts that mainly this is pretty simple to explain maybe As the top half remains basically the same and was going to handle the ISR for the interrupt line for example the bottom half changed set and running the interrupts handler is solid on that line as it was traditionally happened it was going just to schedule a thread that was going to run the the interrupt handler in a --- context and not the kind of --it was going to happen traditionally in a lot of unique system this has the big advantage that in using your own context you can basically you're not forced to use spin locks and you can do a lot of other fancy things this necesity came over because often interrupts handlers needs to adjust to some needs to adjust to some subsystem locks and the as we were going to use blocking ---around we had the necessity to support the the locking of the the possibilities of wide mutex actually A similar thing was implemented using taskqueues previously and the sometimes it I think I saw a lenux too using taskqueues maybe but the it was basically something similar but not exactly in this way a actually FreeBSD from the release seven the interrupt threads are this model is a little bit changed in order to include the a new mechanism called the filtering we have interrupt filters that basically if set then directly directly schedule the thread linked to the parked line they just check for they just let run some new thing in the kernel or context that will decide if handle directly to requests or just schedule the kernel it's like if you have the old bottom handler that add the possibility to register a handler still running in interrupt context and at the same time decide if scheduled or not so that it's no no more madatory So I think that the first part is going to finish so if you have some questions we can handle right now this should be material for the second part actually a new bus for example some some drivers that kind of a frequently used I'm not sure but which ones but all the big ones are compared to finer locking %um actually the problem is not which parts are under Giant well how we could optimize the locking of some subsystems because for example we have to virtual memory which is not on the Giant but its not locate optimally and it's going to bring a lot of contention so it's not under Giant but it should be optimized because the parts under Giant are very tiny.New bus for example some parts relating to the VFS on the mounting but yet a very short parts I'm not sure about others sorry well usually it should be moved completely but yes it could okay although in the kernel we have a basically %um as you should know we already imported the trays for example and I have wondered, I have submitted by developed my country called ---some patches that brings the the ----- directly in our locking in order to allow it to be tracked with the trace. which is very nice but it's still not completed we are reviewing above that we have a very the other useful tool called the lock profiling that has been very helpful in the past in order to find the most contended lock and the to try to propose them to finer locking so at least for the kernel we have such mechanism I'm not sure what should have been the user space.I'm sure we've not something similar but maybe other systems have similar tools I don't know I just know FreeBSD so not sure would you repeat some voice please. No I cant hear It seems to me that you don't you have to do all the work that you do with locking well if you're not on SMP right? well no it's not right because the you have to protect even against some mechanism like preemption which is going to be tricky.It is dfferent implemented than FreeBSD 4.x so it's going to be with preemption its like from it's like if you have a real SMP system from our technical point of view so you have to handle problems typical of that really in the kernel we have other kind of synchronization like atomics I don't, I should have had a slide about that but it disappeared so I can tell you by voice its well like we have the possibility to use atomic instruction in the in FreeBSD kernel directly but the to use even memory bytes linked with them the only pitfall is that you cannot really trust about the cash coherency because as long as it's Im be specific you can just you can just be trust about what happens in your CPU where use the atomic and where to use the memory byte you cannot make assumptions about the what happens about if other CPUs can see your modifiers or not and if the cache can handle that we have a specific primitives in order to for example disable preemption which are the critical sections critical entry and critical exit that what you call them you are not to the preemption is simply allowed it's that's a very fast primitive so there is not much overhead so there's not much overhead we also have a way to disable interrupt which is unofficial.I will tell that because you can do that in machine dependant way with a spin lock entry and spin lock exit and then yeah that you can even disable some thread migration using skid primitives that are very useful when you are going to adjust for example to per-CPU datas and you have several chases and you don't want the CPU migrate from that thread migrate from that CPU because you could read different values from different CPU then I'm not sure if there is something else okay questions? no? so i'll see you later"