Re: Understanding the idlezero code

看板DFBSD_kernel作者時間15年前 (2010/09/24 02:01), 編輯推噓0(000)
留言0則, 0人參與, 最新討論串2/2 (看更多)
--001485f7d906893f630490ef0c31 Content-Type: text/plain; charset=ISO-8859-1 Very interesting, thanks! Tim On Wed, Sep 22, 2010 at 10:36 AM, Venkatesh Srinivas <me@endeavour.zapto.org > wrote: > Hi, > > A feature added to DragonFly during the 2.8 development cycle was > idle-time page zeroing. Stated simply, the system will use some of its > idle time to zero free pages, possibly saving time when they are > allocated. Walking through the idle zero code is instructive - it provides > a view into a number of DragonFly kernel subsystems. > > Some background: > > The DragonFly (and FreeBSD) virtual memory systems are organized around a > number of queues, describing all of the page frames in a system. The > queues are: > active := Pages that are actively mapped and in use > inactive := Pages that are dirty; these may be mapped, but will be > the reclaimed under memory pressure > cache := Pages that are clean and reusable, but still hold their > contents until needed under pressure > free := Pages not actively holding data, ready for allocation > > The cache and free queues are actually divided into a number of > sub-queues, one for each cache color, but they function as single queues. > They are also loosely sorted, with zeroed pages at the tail. > > Page allocation requests, for example by a user process zero-fill fault, > need pages of zeroes. The fault handler code will call vm_page_alloc > (found in /usr/src/sys/vm/vm_page.c) with the VM_ALLOC_ZERO flag set, > which will take a page from the tail of free queue, if available. If the > page was not already zeroed, it will be, (by the caller!). Having zeroed > pages around would save that time. > > In DragonFly, the idle zero logic runs in its own LWKT, which runs at > system idle time. The LWKT is somewhat atypical - it works pretty hard to > get out of the way, at costs to its own idle zero rate. (In FreeBSD 4.x, > it ran as part of the idle loop; in 6.x+, it runs in its own kernel > thread). > > Code time: > > The code is in /usr/src/sys/vm/vm_zeroidle.c > (http://grok.x12.su/source/xref/dragonfly/sys/vm/vm_zeroidle.c) if you'd > like to follow along. > > Typical of such walkthroughs, we will start the very last line of the > file: > > SYSINIT(pagezero, SI_SUB_KTHREAD_VM, SI_ORDER_ANY, pagezero_start, NULL); >> > > SYSINIT is a DFly/FBSD kernel macro, which marks a function to be called > during boot. This SYSINIT invocation is saying 'call the function > pagezero_start, when starting the VM daemons (SI_SUB_KTHREAD_VM), at any > point during the VM daemon startup (SI_ORDER_ANY), with NULL args'. > > The pagezero_start function, just above the SYSINIT invocation, looks > like (simplified): > > static void pagezero_start(void __unused *arg) { >> struct thread *td; >> >> idlezero_nocache = bzeront_avail; >> kthread_create(vm_pagezero, NULL, &td, "pagezero"); >> } >> > > This function captures a flag from the platform specific code - is the > bzeront function available (on SSE2 i386 systems, we use the MOVNTI > instruction to zero pages, avoiding polluting a processor's Data Cache > with lots of zeroes; this flag indicates whether MOVNTI is available). The > function then kicks off an LWKT, named 'pagezero', running the vm_pagezero > function. The LWKT starts up with the MP lock held. > > The vm_pagezero() function, lurking just above in this file, is the core > of the idle zero logic. It performs some setup work: > > lwkt_setpri_self(TDPRI_IDLE_WORK); > > lwkt_setcpu_self(globaldata_find(ncpus - 1)); > > Setting its priority to just above the idle thread and moving itself to > the last CPU on the system. It then enters its main loop. > > The idle zero main loop is constructed as a state machine, with a few > states - IDLE, GET_PAGE, ZERO_PAGE, and RELEASE_PAGE. The main loop > switches on the current state executes a small block of code, then > transitions states. At each transition, it calles lwkt_yield(), to switch > to any ready LWKTs on the current CPU. > > The idle state is the state that the logic starts in: > > case STATE_IDLE: > > tsleep(&zero_state, 0, "pgzero", sleep_time); > > if (vm_page_zero_check()) > > npages = idlezero_rate / 10; > > sleep_time = vm_page_zero_time(); > > if (npages) > > state = STATE_GET_PAGE; > > break; > > In the idle state, the idle zero LWKT sleeps for 'sleep_time'; when there > are no pages to zero, sleep_time is a long interval - 'LONG_SLEEP_TIME', or > ten time the system clock; when there are, we sleep for > 'DEFAULT_SLEEP_TIME', a tenth of the system clock. When the LWKT wakes from > its sleep, it calls vm_page_zero_check(), also in this file; > vm_page_zero_check() will be described later, but it returns true if we > should be zeroing pages. If so, we compute the number of pages to zero, how > long to sleep on the next entry to the idle state, and transition to the > GET_PAGE state. We break between transitions, to attempt lwkt_yield() again. > > The GET_PAGE state logic looks like: > > case STATE_GET_PAGE: > > m = vm_page_free_fromq_fast(); > > if (m == NULL) { > > state = STATE_IDLE; > > } else { > > state = STATE_ZERO_PAGE; > > buf = lwbuf_alloc(m); > > pg = (char *)lwbuf_kva(buf); > > } > > break; > > In GET_PAGE state we attempt to acquire a page to zero, using a relatively > new interface, vm_page_free_fromq_fast(). This routine, in vm_page.c, > attempts to get a page from one of the free queues. If it fails to get one, > we return to the idle state; otherwise, we prepare to entire the ZERO_PAGE > state. We allocate an lwbuf and bind it to the page we wish to zero. > > In the ZERO_PAGE state, we actually zero the page: > > case STATE_ZERO_PAGE: > > while (i < PAGE_SIZE) { > > if (idlezero_nocache == 1) > > bzeront(&pg[i], IDLEZERO_RUN); > > else > > bzero(&pg[i], IDLEZERO_RUN); > > i += IDLEZERO_RUN; > > lwkt_yield(); > > } > > state = STATE_RELEASE_PAGE; > > break; > > We loop across the entire page, zeroing 64-bytes at a time. After each > 64-byte run, we lwkt_yield(), if any LWKTs are waiting to run. If the MOVNTI > instruction is available, we use it via bzeront(); otherwise, we use > bzero(). When we are done zeroing the page, we enter the RELEASE_PAGE state. > > In the RELEASE_PAGE state, we tear down the lwbuf and return the page to > the free queue: > > case STATE_RELEASE_PAGE: > > lwbuf_free(buf); > > vm_page_flag_set(m, PG_ZERO); > > vm_page_free_toq(m); > > state = STATE_GET_PAGE; > > ++idlezero_count; > > break; > > We first release the lwbuf; we then mark the page as zeroed and return it > to the free queue. We transition back to the GET_PAGE state, and bump an > idlezero counter. > > The operation of the idle zero code can be monitored via sysctls - the > sysctl vm.stats.vm.v_ozfod tracks the total number of zero-fill faults which > found a zero-filled page waiting for them (vm.stats.vm.v_zfod tracks total > zfod faults). The vm.idlezero_count tracks the total number of pages the > idle zero logic has managed to zero-fill. > > Hopefully this was interesting, > -- vs > --001485f7d906893f630490ef0c31 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable <font face=3D"tahoma,sans-serif">Very interesting, thanks!<br clear=3D"all"= ></font><div><br></div>Tim<br> <br><br><div class=3D"gmail_quote">On Wed, Sep 22, 2010 at 10:36 AM, Venkat= esh Srinivas <span dir=3D"ltr">&lt;<a href=3D"mailto:me@endeavour.zapto.org= ">me@endeavour.zapto.org</a>&gt;</span> wrote:<br><blockquote class=3D"gmai= l_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left= :1ex;"> Hi,<br> <br> A feature added to DragonFly during the 2.8 development cycle was<br> idle-time page zeroing. Stated simply, the system will use some of its<br> idle time to zero free pages, possibly saving time when they are<br> allocated. Walking through the idle zero code is instructive - it provides<= br> a view into a number of DragonFly kernel subsystems.<br> <br> Some background:<br> <br> The DragonFly (and FreeBSD) virtual memory systems are organized around a<b= r> number of queues, describing all of the page frames in a system. The<br> queues are:<br> =A0 =A0 =A0 =A0active :=3D Pages that are actively mapped and in use<br> =A0 =A0 =A0 =A0inactive :=3D Pages that are dirty; these may be mapped, bu= t will be<br> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0the reclaimed under memory pressure= <br> =A0 =A0 =A0 =A0cache :=3D Pages that are clean and reusable, but still hol= d their<br> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 contents until needed under pressure<br> =A0 =A0 =A0 =A0free :=3D Pages not actively holding data, ready for alloca= tion<br> <br> The cache and free queues are actually divided into a number of<br> sub-queues, one for each cache color, but they function as single queues.<b= r> They are also loosely sorted, with zeroed pages at the tail.<br> <br> Page allocation requests, for example by a user process zero-fill fault,<br= > need pages of zeroes. The fault handler code will call vm_page_alloc<br> (found in /usr/src/sys/vm/vm_page.c) with the VM_ALLOC_ZERO flag set,<br> which will take a page from the tail of free queue, if available. If the<br= > page was not already zeroed, it will be, (by the caller!). Having zeroed<br= > pages around would save that time.<br> <br> In DragonFly, the idle zero logic runs in its own LWKT, which runs at<br> system idle time. The LWKT is somewhat atypical - it works pretty hard to<b= r> get out of the way, at costs to its own idle zero rate. (In FreeBSD 4.x,<br= > it ran as part of the idle loop; in 6.x+, it runs in its own kernel<br> thread).<br> <br> Code time:<br> <br> The code is in /usr/src/sys/vm/vm_zeroidle.c<br> (<a href=3D"http://grok.x12.su/source/xref/dragonfly/sys/vm/vm_zeroidle.c" = target=3D"_blank">http://grok.x12.su/source/xref/dragonfly/sys/vm/vm_zeroid= le.c</a>) if you&#39;d<br> like to follow along.<br> <br> Typical of such walkthroughs, we will start the very last line of the<br> file:<br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> SYSINIT(pagezero, SI_SUB_KTHREAD_VM, SI_ORDER_ANY, pagezero_start, NULL);<b= r> </blockquote> <br> SYSINIT is a DFly/FBSD kernel macro, which marks a function to be called<br= > during boot. This SYSINIT invocation is saying &#39;call the function<br> pagezero_start, when starting the VM daemons (SI_SUB_KTHREAD_VM), at any<br= > point during the VM daemon startup (SI_ORDER_ANY), with NULL args&#39;.<br> <br> The pagezero_start function, just above the SYSINIT invocation, looks<br> like (simplified):<br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> static void pagezero_start(void __unused *arg) {<br> =A0 =A0 =A0struct thread *td;<br> <br> =A0 =A0 =A0idlezero_nocache =3D bzeront_avail;<br> =A0 =A0 =A0kthread_create(vm_pagezero, NULL, &amp;td, &quot;pagezero&quot;= );<br> }<br> </blockquote> <br> This function captures a flag from the platform specific code - is the<br> bzeront function available (on SSE2 i386 systems, we use the MOVNTI<br> instruction to zero pages, avoiding polluting a processor&#39;s Data Cache<= br> with lots of zeroes; this flag indicates whether MOVNTI is available). The<= br> function then kicks off an LWKT, named &#39;pagezero&#39;, running the vm_p= agezero<br> function. The LWKT starts up with the MP lock held.<br> <br> The vm_pagezero() function, lurking just above in this file, is the core<br= > of the idle zero logic. It performs some setup work:<br> =A0 =A0 =A0 =A0&gt; lwkt_setpri_self(TDPRI_IDLE_WORK);<br> =A0 =A0 =A0 =A0&gt; lwkt_setcpu_self(globaldata_find(ncpus - 1));<br> <br> Setting its priority to just above the idle thread and moving itself to<br> the last CPU on the system. It then enters its main loop.<br> <br> The idle zero main loop is constructed as a state machine, with a few<br> states - IDLE, GET_PAGE, ZERO_PAGE, and RELEASE_PAGE. The main loop switche= s on the current state executes a small block of code, then transitions sta= tes. =A0At each transition, it calles lwkt_yield(), to switch to any ready = LWKTs on the current CPU.<br> <br> The idle state is the state that the logic starts in:<br> =A0 =A0 =A0 =A0&gt; case STATE_IDLE:<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 tsleep(&amp;zero_state, 0, &quot;pgzero&qu= ot;, sleep_time);<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 if (vm_page_zero_check())<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 npages =3D idlezero_rate /= 10;<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 sleep_time =3D vm_page_zero_time();<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 if (npages)<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 state =3D STATE_GET_PAGE;<= br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 break;<br> <br> In the idle state, the idle zero LWKT sleeps for &#39;sleep_time&#39;; when= there are no pages to zero, sleep_time is a long interval - &#39;LONG_SLEE= P_TIME&#39;, or ten time the system clock; when there are, we sleep for &#3= 9;DEFAULT_SLEEP_TIME&#39;, a tenth of the system clock. When the LWKT wakes= from its sleep, it calls vm_page_zero_check(), also in this file; vm_page_= zero_check() will be described later, but it returns true if we should be z= eroing pages. If so, we compute the number of pages to zero, how long to sl= eep on the next entry to the idle state, and transition to the GET_PAGE sta= te. We break between transitions, to attempt lwkt_yield() again.<br> <br> The GET_PAGE state logic looks like:<br> =A0 =A0 =A0 =A0&gt; case STATE_GET_PAGE:<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 m =3D vm_page_free_fromq_fast();<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 if (m =3D=3D NULL) {<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 state =3D STATE_IDLE;<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 } else {<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 state =3D STATE_ZERO_PAGE;= <br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 buf =3D lwbuf_alloc(m);<br= > =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 pg =3D (char *)lwbuf_kva(b= uf);<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 }<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 break;<br> <br> In GET_PAGE state we attempt to acquire a page to zero, using a relatively = new interface, vm_page_free_fromq_fast(). This routine, in vm_page.c, attem= pts to get a page from one of the free queues. If it fails to get one, we r= eturn to the idle state; otherwise, we prepare to entire the ZERO_PAGE stat= e. We allocate an lwbuf and bind it to the page we wish to zero.<br> <br> In the ZERO_PAGE state, we actually zero the page:<br> =A0 =A0 =A0 =A0&gt; case STATE_ZERO_PAGE:<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 while (i &lt; PAGE_SIZE) {<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (idlezero_nocache =3D= =3D 1)<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 bzeront(&a= mp;pg[i], IDLEZERO_RUN);<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 else<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 bzero(&amp= ;pg[i], IDLEZERO_RUN);<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 i +=3D IDLEZERO_RUN;<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 lwkt_yield();<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 }<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 state =3D STATE_RELEASE_PAGE;<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 break;<br> <br> We loop across the entire page, zeroing 64-bytes at a time. After each 64-b= yte run, we lwkt_yield(), if any LWKTs are waiting to run. If the MOVNTI in= struction is available, we use it via bzeront(); otherwise, we use bzero().= When we are done zeroing the page, we enter the RELEASE_PAGE state.<br> <br> In the RELEASE_PAGE state, we tear down the lwbuf and return the page to th= e free queue:<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 case STATE_RELEASE_PAGE:<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 lwbuf_free(buf);<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 vm_page_flag_set(m, PG_ZER= O);<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 vm_page_free_toq(m);<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 state =3D STATE_GET_PAGE;<= br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 ++idlezero_count;<br> =A0 =A0 =A0 =A0&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;<br> <br> We first release the lwbuf; we then mark the page as zeroed and return it t= o the free queue. We transition back to the GET_PAGE state, and bump an idl= ezero counter.<br> <br> The operation of the idle zero code can be monitored via sysctls - the sysc= tl vm.stats.vm.v_ozfod tracks the total number of zero-fill faults which fo= und a zero-filled page waiting for them (vm.stats.vm.v_zfod tracks total zf= od faults). The vm.idlezero_count tracks the total number of pages the idle= zero logic has managed to zero-fill.<br> <br> Hopefully this was interesting,<br><font color=3D"#888888"> -- vs<br> </font></blockquote></div><br> --001485f7d906893f630490ef0c31--
文章代碼(AID): #1CcvNeGu (DFBSD_kernel)
文章代碼(AID): #1CcvNeGu (DFBSD_kernel)