Re: Understanding the idlezero code
--001485f7d906893f630490ef0c31
Content-Type: text/plain; charset=ISO-8859-1
Very interesting, thanks!
Tim
On Wed, Sep 22, 2010 at 10:36 AM, Venkatesh Srinivas <me@endeavour.zapto.org
> wrote:
> Hi,
>
> A feature added to DragonFly during the 2.8 development cycle was
> idle-time page zeroing. Stated simply, the system will use some of its
> idle time to zero free pages, possibly saving time when they are
> allocated. Walking through the idle zero code is instructive - it provides
> a view into a number of DragonFly kernel subsystems.
>
> Some background:
>
> The DragonFly (and FreeBSD) virtual memory systems are organized around a
> number of queues, describing all of the page frames in a system. The
> queues are:
> active := Pages that are actively mapped and in use
> inactive := Pages that are dirty; these may be mapped, but will be
> the reclaimed under memory pressure
> cache := Pages that are clean and reusable, but still hold their
> contents until needed under pressure
> free := Pages not actively holding data, ready for allocation
>
> The cache and free queues are actually divided into a number of
> sub-queues, one for each cache color, but they function as single queues.
> They are also loosely sorted, with zeroed pages at the tail.
>
> Page allocation requests, for example by a user process zero-fill fault,
> need pages of zeroes. The fault handler code will call vm_page_alloc
> (found in /usr/src/sys/vm/vm_page.c) with the VM_ALLOC_ZERO flag set,
> which will take a page from the tail of free queue, if available. If the
> page was not already zeroed, it will be, (by the caller!). Having zeroed
> pages around would save that time.
>
> In DragonFly, the idle zero logic runs in its own LWKT, which runs at
> system idle time. The LWKT is somewhat atypical - it works pretty hard to
> get out of the way, at costs to its own idle zero rate. (In FreeBSD 4.x,
> it ran as part of the idle loop; in 6.x+, it runs in its own kernel
> thread).
>
> Code time:
>
> The code is in /usr/src/sys/vm/vm_zeroidle.c
> (http://grok.x12.su/source/xref/dragonfly/sys/vm/vm_zeroidle.c) if you'd
> like to follow along.
>
> Typical of such walkthroughs, we will start the very last line of the
> file:
>
> SYSINIT(pagezero, SI_SUB_KTHREAD_VM, SI_ORDER_ANY, pagezero_start, NULL);
>>
>
> SYSINIT is a DFly/FBSD kernel macro, which marks a function to be called
> during boot. This SYSINIT invocation is saying 'call the function
> pagezero_start, when starting the VM daemons (SI_SUB_KTHREAD_VM), at any
> point during the VM daemon startup (SI_ORDER_ANY), with NULL args'.
>
> The pagezero_start function, just above the SYSINIT invocation, looks
> like (simplified):
>
> static void pagezero_start(void __unused *arg) {
>> struct thread *td;
>>
>> idlezero_nocache = bzeront_avail;
>> kthread_create(vm_pagezero, NULL, &td, "pagezero");
>> }
>>
>
> This function captures a flag from the platform specific code - is the
> bzeront function available (on SSE2 i386 systems, we use the MOVNTI
> instruction to zero pages, avoiding polluting a processor's Data Cache
> with lots of zeroes; this flag indicates whether MOVNTI is available). The
> function then kicks off an LWKT, named 'pagezero', running the vm_pagezero
> function. The LWKT starts up with the MP lock held.
>
> The vm_pagezero() function, lurking just above in this file, is the core
> of the idle zero logic. It performs some setup work:
> > lwkt_setpri_self(TDPRI_IDLE_WORK);
> > lwkt_setcpu_self(globaldata_find(ncpus - 1));
>
> Setting its priority to just above the idle thread and moving itself to
> the last CPU on the system. It then enters its main loop.
>
> The idle zero main loop is constructed as a state machine, with a few
> states - IDLE, GET_PAGE, ZERO_PAGE, and RELEASE_PAGE. The main loop
> switches on the current state executes a small block of code, then
> transitions states. At each transition, it calles lwkt_yield(), to switch
> to any ready LWKTs on the current CPU.
>
> The idle state is the state that the logic starts in:
> > case STATE_IDLE:
> > tsleep(&zero_state, 0, "pgzero", sleep_time);
> > if (vm_page_zero_check())
> > npages = idlezero_rate / 10;
> > sleep_time = vm_page_zero_time();
> > if (npages)
> > state = STATE_GET_PAGE;
> > break;
>
> In the idle state, the idle zero LWKT sleeps for 'sleep_time'; when there
> are no pages to zero, sleep_time is a long interval - 'LONG_SLEEP_TIME', or
> ten time the system clock; when there are, we sleep for
> 'DEFAULT_SLEEP_TIME', a tenth of the system clock. When the LWKT wakes from
> its sleep, it calls vm_page_zero_check(), also in this file;
> vm_page_zero_check() will be described later, but it returns true if we
> should be zeroing pages. If so, we compute the number of pages to zero, how
> long to sleep on the next entry to the idle state, and transition to the
> GET_PAGE state. We break between transitions, to attempt lwkt_yield() again.
>
> The GET_PAGE state logic looks like:
> > case STATE_GET_PAGE:
> > m = vm_page_free_fromq_fast();
> > if (m == NULL) {
> > state = STATE_IDLE;
> > } else {
> > state = STATE_ZERO_PAGE;
> > buf = lwbuf_alloc(m);
> > pg = (char *)lwbuf_kva(buf);
> > }
> > break;
>
> In GET_PAGE state we attempt to acquire a page to zero, using a relatively
> new interface, vm_page_free_fromq_fast(). This routine, in vm_page.c,
> attempts to get a page from one of the free queues. If it fails to get one,
> we return to the idle state; otherwise, we prepare to entire the ZERO_PAGE
> state. We allocate an lwbuf and bind it to the page we wish to zero.
>
> In the ZERO_PAGE state, we actually zero the page:
> > case STATE_ZERO_PAGE:
> > while (i < PAGE_SIZE) {
> > if (idlezero_nocache == 1)
> > bzeront(&pg[i], IDLEZERO_RUN);
> > else
> > bzero(&pg[i], IDLEZERO_RUN);
> > i += IDLEZERO_RUN;
> > lwkt_yield();
> > }
> > state = STATE_RELEASE_PAGE;
> > break;
>
> We loop across the entire page, zeroing 64-bytes at a time. After each
> 64-byte run, we lwkt_yield(), if any LWKTs are waiting to run. If the MOVNTI
> instruction is available, we use it via bzeront(); otherwise, we use
> bzero(). When we are done zeroing the page, we enter the RELEASE_PAGE state.
>
> In the RELEASE_PAGE state, we tear down the lwbuf and return the page to
> the free queue:
> > case STATE_RELEASE_PAGE:
> > lwbuf_free(buf);
> > vm_page_flag_set(m, PG_ZERO);
> > vm_page_free_toq(m);
> > state = STATE_GET_PAGE;
> > ++idlezero_count;
> > break;
>
> We first release the lwbuf; we then mark the page as zeroed and return it
> to the free queue. We transition back to the GET_PAGE state, and bump an
> idlezero counter.
>
> The operation of the idle zero code can be monitored via sysctls - the
> sysctl vm.stats.vm.v_ozfod tracks the total number of zero-fill faults which
> found a zero-filled page waiting for them (vm.stats.vm.v_zfod tracks total
> zfod faults). The vm.idlezero_count tracks the total number of pages the
> idle zero logic has managed to zero-fill.
>
> Hopefully this was interesting,
> -- vs
>
--001485f7d906893f630490ef0c31
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<font face=3D"tahoma,sans-serif">Very interesting, thanks!<br clear=3D"all"=
></font><div><br></div>Tim<br>
<br><br><div class=3D"gmail_quote">On Wed, Sep 22, 2010 at 10:36 AM, Venkat=
esh Srinivas <span dir=3D"ltr"><<a href=3D"mailto:me@endeavour.zapto.org=
">me@endeavour.zapto.org</a>></span> wrote:<br><blockquote class=3D"gmai=
l_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left=
:1ex;">
Hi,<br>
<br>
A feature added to DragonFly during the 2.8 development cycle was<br>
idle-time page zeroing. Stated simply, the system will use some of its<br>
idle time to zero free pages, possibly saving time when they are<br>
allocated. Walking through the idle zero code is instructive - it provides<=
br>
a view into a number of DragonFly kernel subsystems.<br>
<br>
Some background:<br>
<br>
The DragonFly (and FreeBSD) virtual memory systems are organized around a<b=
r>
number of queues, describing all of the page frames in a system. The<br>
queues are:<br>
=A0 =A0 =A0 =A0active :=3D Pages that are actively mapped and in use<br>
=A0 =A0 =A0 =A0inactive :=3D Pages that are dirty; these may be mapped, bu=
t will be<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0the reclaimed under memory pressure=
<br>
=A0 =A0 =A0 =A0cache :=3D Pages that are clean and reusable, but still hol=
d their<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 contents until needed under pressure<br>
=A0 =A0 =A0 =A0free :=3D Pages not actively holding data, ready for alloca=
tion<br>
<br>
The cache and free queues are actually divided into a number of<br>
sub-queues, one for each cache color, but they function as single queues.<b=
r>
They are also loosely sorted, with zeroed pages at the tail.<br>
<br>
Page allocation requests, for example by a user process zero-fill fault,<br=
>
need pages of zeroes. The fault handler code will call vm_page_alloc<br>
(found in /usr/src/sys/vm/vm_page.c) with the VM_ALLOC_ZERO flag set,<br>
which will take a page from the tail of free queue, if available. If the<br=
>
page was not already zeroed, it will be, (by the caller!). Having zeroed<br=
>
pages around would save that time.<br>
<br>
In DragonFly, the idle zero logic runs in its own LWKT, which runs at<br>
system idle time. The LWKT is somewhat atypical - it works pretty hard to<b=
r>
get out of the way, at costs to its own idle zero rate. (In FreeBSD 4.x,<br=
>
it ran as part of the idle loop; in 6.x+, it runs in its own kernel<br>
thread).<br>
<br>
Code time:<br>
<br>
The code is in /usr/src/sys/vm/vm_zeroidle.c<br>
(<a href=3D"http://grok.x12.su/source/xref/dragonfly/sys/vm/vm_zeroidle.c" =
target=3D"_blank">http://grok.x12.su/source/xref/dragonfly/sys/vm/vm_zeroid=
le.c</a>) if you'd<br>
like to follow along.<br>
<br>
Typical of such walkthroughs, we will start the very last line of the<br>
file:<br>
<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
SYSINIT(pagezero, SI_SUB_KTHREAD_VM, SI_ORDER_ANY, pagezero_start, NULL);<b=
r>
</blockquote>
<br>
SYSINIT is a DFly/FBSD kernel macro, which marks a function to be called<br=
>
during boot. This SYSINIT invocation is saying 'call the function<br>
pagezero_start, when starting the VM daemons (SI_SUB_KTHREAD_VM), at any<br=
>
point during the VM daemon startup (SI_ORDER_ANY), with NULL args'.<br>
<br>
The pagezero_start function, just above the SYSINIT invocation, looks<br>
like (simplified):<br>
<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
static void pagezero_start(void __unused *arg) {<br>
=A0 =A0 =A0struct thread *td;<br>
<br>
=A0 =A0 =A0idlezero_nocache =3D bzeront_avail;<br>
=A0 =A0 =A0kthread_create(vm_pagezero, NULL, &td, "pagezero"=
);<br>
}<br>
</blockquote>
<br>
This function captures a flag from the platform specific code - is the<br>
bzeront function available (on SSE2 i386 systems, we use the MOVNTI<br>
instruction to zero pages, avoiding polluting a processor's Data Cache<=
br>
with lots of zeroes; this flag indicates whether MOVNTI is available). The<=
br>
function then kicks off an LWKT, named 'pagezero', running the vm_p=
agezero<br>
function. The LWKT starts up with the MP lock held.<br>
<br>
The vm_pagezero() function, lurking just above in this file, is the core<br=
>
of the idle zero logic. It performs some setup work:<br>
=A0 =A0 =A0 =A0> lwkt_setpri_self(TDPRI_IDLE_WORK);<br>
=A0 =A0 =A0 =A0> lwkt_setcpu_self(globaldata_find(ncpus - 1));<br>
<br>
Setting its priority to just above the idle thread and moving itself to<br>
the last CPU on the system. It then enters its main loop.<br>
<br>
The idle zero main loop is constructed as a state machine, with a few<br>
states - IDLE, GET_PAGE, ZERO_PAGE, and RELEASE_PAGE. The main loop switche=
s on the current state executes a small block of code, then transitions sta=
tes. =A0At each transition, it calles lwkt_yield(), to switch to any ready =
LWKTs on the current CPU.<br>
<br>
The idle state is the state that the logic starts in:<br>
=A0 =A0 =A0 =A0> case STATE_IDLE:<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 tsleep(&zero_state, 0, "pgzero&qu=
ot;, sleep_time);<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 if (vm_page_zero_check())<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 npages =3D idlezero_rate /=
10;<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 sleep_time =3D vm_page_zero_time();<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 if (npages)<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 state =3D STATE_GET_PAGE;<=
br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 break;<br>
<br>
In the idle state, the idle zero LWKT sleeps for 'sleep_time'; when=
there are no pages to zero, sleep_time is a long interval - 'LONG_SLEE=
P_TIME', or ten time the system clock; when there are, we sleep for =
9;DEFAULT_SLEEP_TIME', a tenth of the system clock. When the LWKT wakes=
from its sleep, it calls vm_page_zero_check(), also in this file; vm_page_=
zero_check() will be described later, but it returns true if we should be z=
eroing pages. If so, we compute the number of pages to zero, how long to sl=
eep on the next entry to the idle state, and transition to the GET_PAGE sta=
te. We break between transitions, to attempt lwkt_yield() again.<br>
<br>
The GET_PAGE state logic looks like:<br>
=A0 =A0 =A0 =A0> case STATE_GET_PAGE:<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 m =3D vm_page_free_fromq_fast();<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 if (m =3D=3D NULL) {<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 state =3D STATE_IDLE;<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 } else {<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 state =3D STATE_ZERO_PAGE;=
<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 buf =3D lwbuf_alloc(m);<br=
>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 pg =3D (char *)lwbuf_kva(b=
uf);<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 }<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 break;<br>
<br>
In GET_PAGE state we attempt to acquire a page to zero, using a relatively =
new interface, vm_page_free_fromq_fast(). This routine, in vm_page.c, attem=
pts to get a page from one of the free queues. If it fails to get one, we r=
eturn to the idle state; otherwise, we prepare to entire the ZERO_PAGE stat=
e. We allocate an lwbuf and bind it to the page we wish to zero.<br>
<br>
In the ZERO_PAGE state, we actually zero the page:<br>
=A0 =A0 =A0 =A0> case STATE_ZERO_PAGE:<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 while (i < PAGE_SIZE) {<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (idlezero_nocache =3D=
=3D 1)<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 bzeront(&a=
mp;pg[i], IDLEZERO_RUN);<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 else<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 bzero(&=
;pg[i], IDLEZERO_RUN);<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 i +=3D IDLEZERO_RUN;<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 lwkt_yield();<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 }<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 state =3D STATE_RELEASE_PAGE;<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 break;<br>
<br>
We loop across the entire page, zeroing 64-bytes at a time. After each 64-b=
yte run, we lwkt_yield(), if any LWKTs are waiting to run. If the MOVNTI in=
struction is available, we use it via bzeront(); otherwise, we use bzero().=
When we are done zeroing the page, we enter the RELEASE_PAGE state.<br>
<br>
In the RELEASE_PAGE state, we tear down the lwbuf and return the page to th=
e free queue:<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 case STATE_RELEASE_PAGE:<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 lwbuf_free(buf);<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 vm_page_flag_set(m, PG_ZER=
O);<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 vm_page_free_toq(m);<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 state =3D STATE_GET_PAGE;<=
br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 ++idlezero_count;<br>
=A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;<br>
<br>
We first release the lwbuf; we then mark the page as zeroed and return it t=
o the free queue. We transition back to the GET_PAGE state, and bump an idl=
ezero counter.<br>
<br>
The operation of the idle zero code can be monitored via sysctls - the sysc=
tl vm.stats.vm.v_ozfod tracks the total number of zero-fill faults which fo=
und a zero-filled page waiting for them (vm.stats.vm.v_zfod tracks total zf=
od faults). The vm.idlezero_count tracks the total number of pages the idle=
zero logic has managed to zero-fill.<br>
<br>
Hopefully this was interesting,<br><font color=3D"#888888">
-- vs<br>
</font></blockquote></div><br>
--001485f7d906893f630490ef0c31--
討論串 (同標題文章)