Re: Cleanup and untangling of kernel VM initialization
--5TZBROn01cl7bgIF
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
On Thu, Mar 07, 2013 at 06:03:51PM +0100, Andre Oppermann wrote:
> On 01.02.2013 18:09, Alan Cox wrote:
> > On 02/01/2013 07:25, Andre Oppermann wrote:
> >> Rebase auto-sizing of limits on the available KVM/kmem_map instead of
> >> physical
> >> memory. Depending on the kernel and architecture configuration these
> >> two can
> >> be very different.
> >>
> >> Comments and reviews appreciated.
> >>
> >
> > I would really like to see the issues with the current auto-sizing code
> > addressed before any of the stylistic changes or en-masse conversions to
> > SYSINIT()s are considered. In particular, can we please start with the
> > patch that moves the pipe_map initialization? After that, I think that
> > we should revisit tunable_mbinit() and "maxmbufmem".
>=20
> OK. I'm trying to describe and explain the big picture for myself and
> other interested observers. The following text and explanations are going
> to be verbose and sometime redundant. If something is incorrect or incom=
plete
> please yell, I'm not an expert in all these parts and may easily have mis=
sed
> some subtle aspects.
>=20
> The kernel_map serves as the container of the entire available kernel VM
> address space, including the kernel text, data and bss itself, as well as
> other bootstrapped and pre-VM allocated structures.
>=20
> The kernel_map should cover a reasonable large amount of address space to=
be
> able to serve the various kernel subsystems demands in memory allocation.
> The cpu architecture's address range (32 or 64 bits) puts a hard ceiling =
on
> the total size of the kernel_map. Depending on the architecture the kern=
el_map
> covers a special range in the total addressable address range.
>=20
> * VM_MIN_KERNEL_ADDRESS
> * [KERNBASE]
> * kernel_map [actually mapped KVM range, direct allocations]
> * kernel text, data, bss
> * bootstrap and statically allocated structures [pmap]
> * virtual_avail [start of useable KVM]
> * kmem_map [submap for (most) UMA zones and kernel malloc]
> * exec_map [submap for temporary mapping during process exec()]
> * pipe_map [submap for temporary buffering of data between pipe=
d processes]
> * clean_map [submap for buffer_map and pager_map]
> * buffer_map [submap for BIO buffers]
> * pager_map [submap for temporary pager IO holding]
> * memguard_map [submap for debugging of UMA and kernel malloc]
> * ... [kernel_map direct allocations, free and unused spac=
e]
> * kernel_map [end of kernel_map]
> * ...
> * virtual_end [end of possible KVM]
> * VM_MAX_KERNEL_ADDRESS
>=20
> Some kernel_map's submaps are special by being non-pageable and
> by pre-allocating the necessary pmap structures to avoid page
> faults. The pre-allocation consumes physical memory. Thus a submap's
> pre-allocation should not be larger than a reasonable small fraction
> of available physical memory to leave enough space for other kernel
> and userspace memory demands.
Preallocation is done to ensure that calls to functions like pmap_qenter()
always succeed and do not sleep for succession.
>=20
> The pseudo-code for a dynamic calculation of a submap size would look lik=
e this:
>=20
> submap.size =3D min(physmem.size / pmap.prealloc_max_fraction / pmap.si=
ze_per_page *
> page_size, kernel_map.free_size)
>=20
> The pmap.prealloc_max_fraction is the largest fraction of physical
> memory we allow the pre-allocated pmap structures of a single submap
> to occupy.
>
> Separate submaps are usually used to segregate certain types of memory
> usage and to have individual limits applied to them:
>
> kmem_map: tries to be as large as possible. It serves the bulk of
> all dynamically allocated kernel memory usage. It is the memory
> pool used by UMA and kernel malloc. Almost all kernel structures
> come from here: process-, thread-, file descriptors, mbuf's and
> mbuf clusters, network connection control blocks, sockets, etc...
> It is not pageable. Calculation: is currently only partially done
> dynamically and the MD parts can specify particular min, max limits
> and scaling factors. It likely can be generalized and with only very
> special platforms requiring additional limits.
>
> exec_map: is used as temporary storage to set up a processes address
> space and related items. It is very small and by default contains
> only 16 pages. Calculation: (exec_map_entries * round_page(PATH_MAX
> + ARG_MAX)).
>
> pipe_map: is used to move piped data between processes. It is
> pageable memory. Calculation: min(physmem.size, kernel_map.size) /
> 64.
>
> clean_map: overarching submap to contain the buffer_map and
> pager_map. Likely no longer necessary and a leftover from earlier
> incarnations of the kernel VM.
>
> buffer_map: is used for BIO structures to perform IO between the
> kernel VM and storage media (disk). Not pageable. Calculation:
> min(physmem.size, kernel_map.size) / 4 up to 64MB and 1/10
> thereafter.
>
> pager_map: is used for pager IO to a storage media (disk). Not
> pageable. Calculation: MAXPHYS * min(max(nbuf/4, 16), 256).
It is more versatile. The space is used for pbufs, and pbufs currently
also serve for physio, for the clustering, for aio needs.
>
> memguard_map: is a special debugging submap substituting parts of=20
> kmem_map. Normally not used.
>
> There is some competition between these maps for physical memory. One
> has to be careful to find a total balance among them wrt. static and
> dynamic physical memory use.
They mostly compete for KVA, not for the physical memory.
>
> Within the submaps, especially the kmem_map, we have a number of
> dynamic UMA suballocators where we have to put a ceiling on their
> total memory usage to prevent them to consume all physical *and/or*
> kmem_map virtual memory. This is done with UMA zone limits.
Note that architectures with the direct maps do not use kmem_map for
the small allocations. The uma_small_alloc() utilizes the direct map
for VA of the new page. kmem_map is needed when allocation is multi-page
sized, to provide the continuous virtual mapping.
>
> No externally exploitable single UMA zone should be able to consume
> all available physical memory. This applies for example to the
> number of processes, file descriptors, sockets, mbufs and mbuf
> clusters. These need to be limited to a reasonable and heavy work-load
> permitting amount of available physical memory. However there is going
> to be overcommit among them and not all them can be at their limit
> at the same time. Probably none of these UMA zones should be allowed
> to occupy more than 1/2 of all available physical memory. Often
> individual UMA zone limits have to be put into context and related to
> other concurrent UMA zones. This usually means reduced UMA zone limit
> for a particular zone. Balancing this takes a slight amount of voodoo
> magic and knowledge of common extreme work-loads to align. On the
> other hand for most of those zones allocations are permitted to fail
> rendering an attempt at connection establishment unsuccessful. It can
> be retried later.
>
> Generic pseudo-code: UMA zone limit =3D min(kmem_map.size, physmem.size)
> / 4 (or other appropriate fraction).
>
> It could be that some of the kernel_map submaps are no longer
> necessary and their purpose could simply be emulated by using an
> appropriately limited UMA zone. For example the exec_map is very small
> and only used for the exec arguments. Putting this into pageable
> memory isn't very useful anymore.
I disagree. Having the strings copied on execve() pageable is good,
the default size of around 260KB max for the strings is quite a
load on the allocator.
>
> Also the interesting construct of the clean_map containing only
> the buffer_map and pager_map doesn't seem necessary anymore and is
> probably remains of an earlier incarnation of the VM.
>
> Comments, discussion and additional input welcome.
>
> -- Andre
--5TZBROn01cl7bgIF
Content-Type: application/pgp-signature
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)
iQIcBAEBAgAGBQJROaxxAAoJEJDCuSvBvK1BrIQP/1tKAWRAk8oFDvVuW44tHBdm
6p6yXLnHjAmH3Et1oExh713at/oNcGIsXrj/7MoOQ0jGrZ3dF+dIWC3Rn+5mlyAc
VqUIl/6YzKfZV2uDfbZDhqytt0wqYNo1gv5BlseTb/5/naRHt0SM7Rp+JKRRYeDI
/2Gndmk8B/qJV5+ADJutOq0ri9cgBGsEV6+ZQYSphk0TpSQRv1WqVSWMwArGM8PI
AUBoNiekPFg3cAbC4uYhq8ZMOQrZ4eetVt9f6rAexBqC5GCWVVcOsogeC2xYqMRd
AXLfAo75XxtkjB21xUHKkhvbRfy+Zkxhb6LgOgnrK3QE5AnrFNcjWpzxnZ+2bv5g
xlf3HjkAufWzEaH+IINKPI4kkjJCK/DyrwzGaf5yn926uRpf5lwcUxXTUcmoAOU5
yWFBjtRuzLt4DvMgsJfg4M7H0wSwSgVYazkDfqH3UJT4iCBe4nX5rtarUYYP7V3i
nDf0nxvp6ejfNKk0wB7ABHFGQMD9aq40aie8wZ/55vdy8vpX1458DOX+PTVEQ9ev
5lWmQHRCnpEKd+AhEO1TExghCUTiCCbNR/ntT5Ta7FuTJSCZsmp8HGOqDvHkLO8u
XQ4cvhQzjUvZQmyX7bPUnhXlwQ6Sq4vvKn8FGRSuJU+XuV70eN+MBqOsS2RuXF1D
CLYlxpK4QaHRP1Z3b5u1
=EFZk
-----END PGP SIGNATURE-----
--5TZBROn01cl7bgIF--
討論串 (同標題文章)
完整討論串 (本文為第 7 之 7 篇):