Lexa Software: Inet-Admins@info.east.ru archive

		Apache-Talk @lexa.ru
		Inet-Admins @info.east.ru
		Filmscanners @halftone.co.uk
		Security-alerts @yandex-team.ru
		nginx-ru @sysoev.ru
СТАТЬИ
ПЕРСОНАЛЬНОЕ
ПРОГРАММЫ
ПИШИТЕ
ПИСЬМА
АРХИВ :: Inet-Admins
Inet-Admins mailing list archive (inet-admins@info.east.ru)
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [inet-admins] Re: DG ftpd Q (fwd)

To: inet-admins@info.east.ru
Subject: Re: [inet-admins] Re: DG ftpd Q (fwd)
From: Vladimir Vorobyev <bob@turbo.nsk.su>
Date: Thu, 29 Jun 2000 19:34:20 +0700
In-Reply-To: <200006290820.MAA14777@main.piter.net>;from Cyril A. Vechera <cyril@main.piter.net>
References: <200006290820.MAA14777@main.piter.net>
Thu Jun 29 15:20, Cyril A. Vechera <cyril@main.piter.net> wrote:
> на примере freebsd
> poll все-таки syscall и если много сокетов (~1000) при каждом
> вызове в память ядра копируется довольно-таки много. селект передает много
> меньше, но операция занесения/проверки дескриптора дороже чем у пола.
> у других ос должно быть очень похоже.
> 
> если kernel-level тред использует дескрипторы с блокировкой при
> чтении/записи, лишние syscall и select/poll не используется,
> значит должно быть реально быстрее при большом числе коннектов.
> 
> Но вопрос, какие идут накладные расходы на само управление тредами?
> 
Зависит от реализации, рекомендую почитать ниже, сразу отрезвляет ;)

From: Chris Torek <torek@BSDI.COM>
Date: Sun, 30 Jan 2000 01:53:37 -0700
Subject: Re: kernel threads

>Is this the case?  I thought that not having kernel threads meant that
>the kernel itself only used one CPU (ie, for doing the work of scheduling,
>context switching, handling the disk cache, etc... (I could be wrong), but
>how is one supposed to write an MP program if not with pthreads?  Does one
>do a fork/vfork and communicate between processes?  What is the established
>way?  Has anyone written a program using multiple processes personally?
>Also, if not pthreads, is there a standard like it (I think any POSIX
>system has pthreads), especially to be able to have a good chance of
>communicating with processes on other systems?

There are a bunch of questions here that require some definitions
to answer.  Unfortunately, the exact definition of a "thread" seems
to depend on who you ask. :-)

For me, the most basic definition of a "thread" is "an execution
context".  This definition is loose enough that on the i386, a
saved pc (or "eip") and stack pointer (%esp) could qualify.  In
fact, this is almost all that the C library "pthread" code needs
per thread.  (It also stores other register values and an FPU
context.  For the curious, see src/lib/libc/i386/threads/thread_machdep.h.)

A tiny thread context like this has its drawbacks, but also has some
big advantages.  In particular, *the smaller the context, the faster
it is to switch threads*.  This is an incentive to keep the size down.
Other things exert pressure to bump the size up, as we shall see.

At this point I think I should note that the i386 (and in fact all
the machines on which any of the various BSD flavors run) uses a
stack, and the user-space stack is actually part of the thread
context above.  It is just that, instead of saving the stack of
the current thread, the pthreads code simply saves the stack
*pointer* -- the stack itself is implicitly "whatever is in the
existing address space at that address".

Here we have another term that needs a definition.  What is an
"address space"?  Modern Unix-like systems run on machines with
"paging" and "virtual address spaces".  On these machines, whenever
the computer uses a memory address, there is some kind of implied
"address space" context that goes with it.  (On some machines, you
can explicitly specify an address space.  The SPARC has "load
alternate" and "store alternate" instructions, for instance.  This
is not something most people need to think about though.)  The CPU
maps these "virtual addresses", by some CPU-specific method, to
"physical addresses" elsewhere in memory.  If the physical page is
there, everything runs at "full speed ahead".  Otherwise -- for
instance, if you use an address that is not mapped at all, or is
mapped but not physically loaded at the moment ("paged out"), or
is marked "look but don't touch" -- the CPU hands the job off to
the kernel, which can then do a "page in" or kill an errant process
or whatever is needed.

Traditionally, Unix-like systems have bound "address space" to
"process".  Whenever you started a new process (with "fork"), the
system makes a copy -- these days, using a trick called "copy on
write" rather than actually copying everything -- of everything in
your current address space, and then lets the new "child" process
run with the copy.  Thus, if the child process looks at the same
addresses as the parent, it sees the same original data, but the
first time it stores something there, the kernel gets a chance to
step in and copy the "source" page to a new, different, physical
page.  Then it shuffles the translation maps for the child so that
the child "sees" the copy instead of the original, and can only
change the copy.

(Often, the child will soon call "exec", which means "throw away
all of the current address space, and instead load this other
program off the file system."  This undoes the copy-on-write sharing.
There is obviously some performance penalty for marking everything
copy-on-write long enough for the child to do any last minute setup
-- such as closing files or dup2()ing a pipe descriptor for piped
commands -- then having the child say "okay, forget all the work
you just did, go run this other program now".  What surprises me
is that this overhead often turns out to be *lower* than the overhead
on other systems with all-in-one "spawn" calls, in place of
fork+exec.)

But what happens if you *want* the child process to have access to
the same underlying physical memory?  Currently, BSD/OS only allows
this via mmap (well, that and the wretched System V "shm" facility,
but internally this is just another mmap).  In this case, you could
mmap() your memory "shared" rather than "private", and when you
fork, it will still be shared.  You cannot share absolutely everything
this way though -- you can only share *new* mappings, not existing
ones.  This turns out to be just about the opposite of what you
would want for POSIX threads: you might want *not* to share the
new thread's stack, but *do* share everything else.  (In practice,
and maybe by POSIX definition -- I do not have the document handy
-- pthreads implementations even share the per-thread stacks.)
Right now, the only way to do this on BSD/OS is not to fork.

Anyway, let me use these as working definitions now:

 - thread: a context within an address space (pc, sp, some registers)
 - address space: a mapping context for virtual memory => real memory

Now we come to the subject of "kernel threads".  What the heck *is*
a "kernel thread"?  Before I can define this, I need to explain
how Unix-like systems traditionally handle allowing any given
process to make a kernel call.

Since our machines have stacks, and the kernel is coded in C just
like lots of user programs, the kernel needs a stack too.  When a
process makes a system call (by some CPU-specific method), the CPU
hands off control to the kernel, which must select the appropriate
kernel stack.  Thus the kernel actually needs one stack per process,
not just one big stack.  The process might ask the kernel to read
some data from a file, for instance.  The kernel will run through
the code path for "read a file", and build up a stack full of
"things we need to know to read this file" data.  At some point,
the kernel might have to stop and wait for a disk to spin around
for a while and *get* the data.  We call this wait "blocking" or
"sleeping".  While the process is blocked/asleep, it can make no
progress on anything else, *including user-mode code*.

This is one of those drawbacks I mentioned above.  User-level thread
switching is really fast -- but even if you have just the one CPU,
as soon as a user-level thread makes a "blocking" call, the entire
set of user-level threads is stuck.  The BSD/OS pthreads implementation
sidesteps the problem as much as it can by avoiding blocking calls.
Most system calls have non-blocking variants, and/or can be queried
in advance via the select() system call.  User code can thus put
off making the *actual* call until it will finish right away, or
there are no threads marked "I want to run".  There is a lot of
"hair" in this, which makes everything a bit slower, but it keeps
one thread from blocking all the others all the time.

Other systems, such as Solaris, have a different answer for this
problem.  These are the "kernel threads" people are talking about
here.  The kernel still builds up a stack of "read stuff", and
still "block" on that read, causing that thread to stop dead until
it is unblocked.  But -- this is the key -- that thread is just
one particular kernel stack; there are other kernel stacks for
the same "process".  The "kernel stack + user address space" pair
is a "kernel thread", and if there is another kernel stack for
the same address space, that other kernel stack can handle other
kernel calls from other user-level threads within that address
space.

Since there are multiple kernel stacks, one thread can make one
system call on one CPU (blocking or not), and another thread can
make another system call on another CPU at the same time.  These
may have the same address space, but they have different kernel
stacks, so the two calls will not collide.  This requires a
"multi-threaded" kernel as well.  BSD/OS does not have this yet
(the "true SMP" system is still in development): not only do you
have just the one kernel stack per process, the kernel also prohibits
more than one process to run most "kernel code" at a time.  In
essence, there is one gigantic lock that protects the entire kernel,
so that kernel code that assumes that "I am the only CPU" still
works.  In order to get into the kernel, you must get the "giant
lock".  There are special exceptions for the scheduler, so that if
one process has the kernel busy, and another process tries to make
a system call, the second CPU can pick a *third* process to run in
user-mode.  That lets it get useful work unless *every* process
needs kernel service.

Note that systems like this have made a distinction here between
"process" and "thread".  The "process" is still mapped one-to-one
to "address space"; within that space, there are multiple "threads".
There is another way to do this that I think fits better within
the traditional Unix model.  Instead of adding a new "thread"
concept, what happens if we decouple "process" and "address space"?

Imagine taking the existing fork() call, which has no parameters
and just makes a complete *copy* of the current process, and adding
a parameter to it telling it what it should copy, and what it should
leave shared.  This new "shared fork" call can be called as:

	sfork(SHARE_NOTHING)

which is just a traditional fork(), or as:

	sfork(SHARE_EVERYTHING)

which makes a new process that shares *everything* except the kernel
stack.  The latter does everything that the Solaris-style kernel
thread does.  In between the two, you can make calls that, e.g.,
just share the file descriptor table.  Name any object that fork()
copies, and with sfork(), you can decide whether to copy it or share
it.  The plan for BSD/OS is to have an sfork() call, using flags
like those you can find today in <sys/sfork.h>.

(The sfork() call is spelled rfork() in Plan 9, and maybe 9th
Edition Unix.  The idea has been around for a while.  It solves a
problem that traditional "kernel threads" have: identifying the
kernel thread.  Since Unix has always had "process IDs", if PIDs
become Thread IDs, everything "just work".)

Just for completeness, there is another meaning for "kernel threads".
There are already some "kernel threads" of a sort in the existing
system: processes 0 and 2 are "swapper" and "pagedaemon", and today
pid 3 is the "filesys_syncer", and there is no user-level context
associated with these -- they just run in the kernel and provide
a convenient way for the kernel to do blocking operations.  The
"asyncdaemon", nfsd, and nfsiod *do* have a user context, but mostly
run in the kernel.  On multi-CPU systems, there are special "idle
processes" so that one or more CPUs can be waiting for work at the
same time, without stepping on each other's kernel stacks when the
work shows up.

(We also use "kernel threads" in the true-SMP system.  When an
interrupt occurs, its handler must obtain various data-structure
locks, and it is possible for these locks to be held by another
CPU at the time.  In this case, a "kernel thread" blocks on the
lock, to hang on to the context needed to make progress on that
interrupt as soon as the lock is released.)

Note that all of these "kernel threads" are really just ordinary
processes, that merely happen to return to user mode rarely if
ever.  They handle requests generated internally within the kernel.
The "kernel threads" people mean when they talk about POSIX threads
exist to handle requests coming directly from user code.

>Doesn't the pthreads_atfork create a new process which would run on
>another processor (likely)?  By the way, does anyone know how the
>kernel decides where to put a new process (on which processor)?

This gets into scheduling, which is another matter entirely.  If
you have Solaris-style threads, you must schedule "by thread".  If
you have traditional Unix-style processes, you schedule "by process".
If you have sfork, a thread *is* a process so you still schedule
by process -- but you may still want "gang scheduling" and other
such things.

I am out of time at this point, so all I will say about this is:
current (non-"true-SMP") BSD/OS versions can only have one process
in the kernel at a time, so you have to fork separate processes
to use both CPUs.  The scheduler is not (yet) modified from what
we were using before we had multiple CPUs, except for that sneaky
bit that lets us schedule a third process whenever the 2nd CPU
would otherwise spin its wheels waiting for the "giant lock" in
the kernel.

Chris

=============================================================================
"inet-admins" Internet access mailing list. Maintained by East Connection ISP.
Mail "unsubscribe inet-admins" to Majordomo@info.east.ru if you want to quit.
Archive is accessible on http://info.east.ru/rus/inetadm.html
References:
- Re: [inet-admins] Re: DG ftpd Q (fwd)
  - From: Cyril A. Vechera
Prev by Date: [inet-admins] cisco & motorola vanguard320 interconnection
Next by Date: Re: [inet-admins] cisco & motorola vanguard320 interconnection
Previous by thread: Re: [inet-admins] Re: DG ftpd Q (fwd)
Next by thread: [inet-admins] cisco & motorola vanguard320 interconnection
Index(es):
- Date
- Thread