This article talks about processor allocation strategies employed by 3 systems, namely, Scheduler Activation, Mach and Prospero Resource Manager.
When computing took birth,
we had single processor machines running one application/program at a time.
Sometime later came around the idea of multi-tasking which gave rise to a very
important concept, which is at the center of multi-tasking systems, called
scheduling. There are a variety of scheduling algorithms for a traditional
multi-tasking system, but most can be broadly classified into two classes:
1) Time-sharing
2) Priority based
What multi-tasking gave us
was now all the processes were progressing simultaneously but each of the
process itself was not running any faster, in fact, could be running slower
because now there are competing processes for CPU cycles.
After this, came
multi-processor machine, which means a machine with multiple processors in it.
On top of multi-processor
machines, if we employ old multi-tasking approach, we would get similar results
as a uni-processor machine (each process would not
run any faster), just that the total time would be divided by n as there are n
processors (not exactly 1/n though, in fact it would be significantly higher
because of overheads involved such as context-switches etc.).
To really run a process
faster than it would run on a simple multi-tasking machine, we needed something
called as multi-threading.
The concept of
multi-threading is based on the fact that sometimes an individual process
itself can have concurrency inside it. What we mean by that is there are
sections in the execution instruction set which could be executed
simultaneously on a multi-processor machine. The degree of such parallelism is
called application concurrency. We
employ multi-threading to achieve it and the amount we are able to achieve is
called as application parallelism.
But before we delve into
scheduling of threads, complications (how information is distributed and
utilized) and choices we have to make there, we need to acknowledge division
that exists in the multi-processor systems. Broadly, there are two types of
multi-processor systems:
1) Shared Memory
Multi-Processors (SMMP): mostly a single chip solution with memories and
processors connected through a bus.
2) Distributed Memory
Multi-Processor (DMMP): a network of independent pairs (each pair formed of a
processor and a memory).
We are going to discuss 3
approaches to scheduling in these 2 different environments.
1) Mach
2) Scheduler Activation
3) Prospero Resource
Management (PRM)
As the paper on Scheduler
Activation suggests, threads can be supported either by the operating system
kernel or by the user-level library code in the application address space. But
neither approach has been fully satisfactory. Let.s briefly look at the
reasons.
Kernel threads are
inherently slower than the user-level threads for two main reasons:
1) First, when context
switches happens between threads of same process, it.s expensive.
2) Second, kernel threads
usage implies a system-wide scheduling policy would be used. But application is
the best entity to take decisions about the scheduling of its own threads,
because it has application specific information, and hence any system-wide
scheduling policy used would not fare well in a generic environment.
Note: The term Process Level Scheduler and
User Level Scheduler have been used interchangeably.
On the other hand user
threads suffer from their own shortcomings. Although user threads would perform
excellently well in an ideal world where there are no I/Os, page faults and
system calls. But in reality such real world activities can make user level
threads perform very poorly.
The research community
realized this problem started working towards a solution which employs support
from both user level and kernel level to achieve the best results. Through
these efforts, systems such as Mach and Scheduler Activations came out.
Before further delving into
details of each one of these, let.s look at the entities involved and
information that they withhold which can be put to use while making scheduling
decisions. There are three entities involved in the system:
1) Kernel
2) Process level scheduler
3) Application/Programmer/User
This is a layered
architecture and Kernel being the lowest layer, Process Level Scheduler being
the second (on top of Kernel) and Application/Programmer/User is the top most.
Just like with any layered system, End-to-End argument would apply here. We
would like to do things as high in the layer as possible unless we have enough
evidence that we need a particular functionality for all possible
implementations above a particular layer and hence we would incorporate that
functionality in that particular layer.
Under the light of
End-to-End argument, if all the scheduling decisions are being taken at the
kernel level (as in the case of first generation multi-tasking schedulers),
such approach will not work best with all the applications. This is because
only Process Level Scheduler has information such as which of its thread has a
particular lock or which of its thread is executing in critical region and
hence needs more CPU cycles for now. Similarly, there are some cases where
application specific or instance specific information is available only to
programmer/user which can be extremely useful while making scheduling
decisions, for example, applications requiring gang-scheduling.
Gang scheduling means simultaneous scheduling of an application.s components/threads.
Now let.s look at each of
the 3 systems and how they extract and use this information.
Mach.s processor allocation
facility achieves these goals by dividing the responsibility among three 3
entities:
1) Operating system or
Kernel, performs the allocation mechanism
2) Server, implements
allocation policy
3) Application, requests
processors and can control their usage once assigned
These 3 entities very well
correspond to the 3 entities we discussed before this section. In Mach, the
ownership/full control remains with the Server entity. This is required to
maintain fairness in the usage by multiple processes and avoid situations where
a malicious process hogs up all the CPU resources.
Mach introduces 2 new
objects, the processor and the processor set. Each application has
threads which are associated to one or more processor sets. Each processor set
could be assigned processors. When a processor set has processors assigned to
it, it executes only those threads assigned to that processor set. From threads
perspective, it can only see processor sets and can control the behavior of
processor sets. So, if a threads wants to have full control, it can requests
for as many processor sets as there are threads in the system and once
processors are assigned to the processor sets, it can control their behavior.
Having full control has some significant advantages which we will discuss a
little later.
All the allocation requests
go through Server and hence Server has the full control over processor
allocation and can take away the processors from a processor set on need basis.
If we breakdown this
approach to understand it, we would find that the broad level scheduling is
done by Server while fine grain scheduling is done by the application. Hence
Mach provides mechanisms for both Application and Server to affect the
scheduling of processors. Server maintains fairness and assigns idle processors
to needy processor sets, while Application, since it has application specific
information, can control scheduling of processors once assigned.
In case of a pre-mature
pre-emption, such as an I/O or a lock, an application can provide something
called as Hint to the Mach Operating
System Kernel. The hint is based on two pieces of information:
1) Thread can not make
progress until synchronization/communication is complete with another thread
2) The thread id of the
thread on which it is blocked
There are two classes of
hints:
1) Discouragement hints: optimistic approach, giving up CPU in the
hope that the thread holding the lock would run
2) Hand-off hints: Specifically asking Kernel to schedule the thread
which is holding the lock.
Hand-off hints performed
very well on experiments.
Note: Mach.s port mechanism
offers location transparency and hence it can work in DMMP environment as well
though our discussion was focused to SMMP only, but similar arguments would
apply there too. There is one limitation here though, which we would discuss
under the discussion of PRM.
The approach taken by Scheduler
Activation requires it to have only two entities in the system:
1) User/Process Level
Scheduler
2) Kernel
While User Level Scheduler
controls the overall scheduling behavior once it has been assigned a certain
number of processors, the Kernel maintains the rights over processors so as to
provide fairness to the overall system. This approach looks very similar to
pure User Level Threads approach. But there is a catch. The problem with older
User Level Threads approach was when any one of the threads of a process makes
as I/O system call, the Kernel blocks the whole process and overall progress of
the process gets halted. Such a scenario is avoided in Scheduler Activation by
something called as scheduler activation.
In scheduler activation,
when such an event (I/O blocking) happens, the Kernel instead of just
pre-empting the whole process, pre-empts only that thread and makes an Up Call in to the user-level scheduler
of the process informing which thread has just been blocked and why. Then the
User Level Scheduler can take appropriate action and if there is any run-able
thread, it can schedule that. Similarly, an application can inform the kernel
when it needs extra processor (through a call like .Add more processors (#).) or if it has an extra processor which is
not required (through calls such as .This
Processor is idle ().).
Note: The 2 calls mentioned
above provide ways for a programmer to affect the scheduling.
This overall collaboration
between User Level Scheduler and Kernel was something which was missing in older
User Level Thread mechanisms and it is provided in Scheduler Activation.
Such collaboration allows
User Level Scheduler to control how many (and which) processors have been
allocated to it and it can have complete control over which of its threads are
running on those processors. Such control allows User Level Scheduler to
explicitly manage cache locality which could be of significant advantage
considering cache sizes of today.s processor have increased significantly.
Some Terminologies:
Node: An
independent system with computing capabilities
Job: A
collection of inter-related tasks which will be communicating with each other
while executing
Task: A
collection of threads
PRM solves what other
systems left out. That is, scalability. PRM employs three types of managers:
1) System Manager
2) Job Manager
3) Node Manager
Before we discuss System
Manager, let discuss Job Manager and Node Manager. Job Manager corresponds to
User Level Scheduler which we discussed in the above systems and Node Manager
corresponds to Kernel.
The Job Manager role is to take care of all the requirements of a
particular job. Job Manager acts as the single point of communication for
requesting resources for the application. The Job Manager provides the abstraction
of a single virtual system to the application executing. Job Managers defines
the execution environment. The possible execution environment could be
multi-processing only, or fault tolerant or real-time or a combination of one
or more. When an application selects a Job Manager, it identifies the
application.s requirement and, using Prospero Directory Service/Configuration
file; it locates executing entities in the distributed system and schedules the
tasks on them.
Node Manager on the other
hand is responsible for a particular computing resource. It loads the
application binaries and executes them once it has received message a authorized Job Manager. The Node Manager notifies the Job
Manager about events such as completion or failures of tasks.
Now once we have a huge
pool of such computing resources, conventional techniques for managing
resources perform poorly. Systems, like Mach which support distributed
computing through the implementation of location transparent Ports, will not
scale beyond a certain number of computing nodes. Identifying the location,
maintaining the availability information and providing fault-tolerant services
becomes a bottleneck. To solve this problem, PRM introduced another manager
called System Manager. System Manager
is not a single box. It.s a hierarchical concept very similar to DNS. Every
System Manager is responsible for a subset of resources (consisting of Node
Managers). And then there are System Managers responsible for a set of such
System Managers. This System Manager Layer flowing through large scale
distributed systems blends the system so well that it is able to scale up to
such numbers.
System Manager keeps track
of all the management information about all the Node Managers and assists the
Job Manager in locating available resources giving a fault-tolerant view to the
applications.
Since Job Manager is a user
level entity, there are a number of different Job Managers available to choose
from and applications, depending upon their needs, can choose from this set.
This approach gives a great level of flexibility to the applications and at the
same time let them choose from a pre-programmed set of Job Managers. If they
need a specific Job Manager, programmers have the freedom to code it.
To sum it up:
-
Scheduler
Activation is a good single system solution.
-
Mach provides
things such as gang scheduling, and although it is good for a single system but
can scale up to few nodes, owned by a single organization.
-
PRM offers
scalability of the order of hundreds of nodes (or more) and will be the choice
in case the distributed system spans multiple organizations and ownership.
References:
All the papers I read as part of my cs-555 course: Advanced Operating Systems. List will be available soon.