Dynamic Scheduler For Multi-Core Processor - Final Report - All 4 Names
Dynamic Scheduler For Multi-Core Processor - Final Report - All 4 Names
Bachelor of Engineering
in Information Technology by
Vidya Pratishthans College of Engineering Baramati 413133, Dist- Pune (M.S.) INDIA April 2012
VPCOE, Baramati
Department of Information Technology
Certificate
This is to certify that the dissertation entitled
Examiner 1:
Examiner 2:
Acknowledgements
It was highly eventful at the department of Information Technology, Vidya Pratishthans college of Engineering. Working with highly devoted professor community with remains most memorable experience of our life. Hence this acknowledgement is humble attempt to honestly thank all those who were directly or indirectly involved in our project and were of immense help to us.
We would personally like to thank Prof. S. A. Takale, HOD of Information Technology department who, with such undying interest reviewed and enclosed this project report. We take this opportunity to thank respected Prof. Dinesh A. Zende , our project guide for his generous assistance. Lastly, We would like to thank our Principal Dr. S. B. Deosarkar who created a healthy environment for all of us to learn in best possible way.
Abstract
Many dynamic scheduling algorithms have been proposed in the past. With the advent of multi core processors, there is a need to schedule multiple tasks on multiple cores. The scheduling algorithm needs to utilize all the available cores eciently. The multicore processors may be SMPs or AMPs with shared memory architecture. In this, we propose a dynamic scheduling algorithm in which the scheduler resides on all cores of a multi-core processor and accesses a shared Task Data Structure (TDS) to pick up ready-to-execute tasks. This method is unique in the sense that the processor has the onus of picking up tasks whenever it is idle. We have discussed the proposed scheduling algorithm using a set of tasks as an example.
Also High performance on multicore processors requires that schedulers be reinvented. Traditional schedulers focus on keeping execution units busy by assigning each core a thread to run. Schedulers ought to focus, however, on high utilization of on-chip memory, rather than of exe- cution cores, to reduce the impact of expensive DRAM and remote cache accesses. A challenge in achieving good use of on-chip memory is that the memory is split up among the cores in the form of many small caches. This scheduling that assigns each object and its operations to a specic core, moving a thread among the cores as it uses dierent objects.
ii
Contents
Acknowledgements Abstract Keywords Notation and Abbreviations 1 Introduction 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Related Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Literature Survey 2.1 Need of the topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Proposed Work 3.1 Problem Denition 3.2 Project Scope . . . 3.3 Project Objectives 3.4 Project Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i ii vii viii 1 1 2 3 5 6 7 7 7 7 8 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12 12 12 17 18 22 25 25 37 38
4 Research Methodology 5 Project Design 5.1 Hardware Requirements 5.2 Software Requirements . 5.3 Risk Analysis . . . . . . 5.4 Data Flow Diagrams . . 5.5 Project Schedules . . . . 5.6 UML Documentations .
6 System Implementations 6.1 Important Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Important Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Important Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . .
iii
CONTENTS
CONTENTS
40 41 44 45 46 51
iv
List of Tables
5.1 7.1 8.1 Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dependency Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 40 41
List of Figures
5.1 5.2 5.3 5.4 5.5 5.6 6.1 6.2 8.1 Editing the GRUB 2 Menu DFD . . . . . . . . . . . . Gantt Chart . . . . . . . . Use Case . . . . . . . . . . Flow Chart . . . . . . . . Sequence Diagram . . . . During Boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 17 21 22 23 24 39 39 42 48 49
vi
Keywords
List of keywordsDynamic scheduler; multi-core systems; load balancing; work load distribution; anity scheduling; thread migration; thread scheduling
vii
viii
Chapter 1
Introduction
1.1 Introduction
Multi-core processors have two or more processing elements or cores on a single chip. These cores could be of similar architecture (Synchronous Multicore Processors, SMPs) or of dierent architecture (Asynchronous Multicore Processors, AMPs). All the cores necessarily use shared memory architecture. Multicore processors have existed previously in the form of MPSoC (Multi-Processor System on Chip) but they were limited to a segment of applications such as networking. The easy availability of multicore has forced software programmers to change the way they think and write their applications. Unfortunately, the applications written so far are sequential in nature. We can extract the inherent parallelism in such applications to exploit the available multi core architecture. To do so, conversion of sequential code to parallel code or writing parallel applications from scratch may not alone solve the problem optimally. There is a denite need for scheduling algorithms suitable for shared memory architecture to increase the eciency of multi-core processors in presence of multiple tasks within an application. Most of the proposed scheduling algorithms for multi-core processors concentrate on scheduling tasks that are independent of each other. This means that execution of one task does not aect or is not dependent on the result of other tasks and they may execute concurrently. To utilize multi-core processors more eciently for embedded applications where only one single application executes at any time, the application should be divided into
1.2. MOTIVATION
CHAPTER 1. INTRODUCTION
subtasks. This demands a scheduling algorithm that can be ecient enough to exploit the multicore architecture to achieve an optimal schedule in terms of time of execution and processor utilization.
1.2
Motivation
With the emergence of multicore chips, future distributed shared memory (DSM)
systems will have less powerful processor cores but will have tens of thousands of cores. Performance asymmetry in multicore platforms is another trend due to budget issues such as power consumption and area limitation as well as various degrees of parallelism in dierent applications [5]. We call such a system heterogeneous manycore DSM system. Processor cores belonging to the same level (e.g., same chip or board) frequently share memory resources. For instance, cores on the same chip may share an L2 or L3 cache. The shared-memory programming model is capable of attaining the benets of largescale parallel computing without surrendering much programmability [8]. Using the shared-memory model, a program can be written as if it were running on a large processor count SMP machine. From the perspective of application developers, all processors provide identical performance and the memory access time from each processor is also uniform. This model has been widely accepted and used for a long time. Now if we compare the real architecture and the vision of the architecture from the developers angle, there is a big gap between them. A number of long-standing assumptions are broken. Instead of a uniform memory access time, there are various memory latencies. The immediate result is that placing threads on arbitrary processors may lead to suboptimal performance when there are data accessed in common by threads. Heterogeneous cores provide dierent compute powers. Developers still should be able to write portable programs regardless of dierent machines. When the number of user-level threads is greater than the number of kernel threads, anity based thread scheduling must be taken into account to maximize the program locality.
Dynamic Scheduler for Multi-Core Processor 2 VPCOE, Baramati
CHAPTER 1. INTRODUCTION
If a number of cores share a certain level of cache, problems may arise due to resource contention. We hope to nd a method to reschedule threads to close the above gap and improve the multithreaded programs performance. The scheduling method should be automatic and applicable to a variety of general-purpose programs. Another issue is that multicore chips consist of relatively simple processor cores and will be underutilized if user programs cannot provide sucient thread level parallelism. It is the developers responsibility to write high performance parallel software to fully utilize the processor cores. To achieve high performance, we believe that the new parallel multicore software should have the following two characteristics: Fine grain threads. We need a high degree of parallelism to keep every processor core busy. Another reason is that a core often has a small-size cache or scratch buer to work on, which requires developers decompose a task into smaller tasks. Asynchronous program execution. When there are many processor cores, the presence of a synchronization point can seriously aect the program performance. And eliminating unnecessary synchronization points can increase the degree of parallelism accordingly. Therefore, we want to adopt the current scheduling approach to designing new dynamic scheduler for multicore architectures. The dynamic scheduling approach places ne grain computational tasks in a directed acyclic graph and schedules them dynamically depending on data dependence, program locality, and critical path.
1.3
Related Theory
Since IBM released Power4 (dual cores) in 2001 and Sun Microsystems released
Ultra- SPARC T1 (eight cores) in 2005, there are great numbers of multicore chips implemented by various vendors [6]. Traditional microarchitectures typically relies on increasing the complexity of the logic, wire, and design to nd more Instruction Level
Dynamic Scheduler for Multi-Core Processor 3 VPCOE, Baramati
CHAPTER 1. INTRODUCTION
Parallelism (ILP) such as out-of-order and speculative instruction executions within a sequential program. But this trend cannot continue any more due to the diminishing returns for large increases in complexity and the exponentially rising processor clock rates [7]. Compared to the traditional microarchitectures, multicore chips have a simple design, higher performance-to-area ratio, and better power eciency. Thus, hardware architects have changed their course to rely on multicore architectures. Multicore (or manycore) processors with hundreds of processing cores on a single die are also imminent in the near future. Both shared-memory and distributed-memory platforms could consist of multicore systems. There are two programming models to develop parallel programs: sharedmemory programming model and distributed-memory programming model. This dissertation rst depicts what a future shared-memory multicore machine will look like and then proposes a static scheduling method to improve program performance. Next, the dissertation studies how to use the dynamic directed acyclic graph (DAG) scheduling approach to developing new parallel software for both shared-memory and distributedmemory multicore systems.
VPCOE, Baramati
Chapter 2
Literature Survey
Many researchers have proposed various dynamic scheduling techniques over the past few years. This section gives an overview of some of the prominent work in this area. Megel et al discuss an improvement over the optimal nish time (OFT) algorithm for reducing pre-emption for embedded real time applications. Kurzak and Dongarra propose a data ow based scheduler and discuss provision of data reuse. Their scheduling algorithm is intended for numerical computation based applications. The proposed method relies on data dependency analysis between tasks in a sequential representation of programs. Jooya et als method based on recording application resource utilization and throughput to adaptively change cores for applications at runtime is applicable only to heterogeneous multi core processors. The method also saves power by downgrading applications with low resource utilization to weaker cores. Manikandan Baskaran et al discuss a compile time technique that dynamically extracts inter-tile dependencies and schedule the parallel tiles on the cores for improved scalability on multicores. The approach is applicable to multi-statement input programs with statements of dierent dimensionalities. Ali presents a framework for expressing, evaluating and dynamically executing schedules for FFT computations on hierarchical and shared memory multicore architectures. It describes a FFT schedule specication language. This language is used to generate one-dimensional serial FFT schedule, multi-dimensional serial FFT schedule and parallel FFT schedules. Wang presents a scheduler modeled to accept a task graph, and analyze the same and then put the tasks in scheduling queue. The algorithm involves
prediction according to the history record of task scheduling. It then rearranges a long task into smaller subtasks to form another task state graph and then schedule them in parallel. Blagojevic et al examine user-level schedulers that dynamically right sizes the dimensions and degrees of parallelism on the cell broadband engine. Blagojevic et al mention a new method using sampling of dominant execution phases to converge to the optimal scheduling algorithm.
2.1
multiple cores of the system and to investigate how to eectively schedule threads to improve program performance on multicore architectures. This formulates the anitybased thread scheduling problem on shared-memory multicore systems and proposes a static feedback-directed approach to computing optimized thread schedules to improve the eectiveness on every level of a complex memory hierarchy while keeping load balance. The dissertation also studies the dynamic data-availability driven scheduling approach for ne grain parallel programs and demonstrates the scalability and practicality of the approach on both shared-memory and distributed-memory multicore systems.
VPCOE, Baramati
Chapter 3
Proposed Work
3.1 Problem Denition
We are going to develope the Dynamic Scheduler For Multicore Processor System,In this, we propose a dynamic scheduling algorithm in which the scheduler resides on all cores of a multi-core processor and accesses a shared Task Data Structure (TDS) to pick up ready-to-execute tasks.
3.2
Project Scope
Uptil now the number of apllications are developed but these applications are run
eciently on the single core processor system but these applications does not eciently run on the multicore systems that means these applications does not utilize all cores of the processor equally, generally applications run on rst core but another cores does not get utilized while rst core is get burdened by assigning task.
3.3
Project Objectives
Multi-core processors have two or more processing elements or cores on a single
chip. These cores could be of similar architecture (Synchronous Multicore Processors, SMPs) or of dierent architecture (Asynchronous Multicore Processors, AMPs). All the cores necessarily use shared memory architecture. Multicore processors have existed
7
previously in the form of MPSoC (Multi-Processor System on Chip) but they were limited to a segment of applications such as networking. The easy availability of multicore has forced software programmers to change the way they think and write their applications. Unfortunately, the applications written so far are sequential in nature. We can extract the inherent parallelism in such applications to exploit the available multi core architecture. To do so, conversion of sequential code to parallel code or writing parallel applications from scratch may not alone solve the problem optimally. There is a denite need for scheduling algorithms suitable for shared memory architecture to increase the eciency of multi-core processors in presence of multiple tasks within an application. Most of the proposed scheduling algorithms for multi-core processors concentrate on scheduling tasks that are independent of each other. This means that execution of one task does not aect or is not dependent on the result of other tasks and they may execute concurrently. To utilize multi-core processors more eciently for embedded applications where only one single application executes at any time, the application should be divided into subtasks. This demands a scheduling algorithm that can be ecient enough to exploit the multicore architecture to achieve an optimal schedule in terms of time of execution and processor utilization.
3.4
Project Constraints
Dynamic Multicore exceeds performance expectation in some workloads on mul-
ticore systems. But it still shows some weakness in other workloads. There are some weakness about irresponsiveness of Dynamic Multicore scheduler in 3D game area. In the current implemented Dynamic Scheduling policy we can not handle the deadlock occured while task scheduling and balancing the task on multiple cores.
VPCOE, Baramati
Chapter 4
Research Methodology
With the emergence of multicore chips, future distributed shared memory (DSM) systems will have less powerful processor cores but will have tens of thousands of cores. Performance asymmetry in multicore platforms is another trend due to budget issues such as power consumption and area limitation as well as various degrees of parallelism in dierent applications [Balakrishnan et al., 2005, Kumar et al., 2004, Kumar et al., 2006]. We call such a system heterogeneous manycore DSM system. Processor cores belonging to the same level (e.g., same chip or board) frequently share memory resources. For instance, cores on the same chip may share an L2 or L3 cache. The shared-memory programming model is capable of attaining the benets of largescale parallel computing without surrendering much programmability [Lu et al., 1995]. Using the shared-memory model, a program can be written as if it were running on a large processor count SMP machine. From the perspective of application developers, all processors provide identical performance and the memory access time from each processor is also uniform. This model has been widely accepted and used for a long time. Now if we compare the real architecture and the vision of the architecture from the developers angle, there is a big gap between them. A number of long-standing assumptions are broken. We hope to nd a method to reschedule threads to close the above gap and improve the multithreaded programs performance. The scheduling method should be automatic and applicable to a variety of general-purpose programs. Another issue is that multicore chips consist of relatively simple processor cores and will be underutilized if user
programs cannot provide sucient thread level parallelism. It is the developers responsibility to write high performance parallel software to fully utilize the processor cores. To achieve high performance, we believe that the new parallel multicore software should have the following two characteristics:
1. Fine grain threads. We need a high degree of parallelism to keep every processor core busy. Another reason is that a core often has a small-size cache or scratch buer to work on, which requires developers decompose a task into smaller tasks. 2. Asynchronous program execution. When there are many processor cores, the presence of a synchronization point can seriously aect the program performance. And eliminating unnecessary synchronization points can increase the degree of parallelism accordingly. Therefore, we want to adopt the current scheduling approach to designing new dynamic scheduler for multicore architectures. The dynamic scheduling approach places ne grain computational tasks in a directed acyclic graph and schedules them dynamically depending on data dependence, program locality, and critical path. The most signicant change in 2.6 Linux Kernel which improved scalability in multi processor system was in the kernel process scheduler. The design of Linux 2.6 scheduler is based on per cpu runqueues and priority arrays, which allow the scheduler perform its tasks in O(1) time. This mechanism solved many scalability issues but the scheduler still didnt perform as expected on Hyperthreaded systems and on higher end NUMA systems. In case of Hyper-threading, more than one logical CPU shares the processor resources, cache and memory hierarchy. And in case of NUMA, dierent nodes have dierent access latencies to the memory. These non uniform relationships between the CPUs in the system pose signicant challenge to the scheduler. Scheduler must be aware of these dierences and the load distribution needs to be done accordingly. To address this, 2.6 Linux kernel scheduler introduced a concept called scheduling domains [SD]. 2.6 Linux kernel used hierarchical scheduler domains constructed dynamically depending on the CPU topology in the system. Each scheduler domain contains a list of scheduler groups having a common property. Load balancer runs at each domain
Dynamic Scheduler for Multi-Core Processor 10 VPCOE, Baramati
level and scheduling decisions happen between the scheduling groups in that domain. On a high end NUMA system with processors capable of Hyper-threading, there will be three scheduling domains, one each for HT, SMP and NUMA. In the presence of Hyperthreading, when the system has fewer tasks compared to number of logical CPUs in the system, scheduler must distribute the load uniformly between the physical packages. This distribution will avoid scenarios in the system where one physical package has more than one logical CPU busy and another physical package is completely idle. Uniform load distribution between physical packages will lead to lower resource contention and higher throughput. Presence of Hyperthreading scheduler domain will help the scheduler achieve the equal load distribution between the physical packages. Similarly the NUMA scheduling domain will help in unnecessary task migration from one node to another. This will ensure that the tasks will stay most of the time in their home (where the task has allocated most of its memory) node.
11
VPCOE, Baramati
Chapter 5
Project Design
5.1 Hardware Requirements
5.2
Software Requirements
Operating System: Ubuntu xx.xx.xx(Any Linux OS) Application Software: 1. HackBench 2. GCC Compiler 3. GTK 4. GEdit 5. Latest Kernel
5.3
Risk Analysis
While Developing and installing kernel with the dynamic scheduler in the current
Linux Operating System number of problems are occurs but that can be solved.
12
How To Enable Root User ( Super User ) in Ubuntu By default, root account password is locked in Ubuntu. While compiling the new kernel the default linux directory containing all .o les are created in the /usr/src directory. But the ordinary user accounts has not permissions to it. So, when you do su -, youll get Authentication failure error message as shown below.
$ su Password: Su: Authentication failure. First,unlock the root user and set a password for root user as shown below.
$ sudopasswd root [ sudo ]password for project: Enter new Unix password: Retype new Unix password: Password :password updated successfully. How do I update Ubuntu Linux softwares? In newely installed Linux OS there is no guarantee that all necessary packages are installed in it, for installing new kernel requires some special packages like ncurces,gtk,qtk,gcc,make,yum etc. so it needs to add these packages before starting actual task, this can be done by updating and upgrading the system.It can be upgraded using GUI tools or using traditional command line tools. Using apt-get command line tool apt-get update : Update is used to resynchronize the package index les from their sources via Internet. apt-get upgrade : Upgrade is used to install the newest versions of all packages currently installed on the system
Dynamic Scheduler for Multi-Core Processor 13 VPCOE, Baramati
apt-get install package-name : install is followed by one or more packages desired for installation. If package is already installed it will try to update to latest version. $ sudo apt-get install update $sudo apt-get install update && sudo apt-get install upgrades Reading packages lists done Building dependency tree Reading state information done E:Unable to locate packages update If all the packags updates scessefully then Ubuntu compilation will be done easily.
Editing the GRUB 2 Menu During Boot After completing task we need to reboot the system to feel the new look of the new kernel ,but some times there is problems are occurs in loading the system. This problem is occurs due problems in menu.lst les. This problem can be solved by editing this le at the boot time by using following steps If the menu is displayed, the automatic countdown may be stopped by pressing any key other than the ENTER key. If the menu is not normally displayed during boot, hold down the SHIFT key as the computer attempts to boot to display the GRUB 2 menu. In certain circumstances, if holding the SHIFT key method does not display the menu pressing the ESC key repeatedly may display the menu. The user can edit entries in the GRUB 2 menu using the following instructions: With the menu displayed, press any key (except ENTER) to halt the countdown timer and select the desired entry with the up/down arrow keys. Press the e key to reveal the selections settings.
14
VPCOE, Baramati
Use the keyboard to position the cursor. In this example, the cursor has been moved so the user can change or delete the numeral 9. Make a single or numerous changes to any or every line. Do not use ENTER to move between lines. Tab completion is available, which is especially useful in entering kernel and initrd entries. When complete, determine the next step: CTRL-X - boot with the changed settings (highlighted for emphasis). C - go to the command line to perform diagnostics, load modules, change settings, etc. ESC - Discard all changes and return to the main menu. The choices are listed at the bottom of the screen as a reminder. Edits made to the menu in this manner are non-persistent. They remain in eect only for the current boot. The changes must be re-entered on the next boot.
15
VPCOE, Baramati
Once successfully booted, the changes can be made permanent by editing the appropriate le, saving the le, and running update-grub as root.
16
VPCOE, Baramati
5.4
17
VPCOE, Baramati
5.5
Project Schedules
Table 5.1: Schedule
# T1 T1.1
Tasks Beginning Phase In this phase we actually track out why the need of these project Establish Project Scope Establish Project Scope Create Test Plan Create Manufacturing Plan Establish Engineering Requirements Establish Communications Establish Project Goals Sta Project Establish Training Requirements Establish Engineering Requirements Establish Communications Analysis Phase In these we analyse various constraints involved in project development Develop Project Specications Develop Initial Documentation Conduct User Training Create Manufacturing Plan Create Marketing Plan
Days 12 1
T1.2 T1.3 T1.4 T1.5 T1.6 T1.7 T1.8 T1.9 T1.10 T1.11 T1.12 T2 T2.1
1 1 1 1 1 1 1 1 1 1 1 11 5
07-10-2011 10-10-2011 11-10-2011 12-10-2011 13-10-2011 14-10-2011 17-10-2011 18-10-2011 19-10-2011 20-10-2011 21-10-2011 31-10-2011 31-10-2011
07-10-2011 10-10-2011 11-10-2011 12-10-2011 13-10-2011 14-10-2011 17-10-2011 18-10-2011 19-10-2011 20-10-2011 21-10-2011 14-11-2011 04-11-2011 Sachin Janani, Abbas Baramatiwala
6 1 4 5 6
18
VPCOE, Baramati
Design Phase Actual designing project takes place Develop Prototype Estimating Phase of
14 14 0 12 12
T4.2 T5
In these phase we actually various estimation like effort,time,cost etc Estimate Costs, Savings and /or Revenues Coding Phase
0 50
11-11-2011 14-11-2011
11-11-2011 10-01-2012 Sachin Janani, Abbas Baramatiwala, Vaijnath Jadhav, Balaji Ankamwar
T5.1
T5.2
Actual development of project takes in these project To schedule the processes we rst calculate the threads in the process which are independent Complete Open Items Run Performance Tests Develop Prototype Debugging Phase
50
14-11-2011
10-01-2012
15-11-2011
16-11-2011
1 1 1 46
In these phase the bugs in the project are removed Finalize Testing Correct Problems Conduct Alpha Testing
1 1 43 1
19
VPCOE, Baramati
T7
Maintenance Phase
20-02-2012
23-02-2012
T7.1
T7.2 T7.3
After alpha and beta testing the actual maintenance of software takes place Evaluate Systems Conduct Beta Testing
20-02-2012
20-02-2012
2 3
20-02-2012 21-02-2012
21-02-2012 23-02-2012
20
VPCOE, Baramati
21
VPCOE, Baramati
5.6
UML Documentations
22
VPCOE, Baramati
23
VPCOE, Baramati
24
VPCOE, Baramati
Chapter 6
System Implementations
6.1 Important Functions
As the scheduler implemented is for multi-core processor it should also be compatible with the unicore processor so we have to write the scheduler code in conditional compilation statement i.e
#ifdef CONFIG_SMP
-----
------
#endif
25
runqueue_t *rq; repeat: rq = task_rq(p); while (unlikely(rq->curr == p)) { cpu_relax(); barrier(); } rq = lock_task_rq(p, &flags); if (unlikely(rq->curr== p)) { unlock_task_rq(rq, &flags); goto repeat; } unlock_task_rq(rq,&flags); } This function is generally used for SMP or multicore scheduling.This function wait for a process to unschedule. This is used by the exit() and ptrace() code
2. static int try_to_wake_up(task_t* p, int synchronous) { unsigned long flags; int success = 0; runqueue_t*rq; rq = lock_task_rq(p, flags); p->state = TASK_RUNNING; if (!p->array) { activate_task(p, rq);
Dynamic Scheduler for Multi-Core Processor 26 VPCOE, Baramati
if ((rq->curr == rq->idle) || (p->prio < rq->curr->prio)) resched_task(rq->curr); success = 1; } unlock_task_rq(rq,&flags); return success; } This function wake up a process.The working of this function is as follows:Put a process on runqueue if its not already there. The current process is always on the run-queue (except when the actual re-schedule is in progress), and as such youre allowed to do the simpler current->state = TASK RUNNING to mark yourself runnable without the overhead of this.
3. int wake_up_process(task_t * p) { return try_to_wake_up(p,0); } This function calls the above try to wake up function 4. void sched_task_migrated(task_t *new_task) { wait_task_inactive(new_task); new_task->cpu = smp_processor_id(); wake_up_process(new_task); } This function is generally used by SMP message passing mechanism or code whenever the new task is arrived to the target CPU. We move to the new task into local runqueue so for this migration of the task we use the above function. This function must be called with interrupts disabled. The above function works as
Dynamic Scheduler for Multi-Core Processor 27 VPCOE, Baramati
follows:a) The new task rst waits for the old task to unscheduled by function wait task inactive() which is explained in above points. b) Assign a CPU to the new task using the statement new task->cpu=smp processor id()
c) After assigning the CPU wate up the new process using function wake up process(new task);
5. void kick_if_running(task_t * p) { if (p == task_rq(p)->curr) resched_task(p); } This function is used to signal the CPU in the case if CPU is trying to execute a process that is currently running on other CPU . Kick the remote CPU if the task is running currently, this code is used by the signal code to signal tasks which are in user-mode as quickly as possible. (Note that we do this lockless - if the task does anything while the message is in ight then it will notice the sigpending condition anyway.)
6. static inline unsigned int double_lock_balance(runqueue_t *this_rq, runqueue_t *busiest, int this_cpu, int idle, unsigned int nr_running) { if (unlikely(!spin_trylock(&busiest->lock))) { if (busiest < this_rq) { spin_unlock(&this_rq->lock); spin_lock(&busiest->lock); spin_lock(&this_rq->lock); /* Need to recalculate nr_running*/ if (idle || (this_rq->nr_running >this_rq->prev_nr_running[this_cpu]))
Dynamic Scheduler for Multi-Core Processor 28 VPCOE, Baramati
nr_running = this_rq->nr_running; else nr_running = this_rq->prev_nr_running[this_cpu]; } else spin_lock(&busiest->lock); } return nr_running; } Lock the busiest runqueue as well, this rq is locked already. Recalculate nr running if we have to drop the runqueue lock.
int imbalance, nr_running, load, max_load, idx, i, this_cpu = smp_processor_id task_t *next = this_rq->idle, *tmp; runqueue_t *busiest,*rq_src; prio_array_t *array; list_t *head, *curr;
/* * We search all runqueues to find the most busy one. * We do this lockless to reduce cache-bouncing overhead, * we re-check the best source CPU later on again, with * the lock held. * * We fend off statistical in runqueue lengths by * saving the unqueue length during the previous load-balancing * operation and using the smaller one the current and saved lengths.
Dynamic Scheduler for Multi-Core Processor 29 VPCOE, Baramati
* If a runqueue is long enough for a longer amount of time then * we recognize it and pull tasks from it. * * The current runqueue length is a statistical maximum variable, * for that one we take the longer one - to avoid fluctuations in * the other direction. So for a load-balance to happen it needs * stable long runqueue on the target CPU and stable short runqueue * on the local runqueue. * * We make an exception if this CPU is about to become idle - in * that case we are less picky about moving a task across CPUs and * take what can be taken. */ if (idle || (this_rq->nr_running >this_rq->prev_nr_running[this_cpu])) nr_running = this_rq->nr_running; else nr_running = this_rq->prev_nr_running[this_cpu];
busiest = NULL; max_load = 1; for (i = 0; i < smp_num_cpus; i++) { rq_src = cpu_rq(cpu_logical_map(i)); if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i])) load = rq_src->nr_running; else load = this_rq->prev_nr_running[i]; this_rq->prev_nr_running[i]= rq_src->nr_running;
30
VPCOE, Baramati
/* *It needs an at least ~25% imbalance to trigger *balancing. */ if (!idle && (imbalance < (max_load + 3)/4)) return; nr_running = double_lock_balance(this_rq, busiest, this_cpu,idle, nr_running); /* * Make sure nothing changed since we checked the * runqueue length. */ if (busiest->nr_running <= this_rq->nr_running+ 1) goto out_unlock;
/* * We first consider expired tasks. Those will likely not be * executed in the near future, and they are most likely to * be cache-cold, thus switching CPUs has the least effect * on them. */ if (busiest->expired->nr_active)
31
VPCOE, Baramati
array = busiest->expired; else array = busiest->active; new_array: /* * Load-balancing does not affect RT tasks, so we start the * searching at priority 128. */ idx = MAX_RT_PRIO; skip_bitmap: idx = find_next_bit(array->bitmap, MAX_PRIO, idx); if (idx == MAX_PRIO) { if (array == busiest->expired) { array = busiest->active; goto new_array; } goto out_unlock; \ } head = array->queue + idx; curr = head->prev; skip_queue: tmp = list_entry(curr,task_t, run_list);
/* * We do not migrate tasks that are: * 1) running (obviously),or * 2) cannot be migrated to this CPU due to cpus_allowed, or * 3) are cache-hot on their current CPU.
32
VPCOE, Baramati
*/
#define CAN_MIGRATE_TASK(p,rq,this_cpu) \ ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \ ((p) != (rq)->curr) && \ (tmp->cpus_allowed & (1<< (this_cpu))))
if (!CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) { curr = curr->next; if (curr != head) goto skip_queue; idx++; goto skip_bitmap; } next= tmp; /* * take the task out of the other runqueue and * put it into this one: */ dequeue_task(next, array); busiest->nr_running--; next->cpu = this_cpu; this_rq->nr_running++; enqueue_task(next,this_rq->active); if (next->prio < current->prio) current->work.need_resched= 1; if (!idle && - -imbalance) { if (array == busiest->expired) {
33
VPCOE, Baramati
array = busiest->active; goto new_array; } } out_unlock: spin_unlock(&busiest->lock); } This function is used for load balancing in case of overloading on a single processor. The tasks from busiest runqueue are pulled out and are put in the short runqueue .If there are task that are ready the it is feasible to take that task instead of migrating the task from the busiest runqueue.
This Function is empty to help the developer to decide where to start wirting the scheduler asmlinkage
void schedule(void) { task_t *prev = current,*next; runqueue_t *rq = this_rq(); prio_array_t *array; list_t *queue; int idx; if (unlikely(in_interrupt())) BUG(); release_kernel_lock(prev,smp_processor_id()); spin_lock_irq(&rq->lock);
34
VPCOE, Baramati
switch (prev->state) { case TASK_RUNNING: prev->sleep_timestamp= jiffies; break; case TASK_INTERRUPTIBLE: if (unlikely(signal_pending(prev))) { prev->state = TASK_RUNNING; prev->sleep_timestamp = jiffies; break; } default: deactivate_task(prev, rq); }
if (unlikely(!rq->nr_running)) { #if CONFIG_SMP load_balance(rq, 1); if (rq->nr_running) goto pick_next_task; #endif next = rq->idle; rq->expired_timestamp = 0;
35
VPCOE, Baramati
goto switch_tasks; }
array = rq->active; if (unlikely(!array->nr_active)) { /* * Switch the active and expired arrays. */ rq->active = rq->expired; rq->expired = array; array = rq->active; rq->expired_timestamp = 0; } idx = sched_find_first_bit(array->bitmap); queue = array->queue+ idx; next = list_entry(queue->next, task_t, run_list); switch_tasks: prefetch(next); prev->work.need_resched = 0; if (likely(prev != next)) { rq->nr_switches++; rq->curr = next; context_switch(prev, next); /* * The runqueue pointer might be from another CPU * if the new task was last running on a different * CPU - thus re-load it. */ barrier();
36
VPCOE, Baramati
6.2
Important Algorithms
1.Algorithm for process execution 1.Start 2.Repeat step 3-6 3.If new process arrives calculate process dependencies 4.Recalculate process priorities depending of thr number of dependencies 5.Take the process for execution 6.Execute the process 7.Stop
2.Algo for process execution 1.Start 2.for each CPU tick resolve the dependencies of process Mark the resolve process as ready recalculate the process priorities 3.If burst time of process > 0 goto step 2 4.Stop
37
VPCOE, Baramati
6.3
1. struct runqueue { spinlock_t lock; unsigned long nr_running, nr_switches, expired_timestamp; task_t *curr, *idle; prio_array_t *active, *expired, arrays[2]; int prev_nr_running[NR_CPUS]; }cacheline_aligned; This structure is used to create the runqueue for each CPU or core in the system.Some places requires to lock multiple runqueues lock acquire operations must be ordered by ascending &runqueue.
2.
The scheduler will reside in the shared memory of the multi-core system. This ensures that all the cores share the scheduler code. The same scheduler code will be executing on dierent cores and we will maintain a shared task data structure (TDS) that contains task information. The TDS stores information such as status, list of dependent tasks, data and stack pointers, etc. The detailed description of the TDS is as shown in Figure . The scheduler program executing on dierent cores (scheduler instances) will share this TDS. The access to this TDS is exclusive to each scheduler program. Exclusivity is achieved through the use of locking mechanism such as locks or semaphores.
The scheduler executes on each individual core as a separate thread or instance. Whenever a core is idle, the scheduler thread is invoked and it checks the shared TDS for the available list of ready-to-execute tasks. The shared TDS will have elements as shown in Figure 4.2
38
VPCOE, Baramati
Denitions Ti - Task ID Tis - Task status of Ti Tid - Number of dependencies that should be resolved to start execution of Ti Tia(n) - List of tasks that become available due to execution of task Ti Tip - Priority number of tasks that become available due to execution of task Ti Tidp - Pointer to data required for executing task Ti Tisp - Stack pointer Tix - Execution time for Ti
39
VPCOE, Baramati
Chapter 7
System Testing
Table 7.1: Test Case Test case name Checking the throuput of schedular Checking the Intelligence of schedular Operation Check the number of process that the scheduler can properly balanced on the core. Check to see how the scheduler intelligently split the large process, and if process is small to which core it assign process. Expected Output We are assuming that the scheduler can handle 100 processes If the process is of very short then it can be assigned to any free core directly, but if the process is large then the process is divided into small threads and each thread assigned to dierent core. Speedup should be greater than 1.5. Result Pass
Pass
This test cases is used to check the total performance of the scheduler in term of speedup application of and level of parallism.
Pass
40
Chapter 8
Experimental Results
We discuss the proposed scheduling algorithm with the help of following example. Table 1 below shows a dependency table for a set of six tasks. The number indicates the time unit at which the dependency of a particular task is resolved. This table represents an output of an oine dependency analysis on a sequential code in a simplied form. Tij Task j can be started only after task i has nished Tij time of execution where i is row number and j is column
Table T1 100 0 0 0 0 0 8.1: Dependency T2 T3 200 150 50 150 0 50 0 0 0 0 0 0 Table T4 0 0 100 50 0 0
Task T0 T1 T2 T3 T4 T5
T0 0 0 0 0 0 0
Figure 8.1 shows the simulation output of the dynamic scheduler for dependencies as given in Table 8.1. We have assumed time unit in seconds. Column Tx gives the total execution time for corresponding task. Each cell in the dependency table contains Tij for task j, which means task j can be started only after task i has nished Tij units of execution. For example, Task T2 starts only after task T0 nishes 200s and task T1 nishes 50s of execution. So T02 = 200, T12 = 50. At time t=0 all the cores will try to get lock of TDS and one of them (P1 ) gets
41
that lock and nds that task T0 is ready for execution. At t = 0 T0s = 1 (ready), T1s = -1, T2s =-1, T3s =-1, T4s =-1, T5s =-1 T0d =0, T1d = 1, T2d =2, T3d =3, T4d =2. T5d =4 T0x = 250, T1x =300, T2x =200, T3x =100, T4x = 200, T5x = 50 T0p T0a(n) , T0 will release 3 tasks (T1 , T2 , T3 ) so T0p is highest T0 will start executing at t =0. At, t =100 the dependency for task T1 is resolved, T01 = 100 so task T1 is ready for execution. Thus, At t = t1 (100) T0s = 2 (executing), T1s =1 (ready), T2s =-1, T3s =-1, T4s =-1, T5s =-1 Processor P4 locks data structure and starts executing task T1 . At t=t1 (200) (T02 = 200, T12 = 50) >= t1, So task T2 is ready for execution on free processor P1 T0s = 2 (executing), T1s =2, T2s =1, T3s =-1, T4s =-1, T5s =-1 At t=t2 (250) (T02 = 200, T12 = 50) >= t1, So task T3 is ready for execution on free processor P2 T0s = 3 (Completed), T1s =2, T2s =2, T3s =1(ready), T4s =-1, T5s =-1 Using similar logic, the algorithm will continue to schedule task and it can be
42
VPCOE, Baramati
concluded that tasks T4 and T5 will be scheduled for execution on processors P3 and P4 respectively.
43
VPCOE, Baramati
Chapter 9
Conclusion
The scheduling algorithm discussed attempts to increase the utilization of multicore processors. This algorithm is dierent in the sense that the processor owns the responsibility of picking up tasks for execution whenever it is idle. This method gives priority to tasks that resolve more dependencies and hence make sure that the updates to the ready-to-execute tasks list are done accordingly. The scheduler resides on each core as a separate thread or instance and hence is specic to a core, so we can conclude that the proposed scheduler will be more ecient and will balance the load properly. This project covered most of the important aspects of Linux scheduler. Kernel scheduler is one of the most frequently executed components in Linux system. Hence, it has gained a lot of attentions from kernel developers who have thrived to put the most optimized algorithms and codes into the scheduler. Dierent algorithms used in kernel scheduler were discussed in the project. Dynamic Multicore scheduler achieves a good performance and responsiveness while being relatively simple compared with the previous algorithm like O(1). Dynamic Multicore exceeds performance expectation in some workloads on multicore systems. But it still shows some weakness in other workloads. There are some weakness about irresponsiveness of Dynamic Multicore scheduler in 3D game area.
44
Chapter 10
Future Scope
The future work includes delving deeper in to the scheduling and process codes in a way so that we can implement a new scheduling algorithm in the kernel. Though this project gives a vivid overview and basic steps of conguring and compiling kernel, implementing scheduling policy like the Dynamic Scheduling,SCHED IDLE (with a lower priority) , there were some challenges associated with it. One of the challenges were interpreting the change in the scheduling policy through the process runtime. The goal for the future is to such challenges and develope ecient techniques for kernel scheduling. In the current implemented Dynamic Scheduling policy we have consider about the deadlock handling of the processes,our next goal is to improve this Dynamic Scheduling policy by implementing the better deadlock handling policy
45
Appendix A
Appendix
Kernel Compilation Compiling custom kernel has its own advantages and disadvantages. However, new Linux user / admin nd it dicult to compile Linux kernel. Compiling kernel needs to understand few things and then just type couple of commands. This step by step how-to covers compiling Linux kernel version 2.6.xx under Debian GNU Linux. However, an instruction remains the same for any other distribution except for apt-get command.
Step # 1 Get Latest Linux kernel code Visit http://kernel.org/ and download the latest source code. File name would be linux-x.y.z.tar.bz2, where x.y.z is actual version number. For example le inux2.6.25.tar.bz2 represents 2.6.25 kernel version. Use wget command to download kernel source code: $ cd /tmp $ wget http://www.kernel.org/pub/linux/kernel/v2.6/linux-x.y.z.tar.bz2 Note: Replace x.y.z with actual version number.
APPENDIX A. APPENDIX
# cd /usr/src Step # 3 Congure kernel Before you congure kernel make sure you have development tools (gcc compilers and related tools) are installed on your system. If gcc compiler and tools are not installed then use apt-get command under Debian Linux to install development tools.
# apt-get install gcc Now you can start kernel conguration by typing any one of the command:
$ make menuconfig Text based color menus, radiolists & dialogs. This option also useful on remote server if you wanna compile kernel remotely.
$ make xconfig X windows (Qt) based conguration tool, works best under KDE desktop
$ make gconfig X windows (Gtk) based conguration tool, works best under Gnome Dekstop. For example make menucong command launches following screen:
$ make menuconfig You have to select dierent options as per your need. Each conguration option has HELP button associated with it so select help button to get help.
47
VPCOE, Baramati
APPENDIX A. APPENDIX
Step # 4 Compile kernel Start compiling to create a compressed kernel image, enter: $ make Start compiling to kernel modules: $ make modules Install kernel modules (become a root user, use su command): $ su # make modules_install Step # 5 Install kernel So far we have compiled kernel and installed kernel modules. It is time to install kernel itself. # make install It will install three les into /boot directory as well as modication to your kernel grub conguration le: 1.System.map-2.6.25 2.cong-2.6.25 3.vmlinuz-2.6.25
APPENDIX A. APPENDIX
# cd /boot # mkinitrd -o initrd.img-2.6.25 2.6.25 initrd images contains device driver which needed to load rest of the operating system later on. Not all computer requires initrd, but it is safe to create one.
# vi /boot/grub/menu.lst
APPENDIX A. APPENDIX
savedefault boot Remember to setup correct root=/dev/hdXX device. Save and close the le. If you think editing and writing all lines by hand is too much for you, try out update-grub command to update the lines for each kernel in /boot/grub/menu.lst le. Just type the command:
Step # 8 : Reboot computer and boot into your new kernel Just issue reboot command:
# reboot
50
VPCOE, Baramati
References
[1] D. Tam, R. Azimi, and M. Stumm. Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In Pro-ceedings of the 2nd ACMSIGOPS/EuroSys European Conference on Computer Systems, pages 47-58, New York, NY, USA, 2007.ACM.
[2] F. Bellosa and M. Steckermeier. The performance implications of locality information usage in shared-me mory multiprocessors. J. Parallel Distrib. Comput., 37(1):113-121, 1996.
[3] M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1995.
[4] Jakub Kurzak and Jack Dongarra, Fully Dynamic Scheduler for Numerical Computing on Multicore Processors, LAPACK Working Note 220, UT-CS-09-643, June 4, 2009 .
[5] Balakrishnan, S., Rajwar, R., Upton, M., and Lai, K. K. (2005).The impact of performance asymmetry in emerging multicore architectures. In 32st Inter-national Symposium on Computer Architecture (ISCA 2005), 4-8 June 2005, Madison, Wisconsin, USA, pages 506517. IEEE Computer Society.
51
REFERENCES
REFERENCES
[6] K. Asanovic et al. The Landscape of Parallel Computing Research: A View from Berkeley.Technical Report UCB/EECS-2006-183, University of California at Berkeley, December 2006.
[7] K. Olukotun, L. Hammond, J. Laudon, Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency, Synthesis Lectures on Computer Architecture, Morgan and Claypool, 2007.
[8] [Lu et al., 1995] Lu, H., Dwarkadas, S., Cox, A. L., and Zwaenepoel, W. (1995). Message passing versus distributed shared memory on networks of workstations. In Supercomputing 95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM),page 37. ACM Press.
[10] http://www.ibiblio.org/pub/Linux/docs/HOWTO/KernelAnalysis-HOWTO, <This link tries to explain the most important components of the Linux Kernel>
[11] http://www.barrelfish.org, <The site is exploring how to structure an OS for future Multi- and Many-Core Systems>
[12] http://www.intel.com/core, <The site is exploring architecture of various Intel Core Processor>
[13] http://www.multicoreinfo.com,
52
VPCOE, Baramati
REFERENCES
REFERENCES
53
VPCOE, Baramati