ARM Linux 進(jìn)程調(diào)度(2.4.x)
小弟最近研究了一段時(shí)間的ARM Linux,想把進(jìn)程管理方面的感受跟大家交流下,不對(duì)的地方多多指點(diǎn)
----------------------------------------------
Process Creation and Termination
Process Scheduling and Dispatching
Process Switching
Porcess Synchronization and support for interprocess communication
Management of process control block
-------from Operating system:internals and design principles>
進(jìn)程調(diào)度
Linux2.4.x是一個(gè)基于非搶占式的多任務(wù)的分時(shí)操作系統(tǒng),雖然在用戶進(jìn)程的調(diào)度上采用搶占式策略,但是而在內(nèi)核還是采用了輪轉(zhuǎn)的方法,如果有個(gè)內(nèi)核態(tài)的線程惡性占有CPU不釋放,那系統(tǒng)無法從中解脫出來,所以實(shí)時(shí)性并不是很強(qiáng)。這種情況有望在Linux 2.6版本中得到改善,在2.6版本中采用了搶占式的調(diào)度策略。
內(nèi)核中根據(jù)任務(wù)的實(shí)時(shí)程度提供了三種調(diào)度策略:
1. SCHED_OTHER為非實(shí)時(shí)任務(wù),采用常規(guī)的分時(shí)調(diào)度策略;
2. SCHED_FIFO為短小的實(shí)時(shí)任務(wù),采用先進(jìn)先出式調(diào)度,除非有更高優(yōu)先級(jí)進(jìn)程申請(qǐng)運(yùn)行,否則該進(jìn)程將保持運(yùn)行至退出才讓出CPU;
3. SCHED_RR任務(wù)較長(zhǎng)的實(shí)時(shí)任務(wù),由于任務(wù)較長(zhǎng),不能采用FIFO的策略,而是采用輪轉(zhuǎn)式調(diào)度,該進(jìn)程被調(diào)度下來后將被置于運(yùn)行隊(duì)列的末尾,以保證其他實(shí)時(shí)進(jìn)程有機(jī)會(huì)運(yùn)行。
需要說明的是,SCHED_FIFO和SCHED_RR兩種調(diào)度策略之間沒有優(yōu)先級(jí)上的區(qū)別,主要的區(qū)別是任務(wù)的大小上。另外,task_struct結(jié)構(gòu)中的policy中還包含了一個(gè)SCHED_YIELD位,置位時(shí)表示該進(jìn)程主動(dòng)放棄CPU。
在上述三種調(diào)度策略的基礎(chǔ)上,進(jìn)程依照優(yōu)先級(jí)的高低被分別調(diào)系統(tǒng)。優(yōu)先級(jí)是一些簡(jiǎn)單的整數(shù),它代表了為決定應(yīng)該允許哪一個(gè)進(jìn)程使用CPU的資源時(shí)判斷方便而賦予進(jìn)程的權(quán)值——優(yōu)先級(jí)越高,它得到CPU時(shí)間的機(jī)會(huì)也就越大。
在Linux中,非實(shí)時(shí)進(jìn)程有兩種優(yōu)先級(jí),一種是靜態(tài)優(yōu)先級(jí),另一種是動(dòng)態(tài)優(yōu)先級(jí)。實(shí)時(shí)進(jìn)程又增加了第三種優(yōu)先級(jí),實(shí)時(shí)優(yōu)先級(jí)。
1. 靜態(tài)優(yōu)先級(jí)(priority)——被稱為“靜態(tài)”是因?yàn)樗浑S時(shí)間而改變,只能由用戶進(jìn)行修改。它指明了在被迫和其它進(jìn)程競(jìng)爭(zhēng)CPU之前該進(jìn)程所應(yīng)該被允許的時(shí)間片的最大值(20)。
2. 動(dòng)態(tài)優(yōu)先級(jí)(counter)——counter 即系統(tǒng)為每個(gè)進(jìn)程運(yùn)行而分配的時(shí)間片,Linux兼用它來表示進(jìn)程的動(dòng)態(tài)優(yōu)先級(jí)。只要進(jìn)程擁有CPU,它就隨著時(shí)間不斷減小;當(dāng)它為0時(shí),標(biāo)記進(jìn)程重新調(diào)度。它指明了在當(dāng)前時(shí)間片中所剩余的時(shí)間量(最初為20)。
3. 實(shí)時(shí)優(yōu)先級(jí)(rt_priority)——值為1000。Linux把實(shí)時(shí)優(yōu)先級(jí)與counter值相加作為實(shí)時(shí)進(jìn)程的優(yōu)先權(quán)值。較高權(quán)值的進(jìn)程總是優(yōu)先于較低權(quán)值的進(jìn)程,如果一個(gè)進(jìn)程不是實(shí)時(shí)進(jìn)程,其優(yōu)先權(quán)就遠(yuǎn)小于1000,所以實(shí)時(shí)進(jìn)程總是優(yōu)先。
在每個(gè)tick到來的時(shí)候(也就是時(shí)鐘中斷發(fā)生),系統(tǒng)減小當(dāng)前占有CPU的進(jìn)程的counter,如果counter減小到0,則將need_resched置1,中斷返回過程中進(jìn)行調(diào)度。update_process_times()為時(shí)鐘中斷處理程序調(diào)用的一個(gè)子函數(shù):
void update_process_times(int user_tick)
{
struct task_struct *p = current;
int cpu = smp_processor_id(), system = user_tick ^ 1;
update_one_process(p, user_tick, system, cpu);
if (p->pid) {
if (--p->counter = 0) {
p->counter = 0;
p->need_resched = 1;
}
if (p->nice > 0)
kstat.per_cpu_nice[cpu] = user_tick;
else
kstat.per_cpu_user[cpu] = user_tick;
kstat.per_cpu_system[cpu] = system;
} else if (local_bh_count(cpu) || local_irq_count(cpu) > 1)
kstat.per_cpu_system[cpu] = system;
}
Linux中進(jìn)程的調(diào)度使在schedule()函數(shù)中實(shí)現(xiàn)的,該函數(shù)在下面的ARM匯編片斷中被調(diào)用到:
/*
* This is the fast syscall return path. We do as little as
* possible here, and this includes saving r0 back into the SVC
* stack.
*/
ret_fast_syscall:
ldr r1, [tsk, #TSK_NEED_RESCHED]
ldr r2, [tsk, #TSK_SIGPENDING]
teq r1, #0 need_resched || sigpending
teqeq r2, #0
bne slow
fast_restore_user_regs
/*
* Ok, we need to do extra processing, enter the slow path.
*/
slow: str r0, [sp, #S_R0 S_OFF]! returned r0
b 1f
/*
* "slow" syscall return path. "why" tells us if this was a real syscall.
*/
reschedule:
bl SYMBOL_NAME(schedule)
ENTRY(ret_to_user)
ret_slow_syscall:
ldr r1, [tsk, #TSK_NEED_RESCHED]
ldr r2, [tsk, #TSK_SIGPENDING]
1: teq r1, #0 need_resched => schedule()
bne reschedule 如果需要重新調(diào)度則調(diào)用schedule
teq r2, #0 sigpending => do_signal()
blne __do_signal
restore_user_regs
而這段代碼在中斷返回或者系統(tǒng)調(diào)用返回中反復(fù)被調(diào)用到。
1. 進(jìn)程狀態(tài)轉(zhuǎn)換時(shí): 如進(jìn)程終止,睡眠等,當(dāng)進(jìn)程要調(diào)用sleep()或exit()等函數(shù)使進(jìn)程狀態(tài)發(fā)生改變時(shí),這些函數(shù)會(huì)主動(dòng)調(diào)用schedule()轉(zhuǎn)入進(jìn)程調(diào)度。
2. 可運(yùn)行隊(duì)列中增加新的進(jìn)程時(shí);
ENTRY(ret_from_fork)
bl SYMBOL_NAME(schedule_tail)
get_current_task tsk
ldr ip, [tsk, #TSK_PTRACE] check for syscall tracing
mov why, #1
tst ip, #PT_TRACESYS are we tracing syscalls?
beq ret_slow_syscall
mov r1, sp
mov r0, #1 trace exit [IP = 1]
bl SYMBOL_NAME(syscall_trace)
b ret_slow_syscall 跳轉(zhuǎn)到上面的代碼片斷
3. 在時(shí)鐘中斷到來后:Linux初始化時(shí),設(shè)定系統(tǒng)定時(shí)器的周期為10毫秒。當(dāng)時(shí)鐘中斷發(fā)生時(shí),時(shí)鐘中斷服務(wù)程序timer_interrupt立即調(diào)用時(shí)鐘處理函數(shù)do_timer( ),在do_timer()會(huì)將當(dāng)前進(jìn)程的counter減1,如果counter為0則置need_resched標(biāo)志,在從時(shí)鐘中斷返回的過程中會(huì)調(diào)用schedule.
4. 進(jìn)程從系統(tǒng)調(diào)用返回到用戶態(tài)時(shí);判斷need_resched標(biāo)志是否置位,若是則轉(zhuǎn)入執(zhí)行schedule()。系統(tǒng)調(diào)用實(shí)際上就是通過軟中斷實(shí)現(xiàn)的,下面是ARM平臺(tái)下軟中斷處理代碼。
.align 5
ENTRY(vector_swi)
save_user_regs
zero_fp
get_scno
enable_irqs ip
str r4, [sp, #-S_OFF]! push fifth arg
get_current_task tsk
ldr ip, [tsk, #TSK_PTRACE] check for syscall tracing
bic scno, scno, #0xff000000 mask off SWI op-code
eor scno, scno, #OS_NUMBER 20 check OS number
adr tbl, sys_call_table load syscall table pointer
tst ip, #PT_TRACESYS are we tracing syscalls?
bne __sys_trace
adrsvc al, lr, ret_fast_syscall 裝載返回地址,用于在跳轉(zhuǎn)調(diào)用后返回到
上面的代碼片斷中的ret_fast_syscall
cmp scno, #NR_syscalls check upper syscall limit
ldrcc pc, [tbl, scno, lsl #2] call sys_* routine
add r1, sp, #S_OFF
2: mov why, #0 no longer a real syscall
cmp scno, #ARMSWI_OFFSET
eor r0, scno, #OS_NUMBER 20 put OS number back
bcs SYMBOL_NAME(arm_syscall)
b SYMBOL_NAME(sys_ni_syscall) not private func
5. 內(nèi)核處理完中斷后,進(jìn)程返回到用戶態(tài)。
6. 進(jìn)程主動(dòng)調(diào)用schedule()請(qǐng)求進(jìn)行進(jìn)程調(diào)度。
----------------------------------------------
schedule()函數(shù)分析:
/*
* 'schedule()' is the scheduler function. It's a very simple and nice
* scheduler: it's not perfect, but certainly works for most things.
*
* The goto is "interesting".
*
* NOTE!! Task 0 is the 'idle' task, which gets called when no other
* tasks can run. It can not be killed, and it cannot sleep. The 'state'
* information in task[0] is never used.
*/
asmlinkage void schedule(void)
{
struct schedule_data * sched_data;
struct task_struct *prev, *next, *p;
struct list_head *tmp;
int this_cpu, c;
spin_lock_prefetch(runqueue_lock);
if (!current->active_mm) BUG();
need_resched_back:
prev = current;
this_cpu = prev->processor;
if (unlikely(in_interrupt())) {
printk("Scheduling in interrupt");
BUG();
}
release_kernel_lock(prev, this_cpu);
/*
* 'sched_data' is protected by the fact that we can run
* only one process per CPU.
*/
sched_data = aligned_data[this_cpu].schedule_data;
spin_lock_irq(runqueue_lock);
/* move an exhausted RR process to be last.. */
if (unlikely(prev->policy == SCHED_RR))
/*
* 如果采用輪轉(zhuǎn)法調(diào)度,則重新檢查counter是否為0, 若是則將其掛到運(yùn)行隊(duì)列的最后
*/
if (!prev->counter) {
prev->counter = NICE_TO_TICKS(prev->nice);
move_last_runqueue(prev);
}
switch (prev->state) {
case TASK_INTERRUPTIBLE:
/*
* 如果是TASK_INTERRUPTIBLE,并且能夠喚醒它的信號(hào)已經(jīng)來臨,
* 則將狀態(tài)置為TASK_RUNNING
*/
if (signal_pending(prev)) {
prev->state = TASK_RUNNING;
break;
}
default:
del_from_runqueue(prev);
case TASK_RUNNING:;
}
prev->need_resched = 0;
/*
* this is the scheduler proper:
*/
repeat_schedule:
/*
* Default process to select..
*/
next = idle_task(this_cpu);
c = -1000;
list_for_each(tmp, runqueue_head) {
/*
* 遍歷運(yùn)行隊(duì)列,查找優(yōu)先級(jí)最高的進(jìn)程, 優(yōu)先級(jí)最高的進(jìn)程將獲得CPU
*/
p = list_entry(tmp, struct task_struct, run_list);
if (can_schedule(p, this_cpu)) {
/*
* goodness()中,如果是實(shí)時(shí)進(jìn)程,則weight = 1000 p->rt_priority,
* 使實(shí)時(shí)進(jìn)程的優(yōu)先級(jí)永遠(yuǎn)比非實(shí)時(shí)進(jìn)程高
*/
int weight = goodness(p, this_cpu, prev->active_mm);
if (weight > c) /注意這里是”>”而不是”>=”,如果權(quán)值相同,則先來的先上
c = weight, next = p;
}
}
/* Do we need to re-calculate counters? */
if (unlikely(!c)) {
/*
* 如果當(dāng)前優(yōu)先級(jí)為0,那么整個(gè)運(yùn)行隊(duì)列中的進(jìn)程將重新計(jì)算優(yōu)先權(quán)
*/
struct task_struct *p;
spin_unlock_irq(runqueue_lock);
read_lock(tasklist_lock);
for_each_task(p)
p->counter = (p->counter >> 1) NICE_TO_TICKS(p->nice);
read_unlock(tasklist_lock);
spin_lock_irq(runqueue_lock);
goto repeat_schedule;
}
/*
* from this point on nothing can prevent us from
* switching to the next task, save this fact in sched_data.
*/
sched_data->curr = next;
task_set_cpu(next, this_cpu);
spin_unlock_irq(runqueue_lock);
if (unlikely(prev == next)) {
/* We won't go through the normal tail, so do this by hand */
prev->policy = ~SCHED_YIELD;
goto same_process;
}
kstat.context_swtch ;
/*
* there are 3 processes which are affected by a context switch:
*
* prev == .... ==> (last => next)
*
* It's the 'much more previous' 'prev' that is on next's stack,
* but prev is set to (the just run) 'last' process by switch_to().
* This might sound slightly confusing but makes tons of sense.
*/
prepare_to_switch(); {
struct mm_struct *mm = next->mm;
struct mm_struct *oldmm = prev->active_mm;
if (!mm) { /如果是內(nèi)核線程的切換,則不做頁表處理
if (next->active_mm) BUG();
next->active_mm = oldmm;
atomic_inc(oldmm->mm_count);
enter_lazy_tlb(oldmm, next, this_cpu);
} else {
if (next->active_mm != mm) BUG();
switch_mm(oldmm, mm, next, this_cpu); /如果是用戶進(jìn)程,切換頁表
}
if (!prev->mm) {
prev->active_mm = NULL;
mmdrop(oldmm);
}
}
/*
* This just switches the register state and the stack.
*/
switch_to(prev, next, prev);
__schedule_tail(prev);
same_process:
reacquire_kernel_lock(current);
if (current->need_resched)
goto need_resched_back;
return;
}
----------------------------------------------
ARM Linux 進(jìn)程調(diào)度(3)
switch_mm中是進(jìn)行頁表的切換,即將下一個(gè)的pgd的開始物理地址放入CP15中的C2
寄存器。進(jìn)程的pgd的虛擬地址存放在task_struct結(jié)構(gòu)中的pgd指針中,通過
__virt_to_phys宏可以轉(zhuǎn)變成成物理地址。
static inline void
switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk, unsigned int cpu)
{
if (prev != next)
cpu_switch_mm(next->pgd, tsk);
}
#define cpu_switch_mm(pgd,tsk) cpu_set_pgd(__virt_to_phys((unsigned long)(pgd)
))
#define cpu_get_pgd()
({
unsigned long pg;
__asm__("mrc p15, 0, %0, c2, c0, 0"
: "=r" (pg));
pg = ~0x3fff;
(pgd_t *)phys_to_virt(pg);
})
switch_to()完成進(jìn)程上下文的切換,通過調(diào)用匯編函數(shù)__switch_to
完成,其實(shí)現(xiàn)比較簡(jiǎn)單,也就是保存prev進(jìn)程的上下文信息,該上下文信息由
context_save_struct結(jié)構(gòu)描述,包括主要的寄存器,然后將next
的上下文信息讀出,信息保存在task_struct中的thread.save中TSS_SAVE標(biāo)識(shí)了thread.
save在task_struct中的位置。
/*
* Register switch for ARMv3 and ARMv4 processors
* r0 = previous, r1 = next, return previous.
* previous and next are guaranteed not to be the same.
*/
ENTRY(__switch_to)
stmfd sp!, {r4 - sl, fp, lr} Store most regs on
stack
mrs ip, cpsr
str ip, [sp, #-4]! Save cpsr_SVC
str sp, [r0, #TSS_SAVE] Save sp_SVC
ldr sp, [r1, #TSS_SAVE] Get saved sp_SVC
ldr r2, [r1, #TSS_DOMAIN]
*
* Returns amount of memory which needs to be reserved.
*/
long ed_init(long mem_start, int mem_end)
{
int i,
ep;
short tshort,
version,
length,
s_ofs;
if (register_blkdev(EPROM_MAJOR,"ed",ed_fops)) {
printk("EPROMDISK: Unable to get major %d.n", EPROM_MAJOR);
return 0;
}
blk_dev[EPROM_MAJOR].request_fn = DEVICE_REQUEST;
for(i=0;i 4) {
printk("EPROMDISK: Length (%d) Too short.n", length);
return 0;
}
ed_length = length * 512;
sector_map = ep 6;
sector_offset = ep s_ofs;
printk("EPROMDISK: Version %d installed, %d bytesn", (int)version, ed_length);
return 0;
}
int get_edisk(unsigned char *buf, int sect, int num_sect)
{
short ss, /* Sector start */
tshort;
int s; /* Sector offset */
for(s=0;s0;) {
sock = bp / EPROM_SIZE;
page = (bp % EPROM_SIZE) / EPAGE_SIZE;
offset = bp % EPAGE_SIZE;
nb = (len offset)>EPAGE_SIZE?EPAGE_SIZE-(offset%EPAGE_SIZE):len;
cr1 = socket[sock] | ((page 4) 0x30) | 0x40; /* no board select for now */
cr2 = (page >> 2) 0x03;
outb((char)cr1,CONTROL_REG1);
outb((char)cr2,CONTROL_REG2);
memcpy(buf bofs,(char *)(EPROM_WINDOW offset),nb);
len -= nb;
bp = nb;
bofs = nb;
}
return 0;
}
med.c代碼如下:
/* med.c - make eprom disk image from ramdisk image */
#include
#include
#include
#define DISK_SIZE (6291456)
#define NUM_SECT (DISK_SIZE/512)
void write_eprom_image(FILE *fi, FILE *fo);
int main(int ac, char **av)
{
FILE *fi,
*fo;
char fin[44],
fon[44];
if (ac > 1) {
strcpy(fin,av[1]);
} else {
strcpy(fin,"hda3.ram");
}
if (ac > 2) {
strcpy(fon,av[2]);
} else {
strcpy(fon,"hda3.eprom");
}
fi = fopen(fin,"r");
fo = fopen(fon,"w");
if (fi == 0 || fo == 0) {
printf("Can't open filesn");
exit(0);
}
write_eprom_image(fi,fo);
fclose(fi);
fclose(fo);
}
void write_eprom_image(FILE *fi, FILE *fo)
{
char *ini;
char *outi; /* In and out images */
short *smap; /* Sector map */
char *sp;
char c = 0;
struct {
unsigned short version;
unsigned short blocks;
unsigned short sect_ofs;
} hdr;
int ns,
s,
i,
fs;
ini = (char *)malloc(DISK_SIZE); /* Max disk size is currently 6M */
outi = (char *)malloc(DISK_SIZE); /* Max disk size is currently 6M */
smap = (short *)malloc(NUM_SECT*sizeof(short));
if (ini == NULL || outi == NULL || smap == NULL) {
printf("Can't allocate memory :(n");
exit(0);
}
if (DISK_SIZE != fread(ini,1,DISK_SIZE,fi)) {
printf("Can't read input file :(n");
exit(0);
}
memcpy(outi,ini,512); /* Copy in first sector */
smap[0] = 0;
ns = 1; /* Number of sectors in outi */
參考書目:
[1]《GNU/Linux編程指南》 (美)K.Wall,M.Watson 清華大學(xué)出版社 1999
[2] 《Linux實(shí)用指南》 (美)諾頓、格蕾菲斯著 翟大昆等譯 機(jī)械工業(yè)出版社 1999
[3]《嵌入式系統(tǒng) -- 使用 C 與 C 》 Michael Barr 美商歐萊禮 1999
[4] 《LINUX操作指南》 本社 人民郵電出版社 1999
[5] 《Linux 實(shí)用大全》 楊文志編著 北京清華大學(xué)出版社 1999
[6] 《單片機(jī)與嵌入式系統(tǒng)應(yīng)用》 何立民 北京航空航天大學(xué)出版社 1999
[7] 《Linux內(nèi)核源代碼分析》 (美)馬克斯韋爾 機(jī)械工業(yè)出版社 2000
[8] 《UNIX操作系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)》 陳華瑛、李建國(guó) 電子工業(yè)出版社出版 1999
參考文獻(xiàn):
[1]《frambuffer howto》 Geert Uytterhoeven www.linuxdoc.org 1998
[2] 《RTAI Beginner Guide》Emanuele Bianchi www.rtai.org
[3] 《Booting Linux from EPROM》 Dave Bennett www.linuxjournal.com
[4]《ramdiskhowto》 Paul Gortmaker www.linuxdoc.org 1995
[5]《kernelhowto》 Juan-Mariano de Goyeneche www.linuxdoc.org 2000
[6] 《Embedded Linux Howto》 Sebastien Huet www.linux-embedded.com 2000
[7]《lilohowto》 m.skoric www.linuxdoc.org 2001
[8]《linux from scratch howto》 Gerard Beekmans www.linux-embedded.com 2000
[9]《glibc2howto》 Eric Green www.linux.com 1998
[10] 《Kernel Jorn》Alessandro Rubini Georg Zezchwitz 《Linux Journal》1996
[11]《rtlinux doc》 Michael Barabanov www.rtlinux.com 2001
[12]《the linux boot disk howto》 Tom Fawcett www.linux-embedded.com 2000
評(píng)論