[Comm] ksoftirqd и 100% CPU

Di denyago на rambler.ru
Вт Май 4 08:25:52 UTC 2010


Доброго времени суток.

Сервер (2.6.27-ovz-smp-alt9 #1 SMP x86_64) используется в т.ч. как 
маршрутизатор. Этим заведует VE с проброшенными eth0|1 . На eth0 
настроен VLAN. Всё это работало несколько месяцев. В последнюю неделю 
(да и раньше) ничего не обновлял...

Посоветуйте кто виноват и что делать, пожалуйста.

Сегодня столкнулся с  проблемой:
     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
       7 root      15  -5     0    0    0 R 85.8  0.0  24:12.35 ksoftirqd/1
      13 root      15  -5     0    0    0 R 70.5  0.0  21:55.17 ksoftirqd/3
       4 root      15  -5     0    0    0 R 61.2  0.0  22:09.86 ksoftirqd/0
      10 root      15  -5     0    0    0 R 51.3  0.0  22:51.07 ksoftirqd/2

За вчера, столкнулся с єтим 2 раза. Первый раз помог reset, второй раз 
охранник, наверно, спал и я отложил перезагрузку на сегодня.

Поискал. Всюду ругались на проблемы драйвера сетевой карты. Пустил меня 
сервер только по ssh и стал я разбираться...
1. Отрубил eth0 (внутренний интерфейс) и выгрузил 8021q - не помогло
2. Понаблюдал, как появляются прерывания. Надеюсь, я правильно понял, 
что означают числа в /proc/interrupts:
	CPU0	CPU1	CPU2	CPU3	Время									
14:	15448	15289	15610	15530	20:13:02	IO-APIC-edge	ide0							
	15616	15490	15806	15721	20:17:15			00:04:13	ср.	3
16:	886216	884792	886101	885861	20:13:02	IO-APIC-fasteoi	aacraid							
	892159	890754	891996	891830	20:17:15			00:04:13	ср.	94
17:	124397	125099	124218	124657	20:13:02	IO-APIC-fasteoi 
ehci_hcd:usb1,uhci_hcd:usb2							
	126660	127303	126480	126884	20:17:15			00:04:13	ср.	35
2296:	1302917	1303284	1301544	1299754	20:13:02	PCI-MSI-edge	eth1							
	1303305	1303680	1301954	1300130	20:17:15			00:04:13	ср.	6
LOC:	4554114	3947750	4023168	3822185	20:13:02	Local	timer	interrupts						
	4620588	4014420	4105265	3890368	20:17:15			00:04:13	ср.	1120
RES:	48597	85163	42248	87307	20:13:02	Rescheduling	interrupts							
	51718	87104	44109	90804	20:17:15			00:04:13	ср.	41
TLB:	2853653	2470564	2986077	3606213	20:13:02	TLB	shootdowns							
	2855765	2471715	2987263	3607910	20:17:15			00:04:13	ср.	24
Посчитал среднее количество за 4 мин. Лидером стал aacraid (аппаратный, 
отдельной платой), оставшаяся сетевуха почти прерываний не создавала.
3. Остановил nagios, внутренний сайт, внешний сайт, collectd - не помогло.
4. Посмотрел, что происходит с внешним интерфейсом:
# ethtool -i eth1
driver: e1000e
version: 0.3.3.3-k6
firmware-version: 1.6-12
bus-info: 0000:06:00.0

# ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off

06:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit 
Ethernet Controller (Copper) (rev 01)
           Subsystem: Super Micro Computer Inc Device 0000
           Flags: bus master, fast devsel, latency 0, IRQ 2296
           Memory at d8420000 (32-bit, non-prefetchable) [size=128K]
           I/O ports at 2020 [size=32]
           Capabilities: [c8] Power Management version 2
           Capabilities: [d0] MSI: Mask- 64bit+ Count=1/1 Enable+
           Capabilities: [e0] Express Endpoint, MSI 00
           Capabilities: [100] Advanced Error Reporting
           Capabilities: [140] Device Serial Number 7e-b7-34-ff-ff-48-30-00
           Kernel driver in use: e1000e
           Kernel modules: e1000e

eth1      Link encap:Ethernet  HWaddr ***
             inet addr:***  Bcast:***  Mask:***
             UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
             RX packets:2396787 errors:0 dropped:0 overruns:0 frame:0
             TX packets:6475512 errors:0 dropped:0 overruns:0 carrier:0
             collisions:0 txqueuelen:100
             RX bytes:213544646 (203.6 MiB)  TX bytes:7619333922 (7.0 GiB)
             Memory:d8420000-d8440000

У кое-кого на машине запущены торренты, но нагрузка должны была 
прекратиться, когда я отключил eth0...

При этом, пинг извне - 20 мс.

Дальше я сделал глупость: решил проверить, что скажет netstat. В итоге 
получил вот такое:
May  3 21:21:35 cita kernel: [24748.676007] BUG: soft lockup - CPU#0 
stuck for 61s! [netstat:12144]
May  3 21:21:35 cita kernel: [24748.676007] Modules linked in: garp 
mptctl mptbase simfs vzethdev vznetdev vzrst vzcpt vzdquota vzmon vzdev 
af_packet xt_tcpudp ipt_ttl iptable_mangle ipt_REJECT rpcsec_gss_krb5 
auth_rpcgss des_generic sunrpc bridge stp nf_conntrack_ftp 
iptable_filter dm_mod sr_mod ppdev joydev pl2303 usbserial i5000_edac sg 
edac_core i2c_i801 psmouse pcspkr ide_cd_mod i2c_core serio_raw cdrom 
e1000e floppy rtc_cmos rtc_core parport_pc parport rtc_lib evdev thermal 
processor container tun ipt_MASQUERADE iptable_nat ip_tables nf_nat 
x_tables nf_conntrack_ipv4 nf_conntrack button ohci_hcd uhci_hcd usbhid 
hid ff_memless ehci_hcd usb_storage libusual usbcore ext3 jbd mbcache 
ata_generic aacraid ata_piix pata_acpi libata dock sd_mod crc_t10dif 
scsi_mod ide_disk piix ide_pci_generic ide_core [last unloaded: 8021q]
May  3 21:21:35 cita kernel: [24748.676007] CPU 0:
May  3 21:21:35 cita kernel: [24748.676007] Modules linked in: garp 
mptctl mptbase simfs vzethdev vznetdev vzrst vzcpt vzdquota vzmon vzdev 
af_packet xt_tcpudp ipt_ttl iptable_mangle ipt_REJECT rpcsec_gss_krb5 
auth_rpcgss des_generic sunrpc bridge stp nf_conntrack_ftp 
iptable_filter dm_mod sr_mod ppdev joydev pl2303 usbserial i5000_edac sg 
edac_core i2c_i801 psmouse pcspkr ide_cd_mod i2c_core serio_raw cdrom 
e1000e floppy rtc_cmos rtc_core parport_pc parport rtc_lib evdev thermal 
processor container tun ipt_MASQUERADE iptable_nat ip_tables nf_nat 
x_tables nf_conntrack_ipv4 nf_conntrack button ohci_hcd uhci_hcd usbhid 
hid ff_memless ehci_hcd usb_storage libusual usbcore ext3 jbd mbcache 
ata_generic aacraid ata_piix pata_acpi libata dock sd_mod crc_t10dif 
scsi_mod ide_disk piix ide_pci_generic ide_core [last unloaded: 8021q]
May  3 21:21:35 cita kernel: [24748.676007] Pid: 12144, comm: netstat 
Not tainted 2.6.27-ovz-smp-alt9 #1 briullov
May  3 21:21:35 cita kernel: [24748.676007] RIP: 
0010:[<ffffffff80438582>]  [<ffffffff80438582>] dst_release+0x22/0x40
May  3 21:21:35 cita kernel: [24748.676007] RSP: 0018:ffffffff807408b8 
EFLAGS: 00000202
May  3 21:21:35 cita kernel: [24748.676007] RAX: 0000000000000002 RBX: 
ffffffff807408c8 RCX: ffff8801285878ce
May  3 21:21:35 cita kernel: [24748.676007] RDX: 0000000000000008 RSI: 
000000000000000e RDI: ffff880139acf048
May  3 21:21:35 cita kernel: [24748.676007] RBP: ffffffff80740830 R08: 
ffff88013d1a30e0 R09: ffff880137574000
May  3 21:21:35 cita kernel: [24748.676007] R10: 000000000000000e R11: 
0000000000000000 R12: ffffffff8020d386
May  3 21:21:35 cita kernel: [24748.676007] R13: ffffffff80740830 R14: 
ffff880137453080 R15: ffff8800820a8c00
May  3 21:21:35 cita kernel: [24748.676007] FS:  00007fe1301f06f0(0000) 
GS:ffffffff80749080(0000) knlGS:0000000000000000
May  3 21:21:35 cita kernel: [24748.676007] CS:  0010 DS: 0000 ES: 0000 
CR0: 000000008005003b
May  3 21:21:35 cita kernel: [24748.676007] CR2: 00007fe12fd05480 CR3: 
00000001259e4000 CR4: 00000000000006e0
May  3 21:21:35 cita kernel: [24748.676007] DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
May  3 21:21:35 cita kernel: [24748.676007] DR3: 0000000000000000 DR6: 
00000000ffff0ff0 DR7: 0000000000000400
May  3 21:21:35 cita kernel: [24748.676007]
May  3 21:21:35 cita kernel: [24748.676007] Call Trace:
May  3 21:21:35 cita kernel: [24748.676007]  <IRQ>  [<ffffffff8044774e>] 
? eth_type_trans+0x2e/0x100
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa0420724>] ? 
veth_xmit+0x134/0x2c0 [vzethdev]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffff804348cc>] ? 
dev_hard_start_xmit+0x26c/0x2f0
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffff8043665e>] ? 
dev_queue_xmit+0x35e/0x580
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa033e4c0>] ? 
br_nf_dev_queue_xmit+0x0/0x50 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa03392e7>] ? 
br_dev_queue_push_xmit+0x67/0x90 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa033e4df>] ? 
br_nf_dev_queue_xmit+0x1f/0x50 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa033ede8>] ? 
br_nf_post_routing+0x168/0x230 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffff80455bf7>] ? 
nf_iterate+0x67/0xa0
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa0339280>] ? 
br_dev_queue_push_xmit+0x0/0x90 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffff80455cd3>] ? 
nf_hook_slow+0xa3/0x100
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa0339280>] ? 
br_dev_queue_push_xmit+0x0/0x90 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa0339351>] ? 
br_forward_finish+0x41/0x60 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa033ea08>] ? 
br_nf_forward_finish+0x128/0x130 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa033f5d0>] ? 
br_nf_forward_ip+0x230/0x2f0 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa01b5efc>] ? 
tcp_error+0xfc/0x260 [nf_conntrack]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffff80455bf7>] ? 
nf_iterate+0x67/0xa0
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa0339310>] ? 
br_forward_finish+0x0/0x60 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffff80455cd3>] ? 
nf_hook_slow+0xa3/0x100
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa0339310>] ? 
br_forward_finish+0x0/0x60 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa03393c1>] ? 
__br_forward+0x51/0x90 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa0339458>] ? 
br_forward+0x58/0x70 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa033a226>] ? 
br_handle_frame_finish+0x146/0x1e0 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa033f198>] ? 
br_nf_pre_routing_finish+0x2e8/0x320 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa033eeb0>] ? 
br_nf_pre_routing_finish+0x0/0x320 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffff80455cd3>] ? 
nf_hook_slow+0xa3/0x100
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa033eeb0>] ? 
br_nf_pre_routing_finish+0x0/0x320 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffff802d1630>] ? 
kmem_cache_alloc+0x80/0x100
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa033fa66>] ? 
br_nf_pre_routing+0x3d6/0x7c0 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffff80455bf7>] ? 
nf_iterate+0x67/0xa0
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa033a0e0>] ? 
br_handle_frame_finish+0x0/0x1e0 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffff80455cd3>] ? 
nf_hook_slow+0xa3/0x100
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa033a0e0>] ? 
br_handle_frame_finish+0x0/0x1e0 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffffa033a46e>] ? 
br_handle_frame+0x1ae/0x250 [bridge]
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffff8045d2eb>] ? 
ip_rcv+0x27b/0x310
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffff80431ddd>] ? 
netif_receive_skb+0x1ed/0x660
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffff804360f4>] ? 
process_backlog+0x74/0xf0
May  3 21:21:35 cita kernel: [24748.676007]  [<ffffffff80435f3e>] ? 
net_rx_action+0xee/0x230
May  3 21:21:36 cita kernel: [24748.676007]  [<ffffffff8024a0f2>] ? 
__do_softirq+0xc2/0x190
May  3 21:21:36 cita kernel: [24748.676007]  [<ffffffff8020d93c>] ? 
call_softirq+0x1c/0x30
May  3 21:21:36 cita kernel: [24748.676007]  <EOI>  [<ffffffff8020ef85>] 
? do_softirq+0x45/0x80
May  3 21:21:36 cita kernel: [24748.676007]  [<ffffffff8024a5f1>] ? 
local_bh_enable_ip+0xa1/0xb0
May  3 21:21:36 cita kernel: [24748.676007]  [<ffffffff804c1860>] ? 
_read_unlock_bh+0x10/0x20
May  3 21:21:36 cita kernel: [24748.676007]  [<ffffffff8047d0a8>] ? 
established_get_first+0xf8/0x130
May  3 21:21:36 cita kernel: [24748.676007]  [<ffffffff8047d430>] ? 
tcp_seq_next+0xb0/0xc0
May  3 21:21:36 cita kernel: [24748.676007]  [<ffffffff802f87b2>] ? 
seq_read+0x232/0x3a0
May  3 21:21:36 cita kernel: [24748.676007]  [<ffffffff80324718>] ? 
proc_reg_read+0x78/0xb0
May  3 21:21:36 cita kernel: [24748.676007]  [<ffffffff802d73f8>] ? 
vfs_read+0xc8/0x180
May  3 21:21:36 cita kernel: [24748.676007]  [<ffffffff802d7610>] ? 
sys_read+0x50/0xe0
May  3 21:21:36 cita kernel: [24748.676007]  [<ffffffff8020c68b>] ? 
system_call_fastpath+0x16/0x1b
May  3 21:21:36 cita kernel: [24748.676007]




----------- следующая часть -----------
Было удалено вложение не в текстовом формате...
Имя     : syslog.log.bz
Тип     : application/octet-stream
Размер  : 27452 байтов
Описание: отсутствует
Url     : <http://lists.altlinux.org/pipermail/community/attachments/20100504/338bcd44/attachment-0001.obj>


Подробная информация о списке рассылки community