[Comm] ksoftirqd и 100% CPU
Di
denyago на rambler.ru
Вт Май 4 08:25:52 UTC 2010
Доброго времени суток.
Сервер (2.6.27-ovz-smp-alt9 #1 SMP x86_64) используется в т.ч. как
маршрутизатор. Этим заведует VE с проброшенными eth0|1 . На eth0
настроен VLAN. Всё это работало несколько месяцев. В последнюю неделю
(да и раньше) ничего не обновлял...
Посоветуйте кто виноват и что делать, пожалуйста.
Сегодня столкнулся с проблемой:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7 root 15 -5 0 0 0 R 85.8 0.0 24:12.35 ksoftirqd/1
13 root 15 -5 0 0 0 R 70.5 0.0 21:55.17 ksoftirqd/3
4 root 15 -5 0 0 0 R 61.2 0.0 22:09.86 ksoftirqd/0
10 root 15 -5 0 0 0 R 51.3 0.0 22:51.07 ksoftirqd/2
За вчера, столкнулся с єтим 2 раза. Первый раз помог reset, второй раз
охранник, наверно, спал и я отложил перезагрузку на сегодня.
Поискал. Всюду ругались на проблемы драйвера сетевой карты. Пустил меня
сервер только по ssh и стал я разбираться...
1. Отрубил eth0 (внутренний интерфейс) и выгрузил 8021q - не помогло
2. Понаблюдал, как появляются прерывания. Надеюсь, я правильно понял,
что означают числа в /proc/interrupts:
CPU0 CPU1 CPU2 CPU3 Время
14: 15448 15289 15610 15530 20:13:02 IO-APIC-edge ide0
15616 15490 15806 15721 20:17:15 00:04:13 ср. 3
16: 886216 884792 886101 885861 20:13:02 IO-APIC-fasteoi aacraid
892159 890754 891996 891830 20:17:15 00:04:13 ср. 94
17: 124397 125099 124218 124657 20:13:02 IO-APIC-fasteoi
ehci_hcd:usb1,uhci_hcd:usb2
126660 127303 126480 126884 20:17:15 00:04:13 ср. 35
2296: 1302917 1303284 1301544 1299754 20:13:02 PCI-MSI-edge eth1
1303305 1303680 1301954 1300130 20:17:15 00:04:13 ср. 6
LOC: 4554114 3947750 4023168 3822185 20:13:02 Local timer interrupts
4620588 4014420 4105265 3890368 20:17:15 00:04:13 ср. 1120
RES: 48597 85163 42248 87307 20:13:02 Rescheduling interrupts
51718 87104 44109 90804 20:17:15 00:04:13 ср. 41
TLB: 2853653 2470564 2986077 3606213 20:13:02 TLB shootdowns
2855765 2471715 2987263 3607910 20:17:15 00:04:13 ср. 24
Посчитал среднее количество за 4 мин. Лидером стал aacraid (аппаратный,
отдельной платой), оставшаяся сетевуха почти прерываний не создавала.
3. Остановил nagios, внутренний сайт, внешний сайт, collectd - не помогло.
4. Посмотрел, что происходит с внешним интерфейсом:
# ethtool -i eth1
driver: e1000e
version: 0.3.3.3-k6
firmware-version: 1.6-12
bus-info: 0000:06:00.0
# ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
06:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
Ethernet Controller (Copper) (rev 01)
Subsystem: Super Micro Computer Inc Device 0000
Flags: bus master, fast devsel, latency 0, IRQ 2296
Memory at d8420000 (32-bit, non-prefetchable) [size=128K]
I/O ports at 2020 [size=32]
Capabilities: [c8] Power Management version 2
Capabilities: [d0] MSI: Mask- 64bit+ Count=1/1 Enable+
Capabilities: [e0] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Device Serial Number 7e-b7-34-ff-ff-48-30-00
Kernel driver in use: e1000e
Kernel modules: e1000e
eth1 Link encap:Ethernet HWaddr ***
inet addr:*** Bcast:*** Mask:***
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2396787 errors:0 dropped:0 overruns:0 frame:0
TX packets:6475512 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:213544646 (203.6 MiB) TX bytes:7619333922 (7.0 GiB)
Memory:d8420000-d8440000
У кое-кого на машине запущены торренты, но нагрузка должны была
прекратиться, когда я отключил eth0...
При этом, пинг извне - 20 мс.
Дальше я сделал глупость: решил проверить, что скажет netstat. В итоге
получил вот такое:
May 3 21:21:35 cita kernel: [24748.676007] BUG: soft lockup - CPU#0
stuck for 61s! [netstat:12144]
May 3 21:21:35 cita kernel: [24748.676007] Modules linked in: garp
mptctl mptbase simfs vzethdev vznetdev vzrst vzcpt vzdquota vzmon vzdev
af_packet xt_tcpudp ipt_ttl iptable_mangle ipt_REJECT rpcsec_gss_krb5
auth_rpcgss des_generic sunrpc bridge stp nf_conntrack_ftp
iptable_filter dm_mod sr_mod ppdev joydev pl2303 usbserial i5000_edac sg
edac_core i2c_i801 psmouse pcspkr ide_cd_mod i2c_core serio_raw cdrom
e1000e floppy rtc_cmos rtc_core parport_pc parport rtc_lib evdev thermal
processor container tun ipt_MASQUERADE iptable_nat ip_tables nf_nat
x_tables nf_conntrack_ipv4 nf_conntrack button ohci_hcd uhci_hcd usbhid
hid ff_memless ehci_hcd usb_storage libusual usbcore ext3 jbd mbcache
ata_generic aacraid ata_piix pata_acpi libata dock sd_mod crc_t10dif
scsi_mod ide_disk piix ide_pci_generic ide_core [last unloaded: 8021q]
May 3 21:21:35 cita kernel: [24748.676007] CPU 0:
May 3 21:21:35 cita kernel: [24748.676007] Modules linked in: garp
mptctl mptbase simfs vzethdev vznetdev vzrst vzcpt vzdquota vzmon vzdev
af_packet xt_tcpudp ipt_ttl iptable_mangle ipt_REJECT rpcsec_gss_krb5
auth_rpcgss des_generic sunrpc bridge stp nf_conntrack_ftp
iptable_filter dm_mod sr_mod ppdev joydev pl2303 usbserial i5000_edac sg
edac_core i2c_i801 psmouse pcspkr ide_cd_mod i2c_core serio_raw cdrom
e1000e floppy rtc_cmos rtc_core parport_pc parport rtc_lib evdev thermal
processor container tun ipt_MASQUERADE iptable_nat ip_tables nf_nat
x_tables nf_conntrack_ipv4 nf_conntrack button ohci_hcd uhci_hcd usbhid
hid ff_memless ehci_hcd usb_storage libusual usbcore ext3 jbd mbcache
ata_generic aacraid ata_piix pata_acpi libata dock sd_mod crc_t10dif
scsi_mod ide_disk piix ide_pci_generic ide_core [last unloaded: 8021q]
May 3 21:21:35 cita kernel: [24748.676007] Pid: 12144, comm: netstat
Not tainted 2.6.27-ovz-smp-alt9 #1 briullov
May 3 21:21:35 cita kernel: [24748.676007] RIP:
0010:[<ffffffff80438582>] [<ffffffff80438582>] dst_release+0x22/0x40
May 3 21:21:35 cita kernel: [24748.676007] RSP: 0018:ffffffff807408b8
EFLAGS: 00000202
May 3 21:21:35 cita kernel: [24748.676007] RAX: 0000000000000002 RBX:
ffffffff807408c8 RCX: ffff8801285878ce
May 3 21:21:35 cita kernel: [24748.676007] RDX: 0000000000000008 RSI:
000000000000000e RDI: ffff880139acf048
May 3 21:21:35 cita kernel: [24748.676007] RBP: ffffffff80740830 R08:
ffff88013d1a30e0 R09: ffff880137574000
May 3 21:21:35 cita kernel: [24748.676007] R10: 000000000000000e R11:
0000000000000000 R12: ffffffff8020d386
May 3 21:21:35 cita kernel: [24748.676007] R13: ffffffff80740830 R14:
ffff880137453080 R15: ffff8800820a8c00
May 3 21:21:35 cita kernel: [24748.676007] FS: 00007fe1301f06f0(0000)
GS:ffffffff80749080(0000) knlGS:0000000000000000
May 3 21:21:35 cita kernel: [24748.676007] CS: 0010 DS: 0000 ES: 0000
CR0: 000000008005003b
May 3 21:21:35 cita kernel: [24748.676007] CR2: 00007fe12fd05480 CR3:
00000001259e4000 CR4: 00000000000006e0
May 3 21:21:35 cita kernel: [24748.676007] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
May 3 21:21:35 cita kernel: [24748.676007] DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
May 3 21:21:35 cita kernel: [24748.676007]
May 3 21:21:35 cita kernel: [24748.676007] Call Trace:
May 3 21:21:35 cita kernel: [24748.676007] <IRQ> [<ffffffff8044774e>]
? eth_type_trans+0x2e/0x100
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa0420724>] ?
veth_xmit+0x134/0x2c0 [vzethdev]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffff804348cc>] ?
dev_hard_start_xmit+0x26c/0x2f0
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffff8043665e>] ?
dev_queue_xmit+0x35e/0x580
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa033e4c0>] ?
br_nf_dev_queue_xmit+0x0/0x50 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa03392e7>] ?
br_dev_queue_push_xmit+0x67/0x90 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa033e4df>] ?
br_nf_dev_queue_xmit+0x1f/0x50 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa033ede8>] ?
br_nf_post_routing+0x168/0x230 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffff80455bf7>] ?
nf_iterate+0x67/0xa0
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa0339280>] ?
br_dev_queue_push_xmit+0x0/0x90 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffff80455cd3>] ?
nf_hook_slow+0xa3/0x100
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa0339280>] ?
br_dev_queue_push_xmit+0x0/0x90 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa0339351>] ?
br_forward_finish+0x41/0x60 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa033ea08>] ?
br_nf_forward_finish+0x128/0x130 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa033f5d0>] ?
br_nf_forward_ip+0x230/0x2f0 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa01b5efc>] ?
tcp_error+0xfc/0x260 [nf_conntrack]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffff80455bf7>] ?
nf_iterate+0x67/0xa0
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa0339310>] ?
br_forward_finish+0x0/0x60 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffff80455cd3>] ?
nf_hook_slow+0xa3/0x100
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa0339310>] ?
br_forward_finish+0x0/0x60 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa03393c1>] ?
__br_forward+0x51/0x90 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa0339458>] ?
br_forward+0x58/0x70 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa033a226>] ?
br_handle_frame_finish+0x146/0x1e0 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa033f198>] ?
br_nf_pre_routing_finish+0x2e8/0x320 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa033eeb0>] ?
br_nf_pre_routing_finish+0x0/0x320 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffff80455cd3>] ?
nf_hook_slow+0xa3/0x100
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa033eeb0>] ?
br_nf_pre_routing_finish+0x0/0x320 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffff802d1630>] ?
kmem_cache_alloc+0x80/0x100
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa033fa66>] ?
br_nf_pre_routing+0x3d6/0x7c0 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffff80455bf7>] ?
nf_iterate+0x67/0xa0
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa033a0e0>] ?
br_handle_frame_finish+0x0/0x1e0 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffff80455cd3>] ?
nf_hook_slow+0xa3/0x100
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa033a0e0>] ?
br_handle_frame_finish+0x0/0x1e0 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffffa033a46e>] ?
br_handle_frame+0x1ae/0x250 [bridge]
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffff8045d2eb>] ?
ip_rcv+0x27b/0x310
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffff80431ddd>] ?
netif_receive_skb+0x1ed/0x660
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffff804360f4>] ?
process_backlog+0x74/0xf0
May 3 21:21:35 cita kernel: [24748.676007] [<ffffffff80435f3e>] ?
net_rx_action+0xee/0x230
May 3 21:21:36 cita kernel: [24748.676007] [<ffffffff8024a0f2>] ?
__do_softirq+0xc2/0x190
May 3 21:21:36 cita kernel: [24748.676007] [<ffffffff8020d93c>] ?
call_softirq+0x1c/0x30
May 3 21:21:36 cita kernel: [24748.676007] <EOI> [<ffffffff8020ef85>]
? do_softirq+0x45/0x80
May 3 21:21:36 cita kernel: [24748.676007] [<ffffffff8024a5f1>] ?
local_bh_enable_ip+0xa1/0xb0
May 3 21:21:36 cita kernel: [24748.676007] [<ffffffff804c1860>] ?
_read_unlock_bh+0x10/0x20
May 3 21:21:36 cita kernel: [24748.676007] [<ffffffff8047d0a8>] ?
established_get_first+0xf8/0x130
May 3 21:21:36 cita kernel: [24748.676007] [<ffffffff8047d430>] ?
tcp_seq_next+0xb0/0xc0
May 3 21:21:36 cita kernel: [24748.676007] [<ffffffff802f87b2>] ?
seq_read+0x232/0x3a0
May 3 21:21:36 cita kernel: [24748.676007] [<ffffffff80324718>] ?
proc_reg_read+0x78/0xb0
May 3 21:21:36 cita kernel: [24748.676007] [<ffffffff802d73f8>] ?
vfs_read+0xc8/0x180
May 3 21:21:36 cita kernel: [24748.676007] [<ffffffff802d7610>] ?
sys_read+0x50/0xe0
May 3 21:21:36 cita kernel: [24748.676007] [<ffffffff8020c68b>] ?
system_call_fastpath+0x16/0x1b
May 3 21:21:36 cita kernel: [24748.676007]
----------- следующая часть -----------
Было удалено вложение не в текстовом формате...
Имя : syslog.log.bz
Тип : application/octet-stream
Размер : 27452 байтов
Описание: отсутствует
Url : <http://lists.altlinux.org/pipermail/community/attachments/20100504/338bcd44/attachment-0001.obj>
Подробная информация о списке рассылки community