[5.15] revert e1000e changes by amstewart · Pull Request #78 · ni/linux

amstewart · 2022-09-01T19:05:31Z

This PR reverts #77.

When booting NILRT 5.15 on devices with an e1000e ethernet device (at least a PXIe-8861), the changes in #77 seem to cause an indefinite RCU stall, as below.

Incriminating kernel stack trace:

[    4.238970] ------------[ cut here ]------------
[    4.238972] Voluntary context switch within RCU read-side critical section!
[    4.238974] WARNING: CPU: 2 PID: 1030 at kernel/rcu/tree_plugin.h:316 rcu_note_context_switch+0x4fe/0x560
[    4.238979] Modules linked in: g_ether u_ether libcomposite udc_core mousedev radeon drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm agpgart x86_pkg_temp_thermal coretemp aesni_intel tpm_tis crypto_simd tpm_tis_core igb e1000e i2c_i801 drm i2c_algo_bit tpm i2c_smbus leds_nic78bx button video backlight rng_core
[    4.238990] CPU: 2 PID: 1030 Comm: ifconfig Not tainted 5.15.55-rt48 #1
[    4.238992] Hardware name: National Instruments NI PXIe-8861/NI PXIe-8861, BIOS 1.2.2f0 06/24/2019
[    4.238993] RIP: 0010:rcu_note_context_switch+0x4fe/0x560
[    4.238994] Code: 08 48 89 87 c8 02 00 00 4c 89 8f d0 02 00 00 49 89 11 e9 2c fd ff ff 48 c7 c7 40 71 26 9d c6 05 ea 96 33 01 01 e8 62 3c 70 00 <0f> 0b e9 53 fb ff ff e8 7d 95 f2 ff e9 9f fd ff ff c6 43 15 00 48
[    4.238996] RSP: 0018:ffff9f3841d37af8 EFLAGS: 00010082
[    4.238997] RAX: 0000000000000000 RBX: ffff8dcfb5d28e40 RCX: 000000000000093c
[    4.238998] RDX: 00000000ffffe314 RSI: ffffffff9d461860 RDI: 0000000000000001
[    4.238999] RBP: 0000000000000000 R08: 000000000000003f R09: ffff9f3841d37a90
[    4.239000] R10: 0000000000000040 R11: 0000000000000001 R12: 0000000000000000
[    4.239000] R13: ffff8dce02c249c0 R14: ffff8dce00e50000 R15: ffff8dce02c249c0
[    4.239001] FS:  00007fa7df250800(0000) GS:ffff8dcfb5d00000(0000) knlGS:0000000000000000
[    4.239003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    4.239004] CR2: 00007fdb98be0020 CR3: 0000000102ff4004 CR4: 00000000003706e0
[    4.239004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    4.239005] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    4.239006] Call Trace:
[    4.239008]  <TASK>
[    4.239008]  ? __schedule+0x77/0x5f0
[    4.239011]  ? clockevents_program_event+0x8d/0xf0
[    4.239014]  ? schedule+0xad/0x110
[    4.239016]  ? schedule_hrtimeout_range_clock+0x9f/0x130
[    4.239017]  ? __hrtimer_init+0xe0/0xe0
[    4.239019]  ? usleep_range_state+0x60/0x90
[    4.239020]  ? e1000e_update_mc_addr_list_generic+0x12a/0x150 [e1000e]
[    4.239026]  ? e1000e_set_rx_mode+0x229/0x5e0 [e1000e]
[    4.239031]  ? __dev_open+0xf4/0x180
[    4.239033]  ? __dev_change_flags+0x1ba/0x240
[    4.239035]  ? dev_change_flags+0x21/0x60
[    4.239037]  ? devinet_ioctl+0x63d/0x810
[    4.239039]  ? inet_ioctl+0x175/0x1b0
[    4.239041]  ? dev_get_by_name_rcu+0xa/0x20
[    4.239043]  ? netdev_name_node_lookup_rcu+0x5e/0x70
[    4.239045]  ? dev_get_by_name_rcu+0xa/0x20
[    4.239047]  ? dev_ioctl+0x293/0x550
[    4.239049]  ? sock_do_ioctl.constprop.0+0x2b/0xd0
[    4.239051]  ? sock_ioctl+0xcc/0x2f0
[    4.239053]  ? handle_mm_fault+0x73/0x1b0
[    4.239056]  ? __x64_sys_ioctl+0x80/0xb0
[    4.239058]  ? do_syscall_64+0x40/0x90
[    4.239060]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[    4.239063]  </TASK>
[    4.239063] ---[ end trace 0000000000000002 ]---

The stall forces the user to recover the target using either NILRT safemode, or a USB provisioning key. Revert the changes until a fix can be developed.

NI AZDO: https://dev.azure.com/ni/DevCentral/_workitems/edit/2135135

amstewart · 2022-09-01T19:07:13Z

@gratian Do you want me to revert the same in the 5.10 mainline? Or can we wait on a fix there because it isn't the default anymore?

gratian · 2022-09-01T19:18:44Z

@amstewart Can you revert just the commit that enables the option in nati_x86_64_defconfig? I think the other change is valid and we can leave it in. I think we can wait on 5.10 until we find a better fix since it's not in use.

This reverts commit 3425c3b. Signed-off-by: Alex Stewart <alex.stewart@ni.com>

amstewart · 2022-09-01T19:25:39Z

Patch V2

Reverted only the defconfig change, per @gratian 's request.

[ Upstream commit 34d01df ] We need to detach from the power domains also on remove, not just on probe fail so a subsequent probe works as expected. Otherwise the following error appears on re-probe: [ 29.452005] sysfs: cannot create duplicate filename '/devices/genpd:0:3000000.remoteproc' [ 29.477121] CPU: 1 PID: 483 Comm: sh Tainted: G W 6.1.0-rc4-00075-g71a113770bda ni#78 [ 29.510319] Hardware name: Fairphone 4 (DT) [ 29.538335] Call trace: [ 29.564470] dump_backtrace.part.0+0xe0/0xf0 [ 29.592602] show_stack+0x18/0x30 [ 29.619616] dump_stack_lvl+0x64/0x80 [ 29.646834] dump_stack+0x18/0x34 [ 29.673541] sysfs_warn_dup+0x60/0x7c [ 29.700592] sysfs_create_dir_ns+0xec/0x110 [ 29.728057] kobject_add_internal+0xb8/0x374 [ 29.755530] kobject_add+0x9c/0x104 [ 29.782072] device_add+0xbc/0x8a0 [ 29.808445] device_register+0x20/0x30 [ 29.835175] genpd_dev_pm_attach_by_id+0xa4/0x190 [ 29.862851] genpd_dev_pm_attach_by_name+0x3c/0xb0 [ 29.890472] dev_pm_domain_attach_by_name+0x20/0x30 [ 29.918212] adsp_probe+0x278/0x580 [ 29.944384] platform_probe+0x68/0xc0 [ 29.970603] really_probe+0xbc/0x2dc [ 29.996662] __driver_probe_device+0x78/0xe0 [ 30.023491] device_driver_attach+0x48/0xac [ 30.050215] bind_store+0xb8/0x114 [ 30.075957] drv_attr_store+0x24/0x3c [ 30.101874] sysfs_kf_write+0x44/0x54 [ 30.127751] kernfs_fop_write_iter+0x120/0x1f0 [ 30.154448] vfs_write+0x1ac/0x380 [ 30.179937] ksys_write+0x70/0x104 [ 30.205274] __arm64_sys_write+0x1c/0x2c [ 30.231060] invoke_syscall+0x48/0x114 [ 30.256594] el0_svc_common.constprop.0+0x44/0xec [ 30.283183] do_el0_svc+0x2c/0xd0 [ 30.308320] el0_svc+0x2c/0x84 [ 30.333059] el0t_64_sync_handler+0xf4/0x120 [ 30.359001] el0t_64_sync+0x18c/0x190 [ 30.384385] kobject_add_internal failed for genpd:0:3000000.remoteproc with -EEXIST, don't try to register things with the same name in the same directory. [ 30.406029] remoteproc remoteproc0: releasing 3000000.remoteproc [ 30.416064] qcom_q6v5_pas: probe of 3000000.remoteproc failed with error -17 Fixes: 17ee2fb ("remoteproc: qcom: pas: Vote for active/proxy power domains") Reviewed-by: Sibi Sankar <quic_sibis@quicinc.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Luca Weiss <luca.weiss@fairphone.com> Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20221118090816.100012-2-luca.weiss@fairphone.com Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit bd5dac7 ] Fix the following NULL pointer dereference avoiding to run mt76u_status_worker thread if the device is not running yet. KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 0 PID: 98 Comm: kworker/u2:2 Not tainted 5.14.0+ ni#78 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 Workqueue: mt76 mt76u_tx_status_data RIP: 0010:mt76x02_mac_fill_tx_status.isra.0+0x82c/0x9e0 Code: c5 48 b8 00 00 00 00 00 fc ff df 80 3c 02 00 0f 85 94 01 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 34 24 4c 89 f2 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 89 01 00 00 41 8b 16 41 0f b7 RSP: 0018:ffffc900005af988 EFLAGS: 00010246 RAX: dffffc0000000000 RBX: ffffc900005afae8 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff832fc661 RDI: ffffc900005afc2a RBP: ffffc900005afae0 R08: 0000000000000001 R09: fffff520000b5f3c R10: 0000000000000003 R11: fffff520000b5f3b R12: ffff88810b6132d8 R13: 000000000000ffff R14: 0000000000000000 R15: ffffc900005afc28 FS: 0000000000000000(0000) GS:ffff88811aa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa0eda6a000 CR3: 0000000118f17000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: mt76x02_send_tx_status+0x1d2/0xeb0 mt76x02_tx_status_data+0x8e/0xd0 mt76u_tx_status_data+0xe1/0x240 process_one_work+0x92b/0x1460 worker_thread+0x95/0xe00 kthread+0x3a1/0x480 ret_from_fork+0x1f/0x30 Modules linked in: --[ end trace 8df5d20fc5040f65 ]-- RIP: 0010:mt76x02_mac_fill_tx_status.isra.0+0x82c/0x9e0 Code: c5 48 b8 00 00 00 00 00 fc ff df 80 3c 02 00 0f 85 94 01 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 34 24 4c 89 f2 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 89 01 00 00 41 8b 16 41 0f b7 RSP: 0018:ffffc900005af988 EFLAGS: 00010246 RAX: dffffc0000000000 RBX: ffffc900005afae8 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff832fc661 RDI: ffffc900005afc2a RBP: ffffc900005afae0 R08: 0000000000000001 R09: fffff520000b5f3c R10: 0000000000000003 R11: fffff520000b5f3b R12: ffff88810b6132d8 R13: 000000000000ffff R14: 0000000000000000 R15: ffffc900005afc28 FS: 0000000000000000(0000) GS:ffff88811aa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa0eda6a000 CR3: 0000000118f17000 CR4: 0000000000750ef0 PKRU: 55555554 Moreover move stat_work schedule out of the for loop. Reported-by: Dokyung Song <dokyungs@yonsei.ac.kr> Co-developed-by: Deren Wu <deren.wu@mediatek.com> Signed-off-by: Deren Wu <deren.wu@mediatek.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit b16904f ] With latest upstream llvm18, the following test cases failed: $ ./test_progs -j ni#13/2 bpf_cookie/multi_kprobe_link_api:FAIL ni#13/3 bpf_cookie/multi_kprobe_attach_api:FAIL ni#13 bpf_cookie:FAIL ni#77 fentry_fexit:FAIL ni#78/1 fentry_test/fentry:FAIL ni#78 fentry_test:FAIL ni#82/1 fexit_test/fexit:FAIL ni#82 fexit_test:FAIL ni#112/1 kprobe_multi_test/skel_api:FAIL ni#112/2 kprobe_multi_test/link_api_addrs:FAIL [...] ni#112 kprobe_multi_test:FAIL #356/17 test_global_funcs/global_func17:FAIL #356 test_global_funcs:FAIL Further analysis shows llvm upstream patch [1] is responsible for the above failures. For example, for function bpf_fentry_test7() in net/bpf/test_run.c, without [1], the asm code is: 0000000000000400 <bpf_fentry_test7>: 400: f3 0f 1e fa endbr64 404: e8 00 00 00 00 callq 0x409 <bpf_fentry_test7+0x9> 409: 48 89 f8 movq %rdi, %rax 40c: c3 retq 40d: 0f 1f 00 nopl (%rax) ... and with [1], the asm code is: 0000000000005d20 <bpf_fentry_test7.specialized.1>: 5d20: e8 00 00 00 00 callq 0x5d25 <bpf_fentry_test7.specialized.1+0x5> 5d25: c3 retq ... and <bpf_fentry_test7.specialized.1> is called instead of <bpf_fentry_test7> and this caused test failures for ni#13/ni#77 etc. except #356. For test case #356/17, with [1] (progs/test_global_func17.c)), the main prog looks like: 0000000000000000 <global_func17>: 0: b4 00 00 00 2a 00 00 00 w0 = 0x2a 1: 95 00 00 00 00 00 00 00 exit ... which passed verification while the test itself expects a verification failure. Let us add 'barrier_var' style asm code in both places to prevent function specialization which caused selftests failure. [1] llvm/llvm-project#72903 Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231127050342.1945270-1-yonghong.song@linux.dev Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit 5c1806c ] While fuzzing an arm64 kernel, Alexander Potapenko reported: | BUG: KCSAN: data-race in ktime_get_mono_fast_ns / timekeeping_update | | write to 0xffffffc082e74248 of 56 bytes by interrupt on cpu 0: | update_fast_timekeeper kernel/time/timekeeping.c:430 [inline] | timekeeping_update+0x1d8/0x2d8 kernel/time/timekeeping.c:768 | timekeeping_advance+0x9e8/0xb78 kernel/time/timekeeping.c:2344 | update_wall_time+0x18/0x38 kernel/time/timekeeping.c:2360 | [...] | | read to 0xffffffc082e74258 of 8 bytes by task 5260 on cpu 1: | __ktime_get_fast_ns kernel/time/timekeeping.c:372 [inline] | ktime_get_mono_fast_ns+0x88/0x174 kernel/time/timekeeping.c:489 | init_srcu_struct_fields+0x40c/0x530 kernel/rcu/srcutree.c:263 | init_srcu_struct+0x14/0x20 kernel/rcu/srcutree.c:311 | [...] | | value changed: 0x000002f875d33266 -> 0x000002f877416866 | | Reported by Kernel Concurrency Sanitizer on: | CPU: 1 UID: 0 PID: 5260 Comm: syz.2.7483 Not tainted 6.12.0-rc3-dirty #78 This is a false positive data race between a seqcount latch writer and a reader accessing stale data. Since its introduction, KCSAN has never understood the seqcount_latch interface (due to being unannotated). Unlike the regular seqlock interface, the seqcount_latch interface for latch writers never has had a well-defined critical section, making it difficult to teach tooling where the critical section starts and ends. Introduce an instrumentable (non-raw) seqcount_latch interface, with which we can clearly denote writer critical sections. This both helps readability and tooling like KCSAN to understand when the writer is done updating all latch copies. Fixes: 88ecd15 ("seqlock, kcsan: Add annotations for KCSAN") Reported-by: Alexander Potapenko <glider@google.com> Co-developed-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241104161910.780003-4-elver@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>

amstewart requested review from gratian and shruthi-ravi September 1, 2022 19:06

shruthi-ravi approved these changes Sep 1, 2022

View reviewed changes

amstewart force-pushed the dev/5.15/revert-e1000e-changes branch from 3793367 to 5081616 Compare September 1, 2022 19:24

Revert "nati_x86_64_defconfig: enable CONFIG_E1000_DELAY"

378d8fb

This reverts commit 3425c3b. Signed-off-by: Alex Stewart <alex.stewart@ni.com>

amstewart force-pushed the dev/5.15/revert-e1000e-changes branch from 5081616 to 378d8fb Compare September 1, 2022 19:25

gratian approved these changes Sep 1, 2022

View reviewed changes

gratian merged commit 378d8fb into ni:nilrt/master/5.15 Sep 1, 2022

amstewart deleted the dev/5.15/revert-e1000e-changes branch September 1, 2022 19:35

SparkingSpork mentioned this pull request Jan 10, 2023

[5.15] Revert "e1000e: Add delays after writing to registers" #95

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[5.15] revert e1000e changes#78

[5.15] revert e1000e changes#78
gratian merged 1 commit intoni:nilrt/master/5.15from
amstewart:dev/5.15/revert-e1000e-changes

amstewart commented Sep 1, 2022 •

edited

Loading

Uh oh!

amstewart commented Sep 1, 2022

Uh oh!

gratian commented Sep 1, 2022

Uh oh!

amstewart commented Sep 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

amstewart commented Sep 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amstewart commented Sep 1, 2022

Uh oh!

gratian commented Sep 1, 2022

Uh oh!

amstewart commented Sep 1, 2022

Patch V2

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amstewart commented Sep 1, 2022 •

edited

Loading