Skip to content

[5.15] revert e1000e changes#78

Merged
gratian merged 1 commit intoni:nilrt/master/5.15from
amstewart:dev/5.15/revert-e1000e-changes
Sep 1, 2022
Merged

[5.15] revert e1000e changes#78
gratian merged 1 commit intoni:nilrt/master/5.15from
amstewart:dev/5.15/revert-e1000e-changes

Conversation

@amstewart
Copy link

@amstewart amstewart commented Sep 1, 2022

This PR reverts #77.

When booting NILRT 5.15 on devices with an e1000e ethernet device (at least a PXIe-8861), the changes in #77 seem to cause an indefinite RCU stall, as below.

image

Incriminating kernel stack trace:

[    4.238970] ------------[ cut here ]------------
[    4.238972] Voluntary context switch within RCU read-side critical section!
[    4.238974] WARNING: CPU: 2 PID: 1030 at kernel/rcu/tree_plugin.h:316 rcu_note_context_switch+0x4fe/0x560
[    4.238979] Modules linked in: g_ether u_ether libcomposite udc_core mousedev radeon drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm agpgart x86_pkg_temp_thermal coretemp aesni_intel tpm_tis crypto_simd tpm_tis_core igb e1000e i2c_i801 drm i2c_algo_bit tpm i2c_smbus leds_nic78bx button video backlight rng_core
[    4.238990] CPU: 2 PID: 1030 Comm: ifconfig Not tainted 5.15.55-rt48 #1
[    4.238992] Hardware name: National Instruments NI PXIe-8861/NI PXIe-8861, BIOS 1.2.2f0 06/24/2019
[    4.238993] RIP: 0010:rcu_note_context_switch+0x4fe/0x560
[    4.238994] Code: 08 48 89 87 c8 02 00 00 4c 89 8f d0 02 00 00 49 89 11 e9 2c fd ff ff 48 c7 c7 40 71 26 9d c6 05 ea 96 33 01 01 e8 62 3c 70 00 <0f> 0b e9 53 fb ff ff e8 7d 95 f2 ff e9 9f fd ff ff c6 43 15 00 48
[    4.238996] RSP: 0018:ffff9f3841d37af8 EFLAGS: 00010082
[    4.238997] RAX: 0000000000000000 RBX: ffff8dcfb5d28e40 RCX: 000000000000093c
[    4.238998] RDX: 00000000ffffe314 RSI: ffffffff9d461860 RDI: 0000000000000001
[    4.238999] RBP: 0000000000000000 R08: 000000000000003f R09: ffff9f3841d37a90
[    4.239000] R10: 0000000000000040 R11: 0000000000000001 R12: 0000000000000000
[    4.239000] R13: ffff8dce02c249c0 R14: ffff8dce00e50000 R15: ffff8dce02c249c0
[    4.239001] FS:  00007fa7df250800(0000) GS:ffff8dcfb5d00000(0000) knlGS:0000000000000000
[    4.239003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    4.239004] CR2: 00007fdb98be0020 CR3: 0000000102ff4004 CR4: 00000000003706e0
[    4.239004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    4.239005] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    4.239006] Call Trace:
[    4.239008]  <TASK>
[    4.239008]  ? __schedule+0x77/0x5f0
[    4.239011]  ? clockevents_program_event+0x8d/0xf0
[    4.239014]  ? schedule+0xad/0x110
[    4.239016]  ? schedule_hrtimeout_range_clock+0x9f/0x130
[    4.239017]  ? __hrtimer_init+0xe0/0xe0
[    4.239019]  ? usleep_range_state+0x60/0x90
[    4.239020]  ? e1000e_update_mc_addr_list_generic+0x12a/0x150 [e1000e]
[    4.239026]  ? e1000e_set_rx_mode+0x229/0x5e0 [e1000e]
[    4.239031]  ? __dev_open+0xf4/0x180
[    4.239033]  ? __dev_change_flags+0x1ba/0x240
[    4.239035]  ? dev_change_flags+0x21/0x60
[    4.239037]  ? devinet_ioctl+0x63d/0x810
[    4.239039]  ? inet_ioctl+0x175/0x1b0
[    4.239041]  ? dev_get_by_name_rcu+0xa/0x20
[    4.239043]  ? netdev_name_node_lookup_rcu+0x5e/0x70
[    4.239045]  ? dev_get_by_name_rcu+0xa/0x20
[    4.239047]  ? dev_ioctl+0x293/0x550
[    4.239049]  ? sock_do_ioctl.constprop.0+0x2b/0xd0
[    4.239051]  ? sock_ioctl+0xcc/0x2f0
[    4.239053]  ? handle_mm_fault+0x73/0x1b0
[    4.239056]  ? __x64_sys_ioctl+0x80/0xb0
[    4.239058]  ? do_syscall_64+0x40/0x90
[    4.239060]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[    4.239063]  </TASK>
[    4.239063] ---[ end trace 0000000000000002 ]---

The stall forces the user to recover the target using either NILRT safemode, or a USB provisioning key. Revert the changes until a fix can be developed.

NI AZDO: https://dev.azure.com/ni/DevCentral/_workitems/edit/2135135

@amstewart
Copy link
Author

@gratian Do you want me to revert the same in the 5.10 mainline? Or can we wait on a fix there because it isn't the default anymore?

@gratian
Copy link

gratian commented Sep 1, 2022

@amstewart Can you revert just the commit that enables the option in nati_x86_64_defconfig? I think the other change is valid and we can leave it in. I think we can wait on 5.10 until we find a better fix since it's not in use.

@amstewart amstewart force-pushed the dev/5.15/revert-e1000e-changes branch from 3793367 to 5081616 Compare September 1, 2022 19:24
This reverts commit 3425c3b.

Signed-off-by: Alex Stewart <alex.stewart@ni.com>
@amstewart amstewart force-pushed the dev/5.15/revert-e1000e-changes branch from 5081616 to 378d8fb Compare September 1, 2022 19:25
@amstewart
Copy link
Author

Patch V2

  • Reverted only the defconfig change, per @gratian 's request.

@gratian gratian merged commit 378d8fb into ni:nilrt/master/5.15 Sep 1, 2022
@amstewart amstewart deleted the dev/5.15/revert-e1000e-changes branch September 1, 2022 19:35
mike-petersen-ni pushed a commit to mike-petersen-ni/linux that referenced this pull request Jan 9, 2023
[ Upstream commit 34d01df ]

We need to detach from the power domains also on remove, not just on
probe fail so a subsequent probe works as expected.

Otherwise the following error appears on re-probe:

[   29.452005] sysfs: cannot create duplicate filename '/devices/genpd:0:3000000.remoteproc'
[   29.477121] CPU: 1 PID: 483 Comm: sh Tainted: G        W          6.1.0-rc4-00075-g71a113770bda ni#78
[   29.510319] Hardware name: Fairphone 4 (DT)
[   29.538335] Call trace:
[   29.564470]  dump_backtrace.part.0+0xe0/0xf0
[   29.592602]  show_stack+0x18/0x30
[   29.619616]  dump_stack_lvl+0x64/0x80
[   29.646834]  dump_stack+0x18/0x34
[   29.673541]  sysfs_warn_dup+0x60/0x7c
[   29.700592]  sysfs_create_dir_ns+0xec/0x110
[   29.728057]  kobject_add_internal+0xb8/0x374
[   29.755530]  kobject_add+0x9c/0x104
[   29.782072]  device_add+0xbc/0x8a0
[   29.808445]  device_register+0x20/0x30
[   29.835175]  genpd_dev_pm_attach_by_id+0xa4/0x190
[   29.862851]  genpd_dev_pm_attach_by_name+0x3c/0xb0
[   29.890472]  dev_pm_domain_attach_by_name+0x20/0x30
[   29.918212]  adsp_probe+0x278/0x580
[   29.944384]  platform_probe+0x68/0xc0
[   29.970603]  really_probe+0xbc/0x2dc
[   29.996662]  __driver_probe_device+0x78/0xe0
[   30.023491]  device_driver_attach+0x48/0xac
[   30.050215]  bind_store+0xb8/0x114
[   30.075957]  drv_attr_store+0x24/0x3c
[   30.101874]  sysfs_kf_write+0x44/0x54
[   30.127751]  kernfs_fop_write_iter+0x120/0x1f0
[   30.154448]  vfs_write+0x1ac/0x380
[   30.179937]  ksys_write+0x70/0x104
[   30.205274]  __arm64_sys_write+0x1c/0x2c
[   30.231060]  invoke_syscall+0x48/0x114
[   30.256594]  el0_svc_common.constprop.0+0x44/0xec
[   30.283183]  do_el0_svc+0x2c/0xd0
[   30.308320]  el0_svc+0x2c/0x84
[   30.333059]  el0t_64_sync_handler+0xf4/0x120
[   30.359001]  el0t_64_sync+0x18c/0x190
[   30.384385] kobject_add_internal failed for genpd:0:3000000.remoteproc with -EEXIST, don't try to register things with the same name in the same directory.
[   30.406029] remoteproc remoteproc0: releasing 3000000.remoteproc
[   30.416064] qcom_q6v5_pas: probe of 3000000.remoteproc failed with error -17

Fixes: 17ee2fb ("remoteproc: qcom: pas: Vote for active/proxy power domains")
Reviewed-by: Sibi Sankar <quic_sibis@quicinc.com>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Luca Weiss <luca.weiss@fairphone.com>
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Link: https://lore.kernel.org/r/20221118090816.100012-2-luca.weiss@fairphone.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
mike-petersen-ni pushed a commit to mike-petersen-ni/linux that referenced this pull request Jan 9, 2023
[ Upstream commit bd5dac7 ]

Fix the following NULL pointer dereference avoiding to run
mt76u_status_worker thread if the device is not running yet.

KASAN: null-ptr-deref in range
[0x0000000000000000-0x0000000000000007]
CPU: 0 PID: 98 Comm: kworker/u2:2 Not tainted 5.14.0+ ni#78 Hardware
name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
Workqueue: mt76 mt76u_tx_status_data
RIP: 0010:mt76x02_mac_fill_tx_status.isra.0+0x82c/0x9e0
Code: c5 48 b8 00 00 00 00 00 fc ff df 80 3c 02 00 0f 85 94 01 00 00
48 b8 00 00 00 00 00 fc ff df 4d 8b 34 24 4c 89 f2 48 c1 ea 03 <0f>
b6
04 02 84 c0 74 08 3c 03 0f 8e 89 01 00 00 41 8b 16 41 0f b7
RSP: 0018:ffffc900005af988 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffffc900005afae8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff832fc661 RDI: ffffc900005afc2a
RBP: ffffc900005afae0 R08: 0000000000000001 R09: fffff520000b5f3c
R10: 0000000000000003 R11: fffff520000b5f3b R12: ffff88810b6132d8
R13: 000000000000ffff R14: 0000000000000000 R15: ffffc900005afc28
FS:  0000000000000000(0000) GS:ffff88811aa00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa0eda6a000 CR3: 0000000118f17000 CR4: 0000000000750ef0
PKRU: 55555554
Call Trace:
 mt76x02_send_tx_status+0x1d2/0xeb0
 mt76x02_tx_status_data+0x8e/0xd0
 mt76u_tx_status_data+0xe1/0x240
 process_one_work+0x92b/0x1460
 worker_thread+0x95/0xe00
 kthread+0x3a1/0x480
 ret_from_fork+0x1f/0x30
Modules linked in:
--[ end trace 8df5d20fc5040f65 ]--
RIP: 0010:mt76x02_mac_fill_tx_status.isra.0+0x82c/0x9e0
Code: c5 48 b8 00 00 00 00 00 fc ff df 80 3c 02 00 0f 85 94 01 00 00
48 b8 00 00 00 00 00 fc ff df 4d 8b 34 24 4c 89 f2 48 c1 ea 03 <0f>
b6
04 02 84 c0 74 08 3c 03 0f 8e 89 01 00 00 41 8b 16 41 0f b7
RSP: 0018:ffffc900005af988 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffffc900005afae8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff832fc661 RDI: ffffc900005afc2a
RBP: ffffc900005afae0 R08: 0000000000000001 R09: fffff520000b5f3c
R10: 0000000000000003 R11: fffff520000b5f3b R12: ffff88810b6132d8
R13: 000000000000ffff R14: 0000000000000000 R15: ffffc900005afc28
FS:  0000000000000000(0000) GS:ffff88811aa00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa0eda6a000 CR3: 0000000118f17000 CR4: 0000000000750ef0
PKRU: 55555554

Moreover move stat_work schedule out of the for loop.

Reported-by: Dokyung Song <dokyungs@yonsei.ac.kr>
Co-developed-by: Deren Wu <deren.wu@mediatek.com>
Signed-off-by: Deren Wu <deren.wu@mediatek.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Sasha Levin <sashal@kernel.org>
gratian pushed a commit to gratian/linux that referenced this pull request Feb 28, 2024
[ Upstream commit b16904f ]

With latest upstream llvm18, the following test cases failed:

  $ ./test_progs -j
  ni#13/2    bpf_cookie/multi_kprobe_link_api:FAIL
  ni#13/3    bpf_cookie/multi_kprobe_attach_api:FAIL
  ni#13      bpf_cookie:FAIL
  ni#77      fentry_fexit:FAIL
  ni#78/1    fentry_test/fentry:FAIL
  ni#78      fentry_test:FAIL
  ni#82/1    fexit_test/fexit:FAIL
  ni#82      fexit_test:FAIL
  ni#112/1   kprobe_multi_test/skel_api:FAIL
  ni#112/2   kprobe_multi_test/link_api_addrs:FAIL
  [...]
  ni#112     kprobe_multi_test:FAIL
  #356/17  test_global_funcs/global_func17:FAIL
  #356     test_global_funcs:FAIL

Further analysis shows llvm upstream patch [1] is responsible for the above
failures. For example, for function bpf_fentry_test7() in net/bpf/test_run.c,
without [1], the asm code is:

  0000000000000400 <bpf_fentry_test7>:
     400: f3 0f 1e fa                   endbr64
     404: e8 00 00 00 00                callq   0x409 <bpf_fentry_test7+0x9>
     409: 48 89 f8                      movq    %rdi, %rax
     40c: c3                            retq
     40d: 0f 1f 00                      nopl    (%rax)

... and with [1], the asm code is:

  0000000000005d20 <bpf_fentry_test7.specialized.1>:
    5d20: e8 00 00 00 00                callq   0x5d25 <bpf_fentry_test7.specialized.1+0x5>
    5d25: c3                            retq

... and <bpf_fentry_test7.specialized.1> is called instead of <bpf_fentry_test7>
and this caused test failures for ni#13/ni#77 etc. except #356.

For test case #356/17, with [1] (progs/test_global_func17.c)), the main prog
looks like:

  0000000000000000 <global_func17>:
       0:       b4 00 00 00 2a 00 00 00 w0 = 0x2a
       1:       95 00 00 00 00 00 00 00 exit

... which passed verification while the test itself expects a verification
failure.

Let us add 'barrier_var' style asm code in both places to prevent function
specialization which caused selftests failure.

  [1] llvm/llvm-project#72903

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231127050342.1945270-1-yonghong.song@linux.dev
Signed-off-by: Sasha Levin <sashal@kernel.org>
gratian pushed a commit that referenced this pull request Jan 8, 2025
[ Upstream commit 5c1806c ]

While fuzzing an arm64 kernel, Alexander Potapenko reported:

| BUG: KCSAN: data-race in ktime_get_mono_fast_ns / timekeeping_update
|
| write to 0xffffffc082e74248 of 56 bytes by interrupt on cpu 0:
|  update_fast_timekeeper kernel/time/timekeeping.c:430 [inline]
|  timekeeping_update+0x1d8/0x2d8 kernel/time/timekeeping.c:768
|  timekeeping_advance+0x9e8/0xb78 kernel/time/timekeeping.c:2344
|  update_wall_time+0x18/0x38 kernel/time/timekeeping.c:2360
|  [...]
|
| read to 0xffffffc082e74258 of 8 bytes by task 5260 on cpu 1:
|  __ktime_get_fast_ns kernel/time/timekeeping.c:372 [inline]
|  ktime_get_mono_fast_ns+0x88/0x174 kernel/time/timekeeping.c:489
|  init_srcu_struct_fields+0x40c/0x530 kernel/rcu/srcutree.c:263
|  init_srcu_struct+0x14/0x20 kernel/rcu/srcutree.c:311
|  [...]
|
| value changed: 0x000002f875d33266 -> 0x000002f877416866
|
| Reported by Kernel Concurrency Sanitizer on:
| CPU: 1 UID: 0 PID: 5260 Comm: syz.2.7483 Not tainted 6.12.0-rc3-dirty #78

This is a false positive data race between a seqcount latch writer and a reader
accessing stale data. Since its introduction, KCSAN has never understood the
seqcount_latch interface (due to being unannotated).

Unlike the regular seqlock interface, the seqcount_latch interface for latch
writers never has had a well-defined critical section, making it difficult to
teach tooling where the critical section starts and ends.

Introduce an instrumentable (non-raw) seqcount_latch interface, with
which we can clearly denote writer critical sections. This both helps
readability and tooling like KCSAN to understand when the writer is done
updating all latch copies.

Fixes: 88ecd15 ("seqlock, kcsan: Add annotations for KCSAN")
Reported-by: Alexander Potapenko <glider@google.com>
Co-developed-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241104161910.780003-4-elver@google.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants