summaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
AgeCommit message (Collapse)Author
2020-12-08drm/amdgpu: fix debugfs creation/removal, againArnd Bergmann
There is still a warning when CONFIG_DEBUG_FS is disabled: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1145:13: error: 'amdgpu_ras_debugfs_create_ctrl_node' defined but not used [-Werror=unused-function] 1145 | static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev) Change the code again to make the compiler actually drop this code but not warn about it. Fixes: ae2bf61ff39e ("drm/amdgpu: guard ras debugfs creation/removal based on CONFIG_DEBUG_FS") Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-10-30drm/amdgpu: fix the issue of reserving bad pages failedDennis Li
In amdgpu_ras_reset_gpu, because bad pages may not be freed, it has high probability to reserve bad pages failed. Change to reserve bad pages when freeing VRAM. v2: 1. avoid allocating the drm_mm node outside of amdgpu_vram_mgr.c 2. move bad page reserving into amdgpu_ras_add_bad_pages, if vram mgr reserve bad page failed, it will put it into pending list, otherwise put it into processed list; 3. remove amdgpu_ras_release_bad_pages, because retired page's info has been moved into amdgpu_vram_mgr v3: 1. formate code style; 2. rename amdgpu_vram_reserve_scope as amdgpu_vram_reservation; 3. rename scope_pending as reservations_pending; 4. rename scope_processed as reserved_pages; 5. change to iterate over all the pending ones and try to insert them with drm_mm_reserve_node(); v4: 1. rename amdgpu_vram_mgr_reserve_scope as amdgpu_vram_mgr_reserve_range; 2. remove unused include "amdgpu_ras.h"; 3. rename amdgpu_vram_mgr_check_and_reserve as amdgpu_vram_mgr_do_reserve; 4. refine amdgpu_vram_mgr_reserve_range to call amdgpu_vram_mgr_do_reserve. Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Hawking Zhang <hawking.zhang@amd.com> Signed-off-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Wenhui Sheng <Wenhui.Sheng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-10-30drm/amdgpu: remove redundant GPU resetDennis Li
Because bad pages saving has been moved to UMC error interrupt callback, which will trigger a new GPU reset after saving. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <hawking.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-10-30drm/amdgpu: change to save bad pages in UMC error interrupt callbackDennis Li
Instead of saving bad pages in amdgpu_ras_reset_gpu, it will reduce the unnecessary calling of amdgpu_ras_save_bad_pages. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <hawking.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-14drm/amdgpu: bypass querying ras error count registersGuchun Chen
Once ras recovery is issued by ras sync flood interrupt or ras controller interrupt, add this guard to bypass or execute ras error count register harvest of all IPs. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04drm/amdgpu: break GPU recovery once it's in bad state(v4)Guchun Chen
When GPU executes recovery and retriving bad GPU tag from external eerpom device, the recovery will be broken and error message is printed as well for user's awareness. v2: Refine warning message in threshold reaching case, and fix spelling typo. v3: Fix explicit calling of bad gpu. v4: Rename function names. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04drm/amdgpu: skip bad page reservation once issuing from eeprom writeGuchun Chen
Once the ras recovery is issued from eeprom write itself, bad page reservation should be ignored, otherwise, recursive calling of writting to eeprom would happen. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04drm/amdgpu: validate bad page threshold in ras(v3)Guchun Chen
Bad page threshold value should be valid in the range between -1 and max records length of eeprom. It could determine when saved bad pages exceed threshold value, and proceed corresponding actions. v2: When using the default typical value, it should be min value between typical value and eeprom max records length. v3: drop the case of setting bad_page_cnt_threshold to be 0xFFFFFFFF, as it confuses user. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-15drm/amdgpu: RAS emergency restart logic refineWenhui Sheng
If we are in RAS triggered situation and BACO isn't support, emergency restart is needed, and this code is only needed for some specific cases(vega20 with given smu fw version). After we add smu mode1 reset for sienna cichlid, we need to share AMD_RESET_METHOD_MODE1 with psp mode1 reset, so in amdgpu_device_gpu_recover, we need differentiate which mode1 reset we are using, then decide if it's a full reset and then decide if emergency restart is needed, the logic will become much more complex. After discussion with Hawking, move emergency restart logic to an independent function. Signed-off-by: Likun Gao <Likun.Gao@amd.com> Signed-off-by: Wenhui Sheng <Wenhui.Sheng@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-01drm/amdgpu: disable ras query and iject during gpu resetJohn Clements
added flag to ras context to indicate if ras query functionality is ready Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-10drm/amdgpu: add function to creat all ras debugfs nodeTao Zhou
centralize all debugfs creation in one place for ras this is required to fix ras when the driver does not use the drm load and unload callbacks due to ordering issues with the drm device node. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-18drm/amdgpu: drop useless BACO arg in amdgpu_ras_reset_gpuGuchun Chen
BACO reset mode strategy is determined by latter func when calling amdgpu_ras_reset_gpu. So not to confuse audience, drop it. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-05drm/amdgpu: clear err_event_athub flag after reset exitLe Ma
Otherwise next err_event_athub error cannot call gpu reset. And following resume sequence will not be affected by this flag. v2: create function to clear amdgpu_ras_in_intr for modularity of ras driver Signed-off-by: Le Ma <le.ma@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-05drm/amdgpu: export amdgpu_ras_find_obj to use externallyLe Ma
Change it to external interface. Signed-off-by: Le Ma <le.ma@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-03drm/amdgpu: update parameter of ras_ih_cbTao Zhou
change struct ras_err_data *err_data to void *err_data, align with umc code and the callback's declaration in each ras block could pay no attention to the structure type Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: fix ras ctrl debugfs node leakGuchun Chen
Use debugfs_remove_recursive to remove the whole debugfs directory instead of removing the node one by one. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: Fix mutex lock from atomic context.Andrey Grodzovsky
Problem: amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu because writing to EEPROM during ASIC reset was unstable. But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called directly from ISR context and so locking is not allowed. Also it's irrelevant for this partilcular interrupt as this is generic RAS interrupt and not memory errors specific. Fix: Avoid calling amdgpu_ras_reserve_bad_pages if not in task context. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13drm/amdgpu: move the call of ras recovery_init and bad page reserve to ↵Tao Zhou
proper place ras recovery_init should be called after ttm init, bad page reserve should be put in front of gpu reset since i2c may be unstable during gpu reset. add cleanup for recovery_init and recovery_fini v2: add more comment and print. remove cancel_work_sync in recovery_init. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13drm/amdgpu: save umc error recordsTao Zhou
save umc error records to ras bad page array v2: add bad pages before gpu reset v3: add NULL check for adev->umc.funcs Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13drm/amdgpu: change ras bps type to eeprom table record structureTao Zhou
change bps type from retired page to eeprom table record, prepare for saving umc error records to eeprom Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13dmr/amdgpu: Add system auto reboot to RAS.Andrey Grodzovsky
In case of RAS error allow user configure auto system reboot through ras_ctrl. This is also part of the temproray work around for the RAS hang problem. v4: Use latest kernel API for disk sync. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13drm/amdgpu: Avoid HW GPU reset for RAS.Andrey Grodzovsky
Problem: Under certain conditions, when some IP bocks take a RAS error, we can get into a situation where a GPU reset is not possible due to issues in RAS in SMU/PSP. Temporary fix until proper solution in PSP/SMU is ready: When uncorrectable error happens the DF will unconditionally broadcast error event packets to all its clients/slave upon receiving fatal error event and freeze all its outbound queues, err_event_athub interrupt will be triggered. In such case and we use this interrupt to issue GPU reset. THe GPU reset code is modified for such case to avoid HW reset, only stops schedulers, deatches all in progress and not yet scheduled job's fences, set error code on them and signals. Also reject any new incoming job submissions from user space. All this is done to notify the applications of the problem. v2: Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c Remove print param from amdgpu_ras_query_error_count v3: Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset for other XGMI hive memebers. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13drm/amdgpu: add helper function to do common ras_late_init/fini (v3)Hawking Zhang
In late_init for ras, the helper function will be used to 1). disable ras feature if the IP block is masked as disabled 2). send enable feature command if the ip block was masked as enabled 3). create debugfs/sysfs node per IP block 4). register interrupt handler v2: check ih_info.cb to decide add interrupt handler or not v3: add ras_late_fini for cleanup all the ras fs node and remove interrupt handler Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-27drm/amdgpu: Add RAS EEPROM table.Andrey Grodzovsky
Add RAS EEPROM table manager to eanble RAS errors to be stored upon appearance and retrived on driver load. v2: Fix some prints. v3: Fix checksum calculation. Make table record and header structs packed to do correct byte value sum. Fix record crossing EEPROM page boundry. v4: Fix byte sum val calculation for record - look at sizeof(record). Fix some style comments. v5: Add description to EEPROM_TABLE_RECORD_SIZE and syntax fixes. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Luben Tuikov <Luben.Tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-23drm/amdgpu: correct ras error count typeGuchun Chen
Use unsigned long type for the same ras count variable. This will avoid overflow on 64 bit system. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31drm/amdgpu: add define for gfx ras subblockDennis Li
Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31drm/amdgpu: allow ras interrupt callback to return error dataTao Zhou
add error data as parameter for ras interrupt cb and process it Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Dennis Li <dennis.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31drm/amdgpu: add support for recording ras error addressTao Zhou
more than one error address may be recorded in one query Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Dennis Li <dennis.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31drm/amdgpu: move some ras data structure to amdgpu_ras.hHawking Zhang
These are common structures that can be included by IP specific source files Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <dennis.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-06-13amdgpu: no need to check return value of debugfs_create functionsGreg Kroah-Hartman
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: xinhui pan <xinhui.pan@amd.com> Cc: Evan Quan <evan.quan@amd.com> Cc: Feifei Xu <Feifei.Xu@amd.com> Cc: amd-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-06-11drm/amdgpu: Fix bounds checking in amdgpu_ras_is_supported()Dan Carpenter
The "block" variable can be set by the user through debugfs, so it can be quite large which leads to shift wrapping here. This means we report a "block" as supported when it's not, and that leads to array overflows later on. This bug is not really a security issue in real life, because debugfs is generally root only. Fixes: 36ea1bd2d084 ("drm/amdgpu: add debugfs ctrl node") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-05-24drm/amdgpu: ras support suspend/resumexinhui pan
add ras suspend function. rename ras_post_init to amdgpu_ras_resume. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: James Zhu <James.Zhu@amd.com> Tested-by: James Zhu <James.Zhu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-05-24drm/amdgpu: add badpages sysfs interafcexinhui pan
add badpages node. it will output badpages list in format gpu pfn : gpu page size : flags example 0x00000000 : 0x00001000 : R 0x00000001 : 0x00001000 : R 0x00000002 : 0x00001000 : R 0x00000003 : 0x00001000 : R 0x00000004 : 0x00001000 : R 0x00000005 : 0x00001000 : R 0x00000006 : 0x00001000 : R 0x00000007 : 0x00001000 : P 0x00000008 : 0x00001000 : P 0x00000009 : 0x00001000 : P flags can be one of below characters R: reserved. P: pending for reserve. F: failed to reserve for some reasons. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-05-24drm/amdgpu: handle ras resetxinhui pan
add another flag to allow IP do a gpu reset after device init. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-05-24drm/amdgpu: Revert "drm/amdgpu: skip gpu reset when ras error occured"xinhui pan
Enable this now to reset the GPU on RAS errors. This reverts commit 138352e5752aa3e694951d70c8fe8730219f4edf. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-04-10drm/amdgpu: Introduce another ras enable functionxinhui pan
Many parts of the whole SW stack can program the ras enablement state during the boot. Now we handle that case by adding one function which check the ras flags and choose different code path. Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: xinhui pan <xinhui.pan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-03-27drm/amdgpu: Fix amdgpu ras to ta enums conversionxinhui pan
Add helpes to transalte the two enums. And it will catch bugs easily. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-03-19drm/amdgpu: add new ras workflow control flagsxinhui pan
add ras post init function. Do some initialization after all IP have finished their late init. Add new member flags which will control the ras work flow. For now, vbios enable ras for us on boot. That might change in the future. So there should be a flag from vbios to tell us if ras is enabled or not on boot. Looks like there is no such info now. Other bits of the flags are reserved to control other parts of ras. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-03-19drm/amdgpu: add new member hw_supportedxinhui pan
Currently, it is not clear how ras is supported. Both software and hardware can set the supported. That is confusing. Fix it by adding new member hw_supported. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-03-19drm/amdgpu: skip gpu reset when ras error occuredxinhui pan
gpu reset is not stable on vega20 A1. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-03-19drm/amdgpu: add debugfs ctrl nodexinhui pan
allow userspace enable/disable ras Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-03-19drm/amdgpu: add amdgpu_ras.c to support ras (v2)xinhui pan
add obj management. add feature control. add debugfs infrastructure. add sysfs infrastructure. add IH infrastructure. add recovery infrastructure. It is a framework. Other IPs need call amdgpu_ras_xxx function instead of psp_ras_xxx functions. v2: squash in warning fixes Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>