summaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
Commit message (Collapse)AuthorAgeFilesLines
* drm/amdgpu: Use delayed work to collect RAS error countersLuben Tuikov2021-05-271-0/+5
| | | | | | | | | | | | | | | | | | | | On Context Query2 IOCTL return the correctable and uncorrectable errors in O(1) fashion, from cached values, and schedule a delayed work function to calculate and cache them for the next such IOCTL. v2: Cancel pending delayed work at ras_fini(). v3: Remove conditionals when dealing with delayed work manipulation as they're inherently racy. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: Fix RAS function interfaceLuben Tuikov2021-05-271-2/+3
| | | | | | | | | | | | | | | | | | | | | The correctable and uncorrectable errors are calculated at each invocation of this function. Therefore, it is highly inefficient to return just one of them based on a Boolean input. If the caller wants both, twice the work would be done. (And this work is O(n^3) on Vega20.) Fix this "interface" to simply return what it had calculated--both values. Let the caller choose what it wants to record, inspect, use. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: Conditionally reset RAS counters on bootJohn Clements2021-05-191-0/+3
| | | | | | | | Only clear RAS error counters if perestent EDC harvesting is not supported Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: Rename to ras_*_enabledLuben Tuikov2021-05-101-1/+1
| | | | | | | | | | | | | | | | | Rename, ras_hw_supported --> ras_hw_enabled, and ras_features --> ras_enabled, to show that ras_enabled is a subset of ras_hw_enabled, which itself is a subset of the ASIC capability. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: Move up ras_hw_supportedLuben Tuikov2021-05-101-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | Move ras_hw_supported into struct amdgpu_dev. The dependency is: struct amdgpu_ras <== struct amdgpu_dev <== ASIC, read as "struct amdgpu_ras depends on struct amdgpu_dev, which depends on the hardware." This can be loosely understood as, "if RAS is supported, which is property of the ASIC (struct amdgpu_dev), then we can access struct amdgpu_ras." v2: Fix a typo: must binary AND in ternary cond in amdgpu_ras.c Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: Remove redundant ras->supportedLuben Tuikov2021-05-101-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | Remove redundant ras->supported, as this value is also stored in adev->ras_features. Use adev->ras_features, as that supercedes "ras", since the latter is its member. The dependency goes like this: ras <== adev->ras_features <== hw_supported, and is read as "ras depends on ras_features, which depends on hw_supported." The arrows show the flow of information, i.e. the dependency update. "hw_supported" should also live in "adev". Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: fix send ras disable cmd when asic not support rasStanley.Yang2021-03-231-0/+2
| | | | | | | | | | | | | | | | | | | | | | cause: It is necessary to send ras disable command to ras-ta during gfx block ras later init, because the ras capability is disable read from vbios for vega20 gaming, but the ras context is released during ras init process, this will cause send ras disable command to ras-to failed. how: Delay releasing ras context, the ras context will be released after gfx block later init done. Changed from V1: move release_ras_context into ras_resume Changed from V2: check BIT(UMC) is more reasonable before access eeprom table Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: harvest edc status when connected to host via xGMIDennis Li2021-03-231-1/+4
| | | | | | | | | | | | | | | | | When connected to a host via xGMI, system fatal errors may trigger warm reset, driver has no change to query edc status before reset. Therefore in this case, driver should harvest previous error loging registers during boot, instead of only resetting them. v2: 1. IP's ras_manager object is created when its ras feature is enabled, so change to query edc status after amdgpu_ras_late_init called 2. change to enable watchdog timer after finishing gfx edc init Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reivewed-by: Hawking Zhang <hawking.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: remove unnecessary reading for epprom headerDennis Li2021-02-261-2/+0
| | | | | | | | | | | | If the number of badpage records exceed the threshold, driver has updated both epprom header and control->tbl_hdr.header before gpu reset, therefore GPU recovery thread no need to read epprom header directly. v2: merge amdgpu_ras_check_err_threshold into amdgpu_ras_eeprom_check_err_threshold Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: do not keep debugfs dentryNirmoy Das2021-02-181-4/+0
| | | | | | | | | | | | Cleanup unnecessary debugfs dentries and surrounding functions. v3: remove return value check for debugfs_create_file() v2: remove ttm_debugfs_entries array. do not init variables. Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: fix debugfs creation/removal, againArnd Bergmann2020-12-081-6/+0
| | | | | | | | | | | | | | | There is still a warning when CONFIG_DEBUG_FS is disabled: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1145:13: error: 'amdgpu_ras_debugfs_create_ctrl_node' defined but not used [-Werror=unused-function] 1145 | static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev) Change the code again to make the compiler actually drop this code but not warn about it. Fixes: ae2bf61ff39e ("drm/amdgpu: guard ras debugfs creation/removal based on CONFIG_DEBUG_FS") Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: fix the issue of reserving bad pages failedDennis Li2020-10-301-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In amdgpu_ras_reset_gpu, because bad pages may not be freed, it has high probability to reserve bad pages failed. Change to reserve bad pages when freeing VRAM. v2: 1. avoid allocating the drm_mm node outside of amdgpu_vram_mgr.c 2. move bad page reserving into amdgpu_ras_add_bad_pages, if vram mgr reserve bad page failed, it will put it into pending list, otherwise put it into processed list; 3. remove amdgpu_ras_release_bad_pages, because retired page's info has been moved into amdgpu_vram_mgr v3: 1. formate code style; 2. rename amdgpu_vram_reserve_scope as amdgpu_vram_reservation; 3. rename scope_pending as reservations_pending; 4. rename scope_processed as reserved_pages; 5. change to iterate over all the pending ones and try to insert them with drm_mm_reserve_node(); v4: 1. rename amdgpu_vram_mgr_reserve_scope as amdgpu_vram_mgr_reserve_range; 2. remove unused include "amdgpu_ras.h"; 3. rename amdgpu_vram_mgr_check_and_reserve as amdgpu_vram_mgr_do_reserve; 4. refine amdgpu_vram_mgr_reserve_range to call amdgpu_vram_mgr_do_reserve. Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Hawking Zhang <hawking.zhang@amd.com> Signed-off-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Wenhui Sheng <Wenhui.Sheng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: remove redundant GPU resetDennis Li2020-10-301-9/+1
| | | | | | | | | Because bad pages saving has been moved to UMC error interrupt callback, which will trigger a new GPU reset after saving. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <hawking.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: change to save bad pages in UMC error interrupt callbackDennis Li2020-10-301-0/+1
| | | | | | | | | Instead of saving bad pages in amdgpu_ras_reset_gpu, it will reduce the unnecessary calling of amdgpu_ras_save_bad_pages. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <hawking.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: bypass querying ras error count registersGuchun Chen2020-08-141-0/+3
| | | | | | | | | | | Once ras recovery is issued by ras sync flood interrupt or ras controller interrupt, add this guard to bypass or execute ras error count register harvest of all IPs. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: break GPU recovery once it's in bad state(v4)Guchun Chen2020-08-041-0/+2
| | | | | | | | | | | | | | | | | When GPU executes recovery and retriving bad GPU tag from external eerpom device, the recovery will be broken and error message is printed as well for user's awareness. v2: Refine warning message in threshold reaching case, and fix spelling typo. v3: Fix explicit calling of bad gpu. v4: Rename function names. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: skip bad page reservation once issuing from eeprom writeGuchun Chen2020-08-041-3/+11
| | | | | | | | | | Once the ras recovery is issued from eeprom write itself, bad page reservation should be ignored, otherwise, recursive calling of writting to eeprom would happen. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: validate bad page threshold in ras(v3)Guchun Chen2020-08-041-0/+3
| | | | | | | | | | | | | | | | | Bad page threshold value should be valid in the range between -1 and max records length of eeprom. It could determine when saved bad pages exceed threshold value, and proceed corresponding actions. v2: When using the default typical value, it should be min value between typical value and eeprom max records length. v3: drop the case of setting bad_page_cnt_threshold to be 0xFFFFFFFF, as it confuses user. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: RAS emergency restart logic refineWenhui Sheng2020-07-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | If we are in RAS triggered situation and BACO isn't support, emergency restart is needed, and this code is only needed for some specific cases(vega20 with given smu fw version). After we add smu mode1 reset for sienna cichlid, we need to share AMD_RESET_METHOD_MODE1 with psp mode1 reset, so in amdgpu_device_gpu_recover, we need differentiate which mode1 reset we are using, then decide if it's a full reset and then decide if emergency restart is needed, the logic will become much more complex. After discussion with Hawking, move emergency restart logic to an independent function. Signed-off-by: Likun Gao <Likun.Gao@amd.com> Signed-off-by: Wenhui Sheng <Wenhui.Sheng@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: disable ras query and iject during gpu resetJohn Clements2020-04-011-0/+4
| | | | | | | | added flag to ras context to indicate if ras query functionality is ready Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: add function to creat all ras debugfs nodeTao Zhou2020-03-101-0/+2
| | | | | | | | | | | | centralize all debugfs creation in one place for ras this is required to fix ras when the driver does not use the drm load and unload callbacks due to ordering issues with the drm device node. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: drop useless BACO arg in amdgpu_ras_reset_gpuGuchun Chen2019-12-181-2/+1
| | | | | | | | | | BACO reset mode strategy is determined by latter func when calling amdgpu_ras_reset_gpu. So not to confuse audience, drop it. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: clear err_event_athub flag after reset exitLe Ma2019-12-051-0/+5
| | | | | | | | | | | Otherwise next err_event_athub error cannot call gpu reset. And following resume sequence will not be affected by this flag. v2: create function to clear amdgpu_ras_in_intr for modularity of ras driver Signed-off-by: Le Ma <le.ma@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: export amdgpu_ras_find_obj to use externallyLe Ma2019-12-051-0/+3
| | | | | | | | Change it to external interface. Signed-off-by: Le Ma <le.ma@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: update parameter of ras_ih_cbTao Zhou2019-10-031-1/+1
| | | | | | | | | | change struct ras_err_data *err_data to void *err_data, align with umc code and the callback's declaration in each ras block could pay no attention to the structure type Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: fix ras ctrl debugfs node leakGuchun Chen2019-09-161-2/+0
| | | | | | | | | Use debugfs_remove_recursive to remove the whole debugfs directory instead of removing the node one by one. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: Fix mutex lock from atomic context.Andrey Grodzovsky2019-09-161-1/+3
| | | | | | | | | | | | | | | | | | Problem: amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu because writing to EEPROM during ASIC reset was unstable. But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called directly from ISR context and so locking is not allowed. Also it's irrelevant for this partilcular interrupt as this is generic RAS interrupt and not memory errors specific. Fix: Avoid calling amdgpu_ras_reserve_bad_pages if not in task context. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: move the call of ras recovery_init and bad page reserve to ↵Tao Zhou2019-09-131-0/+5
| | | | | | | | | | | | | | | | proper place ras recovery_init should be called after ttm init, bad page reserve should be put in front of gpu reset since i2c may be unstable during gpu reset. add cleanup for recovery_init and recovery_fini v2: add more comment and print. remove cancel_work_sync in recovery_init. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: save umc error recordsTao Zhou2019-09-131-1/+1
| | | | | | | | | | | | save umc error records to ras bad page array v2: add bad pages before gpu reset v3: add NULL check for adev->umc.funcs Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: change ras bps type to eeprom table record structureTao Zhou2019-09-131-6/+5
| | | | | | | | | change bps type from retired page to eeprom table record, prepare for saving umc error records to eeprom Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* dmr/amdgpu: Add system auto reboot to RAS.Andrey Grodzovsky2019-09-131-1/+1
| | | | | | | | | | | | | In case of RAS error allow user configure auto system reboot through ras_ctrl. This is also part of the temproray work around for the RAS hang problem. v4: Use latest kernel API for disk sync. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: Avoid HW GPU reset for RAS.Andrey Grodzovsky2019-09-131-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Under certain conditions, when some IP bocks take a RAS error, we can get into a situation where a GPU reset is not possible due to issues in RAS in SMU/PSP. Temporary fix until proper solution in PSP/SMU is ready: When uncorrectable error happens the DF will unconditionally broadcast error event packets to all its clients/slave upon receiving fatal error event and freeze all its outbound queues, err_event_athub interrupt will be triggered. In such case and we use this interrupt to issue GPU reset. THe GPU reset code is modified for such case to avoid HW reset, only stops schedulers, deatches all in progress and not yet scheduled job's fences, set error code on them and signals. Also reject any new incoming job submissions from user space. All this is done to notify the applications of the problem. v2: Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c Remove print param from amdgpu_ras_query_error_count v3: Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset for other XGMI hive memebers. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: add helper function to do common ras_late_init/fini (v3)Hawking Zhang2019-09-131-0/+7
| | | | | | | | | | | | | | | | | | In late_init for ras, the helper function will be used to 1). disable ras feature if the IP block is masked as disabled 2). send enable feature command if the ip block was masked as enabled 3). create debugfs/sysfs node per IP block 4). register interrupt handler v2: check ih_info.cb to decide add interrupt handler or not v3: add ras_late_fini for cleanup all the ras fs node and remove interrupt handler Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: Add RAS EEPROM table.Andrey Grodzovsky2019-08-271-0/+3
| | | | | | | | | | | | | | | | | | | | | | Add RAS EEPROM table manager to eanble RAS errors to be stored upon appearance and retrived on driver load. v2: Fix some prints. v3: Fix checksum calculation. Make table record and header structs packed to do correct byte value sum. Fix record crossing EEPROM page boundry. v4: Fix byte sum val calculation for record - look at sizeof(record). Fix some style comments. v5: Add description to EEPROM_TABLE_RECORD_SIZE and syntax fixes. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Luben Tuikov <Luben.Tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: correct ras error count typeGuchun Chen2019-08-231-1/+1
| | | | | | | | | Use unsigned long type for the same ras count variable. This will avoid overflow on 64 bit system. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: add define for gfx ras subblockDennis Li2019-07-311-0/+230
| | | | | | | Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: allow ras interrupt callback to return error dataTao Zhou2019-07-311-18/+19
| | | | | | | | | add error data as parameter for ras interrupt cb and process it Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Dennis Li <dennis.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: add support for recording ras error addressTao Zhou2019-07-311-0/+2
| | | | | | | | | more than one error address may be recorded in one query Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Dennis Li <dennis.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: move some ras data structure to amdgpu_ras.hHawking Zhang2019-07-311-1/+68
| | | | | | | | | These are common structures that can be included by IP specific source files Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <dennis.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* amdgpu: no need to check return value of debugfs_create functionsGreg Kroah-Hartman2019-06-131-2/+2
| | | | | | | | | | | | | | | | | | | When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: xinhui pan <xinhui.pan@amd.com> Cc: Evan Quan <evan.quan@amd.com> Cc: Feifei Xu <Feifei.Xu@amd.com> Cc: amd-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: Fix bounds checking in amdgpu_ras_is_supported()Dan Carpenter2019-06-111-0/+2
| | | | | | | | | | | | | | The "block" variable can be set by the user through debugfs, so it can be quite large which leads to shift wrapping here. This means we report a "block" as supported when it's not, and that leads to array overflows later on. This bug is not really a security issue in real life, because debugfs is generally root only. Fixes: 36ea1bd2d084 ("drm/amdgpu: add debugfs ctrl node") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: ras support suspend/resumexinhui pan2019-05-241-1/+3
| | | | | | | | | | add ras suspend function. rename ras_post_init to amdgpu_ras_resume. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: James Zhu <James.Zhu@amd.com> Tested-by: James Zhu <James.Zhu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: add badpages sysfs interafcexinhui pan2019-05-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | add badpages node. it will output badpages list in format gpu pfn : gpu page size : flags example 0x00000000 : 0x00001000 : R 0x00000001 : 0x00001000 : R 0x00000002 : 0x00001000 : R 0x00000003 : 0x00001000 : R 0x00000004 : 0x00001000 : R 0x00000005 : 0x00001000 : R 0x00000006 : 0x00001000 : R 0x00000007 : 0x00001000 : P 0x00000008 : 0x00001000 : P 0x00000009 : 0x00001000 : P flags can be one of below characters R: reserved. P: pending for reserve. F: failed to reserve for some reasons. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: handle ras resetxinhui pan2019-05-241-0/+3
| | | | | | | | add another flag to allow IP do a gpu reset after device init. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: Revert "drm/amdgpu: skip gpu reset when ras error occured"xinhui pan2019-05-241-3/+0
| | | | | | | | | | Enable this now to reset the GPU on RAS errors. This reverts commit 138352e5752aa3e694951d70c8fe8730219f4edf. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: Introduce another ras enable functionxinhui pan2019-04-101-0/+3
| | | | | | | | | | Many parts of the whole SW stack can program the ras enablement state during the boot. Now we handle that case by adding one function which check the ras flags and choose different code path. Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: xinhui pan <xinhui.pan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: Fix amdgpu ras to ta enums conversionxinhui pan2019-03-271-0/+56
| | | | | | | | | | Add helpes to transalte the two enums. And it will catch bugs easily. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: add new ras workflow control flagsxinhui pan2019-03-191-0/+3
| | | | | | | | | | | | | | | | | add ras post init function. Do some initialization after all IP have finished their late init. Add new member flags which will control the ras work flow. For now, vbios enable ras for us on boot. That might change in the future. So there should be a flag from vbios to tell us if ras is enabled or not on boot. Looks like there is no such info now. Other bits of the flags are reserved to control other parts of ras. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: add new member hw_supportedxinhui pan2019-03-191-0/+3
| | | | | | | | | | | Currently, it is not clear how ras is supported. Both software and hardware can set the supported. That is confusing. Fix it by adding new member hw_supported. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
* drm/amdgpu: skip gpu reset when ras error occuredxinhui pan2019-03-191-0/+3
| | | | | | | | gpu reset is not stable on vega20 A1. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>