drm/amdgpu: Check fence emitted count to identify bad jobs

In SRIOV, when host driver performs MODE 1 reset and notifies FLR to guest driver, there is a small chance that there is no job running on hw but the driver has not updated the pending list yet, causing the driver not respond the FLR request. Modify the has_job_running function to make sure if there is still running job. v2: Use amdgpu_fence_count_emitted to determine job running status. v3: Remove the timeout wait in has_job_running Signed-off-by: Emily Deng <Emily.Deng@amd.com> Signed-off-by: Shikang Fan <shikang.fan@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
author: Shikang Fan <shikang.fan@amd.com> 2024-11-21 17:06:30 +0800
committer: Alex Deucher <alexander.deucher@amd.com> 2024-12-10 10:26:48 -0500
commit: 0859eb540f1412cced6234922626c8b1e6072126 (patch)
tree: 690a494a01bbb985eaa6f6e4569fccdb0e0c3a36 /drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
parent: 9aa879da796fde31533e72884276a440c8c1d886 (diff)
download: linux-0859eb540f1412cced6234922626c8b1e6072126.tar.gz
linux-0859eb540f1412cced6234922626c8b1e6072126.tar.bz2
linux-0859eb540f1412cced6234922626c8b1e6072126.zip
1 files changed, 6 insertions, 8 deletions
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 312c507ef5f9..07c84e4cef5a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5238,16 +5238,18 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 }
 
 /**
- * amdgpu_device_has_job_running - check if there is any job in mirror list
+ * amdgpu_device_has_job_running - check if there is any unfinished job
  *
  * @adev: amdgpu_device pointer
  *
- * check if there is any job in mirror list
+ * check if there is any job running on the device when guest driver receives
+ * FLR notification from host driver. If there are still jobs running, then
+ * the guest driver will not respond the FLR reset. Instead, let the job hit
+ * the timeout and guest driver then issue the reset request.
  */
 bool amdgpu_device_has_job_running(struct amdgpu_device *adev)
 {
 	int i;
-	struct drm_sched_job *job;
 
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 		struct amdgpu_ring *ring = adev->rings[i];
@@ -5255,11 +5257,7 @@ bool amdgpu_device_has_job_running(struct amdgpu_device *adev)
 		if (!amdgpu_ring_sched_ready(ring))
 			continue;
 
-		spin_lock(&ring->sched.job_list_lock);
-		job = list_first_entry_or_null(&ring->sched.pending_list,
-					       struct drm_sched_job, list);
-		spin_unlock(&ring->sched.job_list_lock);
-		if (job)
+		if (amdgpu_fence_count_emitted(ring))
 			return true;
 	}
 	return false;
author	Shikang Fan <shikang.fan@amd.com>	2024-11-21 17:06:30 +0800
committer	Alex Deucher <alexander.deucher@amd.com>	2024-12-10 10:26:48 -0500
commit	0859eb540f1412cced6234922626c8b1e6072126 (patch)
tree	690a494a01bbb985eaa6f6e4569fccdb0e0c3a36 /drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
parent	9aa879da796fde31533e72884276a440c8c1d886 (diff)
download	linux-0859eb540f1412cced6234922626c8b1e6072126.tar.gz linux-0859eb540f1412cced6234922626c8b1e6072126.tar.bz2 linux-0859eb540f1412cced6234922626c8b1e6072126.zip