drm/xe: Don't short circuit TDR on jobs not started

Short circuiting TDR on jobs not started is an optimization which is not required. On LNL we are facing an issue where jobs do not get scheduled by the GuC if it misses a GGTT page update. When this occurs let the TDR fire, toggle the scheduling which may get the job unstuck, and print a warning message. If the TDR fires twice on job that hasn't started, timeout the job. v2: - Add warning message (Paulo) - Add fixes tag (Paulo) - Timeout job which hasn't started after TDR firing twice v3: - Include local change v4: - Short circuit check_timeout on job not started - use warn level rather than notice (Paulo) Fixes: 7ddb9403dd74 ("drm/xe: Sample ctx timestamp to determine if jobs have timed out") Cc: stable@vger.kernel.org Cc: Paulo Zanoni <paulo.r.zanoni@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241025214330.2010521-2-matthew.brost@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
author: Matthew Brost <matthew.brost@intel.com> 2024-10-25 14:43:29 -0700
committer: Lucas De Marchi <lucas.demarchi@intel.com> 2024-10-30 22:14:06 -0700
commit: 35d25a4a0012e690ef0cc4c5440231176db595cc (patch)
tree: de6f27d71ca40e9754cafa1165dc56759a2d15e7 /drivers/gpu/drm/xe
parent: 5a710196883e0ac019ac6df2a6d79c16ad3c32fa (diff)
download: linux-stable-35d25a4a0012e690ef0cc4c5440231176db595cc.tar.gz
linux-stable-35d25a4a0012e690ef0cc4c5440231176db595cc.tar.bz2
linux-stable-35d25a4a0012e690ef0cc4c5440231176db595cc.zip
1 files changed, 12 insertions, 6 deletions
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index d2dcc9afd223..ad194fd297dc 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -991,12 +991,22 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
 static bool check_timeout(struct xe_exec_queue *q, struct xe_sched_job *job)
 {
 	struct xe_gt *gt = guc_to_gt(exec_queue_to_guc(q));
-	u32 ctx_timestamp = xe_lrc_ctx_timestamp(q->lrc[0]);
-	u32 ctx_job_timestamp = xe_lrc_ctx_job_timestamp(q->lrc[0]);
+	u32 ctx_timestamp, ctx_job_timestamp;
 	u32 timeout_ms = q->sched_props.job_timeout_ms;
 	u32 diff;
 	u64 running_time_ms;
 
+	if (!xe_sched_job_started(job)) {
+		xe_gt_warn(gt, "Check job timeout: seqno=%u, lrc_seqno=%u, guc_id=%d, not started",
+			   xe_sched_job_seqno(job), xe_sched_job_lrc_seqno(job),
+			   q->guc->id);
+
+		return xe_sched_invalidate_job(job, 2);
+	}
+
+	ctx_timestamp = xe_lrc_ctx_timestamp(q->lrc[0]);
+	ctx_job_timestamp = xe_lrc_ctx_job_timestamp(q->lrc[0]);
+
 	/*
 	 * Counter wraps at ~223s at the usual 19.2MHz, be paranoid catch
 	 * possible overflows with a high timeout.
@@ -1126,10 +1136,6 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 		exec_queue_killed_or_banned_or_wedged(q) ||
 		exec_queue_destroyed(q);
 
-	/* Job hasn't started, can't be timed out */
-	if (!skip_timeout_check && !xe_sched_job_started(job))
-		goto rearm;
-
 	/*
 	 * If devcoredump not captured and GuC capture for the job is not ready
 	 * do manual capture first and decide later if we need to use it
author	Matthew Brost <matthew.brost@intel.com>	2024-10-25 14:43:29 -0700
committer	Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-30 22:14:06 -0700
commit	35d25a4a0012e690ef0cc4c5440231176db595cc (patch)
tree	de6f27d71ca40e9754cafa1165dc56759a2d15e7 /drivers/gpu/drm/xe
parent	5a710196883e0ac019ac6df2a6d79c16ad3c32fa (diff)
download	linux-stable-35d25a4a0012e690ef0cc4c5440231176db595cc.tar.gz linux-stable-35d25a4a0012e690ef0cc4c5440231176db595cc.tar.bz2 linux-stable-35d25a4a0012e690ef0cc4c5440231176db595cc.zip