sparc64: Measure receiver forward progress to avoid send mondo timeout

A large sun4v SPARC system may have moments of intensive xcall activities, usually caused by unmapping many pages on many CPUs concurrently. This can flood receivers with CPU mondo interrupts for an extended period, causing some unlucky senders to hit send-mondo timeout. This problem gets worse as cpu count increases because sometimes mappings must be invalidated on all CPUs, and sometimes all CPUs may gang up on a single CPU. But a busy system is not a broken system. In the above scenario, as long as the receiver is making forward progress processing mondo interrupts, the sender should continue to retry. This patch implements the receiver's forward progress meter by introducing a per cpu counter 'cpu_mondo_counter[cpu]' where 'cpu' is in the range of 0..NR_CPUS. The receiver increments its counter as soon as it receives a mondo and the sender tracks the receiver's counter. If the receiver has stopped making forward progress when the retry limit is reached, the sender declares send-mondo-timeout and panic; otherwise, the receiver is allowed to keep making forward progress. In addition, it's been observed that PCIe hotplug events generate Correctable Errors that are handled by hypervisor and then OS. Hypervisor 'borrows' a guest cpu strand briefly to provide the service. If the cpu strand is simultaneously the only cpu targeted by a mondo, it may not be available for the mondo in 20msec, causing SUN4V mondo timeout. It appears that 1 second is the agreed wait time between hypervisor and guest OS, this patch makes the adjustment. Orabug: 25476541 Orabug: 26417466 Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com> Reviewed-by: Rob Gardner <rob.gardner@oracle.com> Reviewed-by: Thomas Tai <thomas.tai@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
author: Jane Chu <jane.chu@oracle.com> 2017-07-11 12:00:54 -0600
committer: David S. Miller <davem@davemloft.net> 2017-07-14 11:18:02 -0700
commit: 9d53caec84c7c5700e7c1ed744ea584fff55f9ac (patch)
tree: 1992caee0d236ac95ca9bcfa0ac5f403358974f4 /arch/sparc/include/asm
parent: 2ad67141f1e47dc063b202993835361a06239aaa (diff)
download: linux-stable-9d53caec84c7c5700e7c1ed744ea584fff55f9ac.tar.gz
linux-stable-9d53caec84c7c5700e7c1ed744ea584fff55f9ac.tar.bz2
linux-stable-9d53caec84c7c5700e7c1ed744ea584fff55f9ac.zip
1 files changed, 1 insertions, 0 deletions
diff --git a/arch/sparc/include/asm/trap_block.h b/arch/sparc/include/asm/trap_block.h
index ec9c04de3664..ff05992dae7a 100644
--- a/arch/sparc/include/asm/trap_block.h
+++ b/arch/sparc/include/asm/trap_block.h
@@ -54,6 +54,7 @@ extern struct trap_per_cpu trap_block[NR_CPUS];
 void init_cur_cpu_trap(struct thread_info *);
 void setup_tba(void);
 extern int ncpus_probed;
+extern u64 cpu_mondo_counter[NR_CPUS];
 
 unsigned long real_hard_smp_processor_id(void);
author	Jane Chu <jane.chu@oracle.com>	2017-07-11 12:00:54 -0600
committer	David S. Miller <davem@davemloft.net>	2017-07-14 11:18:02 -0700
commit	9d53caec84c7c5700e7c1ed744ea584fff55f9ac (patch)
tree	1992caee0d236ac95ca9bcfa0ac5f403358974f4 /arch/sparc/include/asm
parent	2ad67141f1e47dc063b202993835361a06239aaa (diff)
download	linux-stable-9d53caec84c7c5700e7c1ed744ea584fff55f9ac.tar.gz linux-stable-9d53caec84c7c5700e7c1ed744ea584fff55f9ac.tar.bz2 linux-stable-9d53caec84c7c5700e7c1ed744ea584fff55f9ac.zip