summaryrefslogtreecommitdiffstats
path: root/drivers/misc/habanalabs
Commit message (Collapse)AuthorAgeFilesLines
* habanalabs/gaudi: Fix a potential use after free in gaudi_memset_device_memoryLv Yunlong2021-05-081-1/+3
| | | | | | | | | | | | | | | | | | | Our code analyzer reported a uaf. In gaudi_memset_device_memory, cb is get via hl_cb_kernel_create() with 2 refcount. If hl_cs_allocate_job() failed, the execution runs into release_cb branch. One ref of cb is dropped by hl_cb_put(cb) and could be freed if other thread also drops one ref. Then cb is used by cb->id later, which is a potential uaf. My patch add a variable 'id' to accept the value of cb->id before the hl_cb_put(cb) is called, to avoid the potential uaf. Fixes: 423815bf02e25 ("habanalabs/gaudi: remove PCI access to SM block") Signed-off-by: Lv Yunlong <lyl2019@mail.ustc.edu.cn> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: wait for interrupt wrong timeout calculationOfir Bitton2021-05-081-1/+1
| | | | | | | | | Wait for interrupt timeout calculation is wrong, hence timeout occurs when user waits on an interrupt with certain timeout values. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: ignore f/w status errorOded Gabbay2021-05-083-1/+16
| | | | | | | | | | | | | In case firmware has a bug and erroneously reports a status error (e.g. device unusable) during boot, allow the user to tell the driver to continue the boot regardless of the error status. This will be done via kernel parameter which exposes a mask. The user that loads the driver can decide exactly which status error to ignore and which to take into account. The bitmask is according to defines in hl_boot_if.h Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: change error level of security not readyOded Gabbay2021-05-081-5/+2
| | | | | | | | This error indicates a problem in the security initialization inside the f/w so we need to stop the device loading because it won't be usable. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: skip reading f/w errors on bad statusOded Gabbay2021-05-081-2/+7
| | | | | | | If we read all FF from the boot status register, then something is totally wrong and there is no point of reading specific errors. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: expose ASIC specific PLL indexBharat Jauhari2021-05-087-114/+94
| | | | | | | | | | | | Currently the user cannot interpret the PLL information based on index as its exposed as an integer. This commit exposes ASIC specific PLL indexes and maps it to a generic FW compatible index. Signed-off-by: Bharat Jauhari <bjauhari@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: Fix uninitialized return code rc when read size is zeroColin Ian King2021-04-161-1/+1
| | | | | | | | | | | | | In the case where size is zero the while loop never assigns rc and the return value is uninitialized. Fix this by initializing rc to zero. Fixes: 639781dcab82 ("habanalabs/gaudi: add debugfs to DMA from the device") Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Colin Ian King <colin.king@canonical.com> Addresses-Coverity: ("Uninitialized scalar variable") Link: https://lore.kernel.org/r/20210412161012.1628202-1-colin.king@canonical.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* habanalabs: print f/w boot unknown errorOded Gabbay2021-04-091-16/+68
| | | | | | | | | | | | | We need to print a message to the kernel log in case we encounter an unknown error in the f/w boot to help the user understand what happened. In addition, we shouldn't print unknown error in case of known errors. Moreover, in case of warnings/info, we shouldn't return -EIO that will fail the initialization and mark the device as disabled Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: update to latest F/W communication headerOhad Sharabi2021-04-092-1/+200
| | | | | | | | update files to latest version from F/W team. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: skip iATU if F/W security is enabledOfir Bitton2021-04-094-1/+101
| | | | | | | | | | As part of the securing GAUDI, the F/W will configure the PCI iATU regions. If the driver identifies a secured PCI ID, it will know to skip iATU configuration in a very early stage. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: derive security status from pci idOfir Bitton2021-04-098-8/+35
| | | | | | | | | | As F/ security indication must be available before driver approaches PCI bus, F/W security should be derived from PCI id rather than be fetched during boot handshake with F/W. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: move dram scrub to free sequenceBharat Jauhari2021-04-091-39/+48
| | | | | | | | | | | | DRAM scrubbing can take time hence it adds to latency during allocation. To minimize latency during initialization, scrubbing is moved to release call. In case scrubbing fails it means the device is in a bad state, hence HARD reset is initiated. Signed-off-by: Bharat Jauhari <bjauhari@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: send dynamic msi-x indexes to f/wOhad Sharabi2021-04-095-19/+131
| | | | | | | | | In order to minimize hard coded values between F/W and the driver, we send msi-x indexes dynamically to the F/W. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: clear QM errors only if not in stop_on_err modeTomer Tayar2021-04-091-1/+2
| | | | | | | | | | Clearing QM errors by the driver will prevent these H/W blocks from stopping in case they are configured to stop on errors, so perform this clearing only if this mode is not in use. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: support DEVICE_UNUSABLE error indication from FWKoby Elbaz2021-04-092-0/+7
| | | | | | | | | In case of multiple ECC errors, FW will set the DEVICE_UNUSABLE bit. On boot-up, the driver will therefore fail inserting the device. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: use strscpy instead of sprintf and strlcpyOded Gabbay2021-04-091-2/+2
| | | | | | | | Prefer the use of strscpy when copying the ASIC name into a char array, to prevent accidentally exceeding the array's length. In addition, strlcpy is frowned upon so replace it. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: add debugfs to DMA from the deviceOded Gabbay2021-04-094-5/+298
| | | | | | | | | | | | | | | | | | | | | | | | | When trying to debug program, the user often needs to dump large parts of the device's DRAM, which can reach to tens of GBs. Because reading from the device's internal memory through the PCI BAR is extremely slow, the debug can take hours. Instead, we can provide the user to copy data through one of the DMA engines. This will make the operation much faster. Currently, only GAUDI is supported. In GAUDI, we need to find a PCI DMA engine that is IDLE and set the DMA as secured to be able to bypass our MMU as we currently don't map the temporary buffer to the MMU. Example bash one-line to dump entire HBM to file (~2 minutes): for (( i=0x0; i < 0x800000000; i+=0x8000000 )); do \ printf '0x%x\n' $i | sudo tee /sys/kernel/debug/habanalabs/hl0/addr ; \ echo 0x8000000 | sudo tee /sys/kernel/debug/habanalabs/hl0/dma_size ; \ sudo cat /sys/kernel/debug/habanalabs/hl0/data_dma >> hbm.txt ; done Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: sync stream add protection to SOB reset flowfarah kassabri2021-04-091-4/+12
| | | | | | | | | | | | | | Since we moved the SOB reset flow to workqueue and not part of the fence release flow, we might reach a scenario where new context is created while we in the middle of resetting the SOB. in such cases the reset may fail due to idle check. This will mess up the streams sync since the SOB value is invalid. so we protect this area with a mutex, to delay context creation. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: add custom timeout flag per csAlon Mizrahi2021-04-093-16/+23
| | | | | | | | | | There is a need to allow to user to send command submissions with custom timeout as some CS take longer than the max timeout that is used by default. Signed-off-by: Alon Mizrahi <amizrahi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: improve utilization calculationKoby Elbaz2021-04-099-169/+40
| | | | | | | | | | | | The new approach is based on the notion that the relative current power consumption is in relation of proportionality to device's true utilization. Utilization info ranges between [0,100]% Currently, dc_power values are hard-coded. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: support legacy and new pll indexesOhad Sharabi2021-04-099-36/+182
| | | | | | | | | | | | In order to use minimum of hard coded values common to LKD and F/W a dynamic method to work with PLLs is introduced in this patch. Formerly asic specific PLL numbering is now common for all asics. To be backward compatible a bit in dev status is defined, if the bit is not set LKD will keep working with old PLL numbering. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: move relevant datapath work outside cs lockOfir Bitton2021-04-093-35/+68
| | | | | | | | | In order to shorten the time cs lock is being held, we move any possible work outside of the cs lock. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: avoid soft lockup bug upon mapping errorfarah kassabri2021-04-091-3/+17
| | | | | | | | | | Add a little sleep between page unmappings in case mapping of large number of host pages failed, in order to avoid soft lockup bug during the rollback. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: Update async events headerOfir Bitton2021-04-093-14/+25
| | | | | | | | Update with latest version from the Firmware team. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: unsecure TPC cfg status registersOfir Bitton2021-04-091-8/+0
| | | | | | | | | Unsecure relevant registers as TPC engine need access to TPC status. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: always use single-msi modeOded Gabbay2021-04-091-2/+1
| | | | | | | | | | | | The device can get into deadlock in case it use indirect mode for MSI interrupts (multi-msi) and have hard-reset during interrupt storm. To prevent that, always use direct mode which means single-msi mode. The F/W will prevent the host from writing to the indirect MSI registers to prevent any malicious user from causing this scenario. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: reset device upon BMC requestOfir Bitton2021-04-093-1/+6
| | | | | | | | | | | In case the BMC of the devices' box wants to initiate a reset of a specific device, it must go through driver. Once driver will receive the request it will initiate a hard reset flow. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: debugfs access to user mapped host addressesOfir Bitton2021-04-095-41/+144
| | | | | | | | | | In order to have a better debuggability we allow debugfs access to user mmu mapped host memory. Non-user host memory access will be rejected. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: Switch to using the new API kobj_to_dev()Yang Li2021-04-091-1/+1
| | | | | | | | | | | fixed the following coccicheck: ./drivers/misc/habanalabs/common/sysfs.c:347:60-61: WARNING opportunity for kobj_to_dev() Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: update hl_boot_if.hOhad Sharabi2021-04-091-0/+11
| | | | | | | | Update to the latest version of the file as supplied by the F/W. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: skip DISABLE PCI packet to FW on heartbeatOhad Sharabi2021-04-097-44/+59
| | | | | | | | | if reset is due to heartbeat, device CPU is no responsive in which case no point sending PCI disable message to it. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: replace GFP_ATOMIC with GFP_KERNELOfir Bitton2021-04-096-14/+36
| | | | | | | | | | | | | | As there are incorrect assumptions in which some of the initialization and data path flows cannot sleep, most allocations are being done using GFP_ATOMIC. We modify the code to use GFP_ATOMIC only when realy needed, as sleepable flow should use GFP_KERNEL. In addition add a fallback to allocate memory using GFP_KERNEL, once ATOMIC allocation fails. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: update extended async event headerOfir Bitton2021-04-091-5/+5
| | | | | | | | Update to the latest definition of the firmware Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: return current power via INFO IOCTLSagiv Ozeri2021-04-094-0/+51
| | | | | | | | | Add driver implementation for reading the current power from the device CPU F/W. Signed-off-by: Sagiv Ozeri <sozeri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: support HW blocks vm showSagiv Ozeri2021-04-094-7/+112
| | | | | | | | | Improve "vm" debugfs node to print also the virtual addresses which are currently mapped to HW blocks in the device. Signed-off-by: Sagiv Ozeri <sozeri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: use a single FW loading bringup flagOfir Bitton2021-04-095-11/+17
| | | | | | | | | For simplicity, use a single bringup flag indicating which FW binaries should loaded to device. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: use correct define for 32-bit max valueOded Gabbay2021-04-091-1/+1
| | | | | | | Timeout in wait for interrupt is in 32-bit variable so we need to use the correct maximum value to compare. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: wait for interrupt supportOfir Bitton2021-04-095-24/+285
| | | | | | | | | | | In order to support command submissions from user space, the driver need to add support for user interrupt completions. The driver will allow multiple user threads to wait for an interrupt and perform a comparison with a given user address once interrupt expires. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: enable all IRQs for user interrupt supportOfir Bitton2021-04-093-2/+74
| | | | | | | | | | In order to support user interrupts, driver must enable all MSI-X interrupts for any case user will trigger them. We differentiate between a valid user interrupt and a non valid one. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: reset device in case of sync errorOhad Sharabi2021-04-095-0/+52
| | | | | | | | | | As the F/wW is the first to detect out of sync event, a new event is added to notify the driver on such event. In which case the driver performs hard reset. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: change default CS timeout to 30 secondsOded Gabbay2021-04-091-2/+2
| | | | | | | | | Because our graph contains network operations, we need to account for delay in the network. 5 seconds timeout per CS is not enough to account for that. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: print if device is used on FD closeOded Gabbay2021-04-092-4/+6
| | | | | | | Notify to the user that although he closed the FD, the device is still in use because there are live CS and/or memory mappings (mmaps). Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: reset_upon_device_release is for bring-upOded Gabbay2021-04-091-2/+1
| | | | | | Move the field to correct location in structure and remove comment. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: fail reset if device is not idleOded Gabbay2021-04-091-14/+12
| | | | | | | | | After any reset (soft or hard) the device (the engines/QMANs) should be idle. If they are not idle, fail the reset. If it is soft-reset, the driver will try to do hard-reset automatically. If it is hard-reset, the driver will make the device non-operational. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: reset after device is actually releasedOded Gabbay2021-04-091-16/+16
| | | | | | | | | | | | | The device is actually released only after the refcnt of the hpriv structure is 0, which means all its contexts were closed. If we reset the device while a context is still open, there are possibilities for unexpected behavior and crashes. For example, if the process has a mapping of a register block that is now currently being reset, and the process writes/reads to that block during the reset, the device can get stuck. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: add reset support when user closes FDOfir Bitton2021-04-092-2/+20
| | | | | | | | | | | In order to support command submissions that are done directly from user space, the driver must perform soft reset once user closes its FD. In case the soft reset fails or device is not idle, a hard reset should be performed. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: set max asid to 2farah kassabri2021-04-092-2/+2
| | | | | | | | | | currently we support only 2 asids in all asics. asid 0 for driver, and asic 1 for user. no need to setup 1024 asids configurations at init phase. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: fix debugfs address translationfarah kassabri2021-03-101-12/+26
| | | | | | | | | | | | | | | when user uses virtual addresses to access dram through debugfs, driver translate this address to physical and use it for the access through the pcie bar. in case dram page size is different than the dmmu page size, we need to have special treatment for adding the page offset to the actual address, which is to use the dram page size mask to fetch the page offset from the virtual address, instead of the dmmu last hop shift. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: Disable file operations after device is removedTomer Tayar2021-03-102-6/+46
| | | | | | | | | | | | | | | | | | A device can be removed from the PCI subsystem while a process holds the file descriptor opened. In such a case, the driver attempts to kill the process, but as it is still possible that the process will be alive after this step, the device removal will complete, and we will end up with a process object that points to a device object which was already released. To prevent the usage of this released device object, disable the following file operations for this process object, and avoid the cleanup steps when the file descriptor is eventually closed. The latter is just a best effort, as memory leak will occur. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: Call put_pid() when releasing control deviceTomer Tayar2021-03-101-0/+2
| | | | | | | | | | | | The refcount of the "hl_fpriv" structure is not used for the control device, and thus hl_hpriv_put() is not called when releasing this device. This results with no call to put_pid(), so add it explicitly in hl_device_release_ctrl(). Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>