We are developing a platform to enable SW developers to leverage the benefits of heterogeneous distributed compute environments.
QCC vs. HLS
This is a new platform developed and defined with very specific objectives. It should feel like a SW development platform. Companies and pundits suggest HLS was developed to fill this void. In reality, HLS is a higher level of abstraction for HW engineers. It is not a tool for software developers. QCC and supporting hardware was specifically developed to streamline acceleration development in a fashion that meets the needs and knowledge of SW developers. There are a number of key differences between QCC and HLS.
User Guided Partitioning
The QCC platform will automatically partition functions to different heterogeneous compute engines. The partitioner is optimized to limit the number of channels and data transfer between compute engines. There is no need for the SW developer to write compute engine specific code utilizing an API to initialize and interrogate resources, start and stop functions, transfers data and synchronizes tasks. QCC automatically generates all the infrastructure, communication, data transfer and data referencing code to support partitioned distributed computing. This is done across different processes and processors and FPGAs. Modifications to optimize the application partition are easily accomplished by re-executing the partitioner. No source code modifications are required. Application development is an iterative process. Requiring code changes across the host code and accelerator to optimize the application partition is resource intensive, reduces cycles per day and is error prone. HLS is two development paths which need to be separately developed and modified to change any partition. The modified code no longer matches the original requiring extensive system verification. Poor partitions can result in a magnitude of differences in performance. To reduce development time, enhance performance and increase quality automatic partitioning is a critical capability.
Memory architecture is essential to deliver acceleration. Absence of data and insufficient memory bandwidth will stall compute. To achieve acceleration critical loops must be pipelined (n*c vs n+c). For additional performance pipelined loops are unrolled. Fully pipelined loops means all operations are executing in parallel. This creates a lot of demand on the memory subsystem. Unrolling further challenges the memory sub-system. The memory sub-system is far more complex than a multi-level cache for processors. The CacheQ Virtual Machine is tightly integrated with a proprietary multi ported arbitrated cached memory structure. Data transfer, data access and the memory subsystem just work. SW developers writing code for a processor do not design their own cache. Beyond the hardware architecture certain SW capabilities must also exist. SW developers want to use malloc. They want extensive pointer support. They want memory shared between host and accelerators. All of this is part of the CacheQ solution. HLS puts an insurmountable burden on the SW developer. Re-writing code to remove malloc is time consuming, error prone and will deliver sub-optimal results. Creating an application specific and fully predictable memory controller is well beyond the timeline required for application development.
The CacheQ development platform includes many critical capabilities required by the SW developer to develop and deliver application acceleration. Code is partitioned, loops are pipelined, loops are unrolled, there is an optimized configurable memory subsystem, malloc just works, there are optimization tools and dataflow, which is a concept understood by SW developers. The SW developer is focused on the application and application optimization. There is too much to digest with HLS. The developer must deal with separate programs, manual partitioning, extensive code modification, multiple simultaneous variables to optimize (memory structure, initiation interval, data movement, code modifications, …), timing simulations, host to FPGA data transfer, array structures to name a few. CacheQ provides the foundation. The SW developer builds upon the foundation.