The Technology
We are developing a platform to enable SW developers to leverage the benefits of heterogeneous distributed compute environments.
qthreads VS. OpenMP
qthreads is a new multithreading development platform to reduce development time, produce superior results and remove guess work. qthreads determines which code can be multithreaded. qthreads requires no insertion of special pragmas. qthreads performs extensive analysis to ensure there is no data movement that causes significant cache thrashing. qthreads reports which code cannot be multithreaded and the reason why. Profiling ensures the code that has been multithreaded will have the biggest impact on performance and power reduction. qthreads ensures single threaded code and multithreaded code produces the exact same result. All qthread code is thread-safe. qthreads works on the same code that can be targeted for FPGA acceleration. OpenMP supports none of the above.
QCC vs. HLS
This is a new platform developed and defined with very specific objectives. It should feel like a SW development platform, reduce development time and support heterogeneous hardware. Multicore execution requires specific knowledge and the limited tools are specific to processors. Companies and pundits suggest HLS was developed to fill this void for FPGAs. In reality, HLS is a higher level of abstraction for HW engineers. It is not a tool for software developers. QCC was specifically developed to streamline acceleration development across heterogeneous hardware in a fashion that meets the needs and knowledge of SW developers.
User Guided Partitioning
The QCC platform will automatically partition functions to different heterogeneous compute engines. The partitioner is optimized to limit the number of channels and data transfer between compute engines. There is no need for the SW developer to write compute engine specific code utilizing an API to initialize and interrogate resources, start and stop functions, transfers data and synchronizes tasks. QCC automatically generates all the infrastructure, communication, data transfer and data referencing code to support partitioned distributed computing. This is done across different processes and processors and FPGAs. Modifications to optimize the application partition are easily accomplished by re-executing the partitioner. No source code modifications are required. Application development is an iterative process. Requiring code changes across the host code and accelerator to optimize the application partition is resource intensive, reduces cycles per day and is error prone. HLS is two development paths which need to be separately developed and modified to change any partition. The modified code no longer matches the original requiring extensive system verification. Poor partitions can result in a magnitude of differences in performance. To reduce development time, enhance performance and increase quality automatic partitioning is a critical capability.
memory
Memory architecture is essential to deliver acceleration. Absence of data and insufficient memory bandwidth will stall compute. To achieve acceleration critical loops must be pipelined (n*c vs n+c). For additional performance pipelined loops are unrolled. Fully pipelined loops means all operations are executing in parallel. This creates a lot of demand on the memory subsystem. Unrolling further challenges the memory sub-system. The memory sub-system is far more complex than a multi-level cache for processors. The CacheQ Virtual Machine is tightly integrated with a proprietary multi ported arbitrated cached memory structure. Data transfer, data access and the memory subsystem just work. SW developers writing code for a processor do not design their own cache. Beyond the hardware architecture certain SW capabilities must also exist. SW developers want to use malloc. They want extensive pointer support. They want memory shared between host and accelerators. All of this is part of the CacheQ solution. HLS puts an insurmountable burden on the SW developer. Re-writing code to remove malloc is time consuming, error prone and will deliver sub-optimal results. Creating an application specific and fully predictable memory controller is well beyond the timeline required for application development.
Limited Variables
The CacheQ development platform includes many critical capabilities required by the SW developer to develop and deliver application acceleration. Code is partitioned, loops are pipelined, loops are unrolled, there is an optimized configurable memory subsystem, malloc just works, there are optimization tools and dataflow, which is a concept understood by SW developers. The SW developer is focused on the application and application optimization. There is too much to digest with HLS. The developer must deal with separate programs, manual partitioning, extensive code modification, multiple simultaneous variables to optimize (memory structure, initiation interval, data movement, code modifications, …), timing simulations, host to FPGA data transfer, array structures to name a few. CacheQ provides the foundation. The SW developer builds upon the foundation.