Up To 6x Faster Than NVIDIA H100 & 30x Faster Than Intel Xeon 8380, Available In 2H 2023

Tachyum has officially published the whitepaper of its 5nm Prodigy Universal Processor which was unveiled all the way back in 2018.

Tachyum Promises Big Numbers In 5nm Prodigy Universal Processor Whitepaper, Up To 9 Times Higher Performance Efficiency Than NVIDIA’s H100

The Tachyum Prodigy CPUs utilize a universal processor design which means that they can execute CPU, GPU, and TPU tasks on the same chip, saving costs over competing products and also offering really high performance.

The company aims to tackle all three chip giants, AMD, Intel & NVIDIA with its Prodigy lineup and in their presentations, Tachyum has estimated a 4x performance uplift over Intel’s Xeon CPUs, on the HPC front, a 3x increase over NVIDIA’s H100 and a 6x increase in raw performance in AI & inference workloads. The chips are also said to offer over 10x the performance of its competitor’s systems at the same power. Some of the main features of the CPUs include:

  • 128 high-performance unified 64-bit cores running up to 5.7 GHz
  • 16 DDR5 memory controllers
  • 64 PCIe 5.0 lanes
  • Multiprocessor support for 4-socket and 2-socket platforms
  • Rack solutions for both air-cooled and liquid-cooled data centers
  • SPECrate 2017 Integer performance of around 4x Intel 8380 and around 3x AMD 7763HPC
  • Double-Precision Floating-Point performance is 3x NVIDIA H100
  • AI FP8 performance is 6x NVIDIA H100

Tachyum has now released the full whitepaper of its Prodigy Universal Processor that details the CPU architecture, platform, and lineup, which will scale from the low-power T8232-LP 32 Core CPU at 180W TDP, all the way up to the flagship T16128- AIX, which features a total of 128 cores.

Tachyum Prodigy Universal CPU Architecture – Custom 64-bit Design

The Tachyum Prodigy makes use of an OOD (Out-Of-Order) architecture that can decode and retire up to 8 instructions per clock, issue up to 11 instructions per clock, with an instruction queue that supports up to 48 instructions and a scheduler that supports 12 queues that are 15 entries deep. It comes with four ALUs, one load unit, one store unit, one load/store unit, one mask unit & two 1024-bit vector units. Each core also has an AI subsystem that includes a 4096-bit matrix unit. Each core is a single-threaded hardware design.

Coming to the cache configuration, each core packs 64 KB I-Cache & 64 KB D-Cache with SECDED ECC. Each core also has 1 MB of L2 with dual error correct ECC and triple error detect DECTED. The active cores can also pool in the L2 cache from idle CPU cores to act as a shared L3 cache.

Prodigy employs an innovative coherence protocol, T-MESI (Tachyum-MESI), that is based on MESI. T-MESI adds optimizations enhancing standard MESI that improve latency and performance. In addition to on-chip cache coherence, Prodigy also supports hardware coherence between Prodigy devices that enables both 2-socket and 4-socket platforms to be fully coherent. Prodigy’s hardware coherence uses eight full duplex lanes of 112 gigabit/sec SERDES links between each set of coherent devices, providing an aggregate of 1.8 terabit/sec of bandwidth between coherent devices.

Prodigy’s TLB can hold large memory footprints for HPC, up to 128 TB. The MMU is hardware-managed for maximum performance and includes a sophisticated global purge mechanism.

Vector and Matrix Units

Prodigy’s 2×1024-bit vector subsystems are 2x the size of Intel and 4x the size of AMD top-end processors. Prodigy’s 4096 matrix unit supports 16 x 16, 8 x 8, and 4 x 4 operations. The vector and matrix subsystems support a wide range of data types, including FP64, FP32, TF32, BF16, Int8, FP8, as well as TAI, or Tachyum AI, a new data type that will be announced later this year and will deliver higher performance than FP8. Prodigy’s matrix operations support sparse data types for highest performance, including 4:2 sparsity which is also supported by the Nvidia H100, as well as Tachyum’s Super-Sparsity, which enables even higher performance with an 8:3 ratio.

Sparse data types maximize performance for training and inference with a very minor reduction in accuracy. Lower precision data types and sparsity are discussed in more detail in the section “Prodigy on the Leading Edge of AI Industry Trends” below. Scatter/Gather operations provide fast, efficient loading and storage for vectors and matrices.

Memory and I/O Subsystems

Prodigy integrates an industry-leading sixteen DDR5 memory controllers that run up to DDR5-7200, providing approximately 1 TB/sec of memory bandwidth, supporting 2 DIMMs per channel. Tachyum will be announcing a new feature later this year called “Bandwidth Amplification” that effectively doubles the memory bandwidth to a staggering 2 TB/sec. The PCIe subsystem includes 64 lanes of PCIe 5.0 with 32 PCIe controllers.

The PCIe subsystem includes four x16 PCIe functional blocks, and each of the x16 blocks includes 8 controllers that can bifurcate down to x2, offering maximum flexibility to support external devices ranging from high performance NICs to large NVMe storage arrays.

Emulation for x86, Arm, RISC-V Prodigy Runs

Prodigy supports software dynamic binary translation for other instruction set architectures (ISAs) that include x86, Arm, and RISC-V. x86 is the established data center processor, Arm is very prevalent for telco applications, and RISC-V is popular with academic institutions. The overhead for binary translation is approximately 30 – 40%, but Prodigy will be running approximately twice the frequency of competitive processors, so the performance should be similar to running native. Binary translation is intended to enable fast, easy out-of-the-box evaluation and testing for customers and partners, with customers migrating to Prodigy’s native ISA for production deployments for maximum performance.

All chips are fabricated on TSMC’s 5nm (N5P) process node which is a slightly optimized variant of the standard 5nm (N5) node and run native and x86, Arm, and RISC-V binaries. As for HPC and AI-specific features, the Tachyum Prodigy lineup includes:

  • 2 x 1024-bit Vector Units Per Core
  • 4096-bit Matrix Processors Per Core
  • FP64, FP32, TF32, BF16, Int8, FP8, TAI Data Types
  • Sparse Data Types Optimize Efficiency
  • Quantization Support Using Low Precision Data Types
  • Scatter/Gather for efficiently storing and loading matrices

Tachyum Prodigy Universal CPU Lineup/Platform – Scaling from 180W To 900W

All 128 cores on the flagship CPU are clocked at 5.7 GHz plus, AI customers will be getting up to 16 memory channels, supporting up to 32 TB (64 DIMMs) of DDR5-7200. The processor will also rock 64 PCIe Gen 5.0 lanes and will come in a 950W TDP package.

The rest of the CPUs that Tachyum will offer are listed in the specs sheet below:

Cores Clock Memory PCIe TDP Market Segment
Prodigy T16128-AIX 128 5.7 GHz 16x DDR5-7200 Gen5 x64 950W HPC, Big AI
Prodigy T16128-AIM 128 4.5 GHz 16x DDR5-7200 Gen5 x64 700W HPC, Big AI
Prodigy T16128-AIE 128 4.0 GHz 16x DDR5-7200 Gen5 x64 600W HPC, Big AI
Prodigy T16128-HT 128 4.5 GHz 16x DDR5-6400 Gen5 x64 300W Analytics, Big Data
Prodigy T864-HS 64 5.7 GHz 8x DDR5-6400 Gen5 x32 300W Cloud, Databases
Prodigy T864-HT 64 4.5 GHz 8x DDR5-6400 Gen5 x32 300W Cloud, Databases
Prodigy T832-HS 32 5.7 GHz 8x DDR5-6400 Gen5 x32 300W Scalar Workloads
Prodigy T832-LP 32 3.2 GHz 8x DDR5-4800 Gen5 x32 180W Hosting, Storage, Edge

Now that’s just one chip and Tachyum will allow full hardware coherence that supports 2 and 4-socket systems. So that’s up to 512 cores and 3600W of power from four Progidy T16128-AIX tier processors.

The Prodigy Platform will come in various rack solutions such as an air-cooled 2U server that will be able to house up to four Tachyum Prodigy chips, 64 16 GB DDR5 DIMMs, and 2×200 GbE RoCE NICs. There’s also a custom 48U rack reference design that comes in 2 versions, one liquid cooled and one air-cooled. The air-cooled version supports 40 4-socket 2U servers for a total of 160 chips while the liquid-cooled version supports 88 4-socket 1U servers for a total of 352 chips. Both racks have a modular design and 2 racks can be combined into a 2-rack cabinet to optimize floor space. Each server comes with four cLGA sockets.

Tachyum Prodigy Universal CPU Lineup – Hitting NVIDIA, Intel & AMD All At Once

Tachyum also provides some preliminary performance estimates against Intel Ice Lake, NVIDIA Hopper / Grace HPC chips, and AMD Milan CPUs. The company claims up to a 4x SPECrate 2017 Integer and 30x Raw Floating Point performance (FP64) increase versus the competition. Hopper H100 from NVIDIA is the main chip that Tachyum seems to have its eyes set upon as it’s used in several comparative tests.

Some of the performance figures mentioned include:

  • 3x vs NVIDIA H100 in Double Precision Floating-Point Performance
  • 6x vs NVIDIA H100 in AI FP8 Performance
  • 9x vs NVIDIA H100 in Performance per Watt
  • 4x vs Intel Xeon Platinum 8380 in Specrate 2017 INT Performance
  • 30x vs Intel Xeon Platinum 8380 in FP64 Performance

Tachyum also provides some preliminary performance estimates against Intel Ice Lake, NVIDIA Hopper / Grace HPC chips, and AMD Milan CPUs. The company claims up to a 4x SPECrate 2017 Integer and 30x Raw Floating Point performance (FP64) increase versus the competition. Hopper H100 from NVIDIA is the main chip that Tachyum seems to have its eyes set upon as it’s used in several comparative tests.

While the Prodigy T16128-AIX offers around 90 TFLOPs of FP64 perf (with sparsity). The company uses an Air-cooled Prodigy rack which is estimated to deliver up to 6.2 PetaFlops of HPC FP64 horsepower versus an NVIDIA H100 DGX POD rack which offers 960 TFLOPs of FP64 HPC performance. The liquid-cooled Prodigy which can sustain higher-end chips should offer more than double the performance at 12.9 PetaFLOPs.

Tachyum expects the first Prodigy ships to start sampling later this year with volume production expected in the second half of 2023. The next-gen upgrade to Prodigy, known as Prodigy 2 is also listed in Tachyum’s roadmap and will be offering a new 3nm architecture with even more cores, higher memory bandwidth, PCIe 6.0 + CXL support, and enhanced connectivity. Sampling on that should begin by the second half of 2024.

Leave a Comment