Key Takeaways
1. The Nvidia GeForce RTX 5090 features the new GB202 GPU with significant hardware upgrades compared to previous models like the RTX 4090 and RTX 3090 Ti.
2. The RTX 5090 does not include the ability to switch between ECC and non-ECC memory states, a feature available in the RTX 3090 Ti and RTX 4090.
3. ECC (Error Correction Code) memory is crucial for tasks requiring high data accuracy, such as machine learning, while regular consumers may not need it.
4. GDDR7 memory specifications now include on-die ECC to handle increased memory densities and improve error correction capabilities.
5. The RTX 5090 has high-performance GDDR7 memory but its support for ECC features is uncertain, relying on future updates for potential activation.
Since the Ampere generation, Nvidia has replaced its top Titan card with the 90 series models aimed at both professionals and gamers.
Significant Hardware Upgrades
The Nvidia GeForce RTX 5090 features the new GB202 GPU, which shows major hardware enhancements when compared to the RTX 4090’s AD102 and RTX 3090 Ti’s GA102 GPUs. Interestingly, while the RTX 3090 Ti and RTX 4090 allowed users to change the VRAM ECC state in the driver, this function seems to be omitted in the RTX 5090.
Understanding ECC
Error Correction Code, or ECC, is a method that allows memory to fix itself. Memory errors can happen due to bit flips during data transfer or when errors arise in the data as memory cells discharge and recharge their energy. This self-correction is achieved through either a dedicated memory chip that checks for parity among the other eight chips (known as on-die ECC) or through the memory controller (DRAM ECC).
Most DDR5 consumer system memory supports ECC, but not fully. DDR5 RAM is designed to spot multi-bit errors but only fix single-bit errors through its built-in checking system. Due to how DDR5 divides 64-bit memory into two 32-bit sections, DDR5-ECC RAM comes in 72-bit (32+4) EC4 or 80-bit (32+8) EC8 configurations.
When is ECC Necessary?
ECC memory is not often needed for regular consumer tasks. If this term is new to you, it’s likely you won’t require ECC memory. Nevertheless, ECC is crucial for mission-critical and machine learning tasks where data accuracy must be preserved throughout the entire process. Google faced significant issues back in 1999 when they neglected to use ECC memory, which severely hampered their search engine’s performance due to memory corruption.
All GPUs that use GDDR5 and GDDR6/6X VRAM have a system for detecting memory errors called Error Detection Code (EDC). Nvidia refers to this as Error Detection and Replay (EDR), a process that requests the retransmission of bits from the memory controller after a cyclic redundancy check (CRC) is completed. EDR helps reduce pixel artifacts when VRAM is overclocked, although this may have a slight negative effect on performance.
Features of RTX 3090 Ti and RTX 4090
A lesser-known feature in the Nvidia GeForce RTX 3090 Ti and RTX 4090 desktop GPUs is the ability to switch between ECC and non-ECC memory states via the driver. Unfortunately, this option is missing in the new RTX 5090. Both the RTX 3090 Ti and RTX 4090 incorporate a method known as “soft ECC,” which doesn’t require a separate chip for parity. Instead, activating this feature dedicates some VRAM to act like an on-die ECC module.
Consequently, this reduces the total available VRAM and memory speed. For the RTX 4090, the usable VRAM drops from 24 GB to 22.5 GB, with 1.5 GB allocated for ECC functions. Activating the ECC state affects performance; for instance, with ECC on in the RTX 4090, 3DMark Speed Way scores saw a 6.4% decrease, and Cyberpunk 2077 2.21 Phantom Liberty experienced about a 5% dip in average fps. The extent of performance loss varies based on the specific task.
Advances with GDDR7
With GDDR7, JEDEC has now included on-die ECC as part of the VRAM specifications, recognizing the higher chances of errors due to increased memory densities. GDDR7 employs on-die ECC with a protocol that informs the memory controller about the types of errors that occur. According to JEDEC, GDDR7 can fully correct 1-bit errors and completely detect 2-bit errors, although the detection for rare 3-bit errors drops slightly to 99.3%.
Moreover, the official specifications also include command address parity with command blocking (CAPARBLK) to enhance the reliability of the command address bus. However, it remains uncertain whether Blackwell’s memory controller utilizes this on-die ECC functionality by default.
Specifications of RTX 5090
The RTX 5090 is equipped with 512-bit GDDR7 memory rated for an impressive 1.792 TB/s bandwidth at a rapid 28 Gbps clock, which could lead to transmission errors. Furthermore, Nvidia is promoting the RTX 5090 for AI workflows, which could gain from ECC when processing large datasets. However, Nvidia’s architecture whitepaper only mentions support for “Enhanced Cyclic Redundancy Check (CRC) for Reliability, Availability, and Serviceability (RAS),” which does not equate to ECC.
While it would be reasonable to assume that Nvidia would activate GDDR7’s on-die ECC capability for the anticipated Blackwell workstation GPUs, it is still unknown if the ECC state toggle will be available for the consumer RTX 5090 through a future driver or VBIOS update.
Source:
Link