Tag: AI benchmarks

September 26, 2025

Samsung Launches TRUEBench for AI Productivity Testing

Key Takeaways

1. TRUEBench aims to evaluate AI performance based on real-world office tasks rather than just basic question-and-answer formats.
2. The benchmark includes diverse assessments like document summarization, translation in twelve languages, and multi-step instructions, with a total of 2,485 test sets.
3. Samsung’s CTO emphasized the importance of rigorous testing standards to determine AI usefulness, combining human feedback and AI evaluations.
4. TRUEBench promotes transparency by providing public access to datasets, leaderboards, and performance statistics, allowing for model comparisons.
5. The benchmark has limitations, including potential bias in rule-making, the strict success criteria that may overlook valuable partial answers, and a focus on general business tasks over specialized fields.

AI benchmarks have had a hard time accurately capturing how people use these systems in real life. Many assessments still zero in on English-only question and answer tasks, which may look good on paper but don’t really show the range of activities that are part of everyday work. Samsung has recently introduced TRUEBench, short for Trustworthy Real-world Usage Evaluation Benchmark, aimed at assessing AI performance in ways that are more aligned with actual office tasks.

Expanding Beyond Simple Tasks

TRUEBench goes further than just basic trivia or single-prompt interactions. It evaluates models on document summarization, translation in twelve different languages, data analysis, and multi-step instructions that require the AI to keep context in mind. Samsung has put together 2,485 test sets divided into ten categories and 46 subcategories, with inputs varying from just a few characters to over twenty thousand. The aim is to replicate everything from quick commands to lengthy business reports.

Insights from Samsung’s CTO

Paul (Kyungwhoon) Cheun, who is the CTO of the DX Division at Samsung Electronics and also leads Samsung Research, remarked, “Samsung Research brings deep expertise and a competitive edge through its real-world AI experience. We expect TRUEBench to establish evaluation standards for productivity and solidify Samsung’s technological leadership.“

For a model to succeed, it has to satisfy every condition in a test, even the implicit ones that a reasonable person might expect, even if those are not clearly outlined. This strict all-or-nothing approach makes the results less forgiving, but it brings them closer to how one would determine if an output is truly useful. Samsung developed the rules by merging human feedback with AI evaluations. Human annotators created the initial conditions, the AI highlighted any contradictions or inconsistencies, and then humans revised the framework before finalizing it. Once completed, the evaluation could run on a larger scale through automated AI scoring.

Transparency in Performance

Additionally, Samsung has publicized the dataset, leaderboards, and output statistics via Hugging Face. This allows users to compare up to five models directly and see how their performances measure up. This transparency enables developers, researchers, and users to investigate the benchmark instead of just taking Samsung’s word for it.

However, the benchmark is not without flaws, as rule-making will always carry some level of bias. The requirement for complete success on every condition means that partial, yet still valuable, answers are marked as failures. While the language support is broader than many existing tests, performance will naturally vary, especially in languages where there isn’t enough training data. The test set also focuses more on general business tasks, so specialized fields like law, medicine, or scientific research may not be adequately represented.

Samsung Newsroom

Source:
Link

Tags: AI benchmarks, Samsung
May 30, 2025

DeepSeek Launches R1 Model with Enhanced AI and Reduced Hallucinations

Key Takeaways

1. DeepSeek-R1-0528 outperforms its predecessor and rivals in cost-effectiveness and training speed.
2. The model shows improvements in performance but only answers 17% correctly on the difficult Humanity’s Last Exam.
3. Enhanced training periods and fine-tuning contribute to the model’s better results, rather than major technological breakthroughs.
4. The new R1 model has fewer occurrences of AI hallucinations, providing more accurate information.
5. An open-source version of the R1 model is available, requiring an Nvidia 4090 GPU with 24 GB of memory for use.

DeepSeek has introduced the newest iteration of its innovative R1 AI large language model, named DeepSeek-R1-0528. The firm made its entrance into the AI sector with the releases of V3 and R1, both of which achieved top-ten performance in AI while being more cost-effective and quicker to train compared to rival models from companies like OpenAI and Google.

Performance Tests

The recent R1 model underwent evaluation using various AI benchmarks:

When compared to the initial release of R1, DeepSeek-R1-0528 shows better performance across all tests, although it only manages to answer 17% of the questions correctly on the challenging Humanity’s Last Exam. Since its main competitors also struggle on this particular test, the improvements seen in the latest DeepSeek R1 version are likely a result of extended training periods and fine-tuning rather than any major advancements in AI technology. A key highlight of the new R1 is its reduced instances of AI hallucinations, making it less prone to providing incorrect or misleading information.

Open-Source Availability

For those interested in exploring the open-source R1 model, it is possible to run distilled versions with eight billion parameters using an Nvidia 4090 GPU that has 24 GB of memory.

In summary, DeepSeek continues to push the boundaries of AI with its latest R1 model, making significant strides while maintaining affordability and efficiency. Users can find more about DeepSeek through its platforms, including DeepSeek news, DeepSeek Chat, and the DeepSeek R1 on GitHub.

Source:
Link

Tags: AI benchmarks, open-source AI
February 10, 2025

Google Unveils Powerful Gemini 2.0 Pro AI Features

Google has rolled out access to its latest AI, the Gemini 2.0 Pro experimental model. This new AI features a massive two million token input window, the largest of any Google AI to date, allowing it to manage very large text inputs. Gemini is engineered to tackle complicated prompts with these extensive inputs. Furthermore, Gemini 2.0 Pro has the ability to browse the internet and run code, while also being capable of generating code for applications.

Performance Compared to Other Models

In terms of performance, Gemini 2.0 Pro surpasses previous AI models from the company across various standardized large language model benchmarks. Nevertheless, it still hasn’t reached the capabilities of humans or the top-performing AIs in every category evaluated. For instance, on the LiveBench AI LLM benchmark, the experimental scores for Gemini 2.0 Pro are only 65.13, compared to Deepseek R1’s 71.57 and OpenAI’s o3-mini which scored 75.88 in high mode.

Human Evaluation and Security Measures

Even so, when human evaluators assess AI based on their own prompts, Gemini 2.0 Pro stands out as one of the top two AIs globally today, according to the responses it provided on the OpenLM.ai Chatbot Arena Elo ranking. Hackers may find themselves frustrated with Gemini 2.0 Pro, as it utilized self-training methods during development to minimize the chances of producing unsafe responses.

Subscription and Availability

Gemini 2.0 Pro is accessible to all users of Google Gemini Advanced who subscribe for $19.99 monthly. It is also available for developers using Google AI Studio and Vertex AI. Users interested in having Gemini at their fingertips can download the Gemini app on their smartphones or buy a Google Pixel 9 Pro smartphone that comes with Gemini integrated (available for purchase on Amazon).

Source:
Link

Tags: AI benchmarks, Google AI

Tag: AI benchmarks

Samsung Launches TRUEBench for AI Productivity Testing

Key Takeaways

Expanding Beyond Simple Tasks

Insights from Samsung’s CTO

Transparency in Performance

DeepSeek Launches R1 Model with Enhanced AI and Reduced Hallucinations

Key Takeaways

Performance Tests

Open-Source Availability

Google Unveils Powerful Gemini 2.0 Pro AI Features

Performance Compared to Other Models

Human Evaluation and Security Measures

Subscription and Availability