ICE-T Serves AI Truth Cold—with Multiple Prompts and a Side of Clarity

11 Jun 2025

Authors:

(1) Goran Muric, InferLink Corporation, Los Angeles, (California [email protected]);

(2) Ben Delay, InferLink Corporation, Los Angeles, California ([email protected]);

(3) Steven Minton, InferLink Corporation, Los Angeles, California ([email protected]).

Table of Links

Abstract and 1 Introduction

1.1 Motivation

2.2 In-context learning

2.3 Model interpretability

3 Method

3.1 Generating questions

3.2 Prompting LLM

3.3 Verbalizing the answers and 3.4 Training a classifier

4 Data and 4.1 Clinical trials

4.2 Catalonia Independence Corpus and 4.3 Climate Detection Corpus

4.4 Medical health advice data and 4.5 The European Court of Human Rights (ECtHR) Data

4.6 UNFAIR-ToS Dataset

5 Experiments

6 Results

7 Discussion

7.1 Implications for Model Interpretability

7.2 Limitations and Future Work

Reproducibility

Acknowledgment and References

A Questions used in ICE-T method

Abstract

In this paper, we introduce the Interpretable Cross-Examination Technique (ICE-T), a novel approach that leverages structured multiprompt techniques with Large Language Models (LLMs) to improve classification performance over zero-shot and few-shot methods. In domains where interpretability is crucial, such as medicine and law, standard models often fall short due to their “black-box” nature. ICE-T addresses these limitations by using a series of generated prompts that allow an LLM to approach the problem from multiple directions. The responses from the LLM are then converted into numerical feature vectors and processed by a traditional classifier. This method not only maintains high interpretability but also allows for smaller, less capable models to achieve or exceed the performance of larger, more advanced models under zero-shot conditions. We demonstrate the effectiveness of ICE-T across a diverse set of data sources, including medical records and legal documents, consistently surpassing the zero-shot baseline in terms of classification metrics such as F1 scores. Our results indicate that ICE-T can be used for improving both the performance and transparency of AI applications in complex decision-making environments.

1 Introduction

There are numerous prompting strategies to achieve good performance using generative Large Language Models (LLMs). Take, for instance, a binary classification problem, where a system should classify the given text into one of two classes. A typical zero-shot approach is to prompt the model with a given text and carefully designed question, that will yield an appropriate answer. There are also multiple variations on that approach that include “chain-of-thought” prompting (Wei et al., 2022c; Wang et al., 2022a; Kojima et al., 2022), “few-shot learning” (Schick and Schütze, 2022; Gu et al., 2021), “self-instruct” (Wang et al., 2022b; Yang et al., 2024) prompting and “iterative refinement” (Wu et al., 2022a; Trautmann, 2023). These tactics are used to get a better sense of the model’s underlying reasoning or to surpass the performance achieved by the standard zero-shot method.

These options are usually used in cases where using highly specialized fine-tuned LLMs is not a viable option because it is often of utmost importance to understand how decisions are made. This is especially true in fields like medicine, where decisions based on opaque, “black-box” models are usually not acceptable. Although zero-shot or fewshot prompting methods can potentially offer explanations for their reasoning, these explanations are often unstructured and lack quantifiability. On the other hand, while finely tuned models may achieve superior performance, they frequently struggle to articulate the rationale behind their outputs unless explicitly trained for this purpose, a process that is labor-intensive. Additionally, outputs from such models may also suffer from the lack of structured reasoning representation.

In cases where using “black-box” models is not practical, and where interpretability is important, users have the option to develop a structured reasoning process by asking several questions to achieve a desired output. There are three main problems that arise with this approach: 1) Non-experts have little chance to develop a good set of questions and rules that ensure optimal model performance; 2) Designing an accurate rule set becomes challenging since individual instances may not perfectly align with all desired criteria, resulting in a mix of positive and negative responses to different rules; and 3) The potential combinations of these rules can become overwhelmingly numerous, making it impractical to hard-code every possible scenario.

In the paper, we propose a method that attempts to overcome the three issues outlined above. We refer to the method as the Interpretable CrossarXiv:2405.06703v1 [cs.CL] 8 May 2024 Examination Technique, or ICE-T for brevity. Our approach exhibits strong performance, consistently surpasses the benchmark set by a zero-shot baseline, and also offers a high level of interpretability. The core concept here is that rather than using a single prompt to get a response from an LLM and making a decision based on that single output, we engage the LLM with multiple prompts, covering various questions. We then combine the responses from all these prompts and use the outputs to make a decision. Compared to other methods that are based on multi-prompting, our approach is fundamentally different in the way the decisions are made. Specifically, we take the responses from the LLM, convert them into numerical values to create a feature vector, and then input this vector into a traditional classifier to determine the final outcome. Since, in this process we create a low-dimensional feature vector with highly informative features, we can then use relatively small classifiers to make a decision.

We established an experimental setup where we tested our Interpretable Cross-Examination Technique on a simple binary classification task. We tested our approach on a set of multiple datasets split on 17 different tasks and we show that:

ICE-T consistently outperforms the zero-shot baseline model in most classification metrics
Using a smaller model with ICE-T we can achieve comparable or better results than using larger and essentially more capable model with zero-shot approach

Furthermore, this approach can be highly interpretable, allowing experts to clearly understand the rationale behind the decision-making process[1]. Additionally, tools commonly used for tabular machine learning can be employed to enhance the understanding of the data. While this technique is specifically evaluated for binary classification within this paper, its applicability potentially extends across a broad spectrum of scenarios.

1.1 Motivation

The ICE-T method was initially conceived at InferLink in a commercial consulting project, where we 1Degree of interpretability may vary depending on the machine learning method selected for the final classification task. The decision on which method to employ should be guided by a consideration of the trade-offs between interpretability and performance tailored to the unique demands of each task needed to address a complex challenge in biomedical text classification. The project’s goals were to develop a model that could perform at a level comparable to human experts, provide interpretable results, and allow for detection of potentially mislabelled data. Initially, conventional “black-box” models such like fine-tuned BERT-based ones underperformed, as well as zero-shot or few-shot learning methods using LLMs. This led to the creation of the ICE-T, which improved the performance of classification, while gaining interpretability and allowing for the correction of labeling errors. ICE-T was used initially for the purpose of classifying biomedical data for a specific commercial purpose. While the specifics of this initial task and data remain confidential, we have conducted further testing on additional publicly available datasets and decided to make the method publicly accessible.

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

[1] Degree of interpretability may vary depending on the machine learning method selected for the final classification task. The decision on which method to employ should be guided by a consideration of the trade-offs between interpretability and performance tailored to the unique demands of each task

Up Next →

How Prompting and In-Context Learning Improve LLM Performance