TAS Usecase Library

Explanation of Terms

Application domain

The application domain within which the usecase (UC) was conceived. The UC may be considered transferable if it has no elements that are task-specific, participant-specific or equipment-specific.

Special equipment: Does the usecase demand the use of expensive or unusual equipment that might prevent other labs or other researchers from carrying it out, e.g.,~a robotic arm, a self-driving car? If yes, the usecase may be non-transferable.

Special skills: Would it be necessary to recruit participants with domain-specific knowledge or special skills, such as only tank drivers or only personnel with medical training? If yes, the usecase may be non-transferable.

Task-specific: Does the usecase involve behaviours by the AI system or the human that can only be carried out in this particular application domain, e.g., over-taking. If yes, the usecase may be non-transferable.

Stage/Experience

Interpersonal trust may evolve over time often as part of an ongoing relationship. Human-AI relationships usually follow a similar trajectory. With this in mind, we see that each UC focuses on a particular stage or aspect of the trust experience: instantaneous, changing, mis-calibrated, perceived trustworthiness/future.

Changing: These usecases typically involve multiple interactions between human and AI, letting researchers see whether and how trust changes over time. We group three sub-categories of changing trust that may be examined within a single usecase. Based on the human-AI interaction history, trust may be built, broken, or repaired.

Future/Perceived: The former term is borrowed from Jones (2016) where it is used to emphasise the dependence of trust on a human's belief about a trustee's future behaviour. Interactions in these usecase scenarios might not require trust at all but, based on their experience, the human is invited to consider whether they would trust the AI system in some hypothetical future, perhaps riskier, situation.

Instantaneous: Applies to UCs where the human must make an on-the-spot decision whether to trust or not trust an AI system that they have not previously encountered.

Mis-calibrated: These usecases examine conditions of over- or under-trust such as may arise instantaneously or following a change in trust.

Interaction

Nature of the interaction between human and AI system: collaboration, competition, conversation, influence, observation, provocation.

Collaboration : In a collaboration, human and AI are working towards the same goal, e.g., to assemble a Lego model. This is typically the interaction type for experiments involving human-AI teams. Note that authors may report that their usecase investigates a `collaborative' task where we would classify it as influence.

Competition: In competition, human and AI are working towards different goals, e.g., in game-play. The notion of competition seems at odds with trust which, by definition, requires the AI to help the human achieve their goal. However if, for example, that goal is to be entertained by playing a game, the AI may be being trusted to deliver a satisfying experience. Moreover, having played a competitive game with an AI, the participant's value-based trust in it might be evaluated or their future trust assessed.

Conversation: Some usecases involve no more than a conversational exchange between the AI system and the human. Typically, such usecases evaluate trust involving speculative risk: having engaged in the conversational exchange the human is asked whether or not they would trust the AI in some future situation. With the advent of large language models (LLMs), however, we may begin to see conversational interactions in which value-based trust is evaluated more directly based on the LLM's truthfulness.

Influence: We classify as influence, any interaction in which the AI system expresses a preference. This includes not only (as perhaps expected) persuasion, nudging and deception, but also recommendations and the provision of advice. To be distinguished from collaboration, where the system may provide information but without expressing a preference, and therefore without applying pressure on the human to agree or comply.

Observation: Here, there is no interaction as such. The human observes the AI and, based on their observation, assesses its trustworthiness.

Provocation: Similar to observation but, in this case, the human is exposed to an AI system that deliberately provokes a trust-related response. We include this interaction type to accommodate creative provocations'. For example, see arts-based approaches to the usecase library at https: //www.stahrc.org/usecase.

Measurement

Mechanism/s by which trust is evaluated, transferable if trust could be assessed differently: behavioural, physiological, self-reported, externally assessed.

Behavioural: A behavioural trust measurement might involve the participant complying with a request, changing their mind in response to a recommendation or taking some action, such as intervening (or not) during a self-driving vehicle's parking manoeuvre.

Externally-assessed: Although in practice individual inspectors may interact with an AI system to gauge its trustworthiness, some usecases base evaluation not on any individual human's belief but on the assessment of an external entity, such as an ethics committee, governing body, or regulator, or the dictates of their rules, regulations or legislation.

Physiological: Currently, the majority of usecases encountered that use a physiological measure rely also on self-reporting and the purpose of those studies is often to evaluate the measure itself (see below). If it can be established that trust may reliably be assessed through purely physiological means (e.g., through the participant's temperature, gaze, heart rate, etc.) the need for self-assessment will be obviated.

Self-reported: Though its value has been challenged Miller (2022), trust is frequently self-reported via questionnaires or, particularly if social science researchers are involved, using structured or unstructured interviews.

Pattern

Overarching dimension that identifies any recognisable pattern that the usecase conforms to, indicating its relationship with other similar usecases.

Foot-in-the-door. A trust-building paradigm, with sales-tactic connotations. The AI system aims to gain the human's trust in relation to a small thing, in order to gain their trust for a more significant thing later.

Prioritisation. A scenario that facilitates \ctype{behavioural} measurement, the human and AI system prioritise a list independently. Having observed the AI's list, the human has the opportunity to re-prioritise, the premise being that the more changes they make, the more they trust the AI. This is the pattern used in the `Desert Survival Task', where participants must select five out of ten possible items (e.g., gun, water, flashlight, etc.) to take on an expedition.

Prisoners' dilemma. The game theory classic whereby prisoners can minimise their sentence by testifying against their partner but if they both testify against one another, both get the maximum. The game can be played once to evaluate instantaneous trust and/or repeatedly to evaluate changing (building, breaking and repairing) trust.

Reliability calibration. Under this pattern, the AI system makes a series of recommendations which the human does or does not accept. Meanwhile, the system's reliability is modulated directly (e.g., by giving incorrect advice or mis-classifying images) or indirectly (e.g., there is a change in simulated weather conditions when the AI system is relying on its cameras). This pattern is particularly well-suited to the evaluation of appropriate trust, which is regarded as having been achieved when the human's dependence on the AI system reliably varies in line with the modulations to its reliability.

Suck-it-and-see. The most basic pattern for a self-reported trust usecase. The human undergoes some experience involving the AI system (e.g., as a passenger in a self-driving vehicle) then describes how they feel about it using a trust scale.

Trust game (aka investment game). In this classic design, involving financial risk and typically evaluating value-based trust, one player (human) invests an amount---whether money or other token---knowing that the amount they get back depends on the other player (AI system), the premise being that the amount invested indicates their level of trust.

Yes-no. Like prioritisation, this involves an influence interaction, but as a binary measure: the human is required to make yes-no decision, with input from an artificial assistant.

Risk

Type of risk or vulnerability to which the human is exposed: financial, physical, environmental, informational, task failure, speculative.

Environmental: Some usecases invite the participant to imagine real-world consequences more far reaching than their personal wealth or well-being. The AI system with which they interact is to be imagined as controlling chemicals to be released into a water supply, for example.

Financial: The trust (or `investment') game is a classic example of a scenario involving financial risk. One player (the human in a human-AI interaction) invests an amount of money---or other commodity of value---but the return they get on their investment depends on the other player (the AI system), the premise being that the amount they invest reflects their degree of trust.

Privacy: These scenarios involves the risk of confidential information or personal data being compromised, as might arise in a usecase evaluating how trust varies in response to changes in an online user interface, for example, or in an HRI scenario where a robot security guard demands the participant's ID.

Physical: Medical scenarios or those that involve assistance robots may involve a degree of physical risk and though, as noted above, there is no `real' risk to a passenger seated in an autonomous vehicle (AV) simulator, the danger they are expected to imagine when evaluating their trust in the AV is often physical: the consequences to their person should the vehicle crash.

Psychological: An experiment to assess whether explanations increase the trust of an autistic child, for example, involves psychological risk to the child \cite{araujo2022kaspar}. More generally, a participant might risk loss of self-esteem if they place trust in an AI that betrays them.

Speculative: As flagged, notwithstanding the definition of trust, some usecases employed for trust-related research do not, in themselves, involve any risk at all. In conversation or observation interactions, for example, although the scenario itself does not involve risk, it may enable the participant to assess whether, hypothetically, they would be prepared to trust the AI system under future---i.e., speculative---risk.

Task failure: A low risk scenario nevertheless encountered frequently in trust-related research experiments. The participant depends on the AI system to complete a task: e.g., assembly of the Lego model in Rahman (2018). If the AI makes a mistake, is unreliable, breaks down, gives confusing or incomplete advice, or fails for any reason, the participant risks being unable to complete the task.

System type

Type of system the human is invited to trust, transferable if the usecase readily translates to a different type: autonomous vehicle, robotic, virtual, embedded.

Embedded: An embedded system is, essentially, invisible to the user. It has no visual representation and no distinct identity. Glickson (2020) offer the example of a search engine or a GPS map (p.639) but other examples that arise in usecases include communication systems and Internet-of-Things devices such as healthcare monitors.

Robotic: Usecases featuring robots or any embodied AI system have a robotic system type. Robotic AI systems are those which have a physical presence. We reserve this category for those robotic AI systems that are non-AV.

Robotic-AV: A subset of robotic systems, includes self-driving cars, unmanned aerial vehicles (UAVs), unmanned underwater vehicles (UUVs) and so on. Any usecase featuring drones or autonomous transport traditionally piloted (in situ or remotely) by a human falls into this category.

Virtual: Usecases that feature unembodied agents, such as chatbots or avatars---which may have a 2-dimensional representation---are virtual. Typically, online agents of some kind, despite being unembodied, these systems have a distinct personality Glikson (2020). The human in the scenario would feel that they were communicating with something or someone.

Test environment

Environment in which the scenario plays out, transferable it could work elsewhere: in-the-wild, in-the-lab (artificial, immersive, audio-visual, in-person, online).

In-the-lab (ITL): We identify two broad test environments: in-the-lab and in-the-wild. ITL test environments are taken to include several subtypes.

In-the-wild: Some usecases are conducted in fully live environments such as shopping centres, caf\'es or hotel foyers.

ITL-Artificial: Ideally, the usecase would be conducted in-the-wild but for some reason---such as danger to the participant or to potential bystanders---a public setting is impractical. Artificial test environments include off-road setups or exhibition spaces.

ITL-Audio-Visual: Many usecases involve presentation of video. Here, the participant observes an interaction rather than participating in it. Nevertheless, a video experience can facilitate immersion, albeit to a lesser degree than when using VR.

ITL-Immersive: Because of risk or perhaps because the usecase involves a futuristic experience not possible in the real world, an immersive test environment may be preferred. Typically, simulation equipment is involved, such as a virtual reality (VR) headset, an AV simulation, and/or sophisticated software development.

ITL-In-person: Some usecases involving neither (a), (b) nor (c) nevertheless require participants to attend the lab in-person, e.g.,~to participate in a human-AI team-building exercise or if a scenario demands the use of special equipment.

ITL-online: Experiments using these scenarios are typically conducted fully online, e.g., using platforms such as Prolific (https: //www.prolific.com/) or Mechanical Turk., (https: //www.mturk.com/). An advantage of the online platform is the ability to present scenarios as `vignettes', that is, in purely textual format or with graphics, asking participants to imagine what might otherwise be a complex and expensive immersive experience.

Trust type

Trust can be defined as the belief that an entity will help you achieve your goal in a potentially risky situation. Commentators differentiate between two types: competency-based and value-based.

Competency-based: Sometimes described as 'professional' or 'cognitive'. The UC looks for reliability, whether the system seems fit for purpose. Distinguished from value-based trust.

Value-based: A UC testing for value-based or 'moral' trust looks for human-like qualities, such as whether the AI seems fair or honest. Distinguished from competency-based trust.

Other Terms

Characteristic: This is a sub-category, a ‘type’ of dimension.. Under our classification system, each dimension (or ‘class’) has a different possible characteristic (or ‘type’).

Contributor: A registered user of the library, with permission to upload usecases or comments on the usecases uploaded by others. Depending on the number of contributions made, different coloured medals may be awarded.

Dimension: Dimension is just another word for a class or category. We have classified usecases under four ‘core’ dimensions (trust type, experience, interaction and risk) and four transferability dimensions (system type, test environment, measurement and application domain).

ITL: In-the-lab test environment

TAS: Trustworthy Autonomous Systems

Transferability: A fully transferable usecase is one that can be assessed using multiple metrics, applied to multiple system types under multiple test environments and in multiple domains. To evaluate, we ask in relation to each dimension: can the usecase operate in contexts other than that for which it was originally designed?

UC: Abbreviation for usecase, used by us to describe an experimental scenario involving human/s and AI system/s used to evaluate some aspect of human trust or AI trustworthiness.

User rating: A subjective rating based on contributor evaluations of a UCs usability and transferability.