Constructing and validating authentic assessments: the case of a new technology‑based assessment of economic literacy

Background: Authentic situations are considered a source of learning due to their real world relevance. This can encourage learners to acquire new knowledge. Increasing digitisation and associated resources, such as professional development opportunities for teachers, technology tools, or digital equipment for schools enable the development and implementation of authentic assessments. The basic academic principles for acquiring economic literacy are already provided in lower secondary school. This article examines, using the example of a new authentic technology-based assessment (TBA)— Economic Literacy—Assessing the Status Quo in Grade 8 (ECON 2022) -, the processes involved in constructing a TBA. The purpose is to develop a curricular valid measurement instrument for surveying the current state of economic literacy in the 8th grade of a German federal state. This study explores which economic competencies students—typically between 14 and 15 years of age—possess in Grade 8, and what level of competence can therefore be expected of them at the beginning of a vocational training programme. The assessment is geared toward the curriculum of the subject of economics and is based on a domain model. This article presents the background and construction process for the development of ECON 2022 as a TBA. Methods: To check the validity of test construction with a focus on the implementation of the authentic assessment and an analysis of difficulty-generating characteristics, the ECON 2022 test items were validated with an expert survey ( N = 25). The two-stage data analysis comprised a descriptive quantifying analysis of the rating from the difficulty-generating characteristics specificity, cognitive demand and modelling and the design criterion authenticity. A set of experts rated the criteria. The expert survey was then compared with a previously conducted rating by the research team. The analysis of free-text comments on individual items was carried out discursively and qualitatively by the research team. Both sources of information were used to adapt the test items to measured item difficulties from the field test. For this purpose, items of great difficulty were changed to slightly easier items. In this context, the paper focuses on two central research questions


Introduction
Economic literacy is considered a component of general education which should be specifically promoted through the introduction of the subject of economics, which was launched in the federal state of North Rhine-Westphalia (NRW) in the school year of 2020/2021 (Ministerium für Schule und Bildung des Landes Nordrhein-Westfalen [MSB] 2021).The launch of this new subject provided an occasion for developing a new knowledge test in the field of economic literacy.In this context, the project Economic Literacy-Assessing the Status Quo in Grade 8 (ECON 2022) offered an opportunity to develop just such a new test.
Instruments in the field of education have changed substantially in recent years, nationally and internationally (Loerwald and Schnell 2016).In a detailed systematic review, Welsandt and Abs (2023) analysed 26 test instruments published between 1990 and 2020 with a total of 1124 items that measure competencies in economics across all age groups.The review showed that assessments differ considerably in their content and focus, and that they usually emphasise a particular aspect of the subject rather than covering all economic areas equally.Tests are aimed mainly at assessing a person's ability to recall factual information and are designed for adults as well as young people.Remarkably, the development of authentic assessments has not been a central focus even in recent times (Welsandt and Abs 2023).However, the ever-increasing potential of the technology-based assessment (TBA) to display images, videos, and audio sequences as part of the assessment offers new opportunities for making test environments authentic (Janesick 2006;Jude and Wirth 2007;Koh 2017).The major advantage of authentic test environments lies in dynamically designed test items that relate to real situations and that are based on skills relevant to everyday life (Janesick 2006).To incorporate these aspects, the goal of this paper is to focus on the innovative development of an authentic TBA for Grade 8 students in the field of economic literacy.The work presented in this paper is part of the research project ECON 2022.
To effectively measure the individual level of economic literacy through a test, it is crucial to first establish a clear definition of what constitutes economic literacy (Loerwald and Schnell 2016).In line with Beck (1989), This study defines economic literacy as a multidimensional construct with a linguistic-argumentative or mathematical-analytical focus on the skills required to solve an economic problem.Test development was based on a domain model that has been derived from a systematic scientific and psychological analysis and tested for its curricular representativeness (Fortunati and Winther 2023a).
The assumption was that authentic situations and the simulation of familiar behaviour would lead to an increase in the students' ability to use their economic skills to solve the test items.Therefore, individual items were embedded in an authentic economic narrative.The technical test environment of ECON 2022 was implemented via the CBA Item-Builder (Kröhne 2023).
In this context, there is a lack of research findings on assessments in economic education that associate student performance in achievement tests with the implementation of a computer-based authentic assessment.In this paper, we seek to address the question of how the authenticity of a test environment is related to possible difficulty-generating principles at item level.We also seek to determine to what extent test results are affected by the authenticity of the test environment.Clarifying the relations can help to identify and minimise possible biases in the test results.It is important to determine the impact of the implementation on authentic assessments to avoid disadvantaging any participant groups.A thorough review of this relationship will allow for the development of fair and balanced testing procedures that take appropriate account of the diversity of the test population.
To ensure the validity of the ECON 2022 assessment, an expert survey was conducted in addition to an analysis of field test data (Fortunati et al. 2024).In accordance with Beck (2020) and Sangmeister et al. (2018), the expert survey evaluated items based on three difficulty-generating design principles: (1) domain specificity, (2) cognitive demand level, and (3) item modelling (Klotz et al. 2015;Winther 2010).Furthermore, both authenticity and usability of the TBA (Sangmeister et al. 2018) were surveyed.
The paper is divided into five sections."Introduction" section provides the introduction."(Authentic) Assessments of Economic Literacy" section presents the current state of the art in test instruments for measuring economic literacy, with a specific focus on authentic computer-based design, and gives an overview of the principles of authentic testing according to Janesick and Gulikers."Development of the ECON 2022 Assessment" section offers a detailed description of the development of the ECON 2022 assessment.The section begins with a theoretical and practical analysis of the design criteria for an authentic TBA, with a specific focus on how an authentic test environment can be implemented, and the importance of the lifeworld of the target group.Then the development of the ECON 2022 assessment is described with reference to the preceding considerations."Validation and Revision of the ECON 2022 Assessment" section presents the expert validation of the ECON 2022 test items.Finally, "Discussion and Outlook" section discusses the results.

Tests of economic literacy
Everyday life is permeated by economic phenomena and problems that require economic literacy to comprehend and resolve.In the field of economic literacy, numerous test instruments have been developed in recent decades both nationally and internationally.These instruments often exhibit significant differences in terms of their conceptual understanding of economic competence, test design, and target audience.Welsandt and Abs (2023) inspected which questions and decision options existing test environments raise for the development of future tests.For this purpose, a systematic review was conducted to examine the similarities and differences between the instruments for measuring economic literacy that have emerged over the past 30 years, and the extent to which the focus of these existing test instruments has evolved or changed.The population intervention comparison outcome (PICO) model was deployed to conduct a systematic search (Sayers 2007).The systematic review included all publications that used a measurement instrument or scale to assess basic economic literacy and reported on the original development, modification of a measurement instrument as a first step.Measurement instruments were excluded if they were not available in English or German, and if neither sample size nor Cronbach's alpha were reported.Altogether, 26 measurement instruments published between 1990 and 2020 were extracted; they included a total of 1124 items regarding economic literacy for all age groups.The analysis included survey format and technical implementation, year of publication, mode of implementation, response formats, content formats, the perspective of the economic subject dimension (Fortunati and Winther 2023a), the perspective of learning psychology (Marzano and Kendall 2007), and the perspective of authenticity (Janesick 2006).Table 1 lists the extracted measurement instruments that were included in the further analyses.
Overall, half of the test environments for measuring economic literacy included in the systematic review addressed children and young people under the age of 18. TBA was not a central feature of existing tests, and only 5 out of 13 test environments (38%) were computer based.With computer-based implementation, the focus was mostly on media support and on transferring tests from a paper-based to a computer-based format.The added value that the new setting might bring was not exploited.This was also evident in the choice of response formats, which were mostly limited to traditional single-choice, multiple-choice, or free-text answers (Welsandt and Abs 2023).Innovative answer formats, on the other hand, would include formats such as drag-and-drop items that allow test takers to move items physically around the screen to indicate their answer.Such formats can provide a more engaging and interactive experience for test takers while also gathering more detailed data for analysis, and this can be especially useful in tests that require spatial reasoning or sorting of items.Another innovative format would be the hotspot.In this format, test takers are presented with an image or diagram and asked to select a specific area by clicking on it.By moving a slider along a scale, test takers can select a value.This can be useful in tests that require numerical estimation or comparison.The innovative answer format of 'matching' presents two columns of items and asks the test taker to match them up.This can be useful in tests that require associations or pattern recognition.Lastly, there is also the possibility of more 'gamified' items.This format involves presenting test items in a game-like format such as quizzes, puzzles, or interactive simulations.This can be useful in engaging test takers and reducing test anxiety (Goldhammer and Kröhne 2020).Concept-mapping is another innovative format.This format can be implemented in less (create a map) or more restricted form (skeleton map).A concept map is a node-link diagram in which each node represents a concept, and each link identifies the relationship between the two concepts it connects (Schroeder et al. 2018).Test takers have to relate (given) concepts and label or choose a label for each link of two concepts.The systematic review showed that such formats were rarely used.
The systematic review highlighted that measurement instruments for economic literacy prioritised mapping one domain at a time rather than mapping all domains together.Despite the high relevance attributed to implementing lifeworld references in measurement instruments, the level of incorporation of authenticity in test formats seems inadequate at this time.Measurement instruments that are fully integrated into authentic settings remain the exception.For example, although all of the five computer-based test environments had items with a lifeworld reference, only two of the test environments (15%) were embedded in an authentic setting (Welsandt and Abs 2023).

Principles of authentic assessments according to Janesick and Gulikers
Authentic assessments consist of dynamic, real-life test items that are oriented towards abilities which are relevant to everyday life.Authentic problems are the origin of learning processes because of their strong connection with the lifeworld and because of their relevance, both of which motivate learners to gain new knowledge.Assessments should embed problems in authentic situations.The principle of the lifeworld reference increases the practical applicability for learners.Since learning tasks in the school context are always designed to be close to the real world at least in principle, it makes sense to implement this real world proximity in authentic assessments as well (Winther et al. 2022).Moreover, the didactic requirement of a test situation should ensure that the authentically conveyed learning content is queried in authentic test items (Klotz 2015).Authentic assessments require students to use their judgement to solve innovative items.Items require a specific set of student competencies in order to be solved.In authentic assessments, real-life situations are ideally simulated (Janesick 2006;Koh 2017).Gulikers et al. (2004) developed five dimensions to evaluate the level of authenticity in an assessment.At the task level, the degree of complexity should correspond to the level of responsibility of the natural work environment.This includes integrating knowledge, skills, and attitudes, as well as the complexity and relevance of the task for the learners.The physical environment simulated in the assessment should resemble the actual workplace environment.Computer-based implementation can help to increase authenticity.The assessment should reflect social relationships and processes in authentic professional settings.Furthermore, performance should be the primary basis of assessment and mirror the competencies that students would exhibit in real-life situations.Students should have multiple opportunities to demonstrate these attributes and capabilities through various tasks.The assessment criteria should align with those applied in real workplace settings.The assessment's criteria and standards should be explicitly stated in order to ensure that students understand how their performance in a series of assessment tasks will be evaluated (Ersozlu et al. 2021).Janesick (2006) has established six principles for authentic assessments.One, authentic assessments require students to demonstrate quality in performance or production, emphasising the significance of students' ability to apply knowledge effectively.Two, authentic assessments establish a strong connection between assessment tasks and the students' real-life experiences, ensuring relevance and practicality.Three, authentic assessments are characterised by their complex and multi-layered nature, requiring students to engage in diverse and interconnected tasks that mirror the complexity of real-life situations.Four, authentic assessments involve an ongoing process with multiple tasks.Five, authentic assessments seek to evaluate higher-order skills such as critical thinking, problem solving, and the application of knowledge in novel and meaningful ways.And six, complex feedback which is provided regularly plays a crucial role in authentic assessments because it allows students to self-adjust and improve their performance over time.By incorporating these principles, authentic assessments comprehensively analyse students' abilities and understanding beyond mere factual recall.
The development of ECON 2022 was based on principles 1-5.Various kinds of feedback can also be implemented as part of traditional individual assessment.Therefore, it doesn't appear as a necessary component of authentic assessment within the following analysis.
Authenticity has its origin in situated learning.Learning processes should be designed in such a way that the requirements they represent can be found in the real world, from which Winther (2010, p. 206) derived the requirement that test formats should also be authentic.Authentic assessments is directed to skills that are necessary for the lifeworld.These skills include the ability to solve problems, work independently, stay motivated, and regulate oneself while being aware of one's thought processes.Authentic assessments allow students to gain practical experience in using these specific skills and abilities, which are highly valued in the workforce (Villarroel et al. 2018, p. 2).Authenticity must be staged, which means the challenging situations must be modelled.
Designing and conducting authentic assessments also involves some challenges.For instance, implementation can lead to a considerable amount of extra work in test creation.Implementation requires time and financial resources and the acquisition of additional knowledge (Aziz et al. 2020, p. 763;Tanner 2001, p. 28).Depending on the design, authentic assessments produce an increased density of information and an increased processing effort because of their contextualisation in the target group's lifeworld.Moreover, the level of language competence required of students is often more complex in authentic test environments.For example, in authentic assessments students are often asked to explain how they solved mathematical items.Although this provides important insights into the students' understanding of mathematics, it also requires excellent language skills (Tanner 2001, p. 28).It is therefore conceivable that authentic assessments in and as of themselves can have their own difficulty-generating effect.
Against this background, this article focuses on the design of a technology-based test environment to provide an authentic assessment.

Quality criteria in assessment construction
In this paper, the term 'assessment' essentially defines an instrument developed for collecting data about students' competences (Pellegrino et al. 2001).If assessment is understood as a process, it can include three steps: operationalising a valid construct, the actual testing, and interpreting test results (Klotz 2015, p. 68).Assessments vary with regard to multiple aspects, such as the mode of presentation, standardisation in stimulus materials, the response format, and the extent to which test materials are close to the test takers' lifeworld.Nonetheless, in all instances, tests need a standardised procedure for evaluating and scoring test takers' responses (AERA 2014, p. 2).The concept of reliability plays an essential role in the interpretation of test results.In this context, reliability refers to the consistency of results when a test procedure is administered several times, irrespective of the method or assessment (AERA 2014, p. 33).Validity is a further quality criterion, which is considered the most important aspect in developing and evaluating tests.According to the Standards for Educational and Psychological Testing (AERA 2014), validity is the question of the plausibility of the interpretation of test results; therefore, the main focus of this paper is interpreting the theoretical construct based on the test results (AERA 2014, p. 11).Furthermore, Mislevy and Riconscente (2005) identified two fundamental components of test construction that also had relevance for the creation of the ECON 2022 assessment: selecting test items with a clear reference to the aim of the assessment; and including reliability considerations.With tests in a school context, the aim of the assessment is therefore necessarily determined by curricula for general or vocational education.In addition, assessments should be oriented towards authentic, domain-typical learning and work processes.

Difficulty-generating design criteria
Concerning item construction, the three difficulty-generating criteria (specificity, cognition, and modelling) in the area of vocational assessments were followed to establish a connection between the requirements of occupation-specific action situations and cross-domain action situations of economic literacy and numeracy (Winther 2010).For the purposes of this study, a fourth design criterion was added, namely authenticity.Figure 1 below shows the decision trees of specificity, cognition, modelling and authenticity with the difficulty-generating criteria that differentiate between three difficulty levels.
Specificity is one of four criteria used for describing economic competencies, especially the difficulty of tests.In this paper, the understanding of specificity is built on the domain model (Fortunati and Winther 2023a), which, in turn, is based on subject-content theories.In line with Gelman and Greeno (1989), a distinction is made between domain-specific and domain-general content.The more specific the items are, the greater the requirement for comprehensive knowledge of economic concepts from multiple subdomains to solve the item.Conversely, domain-general items rely on generic knowledge and skill structures, which are prerequisites for tackling problem situations that bridge multiple domains.The transfer of general to domain-specific competencies can depend on the contextualisation.The findings of Hering et al. (2020Hering et al. ( , 2021)), for example, show that the transfer of general mathematical competencies into the context of commercial vocational training seems to depend in particular on the contextualization of the tasks.Regarding the probability of solving an item, levels 1 and 2 differentiate whether the teaching of economic knowledge was necessary or not.The items can be solved using general, economically relevant knowledge; they can also be solved at least partially without specific knowledge.Level 3, however, requires a combined knowledge of several economic subareas.
The second difficulty-generating characteristic is cognitive demand.The more cognitive resources required to process a test item, the more complex cognitive processes it engages.The taxonomies by Bloom et al. (1956) and Marzano and Kendall (2007) provided theoretical considerations regarding the level of cognitive demand of specific items and thus provided the basis for the decision tree on cognition.An item that can be assigned to level 1 can be solved solely by remembering and naming information; at level 1, knowledge only needs to be reproduced.A level 2 item requires information to be actively used, for example, by applying (calculation) rules or algorithms, or by making a decision.At level 3, data and results must be further interpreted and evaluated based on existing knowledge.
The modelling criterion represents the third difficulty-generating characteristic; it is based on cognitive load theory.It addresses the complexity of the presentation and perception of the item independent of the content difficulty of the item (Sweller et al. 1998).In other words, modelling seeks to measure artificial difficulties that occur independently of cognitive or content difficulty.For example, modelling features such as colour or presentation by means of audio, video, or continuous text could unintentionally influence the level of item difficulty.The decision tree for modelling focuses on the type and number of stimuli that might distract from a correct solution.If the approach is immediately obvious, it can be classified as level 1.If a distractor or audiovisual material is added that makes the item more difficult to solve, the item should be assigned to level 2. If several types of distractors and audiovisual material are used that could be misleading or distracting, the item can be assigned to level 3.By analysing item modelling, it is possible to monitor artificially created difficulties that are detached from the cognitive and content-related difficulty (Klotz 2015).
The fourth design criterion is authenticity.Authenticity must be created, i.e. the challenging situation for measuring the economic competencies of the students must be modelled.It is assumed that situations that are familiar to young people from their everyday life make an item more accessible.If an item presents an action situation that is familiar to young persons from their everyday life, it can be classified as a level 1 item.If it is a situation that is accessible to young people at least in theory, it can be assigned to level 2. If the action situation cannot be expected to be accessible to young people, the item is assigned to level 3.

Target group: vocational education and training
In vocational and economic education, learning an occupation and the associated acquisition of competencies perform an important function for an individual's social integration (Beck et al. 1976).For vocational learning processes, school-based economic literacy education supports the development of area-specific competencies among trainees.The lifeworld environment shapes the competence acquisition process (Lempert 2009).
Recent school and curriculum reforms in the German federal state of NRW have tried to strengthen the vocational preparation of students.For example, from the 2014/2015 school year onwards, the initiative 'Kein Abschluss ohne Anschluss' [Guaranteeing next steps for school leavers: No school leaving certificate without subsequent opportunities for employment or qualification], mandated that all students in Grade 8 have to complete internships to explore occupational fields, which should be prepared and followed up at school (Ministerium für Arbeit, Gesundheit und Soziales des Landes Nordrhein-Westfalen 2020).Moreover, the introduction of the subject of economics focused the content of social science lessons much more strongly on vocational preparation.The ECON 2022 project took these initiatives as a starting point and targeted the additional part of the curriculum, which is specifically designed for vocational preparation and opening up connections with commercial training programmes.
Vocational education research has shown a clear predictive influence of domainrelated economic competencies on the development of vocational competencies in commercial administrative professions (Achtenhagen and Winther 2008).Economic competence refers to the ability to navigate successfully situations that have economic implications, such as those related to the personal-financial, professional-entrepreneurial, and socioeconomic areas of life (Fortunati et al. 2024).This requires knowledge, skills, and abilities to understand and analyse economic problems in a specific context, develop solutions, make informed decisions, and reflect on actions taken.Previous studies on economic literacy focused on upper secondary school students and took a predominantly economic perspective in terms of content (Ackermann 2019).There is little empirical evidence regarding the development and structuring of economic literacy education at lower secondary schools as an important recruiting arena for commercial vocational training (Seeber et al. 2014).With that, the assessment developed in the context of the ECON 2022 project has been specifically designed to be conducted in the preliminary stages of vocational training.The assessment helps to determine what business-related competencies students already have and what knowledge and skills can therefore be expected of them at the beginning of an apprenticeship.

Domain modelling and item construction
In constructing an assessment, ideas regarding a theoretical model and its output must be transferred into an appropriate assessment instrument.In addition to developing and compiling the items and the required materials, the measurement model must be reviewed, the scoring procedure prepared, and implementation techniques tested (Winther 2010).The curriculum-instruction-assessment triad (Pellegrino 2012) stipulates that test design should focus not only on valid test and item construction, but should also make continuous reference to the goals and content defined by the curriculum.The assessment must be tailored to the content of the school curriculum, which in turn is geared towards the learning fields.Accordingly, assessments must not only be coherent in themselves, but must also be meaningfully anchored within the entire education system, i.e. aligned with the curriculum (Klotz 2015, p. 48).According to Achtenhagen and Winther (2009), subject-didactic modelling of economic competencies is of great importance in constructing assessments, particularly as such modelling also addresses the subjects' process knowledge.To implement subject-didactic modelling, items should logically relate to each other in chronological order rather than depicting isolated partial aspects (Klotz 2015, p. 68).To construct a lifeworld reference for the target group, the knowledge, skills, and abilities of the target group must be recorded in authentically modelled situations.
Economic literacy is defined by Beck (1989) as a three-dimensional concept: (1) economic knowledge and cognition, (2) attitude towards economics, and (3) economicsrelated moral reflectiveness.Economic literacy is a prerequisite for achieving economic autonomy and participating in an evolving society.Individuals should be able to take part in society by developing their knowledge, skills, and abilities; they should understand and assess economic contexts which are located in the personal-financial, professional-entrepreneurial, and socioeconomic spheres of an individual's life, and make decisions (Beck 1989, p. 581).Knowledge in this context includes understanding fundamental elements of the economic world and learning about risks that can threaten economic well-being.Skills encompass generic cognitive processes within an economic context, such as information retrieval, comparison, extrapolation, and evaluation.They also entail fundamental mathematical and language abilities (OECD 2019).Skills can be understood as automatable action sequences that are performed routinely.Abilities comprise comprehensive mental tools with which a person can cope with challenges in particular situations.A routine cannot be applied but must be constructed for the situation.Competencies can be defined as complex combinations of abilities and skills which are the cognitive prerequisites for coping with specific lifeworld situations (Klieme et al. 2008;Hartig and Rauch 2008).Economic literacy encompasses all areas; though fundamental, pure knowledge or the execution of a routine are not sufficient.For example, in an arithmetic task, knowledge is needed, and routines can be performed.
Following Ackermann (2019), Fortunati and Winther (2023a) developed a domain model of economic literacy divided into three domains of life in which individuals are confronted with economic situations.The personal-financial domain encompasses everyday economic conditions from the consumers' perspective and addresses the responsible management of personal finances.The professional-entrepreneurial domain contains economic challenges that individuals face in the workplace, which can be further categorised into general, occupational, and cross-occupational situations.The socioeconomic domain focuses on economic issues of high abstraction and complexity, often interconnected with political contexts.Furthermore, this domain concerns all citizens of a country (Fortunati et al. 2024).Sustainability is a cross-cutting dimension (Birindiba Batista et al. 2022) and is becoming increasingly important at both individual and corporate levels (Corsten and Roth 2012) from social and educational perspectives.It is being discussed in vocational (Haan et al. 2021; Rebmann and Schlömer 2020) and socioeconomic education (Schank and Lorch 2018) based on a holistic perspective.Education for sustainable development is a fixed curricular component and serves as a cross-sectional dimension with regard to economic content dimensions (KMK 2016).The ECON 2022 assessment focuses on sustainability as an overarching political concept, its implementation in the economic system, and its significance at the level of individual behaviour.In the questionnaire, individual attitudes are assessed that are not part of the test.

Technology-based assessments for measuring economic literacy and technical implementation in ECON 2022
The first TBAs were developed as early as the 1980s.Since then, the media used and the preparation of the content have changed.In addition to selecting the medium for delivery, the very design of the test environment and test items plays a much more important role (Steger 2019).For economic literacy, implementing a TBA is suitable for simulating an authentic lifeworld and realistic situations in which economic literacy applies (Winther and Achtenhagen 2009).A TBA is complemented by technology-based test construction and offers innovative possibilities for measuring knowledge, skills, and abilities.One aim of using TBAs for economic literacy can be to measure economic citizenship competencies, i.e. individuals' ability to understand and assess economic contexts and to form their own opinions based on their knowledge.The term TBA is a generic term for computer-and smartphone-based assessments (Steger 2019).Digital technologies such as laptops, tablets, and smartphones have become indispensable tools for competence measurement.They make it possible to collect data that goes beyond answering the items.Innovative use of digital technologies goes beyond simply digitising paper questionnaires in computer-based test formats and involves integrating multimedia elements or interactive tools.However, it must also be pointed out that so far, only a minority of existing tests have used digital technologies (Welsandt and Abs 2023).Therefore, a digital implementation of a test instrument is already innovative in and as of itself.In TBAs, more innovative answer formats can be used than in paper-based tests, and multimedia elements can be incorporated (Goldhammer et al. 2020).Furthermore, technology-based implementation enables the collection and analysis of processing data, i.e. data that allows conclusions to be drawn about item processing in addition to evaluating results data.Computer-based testing produces log data in log files (Goldhammer et al. 2020;Kögler et al. 2020).Analysing log data seems to be a suitable procedure for inspecting the test takers' effort in processing the items.Furthermore, from a didactic perspective, TBAs open up relevant possibilities for the analysis of cognitive processing.
During the process of developing the assessment, work was carried out in parallel on the technical implementation and the construction of the content.Functions and answer formats for the technology-based test environment were defined.This made it possible to integrate additional help tools such as a virtual notepad, a calculator, and a help button to explain functions in the assessment.In addition to classic answer formats such as single-choice items, multiple-choice items, and free-text fields, more innovative formats were incorporated, including video sequences, sliders, drag-and-drop items that work by moving, combining, and placing different elements, or items in which incorrect answers must be crossed out.In the preliminary design stages, a suitable program for implementation was investigated.An analysis was carried out to compare the programs H5P (H5P, 2022) and CBA ItemBuilder (Kröhne 2023).In light of considerations around data protection, the availability of process data, and the promise of technical support from the Leibniz Institute for Research and Information in Education, the ECON 2022 authentic TBA was implemented using the CBA ItemBuilder.
ItemBuilder is an authoring tool for creating dynamic and interactive items for technology-based tests.The software was developed by the Centre for Technology-Based Assessment at the Leibniz Institute for Research and Information in Education.Item-Builder allows editing of test items in a user interface and enables innovative item formats.Automatic scoring can be implemented using predefined rules.The delivery of final test environments from ItemBuilder can take place as a software package on a personal computer or USB stick, as a virtual machine, or online (Kröhne 2023).For the ECON 2022 project, assessment was delivered at schools using a USB stick.Automatic scoring took place after the assessment has been completed.
ItemBuilder was chosen because the availability of processing data would enable conclusions to be drawn about item processing.During processing, all user inputs and a time stamp are stored.One aim of the ECON 2022 assessment was to reconstruct test behaviour and the interaction between the test taker and the assessment.This is possible because computer-based test environments offer new possibilities for capturing and describing problem-solving processes due to the extensive data gained from the users' interactions with the test environment (Rausch et al. 2017, p. 569).Analyses of log data can make individual solution strategies visible (Rausch et al. 2017, p. 569).Log data is event based.Events are always linked to a test person and can refer to the content of the test or the test level (Kroehne and Goldhammer 2018, p. 533).User events such as the use of buttons, links, menu items, text input, or scrolling are made visible (Goldhammer et al. 2021).As the test person determines which interactions to carry out, it is possible to draw conclusions about their problem-solving strategies.Design and usability play an important role in log data analysis and influence the options for interpreting the log data (Kögler et al. 2020).

Assessment insight: incorporating authenticity regarding young people's economic opportunities for action
The initial focus of the ECON 2022 test development was the construction of authentic problem situations drawn from the lifeworld of the target group.The authentic test situation was based on didactics of economics to ensure curricular validity (Fortunati and Winther 2023a).Based on the domain model developed for the project (Fortunati and Winther 2023a;Mislevy and Riconscente 2005), a framework for item development was created with the help of extensive curricular analyses which linked the content concepts to cognitive processes.Items were based on the content of the current curriculum for economics in NRW.Individual items refer to certain aspects of knowledge and capturing ways of processing.The economic competencies measured had to be found in the lifeworld and have clear relevance for managing everyday life.Display formats that would be authentic from the perspective of the lifeworld of 14-year-old students were researched in two ways.First, test instruments for measuring economic literacy were systematically researched and analysed; and second, relevant teaching and learning materials were researched to gain familiarity with the common display formats.In addition, they had to be anchored in the curriculum.When designing the items, care was taken to ensure that an immersive experience was possible.The term 'immersion' can be understood as the act of being completely absorbed into a virtual environment.Implementation using ItemBuilder enabled not only a realistic visualisation but also the inclusion of audio, video, and interactive elements to enhance authenticity.The possibility of experiencing digital content authentically in turn leads to high immersion (Wirth et al. 2007).
The modality of the ECON 2022 assessment included information intake, which took place visually and aurally.Depicting a situation close to the lifeworld was intended to enable the test persons to put themselves in the situation and identify with it.The action situations and work activities that were determined to be relevant for the target group were depicted in an authentic test environment.The didactic items involved a realistic representation of the test design tailored to the target group and a realistic representation of actual problems at item level.The digitalised design of the test enabled dynamic, innovative, and interactive item development.It was possible to embed multimedia content such as video and audio sequences (Jude and Wirth 2007, p. 49).
The ECON 2022 assessment was developed drawing on the characteristics of an authentic assessment as described by Gulikers et al. (2004) and Janesick (2006) (see "Principles of Authentic Assessments according to Janesick and Gulikers" section).The narrative of the ECON 2022 assessment depicts a concrete economic situation, namely a visit to a supermarket.Two protagonists, Kim (female) and Juri (male), are introduced to the target group as two school friends who are both 14 years old.Setting the protagonists' age at the age of the target group was designed to enable test participants to identify with the situation.In the test scenario, Kim and Juri are going grocery shopping in the supermarket.The framework situation of 'going shopping in the supermarket' is repeatedly interrupted, for example, by social media messages, associations to their schoolwork, or calls from class mates.These interruptions constitute eight individual units.A unit represents a coherent section of the test content and can consist of several items (Leutner et al. 2008).In the units, different economic problems are addressed in specific items (see Table 2).An item is the smallest element of analysis within the test (Leutner et al. 2008).The content of the domain model was implemented with specific items in each situation.Each unit contains two to six items.The 36 items represent the domain-related content and cognitive requirements in a balanced way.
Table 2 shows the eight economic situations (units), each representing distinct content emphases for the target audience.Each of the situations should be considered from multiple perspectives.In each unit, students are presented with multiple items that examine these situations from various perspectives.Alongside the item content, Table 2 also indicates the question type.Economic situations can be modelled using a linguistic-argumentative or mathematical-analytical approach.Economic literacy and numeracy are considered as domain-specific areas of economic knowledge that represent basic skills for (economic) vocational action (Winther 2010).The curricular representation of the domain model was examined by analysing 31 curricula for economic education at lower secondary school drawn from 10 different federal states and school types in Germany.
All situations of the items tie in with a video-based introduction to the framework situation and its progression between items.Sequencing makes it possible to structure complex issues individually (Bley et al. 2015).For example, in the ECON 2022 assessment, video sequences can be viewed repeatedly.Figure 2 shows a screenshot of the video, which contains spoken texts, sound, and subtitles (Jude and Wirth 2007;Finken et al. 2017).The test person emulates the typical stations that a visit to the supermarket entails.
In addition to developing an authentic framework situation, it was also important to develop stimulating items and to choose realistic item formats.Under the aspect of signalling-i.e.directing the focus to relevant elements-these were specifically highlighted.Unimportant details were omitted to avoid redundancy (Bley et al. 2015).It was crucial not only that the entire test situation could be found in the students' lifeworld, but also that individual items were authentic.Items were distinguished according to whether they represented an action situation that was directly drawn from the young people's everyday life, an action situation that would be accessible from the young people's perspective even if it was not directly drawn from their everyday life, or an action situation that was altogether unfamiliar or alien to the young people.In addition to visualising the structure of the assessment, Fig. 3 shows an example of the design of Item 1 from Unit 2. First, there is a video-based introduction to the item battery that constitutes Unit 2. The subjects are guided to the supermarket, where they receive a message on the imaginary social media platform Picturegram.An influencer, who is introduced as a favourite influencer, is promoting a smartwatch.The item was designed to explore young people's understanding of how online advertising strategies can influence purchasing decisions.Correct answer options are arguments that mention being directly addressed, belonging to a community, or pressure to act quickly because the offer is due to expire shortly.In curricular terms, the item can be assigned to content area 1, 'Economic activity in the market economy' (MSB 2019, p. 20).The aim of this area is to develop the judgement competence of being able to assess the influence of advertising and social media on one's own consumer behaviour.In addition, the item can also be assigned to content area 8, ' Acting as consumers' (MSB 2019, p. 16); this content area deals with purchasing decisions in the digitalised world, whereby one focus is the influence of advertising on purchasing decisions.The item can be assigned to the personal-financial area in the domain model.Access is linguistic-argumentative.Scanning the QR code shown in Fig. 3 will give the reader access to the exemplary test environment.
In terms of difficulty-generating criteria, this item can be assigned to level 2 for the principles of specificity, cognition, and modelling.The criterion of specificity assesses the expertise that the item requires.The probability of solving this item is expected to be higher if students have attended lessons on economics-related subjects up to Grade 8.The criterion of cognition assesses the level of comprehension that the students have to demonstrate to answer the question correctly.To solve the item, individual solution steps must be applied.It is not possible to solve the item simply by reproducing pure factual knowledge.Modelling is also assigned to level 2 since the item contains audiovisual material that could distract from solving the item.For the criterion of authenticity, the item presented in Fig. 3 corresponds to level 1 and thus represents the students' lifeworld.
The targeted design of tasks can avoid cognitive overload (Sweller et al. 1998).In this context, continuity is based on the maximum processing capacity of humans and can be optimised by presenting task formats and contents in a systematic, clear, and wellstructured manner (Bley et al. 2015).The ECON 2022 assessment is designed in a consistent, structured style and continues to offer helpful tools that can be easily opened.
The assumption was that authentic situations and the simulation of familiar behaviours would enable students to use their economic competencies more effectively in a TBA than in a paper-and-pencil questionnaire.Paper-and-pencil surveys require comprehensive descriptions to establish a connection with everyday reality.Too much reading can require excessive concentration and can promote cognitive overload and even failure at the items (Bley et al. 2015, p. 4).All materials were adapted to reduce the linguistic complexity and adjust them to the language competences of the target group.The materials were based on the lifeworld of the target group not only linguistically but also aesthetically (Bley et al. 2015).Establishing the lifeworld reference via videos instead of texts can lead to a reduction in the cognitive load of the subjects (Bley et al. 2015).Subtitles were used in the ECON 2022 assessment to engage equally those students with a high level of reading proficiency and those with reading difficulties.Students with reading difficulties could benefit because the audio-visual load was comparatively lower than the reading load for the same amount of information.
Regarding the psychometric quality of the test instrument, Fortunati and Winter (2023b) found sufficient empirical evidence of measurement accuracy for the construct of economic literacy.The following statements are based on field test data from the ECON 2022 assessment: the assessment can be evaluated as empirically reliable, valid, and fair for Grade 8 students.Adams and Khoo (1996) suggest a range of values between 0.75 and 1.33 for item fit (wMNSQ), while large-scale assessments like PISA consider stricter values between 0.85 and 1.15 as appropriate (OECD 2020).All items except FT22 meet the strict PISA value.The t values exhibit a dispersion ranging from -5.20 to 5.4.Furthermore, no significant gender or language differences were observed at test level in the differential item functioning (DIF) analyses.A significant DIF effect was found for both migration background and socioeconomic background.The DIF effect was 0.282 for migration background and 0.429 for socioeconomic background, both of which can be considered low according to the classification by Paek and Wilson (2011).

Expert validation of the test
Content validation is an important part of test development, but it is often neglected (Ollesch et al. 2018, p. 129).The quality of a test depends to a large extend on the fulfilment of quality criteria.Specifically, the alignment of the theoretical construct with the actual test, in terms of validity, is important (Loerwald and Schnell 2016).To check the validity of the ECON 2022 test construction with a focus on both the authenticity of the assessment and item difficulties, the developed test items were validated using the framework of an expert rating.In order to select suitable candidates for the purpose of expert validation, researchers with research interests in test construction, economics, didactics, psychology, and competence development were invited to take the survey.
The resulting sample included a total of N = 25 experts with expertise in test development (n = 10), economics, i.e. economics or business education or business psychology (n = 11), and schools and teaching (n = 12).Individual experts could be assigned to two groups, indicating that they had attributed themselves expertise in two areas.The 25 experts thus represented expertise in the three fields of action.The validation study was based on the design criteria of specificity, cognition, modelling, and authenticity.Experts assessed the items using the associated decision trees.
The use of expert surveys allows validation of a model after development and implementation and thus to check the nature of decisions during test development (Offergeld 2011, p. 197).The validation study for the ECON 2022 project collected ratings based on the four design criteria (Beck 2020) and the usability of the assessment (Sangmeister et al. 2018).The expert ratings served as external verification of the authenticity of the ECON 2022 assessment and individual test items.Data analysis of the expert survey took place in two steps: first, the experts' rating of the four difficulty-generating criteria was analysed descriptively and quantitatively; and second, the expert rating was compared with a previously conducted self-rating.The analysis of free-text comments on individual items was carried out discursively and qualitatively by the test development team.Using both sources of information, the test development team reviewed the test items and fine-tuned their adaptation with the domain model.Table 3 lists the authenticity values arising from the expert rating.
Table 3 shows that the items were mostly assessed as authentic by the experts with the exception of items 8_1 and 8_2.Authenticity is not an all-or-nothing decision but a staged assessment in which we distinguish between basal and overall authenticity levels in a fluid transition.Therefore, we conducted a relative comparison of the items.Cut-off values were formed from the experts' ratings.According to the experts, only two items did not represent an action situation accessible to the students and were rated with a rather low level of authenticity.The values here were above 2.30.A further 16 items were given a rating of medium authenticity.The values here ranged from 1.51 to 2.30.The items therefore represent a situation that students could think of as potentially accessible for them in the future.Next, 17 items were rated as very authentic, i.e. as an action situation that reflected everyday life.Here the values were below 1.5.The two items which were rated as non-authentic by the experts required review and adaptation.However, since these were the last two items of the assessment and thus formed the conclusion of the test, it seemed justifiable to loosen their reference to the lifeworld even further to generate items that were more strongly geared towards reflection on economic systems.Selecting and adapting the items based on the expert ratings increased content validity.
Further, Table 3 reveals a difference in ratings between the expert and self-ratings in 18 of 34 items.It is striking that 16 of the 18 items that showed a deviation were rated as more authentic by the experts, and only two items were rated as less authentic in comparison to the self-rating.An example of an item that was rated as less authentic by the experts was item 1 from Unit 3. In this item, authenticity was generated by setting the protagonists, Kim and Juri, a homework item that involved designing a poster defining the term sustainability.In the self-rating, the item was assigned to level 1, indicating the item corresponded to an everyday action situation for the students.Overall, however, the expert rating for this item had a total value of M = 1.76, which would prompt the item to be assigned to level 2. The experts with economics expertise (n = 11) rated the item with M = 1.55.It can be assumed that these experts are more familiar with the students' lifeworld than the test developers (n = 10), who rated the item with M = 2.2.The experts with expertise in school and teaching (n = 12) rated the item with M = 1.42.This rating indicated that such items occur in the children's everyday school life.A limitation here was the fact that none of the experts had expertise solely of the young people's lifeworld.The development team interpreted the ratings results as a call for a revised definition of the item.In the item revision process, the team reviewed the item and considered scenarios of how a poster could be designed differently.
Item 5 from Unit 6 deals with currency conversion.A price comparison of headphones and the use of a currency calculator were modelled as an authentic situation.Here it is noteworthy that although the expert ratings can be assigned to level 2 overall, indicating that the action situation should be accessible to young people even though it may not be an everyday situation, the experts with expertise in school and teaching represented this opinion most strongly.The self-rating of this item generated an attribution to level 1 because it was assumed that with the rise of the internet and e-commerce and given the permanent use of social media and mobile devices, it is now easier than ever for people to shop online from retailers all over the world.This means that 14-year-olds might be interested in purchasing items from international retailers trading in a currency other than the students' homeland currency, and a currency calculator can help them understand how much items would cost in their local currency.Table 3 also offers a comparison of ratings by experts' area of expertise.A closer look at the rating differences in Table 3 reveals a high level of agreement between the expert groups.The maximum difference between the three expert groups was above 0.5 for only four items.The greatest difference was between experts with expertise in school and teaching and experts in test development.

Authenticity as a Difficulty-Generating Characteristic
The realisation of an authentic test situation requires the implementation of multimedia content.Identifying with the setting should not generate difficulty for the test taker.This section examines whether authenticity is a difficulty-generating feature like specificity, cognition, and modelling, or whether it is purely a design criterion.
Table 4 shows the correlations between the expert ratings of the difficulty-generating characteristics and the expert rating of authenticity at item level (N = 34).The results reveal that there was a medium strong correlation according to Spearman-Rho between authenticity and specificity (0.477**), cognition (0.445*), and modelling (0.371*), which was significant in all three cases.The low significance level can be explained by the limited number of only 25 expert ratings.Expert ratings were related to the difficulty of the individual test items.To analyse the data from the field test, a polytomous 1PL-IRT model, the multidimensional random coefficients multinomial logit model (Adams et al. 1997), was selected and scaled using ACER ConQuest (Adams et al. 2018). 1 According to theory, the three characteristics of specificity, cognition, and modelling should have showed a positive correlation with item difficulty; however, only cognition (-0.364*) showed a significant correlation with the measured item difficulty.As expected, authenticity (-0.269) showed no significant correlation with actual item difficulty.To exclude the possibility that the result was only an artefact based on rater bias, a second analysis was carried out with z-standardised expert assessments.Here, the mean value of all expert assessments in one criterion was set to 0, and the individual expert assessments were then included in standard deviation proportions.In this second analysis, authenticity was also independent of the measured difficulty (-0.41) and correlated with specificity (0.430*), cognition (0.403*), and modelling (0.342*).
To analyse the empirical correlation of authenticity with the measured item difficulty for the constructed test items, dummy variables were formed.The previously explained three levels of authenticity were compressed into two levels: lifeworld relevance and no lifeworld relevance.Levels 1 and 2 were combined and recorded as 1; level 3 was recorded as 2. Three dummy variables were formed with the mean values of the experts' ratings and the self-ratings, and the mean value of both and a correlation were calculated.The results also showed empirically that authenticity was independent of item difficulty; the finding was that authenticity was not perceived as a difficulty-generating feature (see Table 5).

Discussion and outlook
This article examined which processes have to be completed to create a TBA as an authentic assessment to construct a valid assessment for measuring economic literacy among students in the 8th grade in the federal state of NRW, Germany.The aim of the ECON 2022 assessment was to show which competencies the students already had at this stage and what competencies could therefore be expected from them at the beginning of their training in a vocational area.The added value that the use of technology can have for innovative and interactive item development was highlighted.
Relevance is evident from the fact that TBA has repeatedly been used in accordance with the possibilities it offers when recording economic literacy.New possibilities for measuring knowledge, skills, and abilities, for example through the use of innovative response formats, have not yet been sufficiently exploited.In constructing the ECON 2022 assessment, care was taken to ensure that the test environment had curricular validity in that it was aligned with the curriculum of the subject of economics in the state of NRW (Fortunati and Winther 2023a).Test design was based on the basic parameters of the authentic assessment and on the domain model, which is based on a model of evidence-centred design.An analysis of the ECON 2022 assessment illustrated that an orientation towards the difficulty-generating criteria of specificity, cognition, and modelling, combined with authenticity and usability in the construction of a TBA, leads to a valid assessment.For economic literacy, TBAs provide the opportunity realistically to construct typical work and thought processes.In constructing new test instruments, the focus lies on implementing an environment that is as authentic as possible and thus significant for personal learning and the living environment, which can have emotional and motivational effects through the use of media.
The study shows that authentic simulations can depict processes and actions that illustrate the everyday life of students and thus provide an accurate insight into students' knowledge and skills at the beginning of vocational training in economics.Using the simulated scenario of grocery shopping with the guidance through the assessment by the protagonists Kim and Juri, the action-and comprehension-based ability structures of the students who completed the test could be recorded.Moreover, the simulation established a reference to real-life action processes in the field of economics.
One limitation of the analysis of the ECON 2022 test design is that the target group of students was not included in the rating of authenticity.A further limitation is that no immersion was incorporated in the assessment, i.e. the participants themselves cannot completely merge with the test situation.Immersion would be conceivable in a virtual reality environment.Moreover, in an augmented reality environment, it would also be possible to ask questions in a virtual supermarket that has been specifically prepared for this purpose.That said, the aim of the ECON 2022 test design was not to reproduce reality through an immersive experience, but to construct an assessment that was aligned with the experienced lifeworld of the test persons and to model items that fit the theoretical construct.In addition, the aim was to map the current state of economic literacy into a large-scale assessment.In this context, the use of virtual or augmented reality scenarios was deemed too costly for the purpose.
The expert survey showed no significant correlation between authenticity and the item difficulties measured in the field test.Authenticity was objectively not a difficulty-generating characteristic.This result should be interpreted positively as authenticity was not supposed to have a difficulty-generating effect.This finding, in turn, offers proof of the quality of the implementation of the authentic test environment.After the results of the expert validation have been incorporated into the final adaptation of the test items for the ECON 2022 assessment, the main study will seek to measure the actual status before vocational training so that teachers can address what commercial competencies the students already have when they enter vocational training and what can be expected of the students in this context at the beginning of vocational training.
The ECON 2022 assessment can claim to have implemented a precise understanding of competence in economic literacy in authentic situations, and to have mapped various facets of economic competence.
This study aimed to develop and validate a technology-based authentic assessment that can be used as a theoretical basis for measuring economic competencies prior to entry into vocational training occupations.For vocational education and training, economic literacy acts as a condition for the development of area-specific competencies for trainees.In NRW, the vocational preparation of students is becoming more central and has also been a curricular component since 2020/2021 in the form of the subject of economics.The study focuses on the new curriculum component, which should better prepare students for the profession and provide connections with commercial apprenticeships.
The newly developed authentic TBA-ECON 2022-enables to assess economic literacy in schools in a way that maintains curricular validity, and to analyse what can be expected of students when they enter a vocational training occupation.

Fig. 2
Fig. 2 Introduction to the authentic test environment

Table 1
Welsandt and Abs 2023)s of economic literacy between 1990 and 2020 (based onWelsandt and Abs 2023)

Table 3
Authenticity of the ECON 2022 assessment: expert ratings and self-ratings

Table 5
Correlations with dummy control variables