Measurement of vocational competences: an analysis of the structure and reliability of current assessment practices in economic domains
© Winther and Klotz; licensee Springer. 2013
Received: 12 June 2013
Accepted: 14 June 2013
Published: 16 July 2013
Both fostering and measuring action competence remain central targets of vocational education and training research; adequate measurement approaches clearly are prerequisites for international, large-scale assessments. For the German Chamber of Commerce and Industry, competence assessments of industrial managers rely mainly on final examinations that attempt to measure not just knowledge but also action competence. To evaluate this test instrument, this article considers two questions: (1) Can the test assess action competence with validity, and (2) how reliable are the corresponding assessment results?
The study relied on statistical procedures (e.g., IRT scaling), applied empirically to a sample of 1,768 final examinations.
As a result the current examination appears neither adequate nor accurate as an instrument to capture action competence.
We conclude that several improving steps have to be undertaken to improve the economic assessment.
Prospects and demand for adequate competence assessments
Explicit or implicit measures of vocational competence are relevant to many facets of vocational education and training (VET) and thus constitute an ever-growing research field. They pertain to national educational factors, such as relevant information and instruments for managing the quality of the vocational educational systems and developing adequate support programs, but increasingly, they also appear in international policy agendas. That is, international comparisons and acknowledgement of qualifications, as well as the encouragement of lifelong, informal learning, require adequate measurement concepts and innovative evaluation methods. To meet these multiple expectations, two major conditions must be fulfilled a priori (Klotz & Winther, 2012).
First, we require empirically confirmable competence models that encompass conceptual operationalizations of competences but also reveal a well-postulated theoretical structure that captures their empirical structure. From a scientific perspective, researchers seek empirical results related to the “true” structure of professional competences. From a political point of view, knowledge about the structure and comparability of competences is required to achieve large-scale assessments of VET, such as across Europe. In this context, compulsory education likely refers to a common curriculum of basic competences, such as literacy or numeracy, but the structure of competences within VET is more varied in content and therefore tends to be more complex. Thus VET content is heterogeneous not only between countries but also across different professions within nations (Baethge, Arends, & Winther, 2009) and even in specific workplaces (Billett, 2006). This abundant variation creates an ongoing dilemma for constructing generally valid competence tests. Uncertainty about the structure of competences also undermines international comparisons and the development of binding international agreements for consistent competence standards. Some scarce empirical research into the appropriate structure or model of competence suggests a content-based classification, such that item content exerts a characteristic influence on its difficulty. Other studies assume dimensionality based on different cognitive processing heuristics, which may determine response behaviors (Nickolaus, 2011; Nickolaus, Gschwendter, & Abele 2009;Nickolaus, Gschwendter, & Geißel 2008;Rosendahl & Straka, 2011; Seeber, 2008; Winther & Achtenhagen, 2009b2010).
Second, another necessary condition pertains to the reliability of the test results, that is, the certainty with which we can classify students according to a chosen test instrument. Neglecting this conditions poses serious risks, because people easily can be misclassified based on their test results, and such classification errors can have severe consequences for their future professional advancement.
With this study, we seek to evaluate both necessary conditions with respect to current testing efforts based on final examinations. Specifically, we describe how the German VET system currently operationalizes and measures competences in the economic domain. Empirical results obtained from a sample of 1,768 final examinations of industrial managersa reveal the extent to which German assessment instruments are qualified, in terms of their validity and reliability, to measure and classify students’ economic action competence. This study, in accordance with a broader research program, seeks to develop and test a theoretical competence model and thereby improve current assessment practices. Its results thus offer guidelines for further development of the test instrument, as we discuss before concluding this article.
Conceptualization of final examinations
Final Examination by the GCCI
Economics and social studies (10%)
Expert discussion (20%)
Commercial management & control (20%)
Business processes (40%)
The content-related structure model for the economic domain reflects the previous curriculum of commercial schools, which were officially abolished in 1996, replaced by cross-disciplinary learning fields that sought to foster greater action competence.
Tests of validity determine if and to what extent a measurement actually measures the intended construct. This criterion comprises two facets. First, it describes the operationalization of a theoretical concept, together with its potential subdimensions and observable indicators, to determine if the focal approach offers a good measurement notion in relation to the latent trait. It therefore entails the translation of the latent trait into contents, and then the contents into reasonable measurement items, and in this sense, if refers to content validity. But even if an abstract concept is carefully operationalized, including all theoretical aspects and a reasonable item design, it remains possible that the theoretical concept simply does not exist in the real world—or at least not in the way assumed by the researcher. Second, to address the potential gap between theory and observed reality, validity assessments entail construct validity to determine if the postulated process and content structures arise from empirical test results.
Examination of content validity
Practical and curriculum relevance of examination contents
Practical learning (/25 months)
School hours (/600)
Marketing & Distribution
Goods & Services
Regarding construct validity, neither procedural nor content-based structures are clearly identifiable, perhaps due to the strong correction bias in the data (Winther, 2011). These results prompted a central re-correction of the examinations, such that the test results were compared, independent of the analyst, to gain unbiased data for further analyses of construct validity and reliability.
Examination of construct validity
Competence measures often feature test instruments that contain polytomous, ordered item responses, such as the rating scale (Andrich, 1978), partial credit (Masters, 1982), and graded response (Samejima, 1969) models. Because competence, as measured by final examinations, seemingly constitutes a multidimensional concept, the confirmation of its structure requires a multidimensional modeling approach. If competence tests contain items with various scales, as is likely in complex modeling situations, the partial credit model appears most appropriate. An advanced alternative also could take advantage of a 1PL model but still allow for varied scaling, that is, by fixing the discrimination parameter of the 2PL graded response model to equal 1 and thereby obtain the related 1PL model. The choice between these two models is somewhat arbitrary; both produce nearly identical results, albeit with slightly different parameterizations. Furthermore, this approach is easy to program using Mplus software, so this study adopts it to identify and evaluate whether the postulated theoretical structures appear in the final examination data. Accordingly, we allow for items with different numbers of response categories, as well as varying distances across response categories (e.g., Gibbons et al., 2007).
Examination of test reliability
The term “reliability” describes the replicability and thus the accuracy with which each item measures its intended trait. To assess a student’s expertise, a measure must have a strong probability of correctly classifying each student as possessing a certain competence value. For this analysis, we again applied an IRT standard. An important characteristic of IRT models is that they describe reliability, in terms of measurement precision, as a continuous function that is conditional on the values of the measured construct. It is therefore possible to model the test’s reliability for each individual value of competence for every test taker. The crucial appraisal criterion for a test’s reliability is measurement error, which arises because any measurement concept can include only a limited sample of the many possible items that constitute the measurement domain. The testing conditions also may vary, because factors other than student knowledge affect response behaviors, including both student-specific factors, such as mood, health, or individual differences in exposure to the tested content, and situational factors, such as distractions during the test, room temperature, and so forth (Kiplinger, 2008).
If the information is expansive, it is possible to identify a test taker whose true ability is at that level with reasonable precision.
Results and discussion
Results for the Test’s validity
Global fit indices for the procedural model (M1) and content-related structure model (M2)
Weighted Root Mean Square Residual
Confirmatory fit index
Tucker-Lewis (1973) index
Thus, the content-related structure model supports the validity of 21 of the 35 items with regard to their effectiveness for measuring differences in the abilities of test takers. However, the concept measured is not actually action competence, as intended, but rather content-related, technical knowledge, in an expertise-related sense. If we also consider the content of items not represented in this structure, we note that these abilities are characterized by their relatively transferable, contextualized nature and often involve calculations.
Results for the Test’s reliability
The information function for the test reaches its maximum for persons with an approximately average competence level. That is, near this area, it is possible to estimate, very precisely, test takers’ true level of expertise (reliability = .88). Farther from this maximum though, the test’s estimation precision decreases rapidly. Students with relatively high ability, who are located in the positive space, reveal a lower but still sufficient information value. In contrast, students with below-average expertise get estimated with an information value tending to 0. Because the test information reflects the sum of individual item information at a given ability level, the amount of information also is defined at the item level. The test provides many measurement items related to an average ability level, along with some items to measure high ability levels, but it features few easy items designed to measure low levels of expertise. Therefore, the GCCI final examinations cannot effectively differentiate test takers with low versus very low ability.
However, this gap does not necessarily cause problems. Some tests are constructed explicitly to differentiate students precisely at a specific, crucial point. That is, we need to consider the specific purpose of any particular test instrument to assess its reliability. The primary purpose of the final examinations is to regulate access to the industrial management profession, such that test takers are separated simply into those who pass the test, and thus receive certification to enter the professional community, and those who do not. Annually, approximately 95% of test takers pass,f so the most important separation point must fall far below an average competence level. Yet the amount of test information available in this range tends toward zero, so students have been quasi–blindly classified into the crucial “passed” or “failed” categories. This lack of reliability in final examinations not only infringes on statistical test standards but also has severe implications for the professional development and life of a vast number of students.
The evaluation of the validity of action competence provided by this article reveals that the assessment entails not the intended, process-oriented structure but rather a fractured, subject-specific, content structure. This content-related structure model reflects a previous, officially abolished teaching structure and curriculum, which makes it quite surprising that this conceptualization still dominates the test. The instrument may be partially valid for assessing subject-specific content—that is, the expertise of a student in several subjects—but it cannot capture true action competence.
Furthermore the items do not demonstrate reliability in their ability to depict the expertise of a student in several subjects. The empirical results pertaining to the structure of vocational competence are coherent with studies in other vocational areas that similarly suggest the high relevance of subject-related domains in the structuring of professional competence measures and their frequent influence on item difficulty (e.g., Nickolaus, Gschwendter, & Abele, 2009; Seeber, 2008). However, for measuring competence acquired in VET, this approach seems insufficient. If action competence is not to devolve into simply a buzzword, the concept must be salient and manifest in final examinations. In particular, newly developed and implemented assessment practices must capture students’ skills in thinking and reasoning effectively and solving complex problems autonomously, on the basis of constructivist theory (Gijbels et al., 2006; Pellegrino et al., 2001).
Finally, with regard to the accuracy with which the final examination distinguishes and classifies students, we find that it does not provide enough items to measure under average competence levels accurately. The poor reliability limits true classifications of learning outcomes, because students who have been classified as failures, and who are therefore denied certain positions within the professional community, easily could be misclassified. The informative value and explanatory power for the GCCI test instrument thus are low.
Designing more items pertaining to the “acquisition” and “goods and services” content areas.
Offering adequately authentic and complex test situations, such that the process-oriented, situated item setting aims to model real-life, authentic situations (Shavelson, 2008).
Forming a vertical competence structure based on cognitive dimensions and developing situations with varying complexity, to test different action competence qualities and increase the interpretability of the IRT test scores (i.e., criterion-based assessment).
Designing more easy items, to achieve greater reliability at the most crucial separation point of the test.
Adopting a competence model that better depicts the development of competence throughout the learning process, moving from general competences (domain-related) to more specific competence components (domain-specific) (Winther, 2010a; Winther & Achtenhagen, 2008), focused on work requirements in specific occupations to stimulate company operations across departments and their specific economic features (Winther 1 & Achtenhagen, 2009a).
By incorporating such aspects into the final examination, the GCCI could make its assessment instrument more valid and move it beyond the current focus on component skills and discrete bits of knowledge, to encompass the more complex aspects of student achievement (Pellegrino et al., 2001). Furthermore, such a test structure might offer more information about the level of competence students actually acquire and concrete starting points for developing support measures to improve their learning process. Results from initial tests of this novel examination approach will be ready in late 2013.
aThe data were acquired from six headquarters of the German Chamber of Commerce and Industry: Luneburg, Hanover, Frankfurt on the Main, Munich, Saarland, and Nuremberg.
bSteyer and Eid (2001) note that a missing correlation between different error terms—as assumed in classical test theory—implies unidimensionality. In a probabilistic approach, this assumption disappears though, so (multi)dimensionality is explicitly confirmable. In empirical terms, the identified competence dimensions are sufficiently independent in their correlative cohesion (e.g., Hartig & Klieme, 2006).
cItem discrimination refers to an item’s ability to differentiate among people at different levels along a particular trait continuum (Birnbaum 1968).
dGuessing effects describe a respondent’s probability of getting a question correct, simply by chance.
eFor the exact characteristics of each model, see Embretson and Reise (2000).
fAcquired from statistics for Munich and Upper Bavaria.
This article arose from the subproject “Competence-oriented assessments in VET and professional development” (Wi 3597/1-1 and Wi 3597/1-2), within the framework of the priority programme “Competence Models for Assessing Individual Learning Outcomes and Evaluating Educational Processes” (SPP 1293) of the German Research Foundation (DFG).
- Andrich D: A rating formulation for ordered response categories. Psychometrika 1978, 43: 561–574. 10.1007/BF02293814View ArticleGoogle Scholar
- Baethge M, Arends L, Winther E: International large-scale assessment on vocational and occupational education and training. In VET boost: Towards a theory of professional competences. Essays in honor of Frank Achtenhagen. Edited by: Oser F, Renold U, John EG, Winther E, Weber S. Rotterdam: Sense Publishers; 2009.Google Scholar
- Baker F: The basics of item response theory. College Park, MD: University of Maryland; 2001.Google Scholar
- Berufsbildungsgesetz (BGBI): vom 23. März 2005, in Kraft getreten am 1. 2005.Google Scholar
- Billett S: Work, change and workers. Dordrecht: Springer; 2006.View ArticleGoogle Scholar
- Birnbaum A: Some latent trait models and their use in inferring an examinee’s ability. In Statistical theories of mental test scores. Edited by: Lord FM, Novick MR. Boston, MA: Addison-Wesley; 1968.Google Scholar
- Cattell RB: The scree test for the number of factors. Multivariate Behavioral Research 1966, 1: 245–276. 10.1207/s15327906mbr0102_10View ArticleGoogle Scholar
- Embretson SE, Reise SP: Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum Associates Publishers; 2000.Google Scholar
- Fischer GH: Einführung in die Theorie psychologischer Tests. Bern: Huber; 1974.Google Scholar
- German Chamber of Commerce and Industry (GCCI) Aufgabenstelle für käufmännische Abschluss und Zwischenprüfungen (AKA) (Hrsg.): Prüfungskatalog für die IHK-Abschlussprüfungen. Vol. 3. Auflage: Nürnberg; 2009.Google Scholar
- Gibbons R, Bock D, Hedeker D, Weiss DJ, Segawa E, Bhaumik DK, Kupfer DJ, Frank E, Grochocinski VJ, Stover A: Full-information item bifactor analysis of graded response data. Applied Psychological Measurement 2007, 31: 4. 10.1177/0146621606289485View ArticleGoogle Scholar
- Gijbels D, Van De Watering G, Dochy F, Van Den Bossche P: New learning environments and constructivism: The students' perspective. Instructional Science 2006, 34: 213–226. 10.1007/s11251-005-3347-zView ArticleGoogle Scholar
- Haasler B: Anregungen zur Prüfungspraxis in der deutschen dualen Berufsausbildung aus der Perspektive der gewerblich-technischen Berufsausbildungsforschung. In Praxisbegleitende Prüfungen und Beurteilungen in der Beruflichen Bildung in Europa. Edited by: R Tutschner (Hrsg.), Grollmann P, Luomi-Messerer K, Stenström M-L. Bd. 18 Bildung und Arbeitswelt, Wien, Berlin; 2007:193–220.Google Scholar
- Hacker W: Arbeitspsychologie. Psychische Regulation von Arbeitstätigkeiten. Bern: Huber; 1986.Google Scholar
- Hartig J, Höhler J: Representation of competences I multidimensional IRT. Models with within-item and between-item multidimensionality. Zeitschrift für Psychologie 2008,216(2):89–101. 10.1027/0044-3409.216.2.89View ArticleGoogle Scholar
- Hartig J, Klieme E: Kompetenz und Kompetenzdiagnostik. In Leistung und Leistungsdiagnostik. Edited by: Schweizer K. Berlin: Springer; 2006.Google Scholar
- Horn JL: A rationale and test for the number of factors in factor analysis. Psychometrika 1965, 30: 179–185. 10.1007/BF02289447View ArticleGoogle Scholar
- Kiplinger L: Reliability of large scale assessment and accountability systems. In The future of test-based educational accountability. Edited by: Ryan KE, Shepard LA. New York: Routledge; 2008.Google Scholar
- Klotz VK, Winther E: Kompetenzmessung in der kaufmännischen Berufsausbildung: Zwischen Prozessorientierung und Fachbezug. Eine Analyse der aktuellen Prüfungspraxis. 2012. In: bwp@ Berufs - und Wirtschaftspädagogik – online, 22 In: bwp@ Berufs - und Wirtschaftspädagogik – online, 22Google Scholar
- Kuhl J: A theory of action and state orientation. In J. Kuhl, & J. Beckmann (Eds.) Volition and personality: Action vs. state orientation. Seattle: Hogrefe & Huber; 1994.Google Scholar
- Kuhl J: Action and state orientation: Psychometric properties of the action control scales. In J. Kuhl, & J. Beckmann (Eds.) Volition and personality: Action vs. state orientation. Göttingen: Hogrefe; 1994.Google Scholar
- Masters GN: A Rasch model for partial credit scoring. Psychometrica 1982, 47: 149–174. 10.1007/BF02296272View ArticleGoogle Scholar
- Nickolaus R: Die Erfassung fachlicher Kompetenz und ihrer Entwicklungen in der beruflichen Bildung - Forschungsstand und Perspektiven. Stationen empirischer Bildungsforschung: Traditionslinien und Perspektiven 2011, 331–351.View ArticleGoogle Scholar
- Nickolaus A, Gschwendter T, Geißel B: Modellierung und Entwicklung beruflicher Fachkompetenz in der gewerblich-technischen Erstausbildung. Zeitschrift für Berufs-und Wirtschaftspädagogik 2008, 104: 48–73.Google Scholar
- Nickolaus R, Gschwendter T, Abele S: Die Validität von Simulationsaufgaben am Beispiel der Diagnosekompetenz von Kfz-Mechatronikern. Vorstudie zur Validität von Simulationsaufgaben im Rahmen eines VET-LSA. Stuttgart: Abschlussbericht für das Bundesministerium für Bildung und Forschung; 2009.Google Scholar
- Pellegrino JW, Chudowsky N, Glaser R: Knowing what students know: The science and design of educational assessment. Washington, DC: National Academy Press; 2001.Google Scholar
- Ramsay JO: TestGraf. A program for the graphical analysis of multiple choice test and questionnaire data [Manual and Software]. Montreal: Author; 1995.Google Scholar
- Reeve BB, Fayers P: Applying item response theory modelling for evaluating questionnaire item and scale properties. In Assessing quality of life in clinical trials: methods of practice. 2nd edition. Edited by: Fayers P, Hays RD. New York: Oxford University Press; 2005.Google Scholar
- Rosendahl J, Straka GA: Kompetenzmodellierungen zur wirtschaftlichen Fachkompetenz angehender Bankkaufleute. Zeitschrift für Berufs- und Wirtschaftspädagogik 2011,107(2):190–217.Google Scholar
- Rost J: Lehrbuch Testtheorie und Testkonstruktion. Bern: Huber; 2004.Google Scholar
- Samejima F: Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement 1969,34(4):100–114.Google Scholar
- Schmidt JU: Prüfungen auf dem Prüfstand – Betriebe beurteilen die Aussagekraft von Prüfungen. Berufsbildung in Wissenschaft und Praxis 2000,29(5):27–31.Google Scholar
- Seeber S: Ansätze zur Modellierung beruflicher Fachkompetenz in kaufmännischen Ausbildungsberufen. Zeitschrift für Berufs- und Wirtschaftspädagogik 2008, 104: 74–97.Google Scholar
- Shavelson RJ: Reflections on quantitative reasoning: an assessment perspective. In B. L., Madison, & L. A., Steen. (Eds.) Calculation vs. context: Quantitative literacy and its implications for teacher education. Washington, DC: Mathematical Association of America; 2008.Google Scholar
- Steyer R, Eid M: Messen und Testen. Berlin: Springer; 2001.View ArticleGoogle Scholar
- Tucker LR, Lewis C: The reliability coefficient for maximum likelihood factor analysis. Psychometrica 1973, 38: 1–10. 10.1007/BF02291170View ArticleGoogle Scholar
- Viola Katharina K, Esther W: Kompetenzmessung in der kaufmännischen Berufsausbildung: Zwischen Prozessorientierung und Fachbezug. Eine Analyse der aktuellen Prüfungspraxis. 2012. bwp@ Berufs- und Wirtschaftspädagogik - online, 22 bwp@ Berufs- und Wirtschaftspädagogik - online, 22Google Scholar
- Volpert W: Handlungsstrukturanalyse als Beitrag zur Qualifikationsforschung. Köln: Pahl-Rugenstein; 1983.Google Scholar
- Weiss DJ, Davison ML: Test theory and methods. Annu Rev Psychol 1981,32(1):629–658. 10.1146/annurev.ps.32.020181.003213View ArticleGoogle Scholar
- Winther E: Kompetenzmessung in der beruflichen Bildung. Bielefeld: Bertelsmann; 2010.Google Scholar
- Winther E: Kompetenzen messen – Zur Notwendigkeit methodologischer und quantitativer Standards im Rahmen beruflicher Kompetenz. Zeitschrift für Berufs und Wirtschaftspädagogik 2010,106(3):128–137.Google Scholar
- Winther E: Kompetenzorientierte Assessments in der beruflichen Bildung – Am Beispiel der Ausbildung von Industriekaufleuten. Zeitschrift für Berufs- und Wirtschaftspädagogik 2011,107(1):33–54.Google Scholar
- Winther E, Achtenhagen F: Kompetenzstrukturmodell für die kaufmännische Bildung. Adaptierbare Forschungslinien und theoretische Ausgestaltung. Zeitschrift für Berufs- und Wirtschaftspädagogik 2008,104(4):511–538.Google Scholar
- Winther E, Achtenhagen F: Measurement of vocational competences—a contribution to an international large-scale assessment on vocational education and training. Empirical Research in Vocational Education and Training 2009, 1: 88–106.Google Scholar
- Winther E, Achtenhagen F: Skalen und Stufen kaufmännischer Kompetenz. Zeitschrift für Berufs- und Wirtschaftspädagogik 2009,105(4):521–556.Google Scholar
- Winther E, Achtenhagen F: Berufsfachliche Kompetenz: Messinstrumente und empirische Befunde zur Mehrdimensionalität beruflicher Handlungskompetenz. Berufsbildung in Wissenschaft und Praxis 2010, 1: 18–21.Google Scholar
- Wright BD, Stone MH: Best test design. Chicago: MESA Press; 1979.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.