Open Access

Measurement of vocational competences: an analysis of the structure and reliability of current assessment practices in economic domains

Empirical Research in Vocational Education and Training20135:2

DOI: 10.1186/1877-6345-5-2

Received: 12 June 2013

Accepted: 14 June 2013

Published: 16 July 2013



Both fostering and measuring action competence remain central targets of vocational education and training research; adequate measurement approaches clearly are prerequisites for international, large-scale assessments. For the German Chamber of Commerce and Industry, competence assessments of industrial managers rely mainly on final examinations that attempt to measure not just knowledge but also action competence. To evaluate this test instrument, this article considers two questions: (1) Can the test assess action competence with validity, and (2) how reliable are the corresponding assessment results?


The study relied on statistical procedures (e.g., IRT scaling), applied empirically to a sample of 1,768 final examinations.


As a result the current examination appears neither adequate nor accurate as an instrument to capture action competence.


We conclude that several improving steps have to be undertaken to improve the economic assessment.


Vocational competences Action competence Item response theory (IRT) scaling Competence structure Test reliability


Prospects and demand for adequate competence assessments

Explicit or implicit measures of vocational competence are relevant to many facets of vocational education and training (VET) and thus constitute an ever-growing research field. They pertain to national educational factors, such as relevant information and instruments for managing the quality of the vocational educational systems and developing adequate support programs, but increasingly, they also appear in international policy agendas. That is, international comparisons and acknowledgement of qualifications, as well as the encouragement of lifelong, informal learning, require adequate measurement concepts and innovative evaluation methods. To meet these multiple expectations, two major conditions must be fulfilled a priori (Klotz & Winther, 2012).

First, we require empirically confirmable competence models that encompass conceptual operationalizations of competences but also reveal a well-postulated theoretical structure that captures their empirical structure. From a scientific perspective, researchers seek empirical results related to the “true” structure of professional competences. From a political point of view, knowledge about the structure and comparability of competences is required to achieve large-scale assessments of VET, such as across Europe. In this context, compulsory education likely refers to a common curriculum of basic competences, such as literacy or numeracy, but the structure of competences within VET is more varied in content and therefore tends to be more complex. Thus VET content is heterogeneous not only between countries but also across different professions within nations (Baethge, Arends, & Winther, 2009) and even in specific workplaces (Billett, 2006). This abundant variation creates an ongoing dilemma for constructing generally valid competence tests. Uncertainty about the structure of competences also undermines international comparisons and the development of binding international agreements for consistent competence standards. Some scarce empirical research into the appropriate structure or model of competence suggests a content-based classification, such that item content exerts a characteristic influence on its difficulty. Other studies assume dimensionality based on different cognitive processing heuristics, which may determine response behaviors (Nickolaus, 2011; Nickolaus, Gschwendter, & Abele 2009;Nickolaus, Gschwendter, & Geißel 2008;Rosendahl & Straka, 2011; Seeber, 2008; Winther & Achtenhagen, 2009b2010).

Second, another necessary condition pertains to the reliability of the test results, that is, the certainty with which we can classify students according to a chosen test instrument. Neglecting this conditions poses serious risks, because people easily can be misclassified based on their test results, and such classification errors can have severe consequences for their future professional advancement.

With this study, we seek to evaluate both necessary conditions with respect to current testing efforts based on final examinations. Specifically, we describe how the German VET system currently operationalizes and measures competences in the economic domain. Empirical results obtained from a sample of 1,768 final examinations of industrial managersa reveal the extent to which German assessment instruments are qualified, in terms of their validity and reliability, to measure and classify students’ economic action competence. This study, in accordance with a broader research program, seeks to develop and test a theoretical competence model and thereby improve current assessment practices. Its results thus offer guidelines for further development of the test instrument, as we discuss before concluding this article.

Conceptualization of final examinations

Action competence offers a constitutive element of the German vocational system and a significant topic of scientific and political discourse since the early 1980s, particularly in relation to the didactic implications of action regulation theory (Hacker, 1986; Kuhl, 1994a1994b; Volpert, 1983). In the mid-1990s, the Standing Conference of the Ministers of Education and Cultural Affairs (Kultusministerkonferenz) legally adopted the concept of action competence as a central target. Specifically and by law, students must be instructed in a way that enables them to plan, execute, and monitor an entire action process in a working environment. This concept appears largely heuristic but still must form the foundation for any test construction (BGBI, 2005 §5). In practice, these assessments come from the German Chamber of Commerce and Industry (GCCI) and comprise both oral and written components. The oral part consists of a presentation and then a related expert discussion; it accounts for 30% of the assessment. The written examination comprises practical tasks pertaining to economics and social studies, as well as commercial management and control, together with situational tasks that take the form of case studies related to business processes. This last business processes section represents the most important assessment area, in terms of processing time (180 minutes) and weighting (40% of the final grade) (see Table 1). Therefore, this study focuses on this assessment component.
Table 1

Final Examination by the GCCI

Oral examination

Written examination

Presentation (10%)

Economics and social studies (10%)

Expert discussion (20%)

Commercial management & control (20%)


Business processes (40%)

Recent commentary suggests that these test practices fail to give students sufficient room or potential to apply their knowledge to solve complex problems in a process-oriented working context (e.g., Haasler, 2007; Schmidt, 2000; Winther, 2010b) According to the GCCI (2009), the design of the business processes test component is intended to require test takers to model processes, undertake complex tasks, analyze business processes, and solve problems in an outcome- and customer-oriented way. To implement these goals, the test designers operationalized action competence as the three mutually exclusive process dimensions in Figure 1: planning, executing, and monitoring (GCCI 2009). Thus again, the business processes section seems particularly suitable for our empirical analysis of the structure of action competence.
Figure 1

Procedural structure model.

If these process dimensions actually characterize a test situation, their solutions should require different sets of cognitive abilities of the test taker. In addition to this primary test conception, each item might be categorized according to four content domains: marketing and distribution, acquisition, human resource management (HRM), and goods and services. Such an alternative content-related model of competence measurement, as in Figure 2, appears in some other vocational assessments (Nickolaus, 2011; Nickolaus, Gschwendter, & Geissel 2008; Rosendahl & Straka, 2011; Seeber, 2008).
Figure 2

Content-related structure model.

The content-related structure model for the economic domain reflects the previous curriculum of commercial schools, which were officially abolished in 1996, replaced by cross-disciplinary learning fields that sought to foster greater action competence.



Tests of validity determine if and to what extent a measurement actually measures the intended construct. This criterion comprises two facets. First, it describes the operationalization of a theoretical concept, together with its potential subdimensions and observable indicators, to determine if the focal approach offers a good measurement notion in relation to the latent trait. It therefore entails the translation of the latent trait into contents, and then the contents into reasonable measurement items, and in this sense, if refers to content validity. But even if an abstract concept is carefully operationalized, including all theoretical aspects and a reasonable item design, it remains possible that the theoretical concept simply does not exist in the real world—or at least not in the way assumed by the researcher. Second, to address the potential gap between theory and observed reality, validity assessments entail construct validity to determine if the postulated process and content structures arise from empirical test results.

Examination of content validity

Winther (2011) has analyzed the focal final examinations with regard to their objectivity and content validity. The results indicate systematic biases, due to nonuniform scoring during the correction process (see Table 2). With regard to content validity, Winther (2011) notes that a predominant part of the curriculum is dedicated to the goods and services domain (47% of the curriculum, about one-third of practical training), yet the proportion of content related to that topic in the test is rather small (21%). Thus, the test does not achieve representative validity. In particular, tasks related to modeling the processes of value creation and quantifiable production management are underrepresented, whereas the marketing and distribution content area appears overrepresented (38% of the final), in comparison with both its percentage of the curriculum (26%) and its practical relevance (25%).
Table 2

Practical and curriculum relevance of examination contents


Score (/100)

Practical learning (/25 months)

School hours (/600)

Curriculum weight

Marketing & Distribution


5-7 months

160 h




5-7 months

80 h




2-6 months

80 h


Goods & Services


6-10 months

280 h


Regarding construct validity, neither procedural nor content-based structures are clearly identifiable, perhaps due to the strong correction bias in the data (Winther, 2011). These results prompted a central re-correction of the examinations, such that the test results were compared, independent of the analyst, to gain unbiased data for further analyses of construct validity and reliability.

Examination of construct validity

Construct validity exists if the postulated process and content structures are actually reflected in empirical test results. To analyze theoretical structure models, most research relies on factor analytical approaches, though increasingly, multidimensional item response theory (IRT) models have grown in popularity (Hartig & Höhler, 2008). In accordance with this theory, a set of mathematical models describe, in probabilistic terms, the relationship between a person’s response to an item and the level of a latent trait (e.g., Reeve & Fayers, 2005). Traditional approaches to measurement scales rely on averages or a simple summation of the test scores; IRT models instead reflect the assumption that the probability of solving an item depends on the test taker’s latent trait or ability (i.e., θi = person parameter), combined with the item difficulty (i.e., δi = item parameter). These two parameters relate negatively (θi – δi), because the probability of solving an item increases with the person’s ability but decreases with greater item difficulty (Wright & Stone, 1979). This basic assumption can be formalized as a nonlinear function, namely, the item response function:
p X vi = x = exp θ v δ i 1 + exp θ v δ i x = 0 , 1 .
It also can be depicted in an item characteristic curve, as in Figure 3.
Figure 3

Item characteristic curves for a dichotomous Rasch model with three items of varying difficulty (See Rost, 2004 , p. 125; Winther, 2010a , p. 41).

For the analysis of the final examinations, we used IRT models because their traits and characteristics render them particularly suitable for this research goalb. However, a basic assumption underlying the application of parametric IRT models is that the model is appropriate for the data, which in turn demands the choice of the right model and an evaluation of model fit. The first consideration for choosing the right model is determining the number of item response categories. Only some structure models can model items with more than two response options, commonly referred to as polytomous items. In addition, the modeler must decide if another parameter, in addition to the item and person parameters (1PL model), can add to the level of item discriminationc (2PL model) or even if yet another parameter that reflects guessing effectse should appear in the model (3PL model) (Weiss & Davison, 1981). Although brevity considerations prevent us from describing all these models, we propose the specification scheme in Figure 4 to help render the decision process transparent and facilitate the search for an appropriate IRT model that can analyze the structure of competences in related research fields.
Figure 4

IRT specification scheme f .

Competence measures often feature test instruments that contain polytomous, ordered item responses, such as the rating scale (Andrich, 1978), partial credit (Masters, 1982), and graded response (Samejima, 1969) models. Because competence, as measured by final examinations, seemingly constitutes a multidimensional concept, the confirmation of its structure requires a multidimensional modeling approach. If competence tests contain items with various scales, as is likely in complex modeling situations, the partial credit model appears most appropriate. An advanced alternative also could take advantage of a 1PL model but still allow for varied scaling, that is, by fixing the discrimination parameter of the 2PL graded response model to equal 1 and thereby obtain the related 1PL model. The choice between these two models is somewhat arbitrary; both produce nearly identical results, albeit with slightly different parameterizations. Furthermore, this approach is easy to program using Mplus software, so this study adopts it to identify and evaluate whether the postulated theoretical structures appear in the final examination data. Accordingly, we allow for items with different numbers of response categories, as well as varying distances across response categories (e.g., Gibbons et al., 2007).

Examination of test reliability

The term “reliability” describes the replicability and thus the accuracy with which each item measures its intended trait. To assess a student’s expertise, a measure must have a strong probability of correctly classifying each student as possessing a certain competence value. For this analysis, we again applied an IRT standard. An important characteristic of IRT models is that they describe reliability, in terms of measurement precision, as a continuous function that is conditional on the values of the measured construct. It is therefore possible to model the test’s reliability for each individual value of competence for every test taker. The crucial appraisal criterion for a test’s reliability is measurement error, which arises because any measurement concept can include only a limited sample of the many possible items that constitute the measurement domain. The testing conditions also may vary, because factors other than student knowledge affect response behaviors, including both student-specific factors, such as mood, health, or individual differences in exposure to the tested content, and situational factors, such as distractions during the test, room temperature, and so forth (Kiplinger, 2008).

According to Fischer (1974), item precision can be depicted by item information curves (or functions), which indicate the range over the measurement construct in which the item discriminates best among individuals. The inverse of the squared standard measurement error is equivalent to item information with respect to the latent trait (in our case, expertise). Thus,
I i = 1 σ i 2
The higher the estimation variance, the less test information is available, and the lower the test’s reliability (Ramsay, 1995):
Rel θ = 1 1 + 1 I θ

If the information is expansive, it is possible to identify a test taker whose true ability is at that level with reasonable precision.

Results and discussion

Results for the Test’s validity

Testing both structures (i.e., process-oriented and content-related) within a single, integrated, 12-factor structure model was too unwieldy for the focal database, with only 35 items to distribute across dimensions. Numerically, the question of which theoretical model fits the real database best can be answered most effectively by so-called fit indices. In the test to confirm the processual structure model (M1 from Figure 1), we obtained poor values; this test concept does not appear valid for capturing competence. In contrast, the empirical evidence obtained for a school subject–oriented, content-related measurement approach (M2) suggested good fit with the content structure for most items, as the comparison in Table 3 reveals.
Table 3

Global fit indices for the procedural model (M1) and content-related structure model (M2)

Fit indices

Cut-off criterion




≥ 0.05



Weighted Root Mean Square Residual

≤ 0.05




≤ 0.9



Confirmatory fit index

≥ 0.95



Tucker-Lewis (1973) index

≥ 0.95



To derive the content-related structure model, we used exploratory factor analysis. Specifically, to determine the number of factors, we combined a graphical scree test (Cattell, 1966) with a parallel analysis (Horn, 1965), using the MonteCarlo PA software, which offers a more objective approach for extracting factors. A five-factor solution emerged. We rotated the factor solution using oblique rotation method promax in SPSS, which is well suited to an analysis that allows for some correlation of factors (as can be assumed for the competence dimensions) and for very large data sets (as is the case for the final examinations). During this analysis, the data freely generated the postulated contend-related structure model, together with the predicted parameters of the model. In the only empirical difference, the contents of the academic subjects marketing and distribution split empirically into two domains (marketing and distribution), as we show in Figure 5.
Figure 5

Empirically generated content-related structure model.

Thus, the content-related structure model supports the validity of 21 of the 35 items with regard to their effectiveness for measuring differences in the abilities of test takers. However, the concept measured is not actually action competence, as intended, but rather content-related, technical knowledge, in an expertise-related sense. If we also consider the content of items not represented in this structure, we note that these abilities are characterized by their relatively transferable, contextualized nature and often involve calculations.

Results for the Test’s reliability

Using IRT-standard, the amount of information can be computed for each ability level on a test’s ability scale (Baker, 2001). We show the results for the final examinations data in Figure 6.
Figure 6

Test reliability for the GCCI final examinations.

The information function for the test reaches its maximum for persons with an approximately average competence level. That is, near this area, it is possible to estimate, very precisely, test takers’ true level of expertise (reliability = .88). Farther from this maximum though, the test’s estimation precision decreases rapidly. Students with relatively high ability, who are located in the positive space, reveal a lower but still sufficient information value. In contrast, students with below-average expertise get estimated with an information value tending to 0. Because the test information reflects the sum of individual item information at a given ability level, the amount of information also is defined at the item level. The test provides many measurement items related to an average ability level, along with some items to measure high ability levels, but it features few easy items designed to measure low levels of expertise. Therefore, the GCCI final examinations cannot effectively differentiate test takers with low versus very low ability.

However, this gap does not necessarily cause problems. Some tests are constructed explicitly to differentiate students precisely at a specific, crucial point. That is, we need to consider the specific purpose of any particular test instrument to assess its reliability. The primary purpose of the final examinations is to regulate access to the industrial management profession, such that test takers are separated simply into those who pass the test, and thus receive certification to enter the professional community, and those who do not. Annually, approximately 95% of test takers pass,f so the most important separation point must fall far below an average competence level. Yet the amount of test information available in this range tends toward zero, so students have been quasi–blindly classified into the crucial “passed” or “failed” categories. This lack of reliability in final examinations not only infringes on statistical test standards but also has severe implications for the professional development and life of a vast number of students.


The evaluation of the validity of action competence provided by this article reveals that the assessment entails not the intended, process-oriented structure but rather a fractured, subject-specific, content structure. This content-related structure model reflects a previous, officially abolished teaching structure and curriculum, which makes it quite surprising that this conceptualization still dominates the test. The instrument may be partially valid for assessing subject-specific content—that is, the expertise of a student in several subjects—but it cannot capture true action competence.

Furthermore the items do not demonstrate reliability in their ability to depict the expertise of a student in several subjects. The empirical results pertaining to the structure of vocational competence are coherent with studies in other vocational areas that similarly suggest the high relevance of subject-related domains in the structuring of professional competence measures and their frequent influence on item difficulty (e.g., Nickolaus, Gschwendter, & Abele, 2009; Seeber, 2008). However, for measuring competence acquired in VET, this approach seems insufficient. If action competence is not to devolve into simply a buzzword, the concept must be salient and manifest in final examinations. In particular, newly developed and implemented assessment practices must capture students’ skills in thinking and reasoning effectively and solving complex problems autonomously, on the basis of constructivist theory (Gijbels et al., 2006; Pellegrino et al., 2001).

Finally, with regard to the accuracy with which the final examination distinguishes and classifies students, we find that it does not provide enough items to measure under average competence levels accurately. The poor reliability limits true classifications of learning outcomes, because students who have been classified as failures, and who are therefore denied certain positions within the professional community, easily could be misclassified. The informative value and explanatory power for the GCCI test instrument thus are low.

Because the current examination appears neither adequate nor accurate as an instrument to capture action competence, we propose improving the foundational conceptualization of the test by
  1. 1.

    Designing more items pertaining to the “acquisition” and “goods and services” content areas.

  2. 2.

    Offering adequately authentic and complex test situations, such that the process-oriented, situated item setting aims to model real-life, authentic situations (Shavelson, 2008).

  3. 3.

    Forming a vertical competence structure based on cognitive dimensions and developing situations with varying complexity, to test different action competence qualities and increase the interpretability of the IRT test scores (i.e., criterion-based assessment).

  4. 4.

    Designing more easy items, to achieve greater reliability at the most crucial separation point of the test.

  5. 5.

    Adopting a competence model that better depicts the development of competence throughout the learning process, moving from general competences (domain-related) to more specific competence components (domain-specific) (Winther, 2010a; Winther & Achtenhagen, 2008), focused on work requirements in specific occupations to stimulate company operations across departments and their specific economic features (Winther 1 & Achtenhagen, 2009a).


By incorporating such aspects into the final examination, the GCCI could make its assessment instrument more valid and move it beyond the current focus on component skills and discrete bits of knowledge, to encompass the more complex aspects of student achievement (Pellegrino et al., 2001). Furthermore, such a test structure might offer more information about the level of competence students actually acquire and concrete starting points for developing support measures to improve their learning process. Results from initial tests of this novel examination approach will be ready in late 2013.


aThe data were acquired from six headquarters of the German Chamber of Commerce and Industry: Luneburg, Hanover, Frankfurt on the Main, Munich, Saarland, and Nuremberg.

bSteyer and Eid (2001) note that a missing correlation between different error terms—as assumed in classical test theory—implies unidimensionality. In a probabilistic approach, this assumption disappears though, so (multi)dimensionality is explicitly confirmable. In empirical terms, the identified competence dimensions are sufficiently independent in their correlative cohesion (e.g., Hartig & Klieme, 2006).

cItem discrimination refers to an item’s ability to differentiate among people at different levels along a particular trait continuum (Birnbaum 1968).

dGuessing effects describe a respondent’s probability of getting a question correct, simply by chance.

eFor the exact characteristics of each model, see Embretson and Reise (2000).

fAcquired from statistics for Munich and Upper Bavaria.



This article arose from the subproject “Competence-oriented assessments in VET and professional development” (Wi 3597/1-1 and Wi 3597/1-2), within the framework of the priority programme “Competence Models for Assessing Individual Learning Outcomes and Evaluating Educational Processes” (SPP 1293) of the German Research Foundation (DFG).

Authors’ Affiliations

Chair of Business and Human Resource Education, University of Paderborn


  1. Andrich D: A rating formulation for ordered response categories. Psychometrika 1978, 43: 561–574. 10.1007/BF02293814View ArticleGoogle Scholar
  2. Baethge M, Arends L, Winther E: International large-scale assessment on vocational and occupational education and training. In VET boost: Towards a theory of professional competences. Essays in honor of Frank Achtenhagen. Edited by: Oser F, Renold U, John EG, Winther E, Weber S. Rotterdam: Sense Publishers; 2009.Google Scholar
  3. Baker F: The basics of item response theory. College Park, MD: University of Maryland; 2001.Google Scholar
  4. Berufsbildungsgesetz (BGBI): vom 23. März 2005, in Kraft getreten am 1. 2005.Google Scholar
  5. Billett S: Work, change and workers. Dordrecht: Springer; 2006.View ArticleGoogle Scholar
  6. Birnbaum A: Some latent trait models and their use in inferring an examinee’s ability. In Statistical theories of mental test scores. Edited by: Lord FM, Novick MR. Boston, MA: Addison-Wesley; 1968.Google Scholar
  7. Cattell RB: The scree test for the number of factors. Multivariate Behavioral Research 1966, 1: 245–276. 10.1207/s15327906mbr0102_10View ArticleGoogle Scholar
  8. Embretson SE, Reise SP: Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum Associates Publishers; 2000.Google Scholar
  9. Fischer GH: Einführung in die Theorie psychologischer Tests. Bern: Huber; 1974.Google Scholar
  10. German Chamber of Commerce and Industry (GCCI) Aufgabenstelle für käufmännische Abschluss und Zwischenprüfungen (AKA) (Hrsg.): Prüfungskatalog für die IHK-Abschlussprüfungen. Vol. 3. Auflage: Nürnberg; 2009.Google Scholar
  11. Gibbons R, Bock D, Hedeker D, Weiss DJ, Segawa E, Bhaumik DK, Kupfer DJ, Frank E, Grochocinski VJ, Stover A: Full-information item bifactor analysis of graded response data. Applied Psychological Measurement 2007, 31: 4. 10.1177/0146621606289485View ArticleGoogle Scholar
  12. Gijbels D, Van De Watering G, Dochy F, Van Den Bossche P: New learning environments and constructivism: The students' perspective. Instructional Science 2006, 34: 213–226. 10.1007/s11251-005-3347-zView ArticleGoogle Scholar
  13. Haasler B: Anregungen zur Prüfungspraxis in der deutschen dualen Berufsausbildung aus der Perspektive der gewerblich-technischen Berufsausbildungsforschung. In Praxisbegleitende Prüfungen und Beurteilungen in der Beruflichen Bildung in Europa. Edited by: R Tutschner (Hrsg.), Grollmann P, Luomi-Messerer K, Stenström M-L. Bd. 18 Bildung und Arbeitswelt, Wien, Berlin; 2007:193–220.Google Scholar
  14. Hacker W: Arbeitspsychologie. Psychische Regulation von Arbeitstätigkeiten. Bern: Huber; 1986.Google Scholar
  15. Hartig J, Höhler J: Representation of competences I multidimensional IRT. Models with within-item and between-item multidimensionality. Zeitschrift für Psychologie 2008,216(2):89–101. 10.1027/0044-3409.216.2.89View ArticleGoogle Scholar
  16. Hartig J, Klieme E: Kompetenz und Kompetenzdiagnostik. In Leistung und Leistungsdiagnostik. Edited by: Schweizer K. Berlin: Springer; 2006.Google Scholar
  17. Horn JL: A rationale and test for the number of factors in factor analysis. Psychometrika 1965, 30: 179–185. 10.1007/BF02289447View ArticleGoogle Scholar
  18. Kiplinger L: Reliability of large scale assessment and accountability systems. In The future of test-based educational accountability. Edited by: Ryan KE, Shepard LA. New York: Routledge; 2008.Google Scholar
  19. Klotz VK, Winther E: Kompetenzmessung in der kaufmännischen Berufsausbildung: Zwischen Prozessorientierung und Fachbezug. Eine Analyse der aktuellen Prüfungspraxis. 2012. In: bwp@ Berufs - und Wirtschaftspädagogik – online, 22 In: bwp@ Berufs - und Wirtschaftspädagogik – online, 22Google Scholar
  20. Kuhl J: A theory of action and state orientation. In J. Kuhl, & J. Beckmann (Eds.) Volition and personality: Action vs. state orientation. Seattle: Hogrefe & Huber; 1994.Google Scholar
  21. Kuhl J: Action and state orientation: Psychometric properties of the action control scales. In J. Kuhl, & J. Beckmann (Eds.) Volition and personality: Action vs. state orientation. Göttingen: Hogrefe; 1994.Google Scholar
  22. Masters GN: A Rasch model for partial credit scoring. Psychometrica 1982, 47: 149–174. 10.1007/BF02296272View ArticleGoogle Scholar
  23. Nickolaus R: Die Erfassung fachlicher Kompetenz und ihrer Entwicklungen in der beruflichen Bildung - Forschungsstand und Perspektiven. Stationen empirischer Bildungsforschung: Traditionslinien und Perspektiven 2011, 331–351.View ArticleGoogle Scholar
  24. Nickolaus A, Gschwendter T, Geißel B: Modellierung und Entwicklung beruflicher Fachkompetenz in der gewerblich-technischen Erstausbildung. Zeitschrift für Berufs-und Wirtschaftspädagogik 2008, 104: 48–73.Google Scholar
  25. Nickolaus R, Gschwendter T, Abele S: Die Validität von Simulationsaufgaben am Beispiel der Diagnosekompetenz von Kfz-Mechatronikern. Vorstudie zur Validität von Simulationsaufgaben im Rahmen eines VET-LSA. Stuttgart: Abschlussbericht für das Bundesministerium für Bildung und Forschung; 2009.Google Scholar
  26. Pellegrino JW, Chudowsky N, Glaser R: Knowing what students know: The science and design of educational assessment. Washington, DC: National Academy Press; 2001.Google Scholar
  27. Ramsay JO: TestGraf. A program for the graphical analysis of multiple choice test and questionnaire data [Manual and Software]. Montreal: Author; 1995.Google Scholar
  28. Reeve BB, Fayers P: Applying item response theory modelling for evaluating questionnaire item and scale properties. In Assessing quality of life in clinical trials: methods of practice. 2nd edition. Edited by: Fayers P, Hays RD. New York: Oxford University Press; 2005.Google Scholar
  29. Rosendahl J, Straka GA: Kompetenzmodellierungen zur wirtschaftlichen Fachkompetenz angehender Bankkaufleute. Zeitschrift für Berufs- und Wirtschaftspädagogik 2011,107(2):190–217.Google Scholar
  30. Rost J: Lehrbuch Testtheorie und Testkonstruktion. Bern: Huber; 2004.Google Scholar
  31. Samejima F: Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement 1969,34(4):100–114.Google Scholar
  32. Schmidt JU: Prüfungen auf dem Prüfstand – Betriebe beurteilen die Aussagekraft von Prüfungen. Berufsbildung in Wissenschaft und Praxis 2000,29(5):27–31.Google Scholar
  33. Seeber S: Ansätze zur Modellierung beruflicher Fachkompetenz in kaufmännischen Ausbildungsberufen. Zeitschrift für Berufs- und Wirtschaftspädagogik 2008, 104: 74–97.Google Scholar
  34. Shavelson RJ: Reflections on quantitative reasoning: an assessment perspective. In B. L., Madison, & L. A., Steen. (Eds.) Calculation vs. context: Quantitative literacy and its implications for teacher education. Washington, DC: Mathematical Association of America; 2008.Google Scholar
  35. Steyer R, Eid M: Messen und Testen. Berlin: Springer; 2001.View ArticleGoogle Scholar
  36. Tucker LR, Lewis C: The reliability coefficient for maximum likelihood factor analysis. Psychometrica 1973, 38: 1–10. 10.1007/BF02291170View ArticleGoogle Scholar
  37. Viola Katharina K, Esther W: Kompetenzmessung in der kaufmännischen Berufsausbildung: Zwischen Prozessorientierung und Fachbezug. Eine Analyse der aktuellen Prüfungspraxis. 2012. bwp@ Berufs- und Wirtschaftspädagogik - online, 22 bwp@ Berufs- und Wirtschaftspädagogik - online, 22Google Scholar
  38. Volpert W: Handlungsstrukturanalyse als Beitrag zur Qualifikationsforschung. Köln: Pahl-Rugenstein; 1983.Google Scholar
  39. Weiss DJ, Davison ML: Test theory and methods. Annu Rev Psychol 1981,32(1):629–658. 10.1146/ ArticleGoogle Scholar
  40. Winther E: Kompetenzmessung in der beruflichen Bildung. Bielefeld: Bertelsmann; 2010.Google Scholar
  41. Winther E: Kompetenzen messen – Zur Notwendigkeit methodologischer und quantitativer Standards im Rahmen beruflicher Kompetenz. Zeitschrift für Berufs und Wirtschaftspädagogik 2010,106(3):128–137.Google Scholar
  42. Winther E: Kompetenzorientierte Assessments in der beruflichen Bildung – Am Beispiel der Ausbildung von Industriekaufleuten. Zeitschrift für Berufs- und Wirtschaftspädagogik 2011,107(1):33–54.Google Scholar
  43. Winther E, Achtenhagen F: Kompetenzstrukturmodell für die kaufmännische Bildung. Adaptierbare Forschungslinien und theoretische Ausgestaltung. Zeitschrift für Berufs- und Wirtschaftspädagogik 2008,104(4):511–538.Google Scholar
  44. Winther E, Achtenhagen F: Measurement of vocational competences—a contribution to an international large-scale assessment on vocational education and training. Empirical Research in Vocational Education and Training 2009, 1: 88–106.Google Scholar
  45. Winther E, Achtenhagen F: Skalen und Stufen kaufmännischer Kompetenz. Zeitschrift für Berufs- und Wirtschaftspädagogik 2009,105(4):521–556.Google Scholar
  46. Winther E, Achtenhagen F: Berufsfachliche Kompetenz: Messinstrumente und empirische Befunde zur Mehrdimensionalität beruflicher Handlungskompetenz. Berufsbildung in Wissenschaft und Praxis 2010, 1: 18–21.Google Scholar
  47. Wright BD, Stone MH: Best test design. Chicago: MESA Press; 1979.Google Scholar


© Winther and Klotz; licensee Springer. 2013

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.