Presenting the Meta-Performance Test, a Metacognitive Battery based on Performance

The self-report and think-aloud approaches are the two dominant methodologies to measure metacognition. This is problematic, since they generate respondent and confirmation biases, respectively. The Meta-Performance Test is an innovative battery, which evaluates metacognition based on the respondent's performance, mitigating the aforementioned biases. The MetaPerformance Test consists of two tests, the Meta-text, which evaluates metacognition in the domain of reading comprehension, and Meta-number, in the domain of arithmetic expressions solving. The main focus of this article is to present the development of the battery, in terms of its conceptual basis, development strategies and structure. Evidence of its content validity is also presented, through the evaluation of three experts in metacognition, two experts in Spanish language, two experts in mathematics and five students who represent the target population. The results of the judges' evaluations attested to the Meta-Performance Test content validity, and the target population declared that both the battery understanding and taking are adequate. Contributions and future research perspectives of the Meta-Performance Test in the field of metacognition are discussed.


Introduction
New work-related and educational paradigms have required students to develop the metacognitive abilities of awareness, monitoring and regulation of their own thinking and learning, which enable them to deal with the current demands for constant learning required by the knowledge and information society (Cardoso et al., 2019;Gomes, 2007aGomes, , 2007bGomes & Borges, 2009a;Gomes et al., 2014b;Pereira et al., 2019). Metacognitive abilities are relevant predictors of academic performance, as they involve active processes of interaction of the subject with the objects of knowledge (Abdelrahman, 2020;Cai et al., 2019;Cromley & Kunze, 2020). The subject's active interaction, moreover, is a process that articulates and integrates a series of predictors of student's performance, as is the case of learning approaches (Gomes, 2011a(Gomes, , 2013(Gomes, , 2020Gomes & Golino, 2012b;Gomes et al., 2011Gomes et al., , 2020bGomes et al., , 2021Rodrigues & Gomes, 2020), students' beliefs about the learning process (Alves et al., 2012;Gomes & Borges, 2008a), self-referential cognitions (Costa et al., 2017), and motivation for learning (Gomes & Gjikuria, 2018).
Together with intelligence Gomes, 2005Gomes, , 2010aGomes, , 2010bGomes, , 2011bGomes, , 2012Gomes & Borges, 2007, 2008b, 2009b, 2009cGomes & Golino, 2012a;Martins et al., 2018;Muniz et al., 2016) and the social-economic variables (Gomes et al., , 2020c(Gomes et al., , 2020dGomes & Jelihovschi, 2019), metacognition also has a prominent place in the prediction of academic performance (Gomes et al., 2014a;Pazeto et al., 2019Pazeto et al., , 2020Pires & Gomes, 2018), because it demands from the subject an intense self-regulatory activity. Research has shown that metacognition is linked to regions of the frontal lobe of the brain, and with learning control and management functions (Dinsmore et al., 2008;Morales et al., 2018;Norman et al., 2019). Regarding the measurement of metacognition, the literature identifies two general methods to evaluate the construct. These methods are classified based on the temporary relationship between the measure used and the performance of the task: (1) offline methods and (2) online methods. The first includes evaluations that occur before or after the performance of the cognitive task; the second refers to evaluations that occur simultaneously to the performance of the cognitive tasks (Akturk & Sahin, 2011;Ohtani & Hisasaka, 2018). The systematic review of Gascoine et al. (2017) and the meta-analysis of Ohtani and Hisasaka (2018) identify that self-reporting instruments and think-aloud protocols are the major measures used in offline and online methods, respectively. The virtually full dominance of self-reporting questionnaires and think-aloud protocols to measure metacognition is detrimental to the area of metacognition, as they produce considerable bias. Although the questionnaires have the advantages of enabling a quick and accessible data collection, they demand that respondents have a good perception about their own internal processes. Nevertheless, the literature has shown that this is not usually the case, showing consistent evidence that many people do not have an accurate self-assessment of their own cognitive processes (Abernethy, 2015). In addition to demanding the respondent's accuracy, self-report questionnaires are related to many other biases or problems, such as acquiescence bias, social desirability etc. (Craig et al., 2020;Wetzel et al., 2016).
In turn, think-aloud procedures also potentiate the generation of various and relevant types of biases. Data collection is done through tasks in which the respondents must achieve some goal and report aloud what they are doing at the same time they do the task to achieve the goal proposed. Both the respondents' performance and their speech are recorded and subsequently evaluated by judges so the metacognitive processes or abilities are measured (Greene et al., 2018;Wolcott & Lobczowski, 2021). Considering the intensive evaluation process, studies using this measure involve small samples (e.g., Van der Stel & Veenman, 2008;Veenman & Van Cleef, 2018). Moreover, Priede and Farrall (2010) point out that the need to speak out loud and the possible interference of the evaluator in the respondent's speech; whether it is asking the person to continue "thinking out loud" or ensuring that the respondent is speaking something that can be submitted to analysis, among other elements, produce bias. In addition to this type of evaluator bias, the think-aloud method tends to generate confirmation bias, as metacognitive processes are identified and measured by judges. Das-Smaal (1990, p. 349) warns that, "... real-world features, objects, and events can be categorized in countless different ways. Moreover, our perception is highly selective and therefore, already readily biased." The use of tests that evaluate metacognition based on respondent's performance may be a viable alternative to mitigate the aforementioned biases. Evaluation using tests that focus on the respondent's own performance does not require judges to evaluate the respondents' performance, as it occurs in the think-aloud protocols, nor does it require respondents to report on their own abilities, as in self-reporting questionnaires. In addition, the evaluation by using performance tests enables to add predictive power and generality of the results, since tests of this nature allow applications in larger samples, without a series of biases found in the questionnaires, which decreases the measurement error and increases the statistical power of the analyses (Castillo, 2018).
There are few tests that evaluate metacognition based on respondent's performance. According to our knowledge, there are only three tests with this characteristic: The Metacognitive Skills and Knowledge Assessment (MSA; Desoete et al., 2001), the Metacognitive Knowledge Test (MKT; Neuenhaus et al., 2011) and the Metacognitive Monitoring Test (also called Reading Monitoring Test or Read Monitoring Test, MMT; Gomes et al., 2014a). Both MSA and MKT target primary school students. The MSA aims to evaluate abilities in the domain of metacognitive knowledge (declarative, procedural and conditional) and abilities in the domain of regulation of cognition: prediction, planning, monitoring and judgment. In turn, MKT seeks to evaluate conditional and relational metacognitive knowledge in the domain of reading and mathematics. We found no studies presenting analyses of the structural validity of such tests, by means of factor analysis of items. On the other hand, the MMT shows evidence of convergent, divergent, structural, predictive, and incremental validity for elementary, high school, and higher education students (Castillo, 2018;Gomes et al., 2014a;. Despite the MMT's validity evidence, there are three limitations in the test: (a) It evaluates a single metacognitive ability, monitoring, and only in the context of reading texts; (b) The test has an acceptable, but low, reliability, with Cronbach's alpha between 0.63 and 0.73; (c) the test score derives from the respondent's performance in identifying certain errors in a text and correctly justifying each identification. The test has a space for the respondents to write their justifications. As the score depends on the correct justification for each error identification, the evaluator needs to carefully read each justification, which makes the score generating process relatively slow; moreover, some evaluators may interpret some justifications as correct while others may interpret the same justifications as wrong, generating measurement bias.
In 2019, M. A. Castillo and C. M. A. Gomes, researchers from the Cognitive Architecture Mapping Laboratory (Laboratório de Investigação da Arquitetura Cognitiva -LAICO), created the Meta-Performance Test both in Spanish and Brazilian Portuguese. The goal of the authors of this battery involves performing a set of tests capable of overcoming certain limitations of the MSA, MKT and MMT tests, so that it can become, in the future, a useful tool for the evaluation of metacognition. Among the improvements proposed in the Meta-Performance Test, it is worth noting that: (1) It generates its scores based on data derived from the respondent's performance, without requiring that respondents justify their answers, as in the MMT test.
(2) The battery aims to assess three specific abilities, namely, planning, monitoring and judgment, using a reading comprehension test and an arithmetic expression solving test. This allows to empirically analyze whether these three abilities occur both in the specific domains of reading comprehension and arithmetic expressions solving and in a broad domain that is independent of domain. For example, the battery allows investigating the empirical presence of the reading comprehension monitoring ability, the arithmetic expressions solving monitoring ability, as well as the monitoring ability regardless of domain. This condition overcomes the MMT limitation, which evaluates the reading comprehension monitoring, but does not permit to estimate the monitoring regardless of the domain.
(3) Since its target population are undergraduate or graduate students, one of the goals of the Meta-Performance Test is to be a tool for the diagnosis of processes involved in the self-regulation of adult learning. This assessment tool seeks to facilitate the design of feasible and concrete interventions that focus on training these abilities and boost the adult audience's ability to think and learn. The literature provides evidence that metacognitive abilities training has a positive impact on the students' academic performance (e.g., Alias & Sulaiman, 2017;Blummer & Kenton, 2014;Perry et al., 2018).
(4) Since the mistakes made by the respondents can be identified in the multiple-choice options, the Meta-Performance Test becomes a useful tool for a more accurate evaluation of the process that leads the respondent to answer in a wrong way. This is a major advance that is in line with areas of knowledge that are focused on the evaluation of the process, such as music therapy (André et al., 2016(André et al., , 2017(André et al., , 2020a(André et al., , 2020bRos|rio et al., 2019;Sampaio et al., 2015).
Considering the arguments presented, the goal of this article is to present the rationale and development of the Meta-Performance Test in detail, showing its conceptual basis and the strategies that supported its development. The article also presents the first evidences of its content validity.

Conceptual Basis
The Meta-Performance Test takes the consensus built by the field of metacognition as a reference for the definition of its components. There is a relative consensus that metacognition consists basically of two major domains: (1) metacognitive knowledge and (2) control or regulation of cognition (e.g., Baker et al., 2020;Dent & Koenka, 2016;Ohtani & Hisasaka, 2018). The metacognitive knowledge about cognition (also called knowledge about cognition) refers to the individual's knowledge or beliefs about the variables that interact and can affect the course and results of cognitive tasks (Flavell, 1979). It also refers to the individual's knowledge about himself, about his abilities, limitations, difficulties, weaknesses, potentials, internal processes. In other words, the metacognitive knowledge involves the individual's knowledge of his inner world and everything that affects his performance in the external world (Rhodes, 2019).
In turn, the regulation of cognition is the set of abilities that allow the individual to control or regulate his own cognitive activities (Veenman et al., 2006). Different names have been used in the literature for the domain of regulation of cognition (e.g., metacognitive skills, metacognitive skillfulness, or metacognitive learning strategies), which hinders greater conceptual integration among researchers. However, despite differences in nomenclature, the conceptual definitions of the construct are quite similar, if not identical (e.g., Peña-Ayala, 2015; Veenman et al., 2004).
Within the educational context, the domain of regulation of cognition has proven to be significant in terms of academic prediction, with correlations ranging from 0.31 to 0.54 (Ohtani & Hisasaka, 2018). Veenman et al. (2006) linked planning, monitoring and judgment abilities to the domain of regulation of cognition. These abilities are part of this domain because they reflect the online process or "at the time" that it includes awareness and regulation of one's own cognitive operations that occur before, during or after completion of a task. The Meta-Performance Test aims to evaluate these three specific metacognitive abilities.

Planning
The literature provides several definitions for planning. For instance, a classical definition proposed by Owen (1997) says that planning refers to the ability to organize cognition and behavior in time and space, being necessary in situations where an objective must be achieved through a series of intermediate steps, each of which does not necessarily lead directly and individually to the objective. In turn, a contemporary definition is proposed by Oliveira and Nascimento (2014). The authors describe the construct as the capacity to intentionally define and structure actions and resources, aiming to efficiently achieve an objective. Additionally, they created a self-report instrument that aims to measure the different components of the construct. Li et al. (2015) point out that, compared to other metacognitive abilities, planning is less empirically studied due to the difficulty of separating it from task performance and other components. In what concerns the evaluation of the construct based on performance, the literature indicates the existence of some tests under the tower tasks paradigm, i.e., the Tower of Hanoi (TOH) and the Tower of London (TOL; Georgiou et al., 2017). However, some limitations are observed in relation to these tests. Both TOH and TOL are used mostly in clinical contexts where the objective is to evaluate the occurrence of cognitive impairments in a given patient (e.g., Rodrigues et al., 2019). Because of this condition, the tests tend to generate the ceiling effect, since people who show normal development tend to get a maximum score, which makes it impossible to properly distinguish those people with a higher level of planning (e.g., Kofsky et al., 2014;Oliveira & Nascimento, 2014). In addition, the tests mentioned do not aim to measure only the planning ability, but a mixed set of abilities, so the score produced by these tests is a planning measure with other cognitive abilities, such as problem solving, working memory or inhibition (Sullivan et al., 2009).
The Meta-Performance Test sees the planning construct as a very specific cognition self-regulation process, i.e., the ability to formulate and/or select a sequence of steps or strategies to solve a task. In this sense, the definition of planning adopted is restricted, because the perspective that the definition of the problem, among other processes, is part of the planning does not fit in this definition. The way the Meta-Performance Test sees it, it is possible to plan with a certain degree of clarity and accuracy and, at the same time, define the problem of a task in an inadequate manner, considering, therefore, that these processes are related, but different abilities.

Monitoring
Monitoring is defined and described in different ways; for example, Rhodes (2019) defines it as the ability to observe, reflect and experience the progress of cognitive processes. In turn, in Nelson and Narens (1996), monitoring is a regulatory element responsible for the flow of information from the object level (cognition) to the target level (metacognition), thus reflecting a cognitive activity self-regulation process. Rhodes' (2019) and Nelson and Narens' (1996) definitions see monitoring as a general metacognitive ability, similar to the definition of the broad domain of regulation of cognition.
Despite the definitions of monitoring as a general metacognitive domain, the Meta-Performance Test approaches the construct as a specific metacognitive ability from the perspective of the error detection paradigm, i.e., as the process that allows the individual to identify flaws or errors at the time of solving a cognitive task (Markman, 1977). The error detection paradigm was originally used in reading comprehension monitoring studies conducted by Markman (1977Markman ( , 1979. As indicated by Baker (2016), in this paradigm errors or problems are introduced in the texts and various indices are used to determine whether readers notice the problems and try to solve them. The Meta-Performance Test incorporates a similar error detection paradigm to that used in MMT Gomes et al., 2014a) and in the methodology proposed in Pires and Gomes (2018).

Judgement
This ability refers to probabilistic evaluation of one's own performance in a task performed (Mihalca et al., 2017). Schraw (2009) proposes a classification of metacognitive judgment according to the time when the evaluation of a task is made. The author calls prospective judgment or prediction the judgment made before the performance of a task. Some typical measures of this type of judgment include the Ease of Learning (EOL), Feeling of Knowing (FOK) and Judgements of Learning (JOL) (e.g., Jemstedt et al., 2017;Taub et al., 2021). On the other hand, concurrent judgment refers to the evaluation that takes place during the process of solving the test items. Thus, this judgment occurs during the test, i.e., right after the respondent finishes to answer each item, judging whether they got the item right or wrong. Finally, the retrospective judgment concerns the assessment of performance after completion of the test. This type of judgment has a more general character, involving the evaluation of performance in the entire test, rather than item by item, as it happens in the concurrent judgment. In addition to classify the judgment by the time it occurs, the judgment is classified as continuous or dichotomous. The continuous judgment involves the use of a confidence scale for correct, usually from 0 to 100, while the dichotomous judgment uses the value of 1 for the confidence for correct and 0 for confidence for error (Schraw, 2008).
In the Meta-Performance Test, the concurrent judgment paradigm is used to evaluate one's own performance during the performance of the tests, analyzing whether or not the performance was successful after the completion of each item. As the judgment is the individuals' own assessment of their performance, in a judgement test the respondents' performance involves precisely the accuracy of their evaluation. In this sense, the degree of accuracy of their perception of whether they get a certain item right or wrong determines the respondents' performance in the judgment test. Thus, the judgment items of the Meta-Performance Test are also based on performance, as are the planning and monitoring items.

Strategies to Develop the Battery
The Meta-Performance Test consists of two tests, the Meta-text, which evaluates metacognition in the domain of reading comprehension, and Meta-number, in the domain of arithmetic expressions solving. Both tests seek to assess planning, monitoring, and judgment metacognitive abilities of the regulation of cognition domain. As pointed out before, the presence of two tests with different cognitive domains allows to evaluate, in terms of latent variables, the aforementioned metacognitive abilities both at specific domains level and regardless of the domains. The development of the tests took as reference the criteria of the American Educational Research Association et al. (2014) through the Standards for Educational and Psychological Testing for the development, validation, application and interpretation of tests.

General Development Strategies
As the identification of errors is not a common task, whether in everyday or academic life, the instructions of the Metatext and Meta-number tests indicate the objectives involved in the clearest and most understandable way, always presenting at least one example of an item along with a hypothetical respondent's response to that item. With the purpose of properly distinguishing people with low, medium, and high levels of the target abilities, both tests have sets of items that, in theory, bring different levels of difficulty to the respondent. The items are presented in order of level of difficulty, i.e., easier items come first and more difficult items are presented later. Each test item has three commands and each of these commands aims to measure one of the three target metacognitive abilities of the battery. The first command (A) aims to measure planning, while the second command (B) seeks to measure the respondents' judgment regarding their planning. The third command (C) aims to measure monitoring.
Markman's error detection paradigm (Markman, 1977(Markman, , 1979) was used to evaluate monitoring. Baker (2016) reports that one of the issues associated with the error detection paradigm involves the difficulty the participants have to identify errors, because they assume that the content presented is correct and free of inconsistencies; furthermore, respondents may assume that if the content has inconsistencies, they will be explained later in the text. In the Meta-Performance Test, the command that aims to evaluate monitoring has been specially designed to avoid some problems verified in tasks that use the error detection paradigm. In addition, both the Meta-text and the Meta-number instructions emphasize to the respondent that the texts or arithmetic expressions may contain errors, and that the task of the Command C involves identifying such errors, if they occur in the item.

Specific Development Strategies
The strategies to measure planning (Command A) and monitoring (Command C) were customized according to the content of the domain. Below, the strategies for developing the tasks structure (i.e., items) and the presentation of these two commands in each test will be described.
Meta-Text. This test has 18 items. Each item consists of three fundamental elements: (1) a statement that describes the objective of a hypothetical author; (2) some phrases available to write a text, according to the objective requested; (3) a text written by the hypothetical author, using the phrases available (see Figure 1). The content of the objective, the phrases and the text of each item were carefully prepared, covering a wide range of topics (e.g., society, art, nature, health, technology, and others). In addition, the texts contain words that are part of the respondents' vocabulary and current knowledge so to avoid that they do not perform well due to lack of vocabulary or prior knowledge. To measure planning, Meta-text Command A prompts the respondent to make a plan to create a text that allows the author's goal to be achieved. The respondent should write, in a proper space, the numbers of the phrases that, mandatorily, should have been selected by the author in order for him to correctly achieve his goal. Respondents who properly plan their reading comprehension process are expected to select solely all the phrases necessary to allow the author to achieve his goal.
In turn, the monitoring evaluation, in Command C, asks the respondent to identify if there are errors in the text written by the author. Errors can be phrases that should not have been included in the text because they do not contribute to achieving the author's goal, or phrases that were not but should have been chosen because they are directly related to the author's goal. In most items, the text written by the author contains errors. The exact number of items with error will not be shown, so not to expose the test answer key. To consider the answer correct, the respondent must have identified all the errors in each text written by the author. It is expected that those respondents who properly monitor their reading process can find the existing errors and consequently get more monitoring commands right.
In theory, the Meta-text contains a balanced number of easy, medium and difficult items. The strategies used to vary the level of difficulty of the items were designed according to the level of abstraction of the texts, objectives and phrases available. An item was considered more abstract than the other if it required a greater generation of logical conclusions or required a greater identification of implicit elements present and necessary for the correct answer to the item.

Figure 2. Sample item structure of the Meta-number.
To measure planning, the Command A requires that the respondent develop a plan, numbering and presenting, in a given space, the steps necessary for the correct solution of the arithmetic expression. To develop the plan, the instructions bring a set of rules related to the structure and sequence of steps. The rules require that the respondent produce a single operation at each step, correctly integrate the result of the equation with the rest of the arithmetic expression, as well as respect the order and hierarchy of the operators. The rules were carefully created to ensure that there is only one sequence of steps to solve the arithmetic expression correctly. The respondent must strictly reproduce these steps in order to get the command right; therefore, respondents who properly plan their arithmetic expression solving process are expected to produce more steps correctly and get more planning commands right.
To evaluate monitoring in the Meta-number, the command C prompts the respondent to evaluate whether the steps provided in the item itself are correct or not. If the steps are not correct, the respondent must identify all the specific steps that generated an error. Most of the items prepared contained errors in the steps presented. The errors are related to breaking the rules. Some of the errors involve: (1) disrespecting the hierarchy of operators; (2) performing some incorrect integration step; (3) not following the right order to do the calculations; (4) performing incorrect calculations; (5) omitting calculations; (6) writing some anomalous number.
The test is very careful when providing the instructions for the tasks, showing examples for the correct conduct of the commands, so to avoid that the respondent fails to correctly answer the items due to the lack of understanding of the demands requested by the commands. The strategies to vary the level of difficulty of the items were thought of according to the number of digits and operators that compose each arithmetic expression. In case of Command A, the difficulty varies depending on the number of steps that lead to the correct answer of the arithmetic expression. In turn, in Command C, the difficulty varies according to the type of error and the number of error-making steps.

Participants
The analysis of the Meta-Performance Test content validity involved four groups of participants. The first consists of three experts in the metacognition construct, while the second consists of two experts in Spanish language; the third consists of two experts in mathematics, and the fourth group consists of five undergraduate and graduate students (master's degree), representing the target population of the test.
The group of experts in the construct consisted of two psychologists, one with a master's degree in school psychology and the other with a master's degree in developmental psychology, and a pedagogue specializing in psychopedagogy; the experts ranged from 29 to 39 years of age. The group of experts in Spanish language included a master in language studies, aged 28, and a pedagogue specializing in languages and literature, aged 30. The group formed by experts in mathematics consisted of two male mathematics university professors aged 30 and 41. The participants of the target population were all adults (3 female and 2 male), two were undergraduate students of psychology, two were doing a master degree in psychology and one was doing a master degree in neuroscience; they were aged between 19 and 42.

Instrument
Meta-Performance Test. Developed by Castillo and Gomes with the purpose of measuring planning, monitoring, and judgment metacognitive abilities at higher education level.
The test consists of the Meta-text test and the Meta-number test, respectively belonging to the domains of reading comprehension and arithmetic expressions solving. Each item of the two tests consists of three commands (A, B, and C). The items were designed in order to have a balanced level of difficulty (easy, medium and difficult). In addition, the tests are designed so that each one takes no longer than 60 minutes. Figure 1 and Figure 2 show item samples for the Meta-text and the Meta-number, respectively. Readers interested in knowing the complete test can request it from the corresponding author.
Considering that the test seeks to evaluate the metacognition based on the respondents' performance, the answers to commands A (planning) and C (monitoring) are scored as follows: the respondents get a score of one (1) if they answer correctly to the command or a score of zero (0) if they respond incorrectly. In turn, command B measures the respondents' judgment of their own performance in command A. If they think they got the command A of a particular item right, then their judgment score on that item will be 1; otherwise, their score will be 0. To calculate the judgment accuracy, we used as a parameter the tetrachoric correlation of the raw scores of the respondents in command A in relation to their raw scores in command B. A positive and high correlation indicates high judgment accuracy.

Procedures for Data Collection and Analysis
All participants received information regarding the procedures and objectives of the work beforehand and their participation was subject to the signature of a Free and Informed Consent Form (FICF). Table 1 presents the content validity stages and shows in a schematic way the tasks each of the samples (participants in the study) was requested to perform, as well as the data collection and analysis strategies. Step 2 Experts in the content: Spanish and Mathematics

Experts in Spanish:
-Respond the Meta-text assessment Protocol so to: 1. Analyze the writing as well as the appropriateness and clarity of the instructions.
-Logically analyze the arguments of the texts and the items and assess the test answer key.

Experts in mathematics:
1. Analyze the appropriateness and clarity of the instructions.
2. Analyze the test answer key.
-Refer the Metatext and Meta-number assessment protocol to the experts via email.
-Interview with the experts via Skype.
-Notetaking of the experts' comments during the interview.
-Check problems pointed out by the experts.
Step 3 Target population -Semantic analysis of the tests' instructions (appropriateness and clarity).
-Answer the test: Metatext and Meta-number.
-Refer the tests via email.
-Check understanding of the test.
-Check the answer key.
-Notetaking the tests' completion time.
-Preliminary analysis of the items' difficulty level.

Results
Considering that the content validity analysis involved three distinct stages, the results will be presented in this sequence. Table 2 shows a summary of the general evaluations of the experts in construct and in content. The three experts in the construct found that the Meta-text and Meta-number instructions are adequate and clear, and that the items and commands of these tests allow, in theory, the evaluation of planning, monitoring, and judgment abilities.
The assessment of the two experts in Spanish language was in agreement with the test answer key, except for the answers to two specific items (item 4 and item 7). During the interview with the experts, they explained why they disagreed with the answer key for these items. After analyzing the explanation, both items were reformulated, following the experts' suggestions. In addition to the considerations about the test answer key, the experts pointed out a few grammar mistakes in the instrument, which were reviewed and corrected. In turn, the two experts in mathematics found the test instructions appropriate and clear. In addition, the experts agreed with the answer key, except for the answer to an item (item 14). After the analysis of the arguments, the item was reformulated according to the experts' suggestions. General remarks Experts in the construct -Expert 1: "The test is well structured and allows to evaluate metacognitive components. The instructions are clear and very exemplified, a variation per item is shown in terms of difficulty level." -Expert 2: "The instructions of each of the commands are directly articulated with the evaluation of planning, monitoring and judgment abilities. Therefore, the requested tasks allow the activation of metacognitive processes. I agree with the justification for the answer key for each item." -Expert 3: "The items allow a novel assessment of metacognitive skills in higher education. The rationale for responses to each item are duly justified."

Experts in Spanish language
Meta-text -Expert 1: "The writing and logical arguments of the majority of items are presented correctly. However, it is necessary to review the test answer key of items 4 and 7. The logic of the arguments presented in the answer key of these items may create difficulties for respondents." -Expert 2: "In general, the items present an adequate writing and the arguments for each answer are properly supported. A thematic diversity is shown in each item, which can contribute to keep the interest of the respondents during the test." Experts in mathematics Meta-number -Expert 1: "The test presents a novel and interesting strategy to evaluate the metacognitive process of solving arithmetic expressions. The steps and rules indicated are correctly and clearly defined. Nevertheless, the answer key for item 14 should be revised, since the steps presented to evaluate planning (Command A) might be incorrect (steps 3 and 4)." -Expert 2: "The test requires prior knowledge of mathematics commensurate with the level of higher education. This minimum knowledge is consistent with the target group of the test. I agree with the test answer key, except for item 14." The Meta-text and Meta-number evaluation by the target population was favorable. Semantic analysis indicated that the test instructions are understandable and the test is feasible. Table 3 displays a preliminary analysis of the items' difficulty level. As can be seen, both the Meta-text and the Meta-number contain items considered easy, average and difficult. It was observed that the mistakes made by the participants were specifically caused by difficulties inherent to the challenges of the items themselves. With regard to test completion time, respondents took an average of 46 minutes to complete the Meta-text and 71 minutes to complete the Meta-number. In view of the time required to answer the items of the Meta-number, the authors think it necessary to reduce to 14 the number of items in this test. This is supported by the fact that four of the five members of the target population reported to be answering item number 15 at minute 60 of the Meta-number test.  ,10,11,12,13,14,18 8,9,10,11,12 Difficult 3,5,7,8,15,17 13,14,15,16,17,18

Discussion
This study aimed to present the rationale and development strategies of the Meta-Performance Test. In addition, evidence of this test's content validity was presented. As a result of this paper, some implications to the field of metacognition studies can be pointed out.
The first concerns the presentation of the measurement issue in the area of metacognition. Self-report instruments and think-aloud protocols, despite being the most widely used to measure metacognitive abilities, generate considerable respondent and confirmation biases, respectively. These biases have been presented in previous research (e.g., Abernethy, 2015;Craig et al., 2020;Priede & Farrall, 2010). The use of performance tests is an alternative to deal with the aforementioned biases. However, the lack of this type of measurement is an important limitation of the area of metacognition. Consequently, this article detailed the Meta-Performance Test, since it allows to evaluate metacognitive abilities based on the respondent's performance. The objective of this article is in line with the review of Ohtani and Hisasaka (2018), who emphasize the need to develop innovative measures that assess specific metacognitive abilities, mainly within the domain of the regulation of cognition.
A second implication involves the presentation of initial evidence of the test's content validity, which has gone through the analysis and scrutiny of experts in the construct, experts in the content, and representatives of the target population. In other words, the test shows evidence that reinforces its theoretical assumptions that its items are markers of the measurement of the target abilities and the tests are feasible. According to the American Educational Research Association et al. (2014), the evidence based on test content is part of the various sources which contributes to generate accumulated evidence of validity. In the development of new instruments, this evidence acquires greater importance since it allows to study the relationship between the content of a test and the construct to be measured.
Finally, the structure of Meta-Performance Test involves the presence of two tests, each of them in a different cognitive domain, allowing the measurement of both metacognitive abilities specific to these domains and the measurement of these abilities, but regardless of the domain. The evidence provided by Rouault et al. (2018) indicates that metacognition not only operates as a domain-general resource applied over cognitive tasks, but also differently depending on the nature of these tasks.

Conclusion
Metacognition includes a set of abilities with important applications in the educational and professional contexts demanded by today's society. Developing objective evaluation measures that allow the nature of the construct to be assessed is highly relevant for conducting diagnostic and intervention processes. The almost exclusive predominance of self-report and think-aloud measurements procedures is a limitation that needs to be overcome and the Meta-Performance Test is a promising alternative.

Recommendations
This study presents only initial evidence of the Meta-Performance Test content validity, and new investigations are necessary for this test to present robust evidence of validity and become available for general use. This study is only the initial part of a broad set of necessary validity studies. Future research should aim to search for evidence regarding the Meta-Performance Test structural validity and external validity, allowing to generate more robust evidence on the instrument's construct validity. If this evidence is favorable, then this test may become an instrument of assessment available to be used by professionals who want a metacognition measurement.

Limitations
Despite that the Meta-Performance Test is intended to measure a set of metacognitive abilities within the domain of regulation of cognition (such as planning, monitoring and judgment), there are other abilities that some authors identify and which are not part of the test (e.g., prediction, orientation, adjusting, debugging, etc.). Moreover, considering that university students represent the target population of the test, major adaptations should be made on items content in case to be validated at other educational levels. These adaptations are of relevance mainly in primary and lower secondary education. Finally, this article focuses only on the presentation and content validity of the test, therefore, empirical evidence is needed about the structure of the test on different samples and cultures.