The Impact of Argument Arrangement on Essay Scoring

Knaebel, René; Schaefer, Robin; Stede, Manfred

doi:10.1007/978-3-031-63536-6_9

René Knaebel²⁸,
Robin Schaefer²⁸ &
Manfred Stede²⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14638))

Included in the following conference series:

Conference on Advances in Robust Argumentation Machines

559 Accesses

Abstract

We study the question to what extent the task of predicting the quality of student essays can be supported with computing “flows” of semantic types of argumentative units. Specifically, we use tagsets for claim and premise types that were recently applied to the Argument Annotated Essays corpus (AAE; Stab/Gurevych 2017) by Schaefer et al (2023). We train argument component and semantic type classification models on AAE and then use them to label the essays in two corpora that have numeric essay ratings, viz. FEEDBACK/PERSUADE and ICLE. We train linear classification models on flow features and find that flows of our semantic types are a better predictor for essay quality (in a simplified, good/bad dichotomy) than flows of coarse argument components (major claim, claim, premise). Finally, we calculate feature impact and perform a qualitative inspection, which shows some tendencies for pattern occurrence in the two essay classes.

You have full access to this open access chapter, Download conference paper PDF

Keywords

1 Introduction

Over the last decade, the field of Argument Mining (AM) has grown into a fruitful area of study that comprises a set of challenging sub-tasks [16, 32]. In our work, we make use of the automatic identification and extraction of argument components, i.e., claims [8, 25] and premises [23]. This has been studied for different text domains including news editorials [2], Wikipedia articles [23], social media data like tweets [27], and student essays [31]; the latter are the domain that we address here.

One application of analyzing argumentation in student essays is in contributing to assessing the quality of an essay. To this end, a variety of argument-related features have been studied and found to be useful in the past (see Sect. 2). In this paper, we add features of “flows” (sequences of occurrence in the text) of the types of claims and premises. We compare the impact of coarse types (major claim, claim, premise) to fine-grained semantic types of those components (e.g., fact, value and policy claims; see Sect. 3.1). We achieve this by utilizing the Argument Annotated Essays (AAE) corpus [31] for training ADU identification and semantic type classification models. These models are used to automatically label our two target essay corpora Feedback [7] and ICLE [11], which have previously been annotated with essay quality ratings, with ADUs and their types. We then extract semantic type flows and use them as features in linear classification models for essay quality prediction.

Our two contributions are (i) the finding that for some dimensions of essay quality, flows of fine-grained features are more powerful predictors than flows of the coarse features; and (ii) a qualitative analysis that leads to some observations on correlations between flow patterns and essay quality.

The next section provides an overview of related work, and Sect. 3 introduces the three corpora we are working with, and the features we use for semantic types. In Sect. 4, we describe our experiments, which involve some “within-domain transfer” in that we train on an essay corpus annotated for the component features but that does not have quality scores [31] and then run those models on two corpora that offer scores but no (compatible) type annotation [7, 11]. We discuss the findings in Sect. 5 and conclude in Sect. 6.

2 Related Work

Argument Mining in Essays. The AAE corpus, consisting of 402 essays with claims, premises and relations among them [31], is a widely-used resource for developing AM techniques. We mention a few, viz. component detection [30], semantic type annotation and identification [4, 26], essay quality assessment [4, 33], and end-to-end AM [21, 24]. It was also applied in research on unsupervised AM [22], the analysis of argumentation strategies [26], and multi-scale AM [34]. The latter utilizes the text units essay, paragraph, and word for major claim, claim and premise identification, respectively. Another essay corpus that received attention in AM is ICLE [12]. For example, [5] used its rich annotations to compare aspects of argumentation strategies across different cultural groups among English learners.

Argument Component Types. Specific types of argument components have been used to label claims and premises in a variety of text genres. In Wikipedia [23], editorials [2], and persuasive essays [4, 26] premises have been annotated as, e.g., study/statistics, expert/testimony, anecdote or common knowledge/common ground. Other annotated premise types include study, factual, opinion, and reasoning in idebate.org data [15]. For claims, fact, value and policy have been annotated in persuasive essays [4, 26], in addition to logos, pathos, and ethos [4], i.e. Aristotle’s modes of persuasion [14]. Claims in Amazon reviews have been labeled with the types fact, testimony, policy, and value [6].

Social media text has been a popular target, too. Annotated types include evidence types typical for social media, e.g. news media accounts, blog posts, or pictures [1], factual vs opinionated [9], and more recently un/verifiability, reason and external/internal evidence [27]. Furthermore, discussions collected from the subreddit Change My View were annotated for the claim types interpretation, evaluation-rational, evaluation-emotional, and agreement/disagreement, while premises were labeled with logos, pathos, and ethos [13].

In our work, we apply the set of claim and premise types that we described in our recent work on argument strategy analysis [26]. It was derived and extended from previous studies [2, 4].

Argument Analysis for Essay Scoring. In early work, [18] found correlations between distributions of argument component types and holistic essay scores. In contrast, [29] evaluated the contents of the arguments in relation to the argument scheme present in the essay prompt. Building on their data, [3] turned to structure and found a moderate positive correlation between holistic essay scores and distributions of argument components and relations. Similarly, [10] showed that scoring TOEFL essays benefits from features like the number of claims and premises, the number of supported claims, and aspects of tree topology. [20] worked with a broad set of linguistic features and distributions of argument components to predict scores in the ICLE corpus. Closely related to our work is the study by [33] who proposed to use linear “flows” of (coarse) premise and claim units for essay scoring and examined their contribution. We extend this by attending to the more fine-grained features of units.

3 Data

3.1 Argument-Annotated Essays Corpus

We use the AAE corpus [31] as a starting point. The corpus contains 402 student essays annotated for argumentative discourse units (ADU) major claim, claim, and premise and their relations support and attack. Major claim and claim are linked via stance annotations. Importantly, components can be extracted from the argumentation structure. Claims always relate to the essay’s major claim, while premises support or attack claims (or other premises). Also, while claims and premises can occur in all essay paragraphs, major claims are supposed to be restricted to the first and last paragraphs.

In previous work [26], we annotated the AAE corpus for semantic claim and premise types that can be used for the extraction of argumentative flow patterns. We provided evidence that these flow patterns are suitable for the analysis of argumentation strategy in essays. Here, we will briefly describe the semantic types. For more detailed definitions and examples, we refer the reader to [26]. The following claim types were annotated: policy, value, and fact (see Table 2 below for proportions). Policy refers to claims arguing in favor of some action being taken or not being taken. Value claims evaluate a target, e.g. they may argue towards it being good/bad or important/unimportant. Fact^{Footnote 1} claims, on the other hand, state that some target is true or false. In addition to the claim types, we annotated the following premises types: testimony, statistics, hypothetical-instance, real-example, and common-ground. Testimony gives evidence by referring to some expert. Statistics uses the results of quantitative research, among others, as evidence. Hypothetical-instance and real-example are both example categories. The former refers to situations created by the author, i.e. hypothetical situations, while the latter describes actual historical events or a specific statement about the world. Finally, common-ground includes common knowledge, self-evident facts, or similar.

In this work, we use the AAE corpus for training ADU identification and semantic type classification models, which are then used to automatically label the Feedback and ICLE corpora with ADUs and their types. Note that we do not use the original relation and stance annotations.

3.2 Feedback Corpus

The Feedback corpus (n = 3,405) is a subset of the PERSUADE corpus [7], which consists of 25,996 essays written by students from grades 6 through 12. In total, 15 prompts were used to elicit the essays. The corpus has been annotated for different ADU types: lead, position, claim, counterclaim, rebuttal, evidence, concluding statement. The corpus was additionally annotated for different quality dimensions, such as cohesion.

Comparing the argumentative components of the PERSUADE corpus with those of the AAE corpus reveals an apparent overlap in categories. Both corpora are annotated for claim and premise/evidence. Position and major claim are defined similarly. However, recall that the ADU types in the AAE corpus are derived from the overall argumentation structure (via the relations between components), while in the PERSUADE corpus, ADUs are defined semantically.

Semantic type classification builds on top of previously classified ADU types. A direct mapping of the ADU types from PERSUADE to AAE would allow us to learn ADU classification on a much larger corpus with more confidence in the predictions for out-of-domain data. To test whether the annotations of the AAE corpus are compatible with those of the PERSUADE corpus, we compare the predictions of our ADU classifier (trained on the AAE data) for the PERSUADE corpus with the original component labels. Mapping the output of our model to the annotations reveals mixed results (see Fig. 1). While evidence and premise overlap to a good extent, differences in claim conceptualization appear problematic. Both claim and counterclaim are mapped by similar proportions to claim and premise by our model. Rebuttal, which is defined as “a claim that refutes a counterclaim” [7], is mostly classified as premise, while concluding statement corresponds to the whole variety of AAE components. Thus, conceptualizations of argument components are on the whole different in the two corpora, and therefore we decided to not use the component annotations of the Feedback corpus, and work with our predicted labels instead.

For our quality prediction experiments, we use the dimensions cohesion and conventions. A text with high cohesion is defined as containing a variety of effective linguistic features such as reference and connectives to link ideas across sentences and paragraphs. Conventions is defined as the use of common rules, including spelling, capitalization, and punctuation.

3.3 International Corpus of Learner English

Our second target corpus is derived from the ICLE corpus [11], which contains more than 6,000 student essays, of which 91% are argumentative. While no argument component annotations are available, the corpus has been annotated for different scoring dimensions. In this work, we utilize the subset of the corpus that has been annotated for organization [19] and argument strength [20] (n = 896). Previously a high organization score was defined as providing a position with respect to an introduced topic and supporting that position [28]. As this definition roughly describes the core aspects of argumentation, we assume this scoring dimension to be a good candidate for our study. On the other hand, an essay with high argument strength “presents a strong argument for its thesis and would convince most readers” [20]. Argument strength is thus tied to persuasiveness, again one of the core aspects of successful argumentation.

4 Experiments

Our experiments consist of two steps: Labeling the two target corpora with ADUs and their semantic types (Sect. 4.1), and testing the contribution of type change flows for the task of essay score prediction (Sect. 4.2). In Sect. 4.3, we undertake a qualitative inspection of flows associated with essays of different quality.

4.1 ADU and Sematic Type Classification

We first classify the coarse type of the argumentative components as major claim, claim, and premise. Afterward, we classify the fine-grained semantic types conditioned on their previously identified coarse type. For the semantic type classification, however, we do not distinguish between major claims and claims but regard both of them as claims. As both classification tasks, ADU and semantic type, have been studied previously [26, 31, 33], we do not conduct extensive comparative experiments here but provide the performance of our ensembles for better quality estimation of the projected labels.

We train ensembles of three models each per step. We use 10% of the AAE corpus for development. The remaining data is used for training. Per run, the data is split randomly (with a random number seed set to either 1, 2, or 3).

As a classifier, we use a pre-trained language model, roberta-base [17], for both the coarse and the fine-grained step. Following previous work by [33], we identify ADUs solely on the sentence level, disregarding smaller units. Our input to the model is the target sentence plus one additional sentence on the left and the right, to provide context. The context is separated from the target sentence by the model’s special tokens. We found that adding this context improves results compared to processing single sentences. Also, it works better than giving the model more context information (additional sentences or structural information such as paragraph breaks).

The ensembles are evaluated on the full AAE corpus. The final classification result is an averaged softmax, from which the label with the maximum probability is chosen. See Table 1 for the results on the annotated corpus. We have further assessed our approach manually on a smaller sample. In particular, we sampled 15 instances per semantic type, and have obtained satisfactory macro results (Claims: 95.55 F1, Premises: 91.64 F1). However, during our review, we noticed some problems with the underlying processed data, e.g. grammatical inconsistencies within sentences and the resulting problems in understanding the author’s intentions, which are unfortunately beyond our project’s scope.

Table 1. Macro-averaged classification results for the AAE corpus.

Full size table

We then use the trained classification models to predict argument components and semantic types in our two target corpora, Feedback and ICLE. Table 2 shows the distribution of semantic types both for the manually annotated AAE corpus and for the automatic predictions in the Feedback and ICLE corpora. While some types are equally distributed, e.g. policy and statistics, there are notable differences in others. For instance, fact claims occur more frequently in the AAE essays, while our models labeled claims in Feedback and ICLE more often as value. For premises, Feedback contains substantially more hypothetical-instances, while the majority class in ICLE is common-ground.

Table 2. Proportions of semantic types by corpus.

Full size table

Table 3. Most common change flows of semantic types for different argument components. The first letter refers to the type of the argument component (M = major claim, C = claim, and P = premise), the following letters denote the semantic type (e.g. CV = claim-value; PCG = premise-common-ground). Levels are first and last paragraph of the essay, and everything in-between (body).

Full size table

4.2 Predicting Essay Quality with Flows of Semantic Types

In this section, we investigate whether essay quality prediction can be improved by using flows of our fine-grained semantic types, in comparison to flows of coarse ADU types, as they had been used by [33]. By “flow”, we mean the linear sequence of type labels that occur in a text unit (paragraph or full text). Importantly, we work with change flows, which result from collapsing sequences of identical types into a single label. This way, we ignore the information on the “length” of a stretch with the same type and focus only on the changes from one type to another.

To simplify the prediction problem, we group all essays into two classes good and bad. We normalize all quality scores to the range [0 .. 1], and then label essays with a score above 0.7 as good and others as bad.

Given the annotations of coarse ADU types and semantic types in the two target corpora, we extract change flow features, both on the global essay level and on that of paragraphs, and for ADU and semantic types, respectively. In Table 3, we show the most common change flows of semantic types in the corpora, divided into first paragraph, body, and last paragraph.

For predicting the quality class, we trained linear models on all extracted change flow features, in particular, we chose stochastic gradient descent models. We set the maximum iteration to 1500, use a balanced class weight, and use grid search cross-validation to decide on the remaining parameters.

We run a comparison on 10-fold cross-validation with optimal parameters. Table 4 shows our averaged macro scores (precision, recall, and F1) summarized as mean and standard deviation over 10 runs. We present results for all four essay scoring dimensions cohesion, conventions, organization, and argument strength. Baseline refers to a stratified classifier, which performs classification based on the observed frequency and outperforms a simple majority voting baseline.

Table 4. Essay Scoring Results. Means and standard deviations of 10-fold cross-validation measured as precision, recall, and F1 scores. As the macro average takes into account the imbalance of the labels, this can result in the F1 values not being between the respective macro values for precision and recall.

Full size table

Both ADU and semantic type models outperform the baselines. We achieve higher F1 scores for the dimensions conventions and organization with models trained on semantic type change flows instead of coarse ADU type change flows (conventions: 0.559 vs 0.528; organization: 0.603 vs 0.580). For cohesion and argument strength, the two types of flows obtain similar results.

4.3 Analysis of Feature Impact

We use the trained linear models to extract semantic change flow features that are prevalent in good vs bad essays and are thus good predictors for the respective class. We normalize the coefficients to center them around zero. Thus, positive coefficients of features in the linear model correlate with yielding a better essay score, while negative coefficients result in worse scores. We investigated the most important change flows in good vs bad essays both on the full essay level and on the paragraph level. We will only present the results from analyses of the body paragraphs (see Tables 5 and 6), as, presumably, this is where the main argumentation unfolds.

With respect to claim-premise change flows in paragraphs, bad essays are more notably characterized by a lack of claims, thus only premises are utilized. This is especially the case for the quality dimensions cohesion, organization, and argument strength. Furthermore, paragraphs of good essays appear to show more type variety. This is observable for all quality dimensions, but most clearly for the ICLE corpus, i.e. organization and argument strength.

More patterns emerge in the premise change flows. For instance, both feedback dimensions (cohesion and conventions) show the same most dominant flows in good essays, i.e. PCG-PHI-PCG-PHI and PST-PCG. Also, paragraphs in essays with a high conventions score tend to begin with common-ground, while flows exhibit fewer changes than in bad essays. Recall that this does not necessarily imply a less complex argumentation structure, as change flows collapse sequences of identical semantic types.

ICLE essays with a high organization score show complex premise change flows, which often include several common-ground units framing hypothetical-instance, real-example, or combinations of those. Bad essays, on the other hand, are characterized by example types that are more rarely used in combination with common-ground. Similar observations can be made for the argument strength dimension. As the feedback corpus, both ICLE dimensions have identical dominant flows, i.e. PCG-PRE-PCG and PCG-PHI-PCG.

Table 5. Change Flows on Paragraph Level (Body): Feedback Cohesion & Conventions. The first letter refers to the type of the argument component (M = major claim, C = claim, and P = premise), the following letters denote the semantic type (e.g. CV = claim-value; PCG = premise-common-ground).

Full size table

Table 6. Change Flows on Paragraph Level (Body): ICLE Organization & Argument Strength. The first letter refers to the type of the argument component (M = major claim, C = claim, and P = premise), and the following letters denote the semantic type (e.g. CV = claim-value; PCG = premise-common-ground).

Full size table

5 Discussion

Transfer across corpora is a complex task. Even corpora that belong to the same general domain of texts, e.g. persuasive essays, may exhibit notable differences in argumentation structure and strategies. This is reflected in the distribution of semantic types across our essay corpora. For instance, the AAE corpus contains a substantially larger proportion of fact claims compared to both Feedback and ICLE. The Feedback corpus shows an especially large proportion of hypothetical-instance, while premises in the ICLE corpus have been predominantly labeled with common-ground. These differences in semantic types have an impact on the observable change flows, and thus on argumentation strategies.

To begin with, bad essays with respect to cohesion, organization and argument strength tend to contain paragraphs without a claim more often than good essays. This is intuitively plausible, as a full argument typically consists of a claim and at least one premise. However, important change flows for the prediction of bad essays with respect to the conventions dimension still contain claims. This may be due to the quality dimension at hand, as conventions is less clearly linked to argumentation quality than the other dimensions.

Second, the suitability of premise change flow complexity as a predictor for essay quality depends on the corpus and quality dimension. While ICLE essays with high organization and argument strength scores tend to show more variety in premise change flow patterns, Feedback essays with high convention scores show less variety.

Third, good essays with respect to conventions, organization, and argument strength show change flows that begin with common-ground or use it as a framing type, typically in combination with an example type. This is in line with the argumentation strategy found in the AAE corpus of beginning (and ending) an argument with a general observation while inserting more concrete premises, e.g. examples, in between [26]. Overall, we can summarize that semantic change flows can be indicative of argument strategies applied to produce a persuasive essay of high quality.

6 Conclusion

In this work, we studied the question to what extent argument arrangement in the sense of change flows of semantic types can support the prediction of student essay quality.

To this end, we trained models for ADU and semantic type classification on the AAE corpus, which has been annotated accordingly in previous work [26, 31]. We used these models to label essays in two target corpora: Feedback and ICLE. We extracted change flows of ADUs and semantic types and used them for essay quality prediction. Importantly, we showed that some dimensions of essay quality, i.e. conventions and organization, can be predicted better by using flows of semantic types rather than by coarse ADU types. This result expands on the earlier work of [33]. Finally, we identify change flow features that are important predictors for good vs bad essays.

We find that 1) the distribution of semantic types depends on the corpus at hand and 2) bad essays tend to lack claims, i.e. contain incomplete arguments. Further, we observe that 3) the mere complexity of change flows is not a sufficient predictor for quality and 4) certain change flows of semantic types indicate the use of argumentation strategies.

In the future, we are interested in investigating more thoroughly the relationship between argumentation strategies and essay quality. Here, we considered this topic only briefly in Sect. 5. Also, we plan to extend our analysis to other out-of-domain corpora (e.g., news editorials and the subreddit Change My View).

Limitations

Due to the very small number of annotated essays (402 instances), it is only possible to estimate to a limited extent how the projection of the annotations by our neural models onto corpora outside the essay domain works. The questions of how well these models work on out-of-domain data and how well the semantic type scheme applies to other domains deserve greater attention in future work.

For our study, we decided to follow previous research that simplifies the argument component classification to the sentence level. Although this is considered legitimate for the AAE corpus due to the consistently strict essay structure, in general, this is a simplification that leads to inexactness in the extracted components.

Our work is the first attempt to use abstract semantic patterns to measure the quality of student writing. However, due to the relatively small gains in performance, we assume that the selected quality dimensions may not ideally capture the meaning of our semantic types.

Notes

1.
Note that in this work fact does not refer to actual factual statements. Rather it includes claims that the author presents as factual. Determining the actual truth or falsity of a statement, i.e. fact-checking, is beyond the scope of this paper.

References

Addawood, A., Bashir, M.: “what is your evidence?” A study of controversial topics on social media. In: Proceedings of the Third Workshop on Argument Mining, Berlin, Germany, pp. 1–11. ACL (2016)
Google Scholar
Al-Khatib, K., Wachsmuth, H., Kiesel, J., Hagen, M., Stein, B.: A news editorial corpus for mining argumentation strategies. In: Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 3433–3443. The COLING 2016 Organizing Committee (2016)
Google Scholar
Beigman Klebanov, B., Stab, C., Burstein, J., Song, Y., Gyawali, B., Gurevych, I.: Argumentation: content, structure, and relationship with essay quality. In: Proceedings of the Third Workshop on Argument Mining (ArgMining2016), Berlin, Germany, pp. 70–75. ACL (2016)
Google Scholar
Carlile, W., Gurrapadi, N., Ke, Z., Ng, V.: Give me more feedback: annotating argument persuasiveness and related attributes in student essays. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 621–631. ACL (2018)
Google Scholar
Chen, W.F., Chen, M.H., Mudgal, G., Wachsmuth, H.: Analyzing culture-specific argument structures in learner essays. In: Proceedings of the 9th Workshop on Argument Mining, pp. 51–61. International Conference on Computational Linguistics, Online and in Gyeongju, Republic of Korea (2022)
Google Scholar
Chen, Z., Verdi do Amarante, D., Donaldson, J., Jo, Y., Park, J.: Argument mining for review helpfulness prediction. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, pp. 8914–8922. ACL (2022)
Google Scholar
Crossley, S.A., Baffour, P., Tian, Y., Picou, A., Benner, M., Boser, U.: The persuasive essays for rating, selecting, and understanding argumentative and discourse elements (persuade) corpus 1.0. Assessing Writing 54, 100667 (2022)
Google Scholar
Daxenberger, J., Eger, S., Habernal, I., Stab, C., Gurevych, I.: What is the essence of a claim? Cross-domain claim identification. In: Palmer, M., Hwa, R., Riedel, S. (eds.) Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2055–2066. ACL (2017)
Google Scholar
Dusmanu, M., Cabrio, E., Villata, S.: Argument mining on Twitter: arguments, facts and sources. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2317–2322. ACL (2017)
Google Scholar
Ghosh, D., Khanam, A., Han, Y., Muresan, S.: Coarse-grained argumentation features for scoring persuasive essays. In: Erk, K., Smith, N.A. (eds.) Proceedings of the 54th Annual Meeting of the ACL (Volume 2: Short Papers), Berlin, Germany, pp. 549–554 (2016). https://doi.org/10.18653/v1/P16-2089
Granger, S., Dagneaux, E., Meunier, F., Paquot, M., et al.: International corpus of learner English, vol. 2. Presses universitaires de Louvain Louvain-la-Neuve (2009)
Google Scholar
Granger, S., Dupont, M., Meunier, F., Naets, H., Paquot, M.: International Corpus of Learner English. Version 3. Presses universitaires de Louvain (2020)
Google Scholar
Hidey, C., Musi, E., Hwang, A., Muresan, S., McKeown, K.: Analyzing the semantic types of claims and premises in an online persuasive forum. In: Proceedings of the 4th Workshop on Argument Mining, Copenhagen, Denmark, pp. 11–21. ACL (2017)
Google Scholar
Higgins, C., Walker, R.: Ethos, logos, pathos: strategies of persuasion in social/environmental reports. Account. Forum 36(3), 194–208 (2012). Analyzing the Quality, Meaning and Accountability of Organizational Communication
Google Scholar
Hua, X., Wang, L.: Understanding and detecting supporting arguments of diverse types. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada, pp. 203–208. ACL (2017)
Google Scholar
Lawrence, J., Reed, C.: Argument mining: a survey. Comput. Linguist. 45(4), 765–818 (2020)
Article Google Scholar
Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019)
Google Scholar
Ong, N., Litman, D., Brusilovsky, A.: Ontology-based argument mining and automatic essay scoring. In: Proceedings of the First Workshop on Argumentation Mining, Baltimore, Maryland, pp. 24–28. ACL (2014)
Google Scholar
Persing, I., Davis, A., Ng, V.: Modeling organization in student essays. In: Li, H., Màrquez, L. (eds.) Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, pp. 229–239. ACL (2010)
Google Scholar
Persing, I., Ng, V.: Modeling argument strength in student essays. In: Zong, C., Strube, M. (eds.) Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, pp. 543–552. ACL (2015). https://aclanthology.org/P15-1053
Persing, I., Ng, V.: End-to-end argumentation mining in student essays. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 1384–1394. ACL (2016)
Google Scholar
Persing, I., Ng, V.: Unsupervised argumentation mining in student essays. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, pp. 6795–6803. ELRA (2020)
Google Scholar
Rinott, R., Dankin, L., Alzate Perez, C., Khapra, M.M., Aharoni, E., Slonim, N.: Show me your evidence - an automatic method for context dependent evidence detection. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 440–450. ACL (2015)
Google Scholar
Sazid, M.T., Mercer, R.E.: A unified representation and a decoupled deep learning architecture for argumentation mining of students’ persuasive essays. In: Proceedings of the 9th Workshop on Argument Mining, pp. 74–83. International Conference on Computational Linguistics, Online and in Gyeongju, Republic of Korea (2022)
Google Scholar
Schaefer, R., Knaebel, R., Stede, M.: On selecting training corpora for cross-domain claim detection. In: Lapesa, G., Schneider, J., Jo, Y., Saha, S. (eds.) Proceedings of the 9th Workshop on Argument Mining, pp. 181–186. International Conference on Computational Linguistics, Online and in Gyeongju, Republic of Korea (2022)
Google Scholar
Schaefer, R., Knaebel, R., Stede, M.: Towards fine-grained argumentation strategy analysis in persuasive essays. In: Alshomary, M., Chen, C.C., Muresan, S., Park, J., Romberg, J. (eds.) Proceedings of the 10th Workshop on Argument Mining, Singapore, pp. 76–88. ACL (2023)
Google Scholar
Schaefer, R., Stede, M.: GerCCT: an annotated corpus for mining arguments in German tweets on climate change. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, pp. 6121–6130. ELRA (2022)
Google Scholar
Silva, T.: Toward an understanding of the distinct nature of L2 writing: the ESL research and its implications. TESOL Q. 27(4), 657–677 (1993)
Article Google Scholar
Song, Y., Heilman, M., Beigman Klebanov, B., Deane, P.: Applying argumentation schemes for essay scoring. In: Proceedings of the First Workshop on Argumentation Mining, Baltimore, Maryland, pp. 69–78 (2014). https://doi.org/10.3115/v1/W14-2110
Stab, C., Gurevych, I.: Identifying argumentative discourse structures in persuasive essays. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 46–56. ACL (2014)
Google Scholar
Stab, C., Gurevych, I.: Parsing argumentation structures in persuasive essays. Comput. Linguist. 43(3), 619–659 (2017)
Article MathSciNet Google Scholar
Stede, M., Schneider, J.: Argumentation Mining. Synthesis Lectures in Human Language Technology, vol. 40. Morgan & Claypool (2018)
Google Scholar
Wachsmuth, H., Al-Khatib, K., Stein, B.: Using argument mining to assess the argumentation quality of essays. In: Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1680–1691. The COLING 2016 Organizing Committee, Osaka, Japan (2016)
Google Scholar
Wang, H., Huang, Z., Dou, Y., Hong, Y.: Argumentation mining on essays at multi scales. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 5480–5493. International Committee on Computational Linguistics, Barcelona, Spain (Online) (2020)
Google Scholar

Download references

Acknowledgement

This research has been supported by the German Research Foundation (DFG) with grant number 455911521, project “LARGA” in SPP “RATIO”.

Author information

Authors and Affiliations

Applied Computational Linguistics, University of Potsdam, 14476, Potsdam, Germany
René Knaebel, Robin Schaefer & Manfred Stede

Authors

René Knaebel
View author publications
You can also search for this author in PubMed Google Scholar
Robin Schaefer
View author publications
You can also search for this author in PubMed Google Scholar
Manfred Stede
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to René Knaebel .

Editor information

Editors and Affiliations

Bielefeld University, Bielefeld, Germany
Philipp Cimiano
Heidelberg University, Heidelberg, Germany
Anette Frank
Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
Michael Kohlhase
Bauhaus-Universität Weimar, Weimar, Germany
Benno Stein

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Knaebel, R., Schaefer, R., Stede, M. (2024). The Impact of Argument Arrangement on Essay Scoring. In: Cimiano, P., Frank, A., Kohlhase, M., Stein, B. (eds) Robust Argumentation Machines. RATIO 2024. Lecture Notes in Computer Science(), vol 14638. Springer, Cham. https://doi.org/10.1007/978-3-031-63536-6_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-63536-6_9
Published: 17 July 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-63535-9
Online ISBN: 978-3-031-63536-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The Impact of Argument Arrangement on Essay Scoring

Abstract

Keywords

1 Introduction

2 Related Work

3 Data

3.1 Argument-Annotated Essays Corpus

3.2 Feedback Corpus

3.3 International Corpus of Learner English

4 Experiments

4.1 ADU and Sematic Type Classification

4.2 Predicting Essay Quality with Flows of Semantic Types

4.3 Analysis of Feature Impact

5 Discussion

6 Conclusion

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation