Harnessing GenAI and LLMs for an automated evaluation tool to aid teachers

Oleh Yong Shu Chiang

Learning aids that are automating repetitive administrative and evaluation work can ease teacher workload and enable them to focus on higher-value tasks. GovInsider hears how from Bill Cai, an Applied Scientist at the AWS Generative AI Innovation Center.

Trying to automate the evaluation of open-ended language tasks, such as sentence construction to illustrate the meaning of a word, required a machine-learning approach. Image: Canva

Technologies such as generative artificial intelligence (GenAI) and large language models (LLMs) can help improve access to education, enhance teacher productivity and transform student experiences.

According to studies, education technology can have a big impact on student outcomes, said Bill Cai, an Applied Scientist at the AWS Generative AI Innovation Center.

A 2020 McKinsey report found that using technologies such as data projectors and Internet-connected computers in the classroom can provide a boost equivalent to one year of learning in improving PISA (Programme for International Student Assessment) scores.

The Ministry of Education (MOE) in Singapore has also found, in a 2021 survey, that teachers on average worked about 53 hours per week during term time.

“It’s not just about [reducing] the hours,” said Cai, who was speaking at Public Sector Day Singapore in October, but “how we can improve the balance towards teacher-student interaction, and [automate] repetitive administrative and evaluation work.”

GovInsider hears from Cai about how GenAI enables the creation of learning aids that can ease teacher workload and enable them to focus on more high-value classroom activities.

AI can save nearly half the time for grading and evaluation

At the AWS Generative AI Innovation Center, Cai and his team explore how GenAI can be used responsibly to solve customers' problem statements and implement proofs-of-concept. Image: Public Sector Day Singapore

During his presentation, Cai shared how customers such as MOE could adopt a serverless architecture to harness the power of GenAI.

“By adopting AI well, one study estimates that teachers can save nearly half of their current time – about 46 per cent – spent on grading and evaluation work on student performance, allowing our educators to spend more time on higher-value activities,” Cai said.

The AWS Generative AI Innovation Center is a global programme that pairs organisations with applied scientists, business consultants, machine learning strategists, engineers and architects with deep experience employing GenAI to solve diverse business problems.

“We work directly with our customers and implement many of the proofs-of-concept ourselves [after] exploring what is possible, how GenAI can be used responsibly, which models to use and how to select those models,” said Cai.

In the case of education, there are some assessment tasks today that can be very easily automated, he added. One example is the marking of multiple-choice quizzes that test students’ vocabulary; automating this would only require a programme to check students’ submissions against an answer key.

However, for certain classroom tasks, such as open-ended language tasks, attempting to automatically evaluate a student’s answer is not as straightforward, Cai added.

Open-ended tasks as a machine-learning problem

An open-ended language task, such as one that calls for students to illustrate the meaning of a particular word, can have more than one correct answer and can be completed in multiple ways.

For instance, if a class is tasked to construct a sentence to demonstrate the meaning of the word “glad”, how would you automate the evaluation of whether the sentence is a good one?

For starters, Cai and his team started working on this issue for MOE by framing the evaluation of open-ended language tasks as a machine-learning problem.

A machine-learning problem typically starts with a data set, a set of machine-learning models and methods, as well as benchmarks to compare results against.

“In this case, we have data sets of student answers, some are correct, some are errors, and we use a machine-learning model to evaluate the student responses and pick the model that has the best results that we are looking for. Once these results are satisfactory for teachers and students, you can then roll [the model] out for use.”

However, this approach was fraught with challenges, according to Cai.

Firstly, data sets of student answers were limited and were often not in the format required; to collect actual student data would take significant amounts of time and resources.

Secondly, a traditional machine learning approach required model training and fine-tuning the system also required model training that is both data- and time-intensive.

GenAI and LLMs surmount resource issues

Cai and his team were able to overcome challenges in developing automation for open-ended language tasks by turning to GenAI and the use of LLMs.

Such models can augment data sets of student answers by generating simulated responses at a specific vocabulary level.

Using pre-trained LLMs and a prompt technique known as few-shot in-context learning enables MOE to scale up testing of thousands of sample answers, and thus raise confidence in the models and methods.

What this means is that the LLMs can simulate student answers that are customisable for any target word – such as “glad” – and produce responses that are realistic for a given level of vocabulary.

This, in turn, enables the evaluation model to better assess responses as correct or incorrect, score them and also explain these assessments.

“What we saw was that the LLM was able to adapt very well without fine-turning to this complex, open-ended task of generating [the simulated responses],” said Cai, who noted that the machine-learning model was able to generate and evaluate 2,600 simulated responses in a day, a task that would have taken 27 teacher-days.

“This has created a very large, validated data set to validate their [evaluation] methods,” he added.

The result, as demonstrated by Cai and his team at Public Sector Day Singapore, is the prototype for a web-based assessment tool that can score and provide holistic feedback about vocabulary usage and grammar.

In time to come, this could be a tool that is both useful and reliable for learners, and a time-saver and productivity booster for educators in Singapore.

Also read: In tomorrow's classrooms, smarter learning is fuelled by change management and innovative solutions