Practical Changes Could Reduce AI Resource Use Up to 90%

Small changes in how AI operates can vastly reduce its resource consumption — making it more ‘like using a hammer to drive a nail, rather than a sledgehammer.’

The use of generative AI has expanded rapidly in recent years — with Large Language Models (LLMs) by companies including OpenAI, Meta and Google becoming household names. OpenAI’s ChatGPT service alone receives around one billion queries each day.

As each generation of LLMs has become more sophisticated than the last, its skyrocketing popularity has created vast and increasing demand on resources such as electricity and water, which are needed to run AI data centers. In a recent Capgemini Research Institute report, almost half of executives admitted their use of generative AI has jeopardized their sustainability objectives by fueling their company’s GHG emissions.

According to research from University College London (UCL) published in a new UNESCO report, existing solutions could significantly reduce AI’s energy and resource demand if adopted more widely.

For Smarter, Smaller, Stronger: Resource-Efficient AI and the Future of Digital Transformation, researchers from UCL Computer Science conducted a series of experiments on Meta’s LLaMA 3.1 8B model to assess the impact of changes to the way AI models are configured and used on how much energy they need, and how this affects performance. This model was chosen as it is open source and fully modifiable, enabling the researchers to test the un-optimized version versus a range of optimization techniques (which is not possible with closed models such as GPT-4).

They found that by rounding down numbers used in the models’ internal calculations, shortening user instructions and AI responses, and using smaller AI models specialized to perform certain tasks, a combined energy reduction of 90 percent could be achieved compared to using a large all-purpose AI model.

“Our research shows that there are relatively simple steps we can take to drastically reduce the energy and resource demands of generative AI, without sacrificing accuracy and without inventing entirely new solutions,” said Professor Ivana Drobnjak, an author of the report and a member of the UNESCO Chair in AI at UCL. “Though some AI platforms are already exploring and implementing solutions such as the ones we propose, there are many others besides the three that we looked at. Wholesale adoption of energy-saving measures as standard would have the greatest impact.”

Rounding down to save energy

In the first experiment, the researchers assessed LLaMA 3.1 8B’s accuracy when performing common tasks (summarizing texts, translating languages and answering general knowledge questions), alongside its energy usage, under different conditions.

In a process called tokenization, LLMs convert the words from the user’s prompt into numbers (tokens) — which are used to perform the calculations involved in the task — before converting the numbers back into words to provide a response.

By applying a method called quantization (using fewer decimal places to round down the numbers used in calculations), the energy usage of the model dropped by up to 44 percent while maintaining at least 97 percent accuracy compared to the baseline[1]. This is because it is easier to get to the answer, in much the same way as most people could calculate two plus two much more quickly than calculating 2.34 plus 2.17, for example.

The team also compared LLaMA 3.1 8B to smaller AI models built to specialize in each of the three tasks. Small models used 15 times less energy for summarization, 35 times less for translation and 50 times less for question answering.

Accuracy was comparable to the larger model, with small models performing 4 percent more accurately for summarization, 2 percent for translation and 3 percent for question answering.

Shortening questions and responses

In the second experiment, the researchers assessed the impact on energy usage of changing the length of the user’s prompt (instructions) and the model’s response (answer).

They calculated energy consumption for 1,000 scenarios, varying the length of the user prompt and the model’s response from approximately 400 English words down to 100 English words[2].

The longest combination (400-word prompt and 400-word response) used 1.03 kilowatt hours (kWh) of electricity, enough to power a 100-watt lightbulb for 10 hours or a fridge-freezer for 26 hours.

Halving the user prompt length to 200 words reduced the energy expenditure by 5 percent, while halving the model response length to 200 words reduced energy consumption by 54 percent.

Assessing real-world impact

To assess the global impact of the optimizations tested, the authors asked LLaMA 3.1 8B to provide an answer to a specific question[3]. They then calculated the energy required for it to do so, multiplied by the estimated daily number of requests for this sort of task by users of ChatGPT[4].

They estimated that using quantization, combined with cutting down user prompt and AI response length from 300 to 150 words, could reduce energy consumption by 75 percent.

In a single day, this saving would be equivalent to the amount of electricity needed to power 30,000 average UK households (assuming 7.4 kilowatt hours per house per day). Importantly, this saving would be achieved without the model losing the ability to address more complex general tasks.

For repetitive tasks such as translation and summarization, the biggest savings were achieved by using small, specialized models and a reduced prompt/response length, which reduced energy usage by over 90 percent (enough to power 34,000 UK households for a day).

As Hristijan Bosilkovski, an author of the report and a UCL MSc graduate in Data Science and Machine Learning, explained: “There will be times when it makes sense to use a large, all-purpose AI model — such as for complex tasks or research and development. But the biggest gains in energy efficiency can be achieved by switching from large models to smaller, specialized models in certain tasks such as translation or knowledge retrieval. It’s a bit like using a hammer to drive a nail, rather than a sledgehammer.”

A smarter future for AI

The authors of the report say that as competition in generative AI models increases, it will become more important for companies to streamline models, as well as using smaller models better suited to certain tasks.

“Generative AI’s annual energy footprint is already equivalent to that of a low-income country, and it is growing exponentially,” said Tawfik Jelassi, Assistant Director-General for Communication and Information at UNESCO. “To make AI more sustainable, we need a paradigm shift in how we use it, and we must educate consumers about what they can do to reduce their environmental impact.”

Professor Drobnjak added: “When we talk about the future of resource-efficient AI, I often use two metaphors. One is a collection of brains — lots of separate specialist models that pass messages back and forth — which can save energy but feel fragmented. The other metaphor, and the future that I’m most excited about, looks more like a single brain with distinct regions — which is tightly connected, sharing one memory, yet able to switch on only the circuits it needs. It’s like bringing the efficiency of a finely tuned cortex to generative AI: smarter, leaner and far less resource hungry.”


1 Three quantization models were tested, reducing energy consumption by 22 percent (BNBQ), 35 percent (GPTQ), and 44 percent (AWQ). Please see the report for technical details.

2 In English, 100 words is approximately 128 tokens, but the number varies by language.

3 The question was: “Explain the concept of reinforcement learning, emphasizing its core principles, components (like agents, environments, and rewards), and typical applications. Keep the explanation accessible to someone with basic knowledge of artificial intelligence.”

4 The team used global usage statistics for ChatGPT, which had around one billion daily requests. Assuming that 35 percent of these were concept explanations, the total number of requests of this type was estimated to be 350 million.