Artificial intelligence and machine learning in public sector - for beginners and advanced data users
You have probably come across the term artificial intelligence more than once since the publication of ChatGPT. In our series, we offer beginners and advanced data users an insight into AI, its sub-area of machine learning and how these systems can support data management.
Polyteia is well aware of the complexity that the world of data holds and how challenging it can be to get started and to continuously develop your own data skills. With our new series, we offer insights on two levels: For beginners and for experienced data analysts. For beginners, the aim is to provide access to the world of data with practical relevance. Advanced data analysts are given the opportunity to expand their knowledge. The aim is to strengthen and expand data skills in the public sector.
Artificial intelligence (AI) and machine learning can significantly simplify working with data. It can recognize patterns in data and make predictions or decisions. The first section explains artificial intelligence for beginners and how it can support data processing. If you already have some knowledge of AI, you can skip directly to the second section, where you will gain a more detailed insight into a sub-area of artificial intelligence - machine learning.
For beginners: Data management with the help of artificial intelligence
You have probably come across the term artificial intelligence (AI) more than once recently, especially since the launch of ChatGPT. AI makes it possible for machines and systems to independently perform tasks that are normally undertaken by a human. These tasks range from understanding language and recognizing patterns to making decisions and solving problems. AI can provide support in the work environment, particularly for time-consuming and repetitive tasks. Artificial intelligence can also be helpful in the area of data management. For example, AI can automatically search and analyze data records to establish semantic relationships between the data or identify trends and patterns. Data quality can be increased by AI, as errors or inconsistencies are corrected or made visible. The AI adds missing values or removes duplicates in large data sets, thereby speeding up data analysis.
The use of AI should be viewed critically, since artificial intelligence is only as good as the quality of the data. AI systems such as ChatGPT obtain their information from texts on the internet such as articles, social media, books or Wikipedia. As such content is often not subject to any checks, artificial intelligence can also be fed incorrect information and thus spread misinformation. In the phenomenon of “hallucination”, an AI system generates plausible but false or misleading information. This can be particularly problematic in areas such as news and education. It can also lead to a distorted or one-sided view of the world if the AI has been trained with incorrect or incomplete data. As a result, artificial intelligence can exhibit racist or misogynistic tendencies, among other things, and human control should therefore be essential.
To be able to use the AI for explicit and individual purposes, the algorithm must first be fed with data and information by the artificial intelligence. This data must be fed into the system in advance. The AI then requires little or no human intervention. AI is already being used in a wide variety of areas in public administration, including as chatbots and virtual assistants that answer citizens' queries and provide information on public authority websites. AI is also used in urban planning and development to predict population growth, traffic density and infrastructure requirements.
Although AI requires little to no human interaction, for AI-based decisions, a human is required as a controller in order to be able to intervene if necessary. Especially for safety-critical decisions, human control and supervision is required during the process. The European Union recently decided that AI systems must only be monitored by humans and not by other technologies. Artificial intelligence also often works with sensitive data that must be protected from unauthorized access, manipulation and theft by security measures such as encryption and access control. Other AI applications, on the other hand, are completely prohibited as they violate EU values. For example, no evaluation of citizens' social behavior (“social scoring”) or emotion recognition in the workplace is permitted. The European Union wants to make the use of AI more transparent, comprehensible, tolerant and environmentally friendly.
For data analysts: Machine Learning - a sub-area of artificial intelligence
Artificial intelligence is an interdisciplinary field of research that aims to develop machines or software with human-like intelligence. This includes a wide range of techniques and applications that aim to automate tasks that traditionally require human expertise. A distinction must be made here between weak AI and strong AI. Weak AI specializes in performing specific tasks. This includes voice assistants such as Siri or Alexa, which use speech recognition and processing, or recommendation systems based on the analysis of user behavior. Strong AI, on the other hand, refers to systems that can perform a wide range of intellectual tasks at the level of a human being or even exceed it. This form of AI is currently still a target of research.
One subfield of AI is machine learning. This deals with the development and application of algorithms and techniques that enable computers to learn from data and recognize patterns in order to constantly improve. Machine learning uses data to create models that can make predictions or decisions instead of being programmed in advance for an explicit task. As such, machine learning applications improve with use and become more accurate the more data is available. Machine learning encompasses different concepts of learning models that apply different algorithmic techniques: Supervised, unsupervised, semi-supervised and reinforcement learning. There are also neural networks and deep learning, which are inspired by the structure of the human brain and are a subcategory of machine learning. Deep learning uses deep neural networks with many layers to recognize complex patterns in large amounts of data. This includes applications in image and speech recognition, natural language processing and more.
In supervised learning, a model is trained using a labeled data set, i.e. it consists of input and output data pairs. It aims to map the input to the correct output. The system is told what the correct answer is, and this requires large data sets that are not only expensive but also time-consuming to create. Supervised learning is used for classification tasks such as spam detection of emails or regression tasks such as predicting real estate prices. Unsupervised learning is trained with unlabeled input data without having an answer key. It attempts to gain insight by recognizing patterns and correlations, such as clustering in the area of customer segmentation or dimension reduction in the case of principal component analysis. Semi-supervised learning, on the other hand, is a mixture of supervised and unsupervised learning. The model is trained with small amounts of labeled data and a large amount of unlabeled data. The small amount of labeled data serves as a starting point for the system and can improve learning speed and accuracy. The fourth learning model - reinforcement learning - learns through interaction with its environment. The machine does not receive an answer key, but only has a series of actions, rules and possible end states as a template. During reinforcement learning, rewards and punishments are executed based on the actions and the system learns from these. In principle, the rewards and penalties are also just numbers. The aim is to develop a strategy that maximizes the cumulative reward. Reinforcement learning is often used in areas such as robotics and game intelligence.
Currently the best known AI application is ChatGPT, which is a Large Language Model (LLM). These are a type of generative AI developed specifically for generating text-based content. ChatGPT was fed with information in several phases. The large language model was trained with supervised learning, but with predictions of the next word. Afterwards, the generation of answers by ChatGPT was optimized with reinforcement learning. As a result, LLM has been criticized as a “stochastic parrot” that only predicts what the most likely next word is.
“Not-so-fun-fact”: LLMs such as ChatGPT can be biased due to their data source and therefore have racist and misogynistic tendencies. This is because the data sources for large language models are diverse and include a wide range of texts from the internet such as websites, forums, literature and much more. It is then a case of data bias, where the model learns and reproduces these prejudices. Especially when certain groups are underrepresented, this can lead to a representation bias.
Develop your data skills further
We have already provided insights into other data-related topics in our series:
If you would like to further develop your data skills, we would like to invite you to take a look at our free Data Academy learning platform. It supports administrative staff of all levels of knowledge in developing and expanding their own data skills. The Data Academy's interactive online courses cover a wide range of topics, including data visualization, data transformation, data platforms, data governance, artificial intelligence and much more. You can register quickly and easily for Polyteia's free Data Academy learning platform via this link.