In 2019, Rich Sutton, one of the fathers of reinforcement learning, wrote an essay titled “The Bitter Lesson”, where he argued that the most effective way to advance artificial intelligence (AI) is to leverage computation rather than human knowledge. He pointed out several examples where AI researchers tried to build complex domain-specific knowledge into their systems, only to be surpassed by simpler methods that relied on massive amounts of computation. For instance, in computer chess and computer Go, brute-force search and self-play learning outperformed human-designed heuristics and rules. Sutton’s main message was that AI systems should not try to imitate mimic thinking, which is complex and hard to formalize, but rather exploit the power of computation to find general solutions that work across domains.
In addition to leveraging increasing availability on computation resources, the most successful deep learning models show the relevance of vast amount of high quality data. We are witnessing the rise of large language models (LLMs), such as GPT 4, which are trained on huge amounts of text data and can generate natural language for various tasks, such as summarization, translation, question answering, and more. These models are impressive examples of Sutton’s bitter lesson: they do not rely on any linguistic knowledge or task-specific rules, but simply learn from data using deep neural networks and massive computation.
While data and computation are crucial factors, the addition of human feedback may be necessary to optimize learning machines. Following the same illustrative example, LLMs are far from being perfect. They can generate outputs that are inaccurate, untruthful, harmful, biased, or simply not helpful to the user. In other words, these models are often not aligned with human values and intentions. How can we make them more aligned? The most promising direction is to use reinforcement learning with human feedback (RLHF), which is a technique for fine-tuning LLMs based on feedback gathered from human reviewers or users. Instead of training LLMs merely to predict the next word, they are trained with a human-in-the-loop to better understand instructions and generate helpful responses. RLHF has been shown to improve the performance and alignment of LLMs on a wide range of tasks. For example, OpenAI’s InstructGPT is a model that was fine-tuned with RLHF on a set of labeler-written prompts and prompts submitted through the OpenAI API. In human evaluations, outputs from InstructGPT were preferred to outputs from GPT-3, despite having 100 times fewer parameters. Moreover, InstructGPT showed improvements in truthfulness and reductions in toxic output generation. As well, current ChatGPT is a version of GPT3.5 or GPT4 enhanced with RLHF, which improves the user experience and allow to get results more helpful and aligned with the user.
What does RLHF could for air traffic management (ATM) industry? ATM is a complex and safety-critical domain that requires high levels of coordination, communication, and decision making among various actors, such as controllers, pilots, dispatchers, airlines, and regulators. ATM also faces many challenges and constraints, such as increasing traffic demand, environmental impact, human factor of security threats. AI has the potential to help address some of these challenges and improve the efficiency and safety of ATM operations. However, AI also poses new risks and uncertainties that need to be carefully managed and mitigated.
As mentioned before, the success in AI is by leveraging computation and data. Taking into account the case of conflict resolution in ATM as example, a solution to conflict resolution is likely to rely on vast amounts of data and computation to generate suggestions that avoid collisions or separation violations. However, it is crucial to ensure that these suggestions align with the goals and preferences of the controllers and pilots, as well as comply with the rules and regulations of the airspace. To achieve this alignment, RLHF could be used to incorporate human feedback into the learning process and fine-tune the algorithm’s parameters. For example, the RLHF system could present different potential solutions to human experts, such as controllers, and ask them to evaluate the effectiveness and acceptability of each one. The system could then learn from the feedback and adjust its parameters to generate more helpful and acceptable suggestions in the future. This feedback loop could also help the system adapt to changing situations and preferences, such as different weather conditions, traffic patterns, or airspace configurations. Moreover, the use of RLHF could also help foster trust and transparency between humans and AI systems. By involving human experts in the learning process and making the system’s behavior and decisions more understandable and interpretable, the system could build credibility and confidence among its users. This, in turn, could facilitate the transfer of knowledge and skills from humans to AI systems, and enable a smoother transition towards greater levels of automation in ATM operations.
In summary, Rich Sutton’s bitter lesson anticipated the significance of harnessing computation and data in AI advancement. Furthermore, reinforcement learning with human feedback (RLHF) offers a promising approach for refining algorithms in specific applications, such as air traffic control conflict resolution. The human input can push that algorithm-generated suggestions are in line with the objectives and preferences of controllers and airlines while adhering to airspace regulations. Ultimately, the integration of AI solutions will lead to a safer, more efficient, and scalable air traffic management industry, better grounded to meet the evolving needs of the future.
References: