Is the still novel AI entering already into an infinite regress loop (AI to explain AI)? Just wondering! 😊
"Explaining the behavior of trained neural networks remains a compelling puzzle, especially as these models grow in size and sophistication. Like other scientific challenges throughout history, reverse-engineering how artificial intelligence systems work requires a substantial amount of experimentation: making hypotheses, intervening on behavior, and even dissecting large networks to examine individual neurons. To date, most successful experiments have involved large amounts of human oversight. Explaining every computation inside models the size of GPT-4 and larger will almost certainly require more automation — perhaps even using AI models themselves. ...
Their method uses agents built from pretrained language models to produce intuitive explanations of computations inside trained networks. ...
As we enter a regime where the models doing the explaining are black boxes themselves, external evaluations of interpretability methods are becoming increasingly vital. ..."
Their method uses agents built from pretrained language models to produce intuitive explanations of computations inside trained networks. ...
As we enter a regime where the models doing the explaining are black boxes themselves, external evaluations of interpretability methods are becoming increasingly vital. ..."
From the abstract:
"Labeling neural network submodules with human-legible descriptions is useful for many downstream tasks: such descriptions can surface failures, guide interventions, and perhaps even explain important model behaviors. To date, most mechanistic descriptions of trained networks have involved small models, narrowly delimited phenomena, and large amounts of human labor. Labeling all human-interpretable sub-computations in models of increasing size and complexity will almost certainly require tools that can generate and validate descriptions automatically. Recently, techniques that use learned models in-the-loop for labeling have begun to gain traction, but methods for evaluating their efficacy are limited and ad-hoc. How should we validate and compare open-ended labeling tools? This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating the building blocks of automated interpretability methods. FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate. The functions are procedurally constructed across textual and numeric domains, and involve a range of real-world complexities, including noise, composition, approximation, and bias. We evaluate methods that use pretrained language models (LMs) to produce code-based and natural language descriptions of function behavior. Additionally, we introduce a new interactive method in which an Automated Interpretability Agent (AIA) generates function descriptions. We find that an AIA, built with an off-the-shelf LM augmented with black-box access to functions, can sometimes infer function structure, acting as a scientist by forming hypotheses, proposing experiments, and updating descriptions in light of new data. However, FIND also reveals that LM-based descriptions capture global function behavior while missing local details. These results suggest that FIND will be useful for characterizing the performance of more sophisticated interpretability methods before they are applied to real-world models."
FIND dataset and the Automated Interpretability Agent
No comments:
Post a Comment