The Road to General Artificial Intelligence: Which Problem Will Be Solved First, General Computer Vision or General Natural Language Processing?

Introduction to General Artificial Intelligence

The quest for general artificial intelligence (AGI) has intrigued scientists and researchers for decades. AGI is the type of artificial intelligence that can carry out any intellectual task that a human being can. Two critical domains in this quest are computer vision and natural language processing (NLP). As of now, both of these technologies are far from achieving the level of intelligence that most AGI enthusiasts hope to see. However, in this exploration, we’ll delve into which problem might be solved first, or if both are intertwined to the extent that they will be solved simultaneously.

Human-Centric Aspects of Computer Vision and NLP

Computer vision and NLP are deeply rooted in human-centric constructs. A sentient being might have a “vision” that perceives electromagnetic waves beyond the visible spectrum and a symbolic system distinct from human languages to represent the world. Therefore, neither computer vision nor current NLP technologies are prerequisites for achieving generalized artificial intelligence.

However, if we were to create intelligence the human way, i.e., by mimicking human visual and linguistic abilities, both computer vision and NLP need to advance significantly. A blind person cannot comprehend descriptions of visually rich objects, and a person who can see but lacks the ability to read or write will struggle to understand human conversations. This interplay between sensory inputs and cognitive processing highlights the importance of both domains.

Interdependence of Computer Vision and NLP

Computer vision and NLP are not standalone technologies but rather parts of a broader system that models and understands the world. Both involve the process of modeling the world, which includes understanding objects, entities, actions, and concepts. While computer vision focuses on the static and dynamic aspects of the world, NLP delves into textual and symbolic representations. The two domains are interconnected; understanding one enhances the other.

For instance, computer vision in static scenes focuses on objects, but in dynamic scenes, it involves modeling time, actions, and causality. Similarly, the world can be described through text, and visual elements can influence meaning. Both domains have overlapping processing stages that are fundamental to their capabilities.

Abstraction and Complexity

One key aspect that differs between NLP and computer vision is abstraction. NLP can handle pure abstraction more easily than computer vision, which focuses on concrete visual cues. This difference might suggest that computer vision could be solved first, as it deals with more concrete representations. However, understanding physical views over time may require an understanding of causality and abstraction. Therefore, the notion that vision might be solved first is not entirely straightforward.

Both domains are likely to see advancements in the same time frame, with each domain influencing the other. Effective approaches to both—computer vision and NLP—will emerge simultaneously. This interplay between sensory perception and cognitive understanding underscores the complex interdependence of these technologies in the pursuit of general artificial intelligence.

Conclusion

The journey towards general artificial intelligence involves intricate challenges in both computer vision and natural language processing. While these fields are distinct, they are deeply intertwined. Both will need to evolve extensively to achieve AGI. As researchers continue to push the boundaries in both domains, the solutions to these problems will likely converge in an integrated approach. The path forward remains exciting and full of possibilities for the future of AI.