Скачать 48.31 Kb.
Integrative Artificial Intelligence as a Key Ingredient of Systems Biology
Ben Goertzel*, Cassio Pennachin,
Lucio de Souza Coelho, Izabela Lyon Freire,
Leonardo Kenji Shikida, Rafael Silva
Systems biology – the understanding of biological systems as complex, self-organizing, autopoietic systems – has many different dimensions, and is best addressed, we feel, by a combination of different research approaches. Much research currently conducted under the rubric of systems biology involves simulation modeling, and this work is very important; however, it must be complemented by an equal focus on systems-biology oriented data analysis and natural language text understanding. Furthermore, we argue that, in order to analyze biological data or texts in a way that truly acknowledges the systemic nature of cells, tissues and organisms, it is necessary to move beyond standard statistical and machine learning approaches, and take more of an “integrative artificial intelligence” approach, in which quantitative data analysis, wide-ranging data integration and automated reasoning are used together to provide an integrative understanding of biological systems. And – although we will not pursue this topic in depth here – the knowledge generated in this way may be used to guide systems-biology simulation modeling in cases where biological data is too incomplete to allow detailed simulation equations to be created.
This is the approach we have taken with our recent work using the Biomind AI Engine, an integrative bioinformatics/AI software system based on the Novamente AI Engine framework. In this paper we review the key properties of the Biomind AI Engine, and then describe one of the current applications of this software system: to the analysis of gene expression microarray data in a manner assisted and enhanced by knowledge from integrated biological databases.
1.2. The Biomind AI Engine
1.2.1. An Integrative Bioinformatics Software Framework
The Biomind AI Engine is a novel AI software system aimed at bioinformatics applications, including data analysis, database integration, text mining and simulation modeling. Unlike any other biology-oriented software system, it is a general-purpose framework for biological informatics and computational biology, applicable to the full variety of problems in this category, including problems at the interface between biology and other domains such as chemistry, medicine and linguistics. The overall goal is to analyze, manage, predict and understand diverse data regarding the multiple levels of biological systems, in a unified way.
The Biomind AI Engine integrates aspects of many prior bioinformatic approaches, including clustering, categorization, time series analysis, pattern recognition, probabilistic modeling, sequence analysis and nonlinear-dynamical simulation. However, there are also qualitative differences from what has come before. In the area of data analysis, other bioinformatic methods tend to search for data patterns adhering to narrowly restrictive forms, whereas the Biomind AI Engine is much more flexible and adaptive in its search of ``pattern space.'' Furthermore, it has the capability to integrate data of diverse types from diverse databases, and reason on this data, forming its own ``BiomindDB'' of knowledge inferred from, but not directly contained in, existing databases. This knowledge may then be used in many ways – for example, to guide the system's analyses of experimental data, leading it to new insights that would never be obtained from looking at the data in a more isolated way. And the Biomind AI Engine learns from experience, so that each dataset it analyzes gives it general knowledge that it may use in analyzing other datasets.
The first major applications of the Biomind AI Engine have been to gene expression data analysis, as embodied in the Biomind Analyzer software product; and to the integration of biological databases, as embodied in the BiomindDB integrative database (used in the Biomind Analyzer, but also valuable on its own). Two additional major applications are under development, in the areas of biological text mining and pathway inference/extension. Further minor applications have also been carried out, including an application to the development of disease diagnostics based on combinations of uncommon mutations in the mitochondrial genome.
In practical terms, the Biomind AI Engine is a Linux-based C++ program, with a few Java components handling peripheral tasks. It is intended to be run as a ``server,'' meaning that it requires a dedicated high-end PC with at least 1GB of memory and ideally more than one processor. Making use of distributed processing technology, it can also run on a number of servers networked together. The system is still under development and will be for some time, but many important structures and dynamics are in place, enabling practical applications as described above.
1.2.2. An Integrative Approach to Artificial General Intelligence
The Biomind AI Engine is built on top of a general-purpose AI architecture called Novamente [Goertzel et al, 2003 ; Goertzel and Pennachin, in preparation], which has been used for other applications in the areas of numerical data analysis, text mining and natural language processing. It adds a variety of specific biology-focused techniques to Novamente’s general AI capabilities. .
The Novamente AI framework is founded on three general principles :
In contrast to these principles, most contemporary biological software tends to be extremely narrowly focused. For example, in the data analysis domain, each algorithm carries out one particular type of analysis, often in a manner narrowly customized to one particular type of data. These different algorithms are often combined together into toolkits, but this generally means little more than the provision of a common user interface and common data preprocessing methods. The tight integration methodology is something quite different; it involves considering different analytical and inferential algorithms as part of the same overall, synergetic analytical process. Diverse analytical processes may act together on the same data store, achieving a level of intelligence not possible by any of the processes on their own, or by a loosely integrated confederation of processes.
The flexibility of a tightly integrated setup enables the same synergetic structures and processes to handle a wide variety of different problems. For instance, the same probabilistic inference technique used to generalize patterns recognized in gene expression data, may be used for combining relationships derived from different databases, or to assist in probabilistic sequence matching, or to carry out key aspects of natural language parsing.
1.2.3. Knowledge, Reasoning and Learning.
Artificial intelligence involves three major aspects: knowledge representation, reasoning, and learning [Russell and Norvig,2002].
Knowledge representation is handled in the Biomind AI Engine via a unique scheme that integrates symbolic and subsymbolic techniques. It combines aspects of attractor neural networks and semantic networks, and other aspects as well. There are two primary levels of representation involved:
Node and Links come in a variety of types, representing different types of data and different types of relationships; e.g. InheritanceLink, SimilarityLink, AssociativeLink and PredicateEvaluationLink; ConceptNode, GeneNode, PredicateNode.
A unique form of probabilistic inference called probabilistic term logic is utilized for reasoning and some kinds of learning. More speculative learning is carried out using evolutionary techniques, including genetic algorithms [Goldberg, 1989], genetic programming [Koza, 1991] and a variant of the Bayesian Optimization Algorithm [Pelikan, 2002]. Some more conventional machine learning algorithms such as hierarchical and partitioning based clustering, and support vector machines are also incorporated.
1.2.4. Cognitive Processes in the Biomind AI Engine
The crux of the Biomind AI Engine's intelligence lies in a collection of software objects called MindAgents and Tasks, which dynamically update the Atoms in the system’s knowledge base on an ongoing basis. MindAgents are continual in their operation -- regardless of what inputs are coming into the system or what demands are placed upon it, these MindAgents keep on working, analyzing the information in the system and creating new information based on it. MindAgents or user queries may create Tasks, such as data analysis tasks, which are more short-lived.
Generally speaking, the MindAgents and Tasks serve four overlapping functions. They recognize patterns in empirical datasets; they combine patterns found in empirical datasets to form new, more generalized patterns; they create new knowledge by combining knowledge derived from existing databases, and combining knowledge in databases with inferred empirical data patterns; and finally, they adapt the system's learning processes in a context-dependent way, based on what has been found to work best in the past. However, the MindAgents and Tasks cannot be neatly divided up according to function; many of them serve multiple purposes. For instance, probabilistic inference plays a role in all four of the functions; and evolutionary optimization plays a role in several.
The primary MindAgents we're currently working with are:
The various MindAgents cooperate together in various ways. For example, inferring that two genes with similar sequences and similar expression values may belong to similar categories is a job for first-order inference and logical link-mining. Recognizing a complex pattern of expression among a set of genes in a collection of gene expression experiments, is a job for genetic pattern mining or genetic network inference. Combining complex patterns of expression to form new ones is a job for higher-order inference. Clustering creates new nodes, representing clusters, which may be used as terms in patterns recognized and manipulated by other MindAgents.
1.3. Application to Microarray Data Analysis
1.3.1. Overview of Microarray Technology
Microarray technology [Speed, 2003] is a powerful part of the ongoing biotech revolution. Gene arrays, with their ability to observe which genes are expressed in a cell at a given point in time, provide an unparalleled window into cellular activity.
The practice of microarray genomics and proteomics, however, is fraught with difficulties. Array technology is complex and in many ways error-prone; and results can be subtly dependent on experimental conditions. Many researchers feel that, although microarray datasets overall are valuable and informative, single data points derived from microarrays are not reliable. What this means that bioinformatic data analysis methods, to be successfully applied to microarray data, have got to be extremely sophisticated – in terms of their robustness with respect to noise, and their ability to glean subtle data patterns. For maximal analytical intelligence, it is necessary to use algorithms that go beyond the data itself, and use information derived from diverse biological databases to guide their study of the data, and compensate for the data’s flaws. This is provided via carrying out microarray data analysis within the integrative Biomind AI Engine software framework. In the Biomind Analyzer software, analytics are done using not only the user's experimental data, but also diverse biological background knowledge contained in the BiomindDB.
1.3.2 Feature Vector Enhancement
In the Biomind Analyzer, a mathematical trick, feature vector enhancement, is used to deploy the BiomindDB knowledge in the microarray data analysis process.
In typical microarray data analysis procedures, if one is studying gene expression profiles associated with individual organisms or tissues, one represents each organism or tissue by a numerical vector, the entries of which represent the measured expression values of genes in that organism or tissue. These numerical “feature vectors” are then studied using appropriate mathematical methods.
The feature vector enhancement process consists of taking these standard numerical vectors, and augmenting them with additional entries, which represent either the values of various clinical-data features for each organism or tissue – or, yet more interestingly, the “inferred expression values” of gene or protein categories for the organism or tissue
For example, consider the biological process of “DNA binding.” There are many genes known to be involved with DNA binding, and others that are inferred by the Biomind AI Engine to probably be involved with DNA binding (with varying degrees of probability). Based on all these genes, one can compute an estimated “weighted average expression value” for all the genes involved with DNA binding. In general this may include genes known to be involved with DNA binding, and also genes that the Biomind AI Engine has inferred to possibly be involved. This is the inferred expression value for the feature of “DNA binding.” Thousands of inferred expression values of this nature may be calculated for each organism or tissue, thus creating thousands of extra entries for each feature vector under analysis. Many of the extra features correspond to Gene Ontology categories or protein families or superfamilies, but some may correspond to novel categories learned by the Biomind AI Engine.
1.3.3 Supervised Categorization Using Enhanced Feature Vectors
One of the tasks that can be carried out using these enhanced feature vectors is supervised categorization. In a microarray data analysis context, this refers to taking two sets of microarray data – for example, a set from cancerous liver cells, and a set from a control population of healthy liver cells – and having software learn rules that distinguish the one set from the other.
Conventional statistical and machine learning methods are good at this when the rule is simple, for instance when the presence of a single gene is enough to predict the presence of cancer with reasonable accuracy. When the rules involved are subtler, however, the performance of contemporary techniques is significantly more erratic. The most effective techniques, such as support vector machines (Hastie et al, 2001), will commonly find complex patterns involving very large numbers of genes, which are difficult to comprehend or intuitively evaluate.
The Biomind Analyzer may sometimes find these same complex patterns involving expression levels of large numbers of genes. However, powered by the BiomindDB, it can also find classificatory rules involving more general information,
such as ``genes involved in pathways associated with fermentation are particularly active in the profiles of the cancer patients.'' This kind of classificatory rule may potentially be much more informative, but it can only be provided by an analytic system that is doing more than just data analysis -- that is applying integrated
Figure 1. Example Background-Knowledge-Based Classification Rule Learned Using Genetic Programming Classification in the Biomind Analyzer
information about multiple levels of biological systems to each of its data-analytic judgments.
Figure 1 shows an example of a classification rule that the Biomind Analyzer found, which predicts, based on a person’s gene expression profile (as revealed from microarray analysis of their blood cells), whether that person possesses a certain disease or not. This rule is based on a study that will be published separately in 2004 (in a collaboration between Biomind scientists and the biologists who gathered the microarray data and designed the experiments). The rule happens Gene Ontology categories, biological concepts loaded into the BiomindDB from the Gene Ontology database [Gene Ontology Consortium, 2000]. It was learned via the genetic programming algorithm as implemented in the Biomind AI Engine, and is depicted in the form of an “evaluation tree” as is commonly seen in the genetic programming literature. The “leaf nodes” of the tree indicate either numbers or GO categories, and when applying the tree to a given gene expression profile, the inferred expression value of the GO category at each leaf node is evaluated for that profile. The arithmetic operations indicated in the internal nodes are executed, ultimately resulting in a single number, which – if it is above a certain threshold – indicates that the profile belongs to the category being learned.
Systematic application of the background-knowledge-enhanced supervised classification process, as embodied in the Biomind Analyzer, has been carried out during late 2003 and early 2004. Results on standard datasets, as well as on original microarray data collected by our collaborators, will be published in a series of papers during the next year.
At time of writing, the Biomind AI Engine is only partially complete, but we are already using it as the back end for the Biomind Analyzer product for analyzing gene expression data, and for our in-development text-mining product, as well as for other bioinformatics data analysis projects. As more and more of the engine is implemented and tested, the functionality will be expanded to deal with more and more types of data and more and more types of queries, ultimately resulting in a comprehensive artificial cognitive agent specifically oriented toward systems biology.
While much work remains to be done, we believe the Biomind AI Engine approach to systems-biology-based bioinformatics is extremely promising -- and this promise is partially fulfilled already by our current work. We expect each coming year of development and experimentation to yield more and more exciting and powerful results.
Ben-Dor, Amir, R. Shamir, Z. Yakhini, 1999. Clustering Gene Expression Patterns. Computational Biology, vol. 6, pp. 281-297
Gene Ontology Consortium, 2000. Gene Ontology: Tool for the Unification of Biology. Nature Genetics. 25(1):25-9.
Goertzel, Ben and Cassio Pennachin, in preparation. Novamente: An Integrative Design for Artificial General Intelligence
Goertzel, Ben, Cassio Pennachin, Andre Senna, Thiago Maia and Guilherme Lamacie, 2003. Novamente: An Integrative Design for Artificial General Intelligence, Proceedings of IJCAI-04 Workshop on Agents and Cognitive Modeling.
Goldberg, David, 1985. Genetic Algorithms for Search, Optimization and Machine Learning. Addison-Wesley.
Hastie, T. , R. Tibshirani, J.H. Friedman, 2001, The Elements of Statistical Learning, Springer-Verlag
Koza, John, 1992, Genetic Programming, MIT Press (Cambridge MA).
Pelikan, Martin, 2002. Bayesian Optimization Algorithm: From Single-Level to Hierarchy. PhD Thesis in Department of Computer Science, UIUC.
Russell, Stuart and P. Norvig, 2002. Artificial Intelligence: A Modern Approach. Prentice-Hall.
Speed, T.P., 2003. Statistical Analysis of Gene Expression Microarray Data, Chapman and Hall
|Course Name: Introduction to Artificial Intelligence||Programme Artificial Intelligence|
|Programme Artificial Intelligence||Department of Artificial Intelligence|
|The Origins of Artificial Intelligence Computing||240420: Introduction to Artificial Intelligence|
|0004 A. I. (Artificial intelligence)(steven spielberg)||Artificial Intelligence Illuminated by Ben Coppin|
|Argumentation in Artificial Intelligence: a modern Approach||CSc 347 Introduction to Artificial Intelligence Programming|