Information theory is a mathematical approach for the quantification, storage, and communication of information. Information theory concepts are used across a wide variety of scientific areas. Sometimes even very distant scientific fields independently develop methodologies that are built upon the same underlying information theory framework.
In one such framework, pointwise mutual information (PMI) is used to profile and compare objects based on the interrelations between their features. PMI is commonly used in linguistics to identify unusual word combinations with the aim to estimate text complexity. In medicinal sciences, mutual information was applied to profile relations between stressors, health conditions, genes, and other factors in order to build comorbidity charts useful for disease study and preventive medicine. However, the use of PMI in cheminformatics even for such a basic task as is the comparison of compound sets based on interrelations between their structural features was not reported so far.
In the presented research, PMI is used to characterize several publicly available chemical databases in terms of association strength between individual compound substructures. Z-standardized relative feature tightness (ZRFT), a PMI-derived measure that quantifies how well the given compound fits, in terms of of its substructures, in a particular chemical database, is applied for the analysis of compound synthetic accessibility, as well as for the classification of compounds as easy and hard to synthesize. “The results presented in the current work indicate that structural feature co-occurrence, quantified by PMI or ZRFT, contains a significant amount of information relevant to physico-chemical properties of organic compounds,” says principal researcher professor Svozil.
Text is based on the research article: Čmelo, I., Voršilák, M., & Svozil, D. (2021). Profiling and analysis of chemical compounds using pointwise mutual information. Journal of Cheminformatics, 13(1). https://doi.org/10.1186/s13321-020-00483-y