How to build high-quality scientific data resources for AI4S? What are the challenges in the construction process? What are the key technologies? How to realize multi-modal data unified management and analysis? How to build an open and sharing ecosystem of systematic scientific data?
At the academic summit of 2023 Science Intelligence Summit "Building AI4S Infrastructure -Al4S Database and Knowledge Base" held on August 11, a number of experts and scholars jointly discussed AI for Science(referred to as: AI4S) infrastructure construction - the construction of database and knowledge base in the AI4S era.
In the scientific research scene, scientific data and literature are the knowledge system and treasure house to support the research and decision-making of researchers in various disciplines. Reducing the time researchers spend searching and processing scientific data and literature is an opportunity to improve scientific research efficiency. In the face of scientific research problems, the addition of AI and the improvement of model and algorithm capabilities have brought innovation to the processing of massive scientific data and the automatic sorting and classification of literature, greatly improving the efficiency of scientific research.
Hu Zhengyin, research librarian of Chengdu Library and Information Center, Chinese Academy of Sciences, believes that scientific and technological literature contains a large number of credible, professional and standardized domain knowledge and scientific data, and at the same time, scientific and technological literature contains a large number of experimental parameters, publicity, charts and other scientific and technological literature data, which can provide high-quality data support for AI4S application. Freeing researchers from the heavy literature reading can also accelerate the process of knowledge acquisition and scientific research innovation.
Zhou Guomin, director of the Farmland Irrigation Research Institute of the Chinese Academy of Agricultural Sciences and director of the National Agricultural Science Data Center, said that one of the core of AI4S is data. As one of the most basic outputs in scientific research activities, scientific data has become a key force to promote the efficiency of scientific research. The ecology from "self-use" to "self-proof" to "he uses" is gradually forming, and scientific data has run through the whole process of scientific research activities.
Scientific research requires rigor and needs to rely on accurate knowledge system as support. In addition to scientific literature, scientific data is also the focus of scientific research. Constructing literature and scientific data into knowledge base or database, using large language model and other technologies to improve the efficiency of scientific research in different fields has also become the key research direction of AI4S. Zhou Yuanchun, deputy director/researcher of the Computer Network Information Center of the Chinese Academy of Sciences, believes that standing at the starting point of the new scientific revolution, in the process of promoting the transformation of scientific research mode from "small workshop" to "large platform", it is necessary to focus on solving common problems, so as to better build the AI4S innovation base and promote the rapid development of AI4S. By building large models in the field of scientific research, the quality and efficiency of work in different scientific research fields can be improved, and researchers can have more time and energy to solve key problems and innovative thinking in their fields.
One thing that can't be ignored when discussing the effects of machine learning is the importance of high-quality data. Scientific data include observation data, experimental data, recorded data, survey data, simulation data and so on. Each type of data has its unique application fields and acquisition methods, and the comprehensive use of these data is of great significance for scientific research.
It is a developing trend to use machine learning technology to extract associated data from scientific literature and establish multi-modal database for comprehensive use. Su Yanjing, a professor at the University of Science and Technology Beijing, said that for the field of materials, scientific data transparency, the formation of a convenient database architecture, while accurate retrieval, to meet the scientific research needs of the field of materials.
Li Xin, a researcher at the Institute of Zoology,Chinese Academy of Sciences, said that generative models provide more opportunities for AI4S to make full use of massive scientific data to enable larger models to produce greater effects. It may not only subvert the basic research paradigm in the field of life sciences, but also promote industrial transformation to achieve accelerated development.
Wang Han, a senior engineer at Zhijiang Laboratory, said the FRB bursts last only a few milliseconds, which is equivalent to the energy released by the sun in a whole day. There are many radio telescopes in the world that can observe this phenomenon, but the model parameters of the observing equipment are not the same, which will cause the final result to be biased. In terms of scientific research, the standard database and specification are very conducive to the exploration of the source and principle of fast radio bursts.
From the perspective of openness and sharing, Du Yi, a researcher at the Computer Network Information Center of the Chinese Academy of Sciences, said that in recent years, the importance of scientific data at home and abroad has increased. At the end of last year, the Central Committee of the Communist Party of China and The State Council issued the Opinions on Building a Data Basic System to Better Play the Role of Data Elements, which put forward the working principles of adhering to sharing and sharing, strengthening high-quality supply, improving the governance system, and deepening opening-up and cooperation, which are of guiding significance for the development of scientific data.
In 2019, in order to implement the requirements of the Measures for the Management of Scientific Data and the Measures for the Management of the National Science and Technology Resource Sharing Service Platform, standardize the management of the national science and technology resource sharing service platform, improve the science and technology resource sharing service system, and promote the open sharing of science and technology resources to the society, the Ministry of Science and Technology and the Ministry of Finance carried out optimization and adjustment work on the original national platform. Through departmental recommendations and expert consultation, a total of 20 national science data centers including the National Space Science Center have been formed through research.
What role does AI4S play in building and developing these national science data centers? How to give full play to the technical advantages of AI4S so as to help build a high-quality national science center?
For the construction and development of the National Space Science Center, Zou Ziming, director of the National Space Science Data Center and researcher of the National Space Science Center of the Chinese Academy of Sciences, said that in the process of AI empowering space science, higher requirements are put forward for high-quality data, and the development of AI-ready scientific data needs to make efforts in data processing, information mining, knowledge discovery and prediction applications. Aiming at AI for Space Science, the National Space Science Data Center plans to further build a platform-type, service-oriented, open and research-oriented scientific data center, relying on the open research paradigm of the three "E environments" of solar-Earth space weather "STAR-E", planetary science "PSAR-E" and high-energy astronomy "Heard-E". Support human-machine collaborative research driven by scientific data, evolution of autonomous learning, emergence of complex systems, global tracking and prediction processes.
Speaking of the construction and development of the National Microbial Science Data Center, Ma Juncai, director of the National Microbiology Data Center and a researcher at the Institute of Microbiology of the Chinese Academy of Sciences, said that AI4S is not a single dimension data information island, but domain data fusion and data integration, so it needs effective integration and integration of microbial resources, literature, patents, functions, omics and other data. Lay a good foundation for the work of AI4S, so that the data of the whole life cycle of microorganisms can play its real value.
For the construction and development of the National Cryosphere Data Center , Zhang Yaonan, director of the National Cryosphere Data Center and a researcher at the Northwest Institute of Eco-Environment and Resources of the Chinese Academy investigation, numerical simulation, test analysis, remote sensing inversion, statistical analysis and other categories of data. In order to more effectively support AI applications, a data set storage environment is built at the same time, and a support system is established through raw data, AI sample data, and AI data set management. In addition to data set management and reorganization, it is also necessary to establish an application platform of "AI data set +AI algorithm + intelligent computing" to carry out work.
In recent years, as the importance of scientific data has been widely recognized, there has been a marked increase in the quantity and quality of data. The development of AI4S requires the cooperation of data, algorithms, computing resources and other aspects. In the face of data-related challenges in the future, it is still necessary to pay attention to the quality and utilization of data.
From August 10 to 11, the 2023 Science Intelligence Summit was held in Beijing. As a series of activities of the Zhongguancun Forum, the 2023 Science Intelligence Summit is hosted by the Beijing Institute of Science Intelligence, aiming to build a co-creation platform for scientific research breakthroughs, technology cultivation and talent exchange in the field of AI for Science. The summit set up a main forum and 10 thematic academic summits, covering topics such as model algorithms, databases, energy materials, and computing engines. At the meeting, participating academicians, experts and business representatives shared advanced ideas and cutting-edge insights, presented research results and innovative technologies, and looked forward to the future development trend of AI for Science.
Reprinted from official account [Keyuan Data]
©Copyright 2005-. Northwest Institute of Eco-Environment and Resources, CAS.
Donggang West Road 320, Lanzhou, Gansu, China (730000)