Tytuł: New methods for metadata extraction from scientific literature
Wariant tytułu: Nowe metody wydobywania metadanych z literatury naukowej
Autorzy: Dominika Beata Tkaczyk
Partner: Instytut Badań Systemowych PAN w Warszawie
Opis: Spreading the ideas and announcing new discoveries and findings in the scientific world is typically realized by publishing and reading scientific literature. Within the past few decades we have witnessed digital revolution, which moved scholarly communication to electronic media and also resulted in a substantial increase in its volume. Nowadays keeping track with the latest scientific achievements poses a major challenge for the researchers.
Scientific information overload is a severe problem that slows down scholarly communication and knowledge propagation across the academia. 
Modern research infrastructures facilitate studying scientic literature by providing intelligent search tools, proposing similar and related documents, building and visualizing interactive citation and author networks, assessing the quality and impact of the articles using citation-based statistics, and so on. In order to provide such high quality services the system requires the access not only to the text content of stored documents, but also to their machine-readable metadata. Since in practice good quality metadata is not always available, there is a strong demand for a reliable automatic method of extracting machine-readable metadata directly from source documents.
Our research addresses these problems by proposing an automatic, accurate and flexible algorithm for extracting wide range of metadata directly from scientific articles in born-digital form. Extracted information includes basic document metadata, structured full text and bibliography section.
Designed as a universal solution, proposed algorithm is able to handle a vast variety of publication layouts with high precision and thus is well-suited for analyzing heterogeneous document collections. This was achieved by employing supervised and unsupervised machine-learning algorithms trained on large, diverse datasets. The evaluation we conducted showed good performance of proposed metadata extraction algorithm. The comparison with other similar solutions also proved our algorithm performs better than competition for most metadata types.
Proposed method is a reliable and accurate solution to the problem of extracting the metadata from documents. 
It allows modern research infrastructures to provide intelligent tools and services supporting the process of consuming the growing volume of scientic literature by the readers, which results in facilitating the communication among the scientists and the overall improvement of the knowledge propagation and the quality of the research in the scientic world.
Słowa kluczowe: "eksploracja danych"@pl, "analiza dokumentów"@pl, "wydobywanie metadanych"@pl, "uczenie maszynowe"@pl, "Machine Learning"@en
Typ zasobu: praca dyplomowa
Dyscyplina naukowa: dziedzina nauk technicznych / informatyka (2011)
Grupa docelowa: naukowcy, studenci, przedsiębiorcy
Szkodliwe treści: Nie
Promotor: Marek Antoni Niezgódka (10195)
Język zasobu: Polski
Czas powstania: 2015
Lokalizacja: Warszawa
Miejsce powstania: Warszawa
Liczba stron: 180
Prawa/licencja: CC BY-SA 4.0
Deponujący: Anna Wasilewska
Data udostępnienia: 15-10-2018
Link do zasobu (portal): https://zasobynauki.pl/zasoby/new-methods-for-metadata-extraction-from-scientific-literature,21567/
Link do zasobu (repozytorium): https://id.e-science.pl/records/21567