Dokument: Components of an Automatic Single Document Summarization System in the News Domain

Titel:

Components of an Automatic Single Document Summarization System in the News Domain

Weiterer Titel:

Komponenten eines Automatischen Einzeldokument-Zusammenfassungssystems im News-Bereich

URL für Lesezeichen:

https://docserv.uni-duesseldorf.de/servlets/DocumentServlet?id=42475

URN (NBN):

urn:nbn:de:hbz:061-20170807-081437-4

Kollektion:

Dissertationen

Sprache:

Englisch

Dokumententyp:

Wissenschaftliche Abschlussarbeiten » Dissertation

Medientyp:

Text

Autor:

Modaresi Chahardehi, Pashutan [Autor]

Dateien:

[Dateien anzeigen]	Adobe PDF
[Details]	1,12 MB in einer Datei
[ZIP-Datei erzeugen]
Dateien vom 30.05.2017 / geändert 30.05.2017

Beitragende:

Prof. Dr. Conrad, Stefan [Gutachter]
Prof. Dr. Kollmann, Markus [Gutachter]

Dewey Dezimal-Klassifikation:

000 Informatik, Informationswissenschaft, allgemeine Werke » 004 Datenverarbeitung; Informatik

Beschreibungen:

Die zunehmende Menge verfügbarer Informationen im Internet erfordert die Schaffung von Werkzeugen und Algorithmen, um diese automatisch zu verwalten und zusammenzufassen. Dieses Problem ist sogar noch größer, wenn es sich um unstrukturierteDaten wie Textdokumente handelt. Die große Menge an Informationen in Text form wird auf einer täglichen Basis aus Quellen wie Blogs, Twitter, Facebook und Online-News erstellt. Um diese immense Menge an Informationen in einer kompakten Form zusammenzufassen, sind automatische Ansätze zur Verwaltung der Informationen erforderlich. Aus diesem Grund hat automatische Textzusammenfassung die Aufmerksamkeit vieler Forscher auf dem Gebiet der natürlichen Sprachverarbeitung und künstlicheIntelligenz erregt. Insbesondere in der Nachrichten-Domäne sind eine große Anzahl von Unternehmen und Organisationen an anspruchsvollen Algorithmen, um Nachrichtenartikel automatisch zusammenzufassen, interessiert. Im Allgemeinen werden die existierendenAnsätze inEinzeldokument(ein einziges Dokument muss zusammengefasst werden)undMultidokument(mehrere Dokumente müssen zusammengefasst werden) kategorisiert. In dieser Arbeit konzentrieren wir uns auf das Problem der automatischenEinzeldokument-Textzusammenfassung im Nachrichtenbereich und untersuchen verschiedene Komponenten und Elemente, die wir für die Konstruktion und Entwicklung der Zusammenfassungsalgorithmen kritisch beanspruchen.In unserer Arbeit folgen wir einem Bottom-up-Ansatz. Wir beginnen mit demVersuch, das Problem der automatischen Textzusammenfassung formal zu definieren und eine Definition vorzuschlagen, die stets als Leitlinie verwendet wird. Als nächstes schlagen wir Ansätze vor, um automatisch Trainingsdaten für Zusammenfassungsalgorithmen zu erfassen, die das maschinelle Lernen verwenden. Als kritische Komponente in vielen Zusammenfassungssystemen schlagen wir einen maschinellen Lernansatz vor,um ein Textdokument in Form von Schlüsselwörtern und Schlagwörtern zusammenzufassen. Darüber hinaus ziehen wir Komponenten heran, um redundante Sätze in einemTextdokument automatisch zu erkennen und zu entfernen und die übrigen so zu ordnen,dass der resultierende zusammengefasste Text linguistisch kohärent ist. Wir schlagen auch automatische und manuelle Ansätze vor, um die Qualität der erstellten Zusammenfassungen zu bewerten.In der vorliegenden Arbeit wird ferner versucht, eine Verbindung zwischen denGebieten der automatischen Textzusammenfassung und der digitalen Textforensik herzustellen, wo verschiedene Techniken im Bereich der digitalen Textforensik wie Authorenverifikation, Author Profiling und Plagiatserkennung verwendet werden, um dieQualität der Zusammenfassungen zu verbessern und sicherzustellen, dass die Dokumente und ihre automatisch erstellten Zusammenfassungen demselben Schreibstil folgen.

AbstractThe increasing amount of available information on the Internet has necessitated the creation of tools and algorithms to automatically manage and summarize them. This problem is even more compelling when dealing with unstructured data such as textual documents. An enormous amount of textual information is created on a daily basis originating from sources such as blogs, Twitter, Facebook and online news. To manage this amount of information, automatic approaches are required to summarize them in a compact form. For this reason, automatic text summarization has gained lots of attention by researchers in the field of natural language processing and artificial intelligence. Specifically, in the news domain, a large number of companies and organizations are interested in using sophisticated algorithms to summarize news articles automatically. Although the field of automatic text summarization has been investigated by researchers for almost sixty years, there still exists enormous potential to improve existing approaches. In general, the existing summarization approaches are categorized into single document(one single document has to be summarized) and multidocument (multiple documents have to be summarized). In this work, we focus on the problem of automatic single document text summarization in the news domain, and investigate the several components and elements that we claim to be critical in the design and development of the summarization algorithms.In our work, we follow a bottom-up approach. We start with an attempt to formally define the problem of automatic text summarization and propose a definition that we use as a guideline in our entire work. Next, we propose approaches to automatically collect training data for summarization algorithms that incorporate machine learning.As a critical component in many summarization systems, we propose a machine learning approach to summarize a textual document in the form of keywords and keyphrases. Moreover, we propose components to automatically detect and remove redundant sentences in a textual document and order the remaining ones in such a manner that the resulting summary text is linguistically coherent. We also propose automatic and manual approaches to evaluate the quality of the created summaries. Specifically,we present the results of an extensive study in the domain of media monitoring and media responsive analysis and show the impressive financial benefits of incorporating automatic summarization systems.Another contribution of this work is the attempt to establish a connection between the fields of automatic text summarization and digital text forensics where various techniques in the field of digital text forensics such as author verification,author profiling and plagiarism alignment detection will be used to improve the quality of the summaries and assure that the documents and their automatically created summaries obey the same writing style.

Lizenz:

Urheberrechtsschutz

Fachbereich / Einrichtung:

Mathematisch- Naturwissenschaftliche Fakultät » WE Informatik » Datenbanken und Informationssysteme

Dokument erstellt am:

07.08.2017

Dateien geändert am:

07.08.2017

Promotionsantrag am:

01.03.2017

Datum der Promotion:

11.04.2017

Heinrich-Heine-Universität Düsseldorf

Dokument: Components of an Automatic Single Document Summarization System in the News Domain