Repositori DSpace/Manakin

Probabilistic topic modeling with latent Dirichlet allocation on Apache Spark

Mostra el registre parcial de l'element

dc.contributor Cerquides Bueno, Jesús
dc.creator Cortés Hinojosa, Carlos Omar
dc.date 2016-10-19T07:31:35Z
dc.date 2016-10-19T07:31:35Z
dc.date 2016-06-05
dc.date.accessioned 2024-12-16T10:23:14Z
dc.date.available 2024-12-16T10:23:14Z
dc.identifier http://hdl.handle.net/2445/102724
dc.identifier.uri http://fima-docencia.ub.edu:8080/xmlui/handle/123456789/15657
dc.description Treballs Finals de Grau d'Enginyeria Informàtica, Facultat de Matemàtiques, Universitat de Barcelona, Any: 2016, Director: Jesús Cerquides Bueno
dc.description In a world in which we have access to vast amount of data, it is important to develop new tools that allow us to navigate through it. Probabilistic topic models are statistical methods to analyse text corpora and discover themes that best explain its documents. In this work, we introduce probabilistic topic models with special focus on one of the most common models called Latent Dirichlet Allocation (LDA). To learn LDA model from data, we present two variational inference algorithms for batch and online learning. Both algorithms are implemented on a popular Big Data computing framework known as Apache Spark. We introduce this framework and We study the algorithm scalability and topic coherence in two different news data sets from New York Times and BBC News. The results point out to the need to tune up Apache Spark in order to boost its performance and to the goodness of the resulting topics in the BBC News dataset.
dc.format 59 p.
dc.format application/pdf
dc.language eng
dc.rights memòria: cc-by-nc-sa (c) Carlos Omar Cortés Hinojosa, 2016
dc.rights codi: GPL (c) Carlos Omar Cortés Hinojosa, 2016
dc.rights http://creativecommons.org/licenses/by-sa/3.0/es
dc.rights http://www.gnu.org/licenses/gpl-3.0.ca.html
dc.rights info:eu-repo/semantics/openAccess
dc.source Treballs Finals de Grau (TFG) - Enginyeria Informàtica
dc.subject Mètodes estadístics
dc.subject Tractament del llenguatge natural (Informàtica)
dc.subject Programari
dc.subject Treballs de fi de grau
dc.subject Algorismes computacionals
dc.subject Dades massives
dc.subject Statistical methods
dc.subject Natural language processing (Computer science)
dc.subject Computer software
dc.subject Bachelor's theses
dc.subject Computer algorithms
dc.subject Big data
dc.title Probabilistic topic modeling with latent Dirichlet allocation on Apache Spark
dc.type info:eu-repo/semantics/bachelorThesis


Fitxers en aquest element

Fitxers Grandària Format Visualització

No hi ha fitxers associats a aquest element.

Aquest element apareix en la col·lecció o col·leccions següent(s)

Mostra el registre parcial de l'element

Cerca a DSpace


Cerca avançada

Visualitza

El meu compte