Modelagem de Tópicos em Python utilizando o Modelo de Alocação Latente de Dirichlet (LDA)

A modelagem de tópicos é um método que identifica tópicos ocultos em grandes volumes de texto. Utilizando técnicas de Processamento de Linguagem Natural (PLN), essa abordagem permite analisar tendências e identificar os principais temas discutidos nos dados.

O Latent Dirichlet Allocation (LDA) é um dos algoritmos mais utilizados para essa finalidade e possui implementações no pacote Gensim do Python, sendo aplicado na categorização de textos.

Esse processo é essencial para empresas que buscam monetizar dados e aprimorar serviços, analisando avaliações de clientes, feedbacks, notícias e redes sociais.

O objetivo deste estudo é desenvolver um algoritmo automatizado para identificar os tópicos mais discutidos. Para isso, é fundamental garantir a qualidade dos dados no pré-processamento, bem como definir a melhor estratégia para encontrar o número ideal de tópicos.

A análise foi realizada com dados extraídos do repositório “Manchetes Brasil”, de Paula Dornhofer Paro Costa (2017), disponível no GitHub. A base contém 500 manchetes de jornais brasileiros, publicadas entre dezembro de 2016 e agosto de 2017, abrangendo veículos como Valor Econômico, O Globo, Folha de S. Paulo e O Estado de S. Paulo.

O link para a base de dados pode ser acessada no https://github.com/pdpcosta/manchetesBrasildatabase

1. IMPORTAÇÃO DE BIBLIOTECAS

Para começar, foi necessário importar algumas bibliotecas importantes, dentre elas o pandas, numpy, matplotlib, nltk, re e gensim:

import re
import numpy as np
import pandas as pd
from pprint import pprint
import unicodedata

# Importando a library Natural Language Toolkit - NLTK para tratamento de linguagem natural.
import nltk
nltk.download('wordnet')
nltk.download('punkt')

#Importando as stopwords
from nltk.corpus import stopwords
nltk.download('stopwords')
language = 'portuguese'
stopwords = stopwords.words(language)
stopwords = list(set(stopwords))

#Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

#Plotagem
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from wordcloud import WordCloud, STOPWORDS
%matplotlib inline

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

2. COLETA DE DADOS

O conjunto de dados utilizado, como foi mencionado anteriormente, é a base de manchetes brasileiras.

Portanto, foi importado para dentro do Google Colab o arquivo csv, através do código abaixo. O arquivo contém colunas de dia, mês, ano, jornal e as headlines(notícias).

Para visualizar, foi aplicado o método head() que traz os 5 primeiros dados do nosso dataset:

caminho = '/content/manchetesBrasildatabase.csv'
dataframe = pd.read_csv(caminho, quotechar="'", header = None, names = ["Day", "Month", "Year", "Company", "Headline"]) dataframe.head()

Para este artigo, foi realizada a modelagem de tópicos somente para o jornal Folha de São Paulo, portanto, aplicou-se o método loc() para selecionar somente este jornal, atribuindo a uma nova variável, como mostra abaixo:

dataframe_folha = dataframe.loc[dataframe['Company'] == 'Folha'] dataframe_folha

Sendo assim, o dataframe ficou com 127 colunas e 5 colunas.

3. LIMPEZA DOS DADOS

Como é possível visualizar na coluna Headline, os textos apresentam pontuações, acentuações, letras maiúsculas, stopwords… Para aplicação do modelo LDA é necessário que as palavras estejam sem essas distrações.

Além disso, para ser consumido pelo LDA, é necessário fazer uma quebra de cada frase em palavras através da tokeinização.

Portando o seguinte processo foi realizado:

a) Conversão da coluna para lista, remoção de novas linhas e distrações:

# Convertendo para lista data = dataframe_folha.Headline.values.tolist() # Removendo novas linhas data = [re.sub('\s+', ' ', sent) for sent in data] # Removendo distrações data = [re.sub("\'", "", sent) for sent in data]

b) Substituição de letras maiúsculas por letras minúsculas:

#Aplicando função para deixar somente letras minúsculas. def to_lowercase(words): new_words = [] for word in words: new_word = word.lower() new_words.append(new_word) return new_words

c) Remoção de caracteres NON-ASCII:

#Aplicando função para remover os caracteres Non ASCII def remove_non_ascii(words): """Remove non-ASCII characters from list of tokenized words""" new_words = [] for word in words: new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore') new_words.append(new_word) return new_words

d) Remoção de stop words:

As stop words (ou palavras de parada) são palavras que podem ser consideradas irrelevantes para um conjunto de documentos. Ex: e, os, de, para, com, sem, foi.

def remove_stopwords(texts): return [[word for word in simple_preprocess(str(doc)) if word not in stopwords] for doc in texts] # Removendo Stop Words data_words_nostops = remove_stopwords(data_words)

Somente aplicando as stopwords do NLTK não é suficiente para as palavras em português, pois não possui uma base tão boa. Por isso, aplicou-se o método append() para algumas palavras identificadas na análise, adicionando-as à biblioteca de stopwords.

#Adicionando novas stopwords em português stopwords = nltk.corpus.stopwords.words('portuguese') stopwords.append('ja') stopwords.append('viu') stopwords.append('vai') stopwords.append('ne') stopwords.append('ai') stopwords.append('ta') stopwords.append('gente') stopwords.append('nao') stopwords.append('aqui') stopwords.append('tambem') stopwords.append('vc') stopwords.append('voce') stopwords.append('entao') stopwords.append('ate') stopwords.append('agora') stopwords.append('ser') stopwords.append('sempre') stopwords.append('ter') stopwords.append('so') stopwords.append('porque') stopwords.append('sobre') stopwords.append('ainda') stopwords.append('la') stopwords.append('tudo') stopwords.append('ninguem') stopwords.append('de')

e) Remoção de pontuação e tokeinização através do simple_preprocess do Geisim:

#Removendo pontuação e fazendo a tokeinização (para conseguir aplicar o modelo LDA) def sent_to_words(sentences): for sentence in sentences: yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations data_words = list(sent_to_words(data))

Topic modeling is a method that identifies hidden topics in large volumes of text. Using Natural Language Processing (NLP) techniques, this approach enables the analysis of trends and the identification of the main themes discussed in the data.

Latent Dirichlet Allocation (LDA) is one of the most widely used algorithms for this purpose and has implementations in Python's Gensim package, being applied in text categorization.

This process is essential for companies looking to monetize data and improve services by analyzing customer reviews, feedback, news, and social media.

The objective of this study is to develop an automated algorithm to identify the most discussed topics. To achieve this, it is essential to ensure data quality during preprocessing, as well as to define the best strategy for finding the optimal number of topics.

The analysis was performed with data extracted from the “Manchetes Brasil” repository by Paula Dornhofer Paro Costa (2017), available on GitHub. The dataset contains 500 headlines from Brazilian newspapers, published between December 2016 and August 2017, covering outlets such as Valor Econômico, O Globo, Folha de S. Paulo, and O Estado de S. Paulo.

The link to the database can be accessed at https://github.com/pdpcosta/manchetesBrasildatabase

1. LIBRARY IMPORTS

To begin, it was necessary to import some important libraries, including pandas, numpy, matplotlib, nltk, re, and gensim:

import re
import numpy as np
import pandas as pd
from pprint import pprint
import unicodedata

# Importing the Natural Language Toolkit - NLTK library for natural language processing.
import nltk
nltk.download('wordnet')
nltk.download('punkt')

#Importing stopwords
from nltk.corpus import stopwords
nltk.download('stopwords')
language = 'portuguese'
stopwords = stopwords.words(language)
stopwords = list(set(stopwords))

#Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

#Plotting
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from wordcloud import WordCloud, STOPWORDS
%matplotlib inline

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

2. DATA COLLECTION

The dataset used, as previously mentioned, is the Brazilian headlines database.

Therefore, the csv file was imported into Google Colab using the code below. The file contains columns for day, month, year, newspaper, and the headlines (news).

To visualize the data, the head() method was applied, which returns the first 5 records from our dataset:

caminho = '/content/manchetesBrasildatabase.csv'
dataframe = pd.read_csv(caminho, quotechar="'", header = None, names = ["Day", "Month", "Year", "Company", "Headline"]) dataframe.head()

For this article, topic modeling was performed only for the Folha de São Paulo newspaper, so the loc() method was applied to select only this newspaper, assigning it to a new variable, as shown below:

dataframe_folha = dataframe.loc[dataframe['Company'] == 'Folha'] dataframe_folha

As a result, the dataframe ended up with 127 rows and 5 columns.

3. DATA CLEANING

As can be seen in the Headline column, the texts contain punctuation, accents, uppercase letters, stopwords… To apply the LDA model, the words need to be free of these distractions.

Additionally, to be consumed by LDA, each sentence needs to be broken into words through tokenization.

Therefore, the following process was carried out:

a) Converting the column to a list, removing new lines and distractions:

# Converting to list data = dataframe_folha.Headline.values.tolist() # Removing new lines data = [re.sub('\\s+', ' ', sent) for sent in data] # Removing distractions data = [re.sub("\\'", "", sent) for sent in data]

b) Replacing uppercase letters with lowercase letters:

#Applying function to convert to lowercase only. def to_lowercase(words): new_words = [] for word in words: new_word = word.lower() new_words.append(new_word) return new_words

c) Removing NON-ASCII characters:

#Applying function to remove Non ASCII characters def remove_non_ascii(words): """Remove non-ASCII characters from list of tokenized words""" new_words = [] for word in words: new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore') new_words.append(new_word) return new_words

d) Removing stop words:

Stop words are words that can be considered irrelevant for a set of documents. E.g.: and, the, of, for, with, without, was.

def remove_stopwords(texts): return [[word for word in simple_preprocess(str(doc)) if word not in stopwords] for doc in texts] # Removing Stop Words data_words_nostops = remove_stopwords(data_words)

Simply applying NLTK's stopwords is not sufficient for Portuguese words, as it doesn't have a very comprehensive database. Therefore, the append() method was applied for some words identified during the analysis, adding them to the stopwords library.

#Adding new Portuguese stopwords stopwords = nltk.corpus.stopwords.words('portuguese') stopwords.append('ja') stopwords.append('viu') stopwords.append('vai') stopwords.append('ne') stopwords.append('ai') stopwords.append('ta') stopwords.append('gente') stopwords.append('nao') stopwords.append('aqui') stopwords.append('tambem') stopwords.append('vc') stopwords.append('voce') stopwords.append('entao') stopwords.append('ate') stopwords.append('agora') stopwords.append('ser') stopwords.append('sempre') stopwords.append('ter') stopwords.append('so') stopwords.append('porque') stopwords.append('sobre') stopwords.append('ainda') stopwords.append('la') stopwords.append('tudo') stopwords.append('ninguem') stopwords.append('de')

e) Removing punctuation and tokenization using Gensim's simple_preprocess:

#Removing punctuation and performing tokenization (to apply the LDA model) def sent_to_words(sentences): for sentence in sentences: yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations data_words = list(sent_to_words(data))