Do you really know how to make word cloud images with wordcloud?

Do you really know how to make word cloud images with wordcloud?

Preface

As far as text analysis is concerned, everyone can't get around word cloud images, and making word cloud images in python can't get around wordcloud, but what I want to say is, can you really use it? You may have followed the online tutorials and made a nice word cloud map, but I think this article today will definitely let you understand the principles behind wordcloud.

Small scale chopper

First you need to install this third-party library using pip. Then we briefly look at the differences between Chinese and English word clouds.

from matplotlib import pyplot as plt
from wordcloud import WordCloud

text ='my is luopan. he is zhangshan'

wc = WordCloud()
wc.generate(text)

plt.imshow(wc)
from matplotlib import pyplot as plt
from wordcloud import WordCloud

text ='My name is Luo Pan, his name is Zhang San, and my name is Luo Pan'

wc = WordCloud(font_path = r'/System/Library/Fonts/Supplemental/Songti.ttc') #Set Chinese font
wc.generate(text)

plt.imshow(wc)

You will find that the Chinese word cloud map is not what we want, because the wordcloud cannot successfully segment Chinese words. Through the source code analysis of the wordcloud below, I think you should be able to figure it out.

WordCloud source code analysis

We mainly want to look at the WordCloud category, here I will not put all the source code up, but mainly analyze the entire process of making a word cloud.

class WordCloud(object):
    
    def __init__(self,):
        '''This is mainly to initialize some parameters
        '''
        pass

    def fit_words(self, frequencies):
        return self.generate_from_frequencies(frequencies)

    def generate_from_frequencies(self, frequencies, max_font_size=None):
        '''The word frequency is normalized to create a drawing object 
        '''
        pass

    def process_text(self, text):
        """ Segments the text and preprocesses it
        """

        flags = (re.UNICODE if sys.version <'3' and type(text) is unicode # noqa: F821
                 else 0)
        pattern = r"\w[\w']*" if self.min_word_length <= 1 else r"\w[\w']+"
        regexp = self.regexp if self.regexp is not None else pattern

        words = re.findall(regexp, text, flags)
        # remove's
        words = [word[:-2] if word.lower().endswith("'s") else word
                 for word in words]
        # remove numbers
        if not self.include_numbers:
            words = [word for word in words if not word.isdigit()]
        # remove short words
        if self.min_word_length:
            words = [word for word in words if len(word) >= self.min_word_length]

        stopwords = set([i.lower() for i in self.stopwords])
        if self.collocations:
            word_counts = unigrams_and_bigrams(words, stopwords, self.normalize_plurals, self.collocation_threshold)
        else:
            # remove stopwords
            words = [word for word in words if word.lower() not in stopwords]
            word_counts, _ = process_tokens(words, self.normalize_plurals)

        return word_counts

    def generate_from_text(self, text):
        words = self.process_text(text)
        self.generate_from_frequencies(words)
        return self

    def generate(self, text):
        return self.generate_from_text(text)

When we use the generate method, the calling sequence is:

generate_from_text
process_text #Text preprocessing
generate_from_frequencies #Word frequency normalization, create drawing objects

Note: So when making a word cloud, whether you use the generate or generate_from_text method, in fact, the generate_from_text method will eventually be called.

So, the most important thing here is the process_text and generate_from_frequencies functions. Next, we will explain them one by one.

process_text function

The process_text function is actually to segment the text and then clean it. It is best to return a dictionary of word segmentation counts. We can try it out:

text ='my is luopan. he is zhangshan'

wc = WordCloud()
cut_word = wc.process_text(text)
print(cut_word)
# {'luopan': 1,'zhangshan': 1}

text ='My name is Luo Pan, his name is Zhang San, and my name is Luo Pan'

wc = WordCloud()
cut_word = wc.process_text(text)
print(cut_word)
# {'My name is Luo Pan': 2,'His name is Zhang San': 1}

So it can be seen that the process_text function cannot perform good segmentation of Chinese words. Regardless of how the process_text function cleans up the word segmentation, we will focus on how to segment the text.

def process_text(self, text):
    """ Segments the text and preprocesses it
    """

    flags = (re.UNICODE if sys.version <'3' and type(text) is unicode # noqa: F821
             else 0)
    pattern = r"\w[\w']*" if self.min_word_length <= 1 else r"\w[\w']+"
    regexp = self.regexp if self.regexp is not None else pattern

    words = re.findall(regexp, text, flags)

The key here is to use regular expressions for word segmentation ("\w[\w']+"). Anyone who has learned regular expressions knows that/w[\w]+ means matching 2 or more Letters, numbers, Chinese, underscores (\w in python regular expressions can represent Chinese).

Therefore, Chinese cannot be segmented, only Chinese can be segmented in various punctuation marks, which does not conform to the logic of Chinese word segmentation. But the English text itself is segmented by spaces, so English words can be easily segmented.

In summary, wordcloud itself is used as a word cloud for English text. If you need to make a Chinese text word cloud, you need to segment Chinese words first.

generate_from_frequencies function

Finally, let's briefly talk about this function. The function of this function is to normalize word frequency and create drawing objects.

There are many codes for drawing, which is not the focus of what we are going to talk about today. We only need to understand what data is needed to draw a word cloud diagram. Below is the code for word frequency normalization. I think everyone should be able to understand it.

from operator import itemgetter

def generate_from_frequencies(frequencies):
    frequencies = sorted(frequencies.items(), key=itemgetter(1), reverse=True)
    if len(frequencies) <= 0:
        raise ValueError("We need at least 1 word to plot a word cloud, "
                         "got %d."% len(frequencies))

    max_frequency = float(frequencies[0][1])

    frequencies = [(word, freq/max_frequency)
                   for word, freq in frequencies]
    return frequencies

test = generate_from_frequencies({'My name is Luo Pan': 2,'He is called Zhang San': 1})
test

# [('My name is Luo Pan', 1.0), ('His name is Zhang San', 0.5)]

The correct way to make word cloud diagrams for Chinese text

We use jieba word segmentation first, splicing the text with spaces, so that the process_text function can return a dictionary with the correct word segmentation count.

from matplotlib import pyplot as plt
from wordcloud import WordCloud
import jieba

text ='My name is Luo Pan, his name is Zhang San, and my name is Luo Pan'
cut_word = "".join(jieba.cut(text))

wc = WordCloud(font_path = r'/System/Library/Fonts/Supplemental/Songti.ttc')
wc.generate(cut_word)

plt.imshow(wc)

Of course, if you have a dictionary for word segmentation directly, you do not need to call the generate function, but directly call the generate_from_frequencies function.

text = {
    'Luo Pan': 2,
    'Zhang San': 1
}

wc = WordCloud(font_path = r'/System/Library/Fonts/Supplemental/Songti.ttc')
wc.generate_from_frequencies(text)

plt.imshow(wc)

summary

(1) Through the analysis of the process_text function, wordcloud itself is a third-party library for word cloud production of English text.

(2) If you need to make a Chinese word cloud, you need to split the Chinese text through a Chinese word segmentation database such as jieba.

Finally, the Chinese word cloud mentioned above is not our ultimate ideal word cloud. For example, I don’t need to display it, and it is to make the word cloud more beautified. I will tell you this content in the next issue~

Reference: https://cloud.tencent.com/developer/article/1784989 Do you really know how to make word cloud images with wordcloud? -Cloud + Community-Tencent Cloud