TDM 10200: Project 14 — Spring 2023

Motivation: As the last project of the semester, lets take a moment and do something that caters to those that have a bit more of a creative artsy side. Data is fun, and data visualizations are a great way to include our creativity. Last project we did a word cloud. This project we are going to do a word cloud but we are going to change the way it is shown.

Scope: python, nltk, wordcloud, matplotlib.pyplot, numpy, PIL

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

When launching Juypter Notebook on Anvil you will need to use 4 cores.

The following questions will use the following dataset(s):

/anvil/projects/tdm/data/icecream

/anvil/projects/tdm/data/icecream/icecream.png

/anvil/projects/tdm/data/icecream/bj/products.csv

/anvil/projects/tdm/data/icecream/combined/reviews.csv

ONE

We are going to look at Ben and Jerry’s product information. There are two CSV files:

combined/reviews.csv
bj/products.csv

Take a look at the head of both of the dataframes, to get familiar with the data in these two files. Then consider:

What are the column names for combined/reviews.csv? What are the column names for bj/products.csv?
What column do they have in common?
Go ahead and merge these two data frames based on the common column (do not merge on a column that has NaN), and save the results from the merge as a new dataframe.
Find a second way to merge the data frames.

Helpful Hint

(pd.merge(df1, df2, on='column'))

Items to submit

Code used to solve this problem.
Output from running the code.
Answer to questions a,b,c,d

TWO

Now that we have merged the two dataframes, let’s take a look at the columns again.

What do you notice about the column ingredients that was originally in the products dataframe for the bj data? Why do you think this happened?
What happens and why if you tried to merge the dataframes on the column ingredients?
What should we do instead, if we want to merge on ingredients?

Items to submit

Code used to solve this problem.
Output from running the code.
Answers to a,b,c

THREE

Let’s create a word cloud with the ingredients in all of the Ben and Jerry icecream. Remove all the stop words. Afterwards, we want to focus on the words that appear most frequently. Go ahead and play with the parameters of the WordCloud function. We want you to get comfortable with WordCloud.

Insider Information

max_font_size: This argument defines the maximum font size for the biggest word. If none, adjust as image height.
max_words: It specifies the maximum number of the word, default is 200.
background_color: It set up the background color of the word cloud image, by default the color is defined as black.
colormap: using this argument we can change each word color. Matplotlib colormaps provide awesome colors.
background_color: It is used for the background color of the word cloud image.
width/height: we can change the dimension of the canvas using these arguments. Here we assign width as 3000 and height as 2000.
random_state: It will return PIL(python imaging library) color for each word, set as an int value.

Helpful Hint

import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator
from PIL import Image
import numpy as np
from wordcloud import STOPWORDS

import nltk
from nltk.probability import FreqDist

Items to submit

Code used to solve this problem.
Output from running the code.
Answer to the question

FOUR

Now for the best part, let’s create a custom shape. The WordCloud function has an argument called mask that enables it to take maskable images and use them as the outline the word cloud we created. In the dataset there is an image named icecream.png. This image meets the requirement of having a background that is completely white (the color code is #ffffff). Go ahead and create the word cloud! What do you see?

Helpful Hint

Figure 1. colorful word cloud in the shape of an ice-cream cone

Insider Information

mask: Specify the shape of the word cloud image. By default, it takes a rectangle.
Contour_width: This parameter creates an outline of the word cloud mask.
Contour_color: Contour_color use for the outline color of the mask image.

Helpful Hint

# Load the image mask
icecream_mask = np.array(Image.open('path'))

# Extract the text to use for the word cloud
text = " ".join(str(each) for each in df.columnname)

# Create a WordCloud object with the mask
wordcloud = WordCloud(max_words=200, colormap='Set1', background_color="white", mask=icecream_mask).generate(text)

# Display the word cloud on top of the image
fig, ax = plt.subplots(figsize=(8, 6))
ax.imshow(wordcloud, interpolation="bilinear")
ax.axis('off')

plt.show()

Items to submit

Code used to solve this problem.
Output from running the code.
Answer to the questions

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.