Table of contents
- About the Dataset:
- Dealing with Null values in the Country column:
- Dealing with Null values in the Rating Column:
- Dropping Columns:
- Replacing Values in the Rating column:
- Removing Punctuation from Columns:
- Creating separate data frames for movies and TV shows:
- Creating the Time column in the Movies data frame:
- Creating the Seasons column in the TV shows data frame:
- Analysing the Type of Content available on Amazon Prime:
- Analyzing the Release Year of Content:
- Analysing the Genres on Amazon Prime:
- Analysing the Relationship between the Duration and Release Year of Movies:
- Analysing the Relationship between Seasons and Release Year of TV Shows:
- Importing WordCloud and setting Stopwords:
- Analysing the Description of Movies:
- Analysing the description of TV Shows:
- Analysing the titles of Movies:
- Analyzing the titles of TV Shows:
- Analysing the Age Rating of Movies:
- Analysing the Age Rating of TV Shows:
All of us are guilty of binging on movies and TV shows endlessly. One such platform that provides this type of entertainment is Amazon Prime. It offers content in a wide variety of genres, with close to 10000 movies and TV shows as of mid-2021.
In this blog post, I have analyzed the content available on Amazon Prime based on the type, cast, director and many more such intriguing parameters using Matplotlib, Seaborn and WordCloud in Python.
About the Dataset:
I have used the following dataset to carry out the analysis:
https://www.kaggle.com/datasets/shivamb/amazon-prime-movies-and-tv-shows
The dataset contains 9668 rows and contains the following columns:
show_id
type of show ( either movie or TV show )
title of the movie/ TV
director
cast
country of production
date when the movie/ TV show was added
release year of movie/ TV show
rating of the movie/ TV show
duration of the movie( seasons for TV shows and in minutes for movies)
listed in (the genre of the movie or TV Show)
description of the movie or TV show
I have used Google Colab to carry out this analysis.
Importing and installing the desired libraries:
import pandas as pd
import io
import matplotlib.pyplot as plt
import seaborn as sns
Importing CSV file in Google Colab:
I have downloaded the dataset on my local computer. The following code snippet needs to be executed to upload the dataset:
from google.colab import files
uploaded=files.upload()
After choosing the file, the next snippet needs to be executed:
df = pd.read_csv(io.BytesIO(uploaded['amazon_prime_titles.csv']))
To check if data is successfully imported or not, the following line is executed:
df.head()
The output will be:
Checking for Null Values:
df.isnull().sum()
The output is:
The director, cast, country, date_added and rating columns contain null values which have to be dealt with accordingly to ensure that there are no obstacles in visualization.
Dealing with Null values in Director column:
The following snippet is executed to check the count of occurrences of various directors in the dataset:
df['director'].value_counts()
The output is:
The count of "Mark Knight" is the highest among all the directors. Hence we replace the null values of the director column with "Mark Knight" in the following manner:
df['director'].fillna(value='Mark Knight',inplace=True)
To check for the persistence of null values in the director column, the next snippet is run:
df.isnull().sum()
The output is:
Dealing with Null values in the Cast column:
The following snippet is executed to check the count of occurrences of various casts in the dataset:
df['cast'].value_counts()
The output is:
The count of "Maggie Binkley" is the highest in the cast column. Hence we replace all the null values in the cast column with "Maggie Binkley" in the following manner:
df['cast'].fillna(value="Maggie Binkley",inplace=True)
To check if the null values are still present in the cast column, the following code is run:
df.isnull().sum()
The output is :
No null values are present in the cast column now.
Dealing with Null values in the Country column:
The following code is run to check for the country occurring the most number of times in the dataset:
df['country'].value_counts()
The output is:
United States occurs most number of times hence we replace the null values in the Country column with this value in the manner given below:
df['country'].fillna(value='United States',inplace=True)
To check if null values have been successfully dealt with in the country column, the next snippet is executed:
df.isnull().sum()
The output is:
Null values have been successfully dealt with in the Country column.
Dealing with Null values in the Rating Column:
The following code is run to check the occurrences of each value in the Rating column:
df['rating'].value_counts()
The output is:
"13+" rating occurs the most number of times, hence we replace the null values in the ratings column with this value by running the following code:
df['rating'].fillna(value="13+",inplace=True)
To check if changes are reflected in the dataset or not, the following code is executed:
df.isnull().sum()
The result is:
Dropping Columns:
The show_id and date_added columns do not play a significant role in the analysis. Hence, these columns are dropped in the following manner:
df=df.drop(['date_added'],axis=1)
df=df.drop(['show_id'],axis=1)
Replacing Values in the Rating column:
When the following line of code is executed, it can be observed that there are many values such as "13+" and "PG-13" that are similar to each other:
df['rating'].value_counts()
The result is:
The following function is written to group similar ratings:
def rating_cleaner(text):
if(text=="13+" or text=="PG-13"):
return ("above13")
if(text=="16+" or text=="16" or text=="AGES_16_"):
return("above16")
if(text=="18+" or text=="AGES_18_" or text=="R" or text=="NC-17" or text=="TV-MA"):
return("above18")
if(text=="7+" or text=="TV-Y7"):
return("above7")
if(text=="ALL_AGES" or text=="NOT_RATE" or text=="UNRATED" or text=="TV-G" or text=="G" or text=="TV-NR"):
return("forall")
if(text=="TV-PG" or text=="PG"):
return("parentalguidance")
In order to apply the above function to the rating column the next snippet is run:
df['rating']=df['rating'].apply(rating_cleaner)
To check if these changes have been reflected in the dataset or not, the following snippet is executed:
df['rating'].unique()
The output is:
Removing Punctuation from Columns:
To ease out the visualization of text, the following function is written to remove punctuation:
def remove_punctuation(text):
punc = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
for ele in text:
if(ele in punc):
text=text.replace(ele," ")
return(text)
The function is now applied to the columns in the following manner:
df['cast']=df['cast'].apply(remove_punctuation)
df['title']=df['title'].apply(remove_punctuation)
df['listed_in']=df["listed_in"].apply(remove_punctuation)
df['description']=df['description'].apply(remove_punctuation)
df['country']=df['country'].apply(remove_punctuation)
Creating separate data frames for movies and TV shows:
The following snippet is executed to create a data frame that consists only of the movies available on Amazon Prime:
df_movies=df[df['type']=="Movie"]
The next snippet creates a data frame consisting of only TV Shows listed on Amazon Prime:
df_tvshows=df[df['type']=="TV Show"]
Creating the Time column in the Movies data frame:
The following snippet is executed to create the Time column in df_movies from the Duration column:
df_movies[['time','min']]=df_movies['duration'].str.split(" ",expand=True)
df_movies now looks as follows:
In order to use the Time column in the visualizations, later on, the type is changed to integer in the following manner:
df_movies=df_movies.astype({'time':'int'})
The "min" and "duration" columns are now dropped from df_movies since they aren't useful in further analysis:
df_movies=df_movies.drop(['duration','min'],axis=1)
Creating the Seasons column in the TV shows data frame:
In order to extract the number of seasons, the duration column is split into "seasons" and "remaining" columns in the manner given below:
df_tvshows[["seasons","remaining"]]=df_tvshows["duration"].str.split(" ",expand=True)
df_tvshows now looks as :
In order to use the seasons column later on, the following snippet is executed to change the type to integer:
df_tvshows=df_tvshows.astype({'seasons':'int'})
The "remaining" and "duration" columns are dropped since they play no role in the further analysis:
df_tvshows=df_tvshows.drop(['remaining','duration'],axis=1)
Analysing the Type of Content available on Amazon Prime:
To find the count of values taken by the "type", the following snippet is run:
df['type'].value_counts()
The output is:
In order to visualize this information, the next block of code is executed:
palette_color=sns.color_palette("Pastel1")
keys=["Movie","TV Show"]
data=[7814,1854]
plt.pie(data,labels=keys,colors=palette_color)
plt.title("Type of Content available on Amazon Prime Video")
The result is:
Movies constitute a major part of the content available on Amazon Prime.
Analyzing the Release Year of Content:
sns.histplot(x='release_year',hue='type',data=df,palette="RdYlGn")
plt.title("Histogram of Releases of Tv Shows/Movies Available on Amazon Prime")
The output is:
Movies and TV shows released in 2020-2021 constitute a major part of the content available on Amazon Prime.
Analysing the Genres on Amazon Prime:
To find the count of values taken by the "listed_in" column of "df", the following snippet is executed:
df['listed_in'].value_counts().head(5)
The output is:
To visualize this information, the next set of statements is executed:
palette_color=sns.color_palette("Set3")
keys=['Drama','Comedy','Drama+Suspense','Comedy+Drama','Animation+Kids']
data=[986,536,399,377,356]
plt.pie(data,labels=keys,colors=palette_color)
plt.title("Most Prominent Genres on Amazon Prime ")
The output is:
Among the TV shows and movies available on Amazon Prime, drama followed by comedy is the most popular genre.
If the genres of movies available on Amazon Prime have to be analyzed separately, then the following steps can be performed:
To find the count of values taken by the "listed_in" column of "df_movies", the following snippet is executed:
df_movies['listed_in'].value_counts().head(10)
The output is:
To visualize this information, the next code snippet is executed:
palette_color=sns.color_palette("Accent")
keys=['Drama','Comedy','Drama+Suspense','Comedy+Drama','Documentary','Action+Drama','Horror','Kids','Action','Arts+Enertainment+Culture']
data=[870,442,349,338,300,277,253,226,216,215]
plt.pie(data,labels=keys,colors=palette_color)
plt.title("Most Prominent Genres in Movies")
The result is:
Even in the movies available on Amazon Prime, drama and comedy tend to dominate the genre.
If the genres of TV Shows available on Amazon Prime have to be analyzed separately, then the following steps can be performed:
To find the count of values taken by the "listed_in" column of "df_tvshows", the following snippet is executed:
df_tvshows['listed_in'].value_counts().head(10)
The result is:
To visualize this information, the following code is executed:
palette_color=sns.color_palette("PuOr")
keys=["TV shows","Animation+Kids","Drama","Documentary+Special Interest","Kids","Comedy","Documentary","Drama+Suspense","Special Interest","Comedy+Drama"]
data=[263,176,116,114,108,94,50,50,40,39]
plt.pie(data,labels=keys,colors=palette_color)
plt.title("Most Prominent Genres in TV Shows")
The output is:
TV shows followed by animation+kids genre is the most prominent genre of this type of content on Amazon Prime.
Analysing the Relationship between the Duration and Release Year of Movies:
sns.lineplot(y='time',x='release_year',data=df_movies)
The output is:
The duration of movies listed on Amazon Prime has stabilized in recent years ranging from roughly 90 to 110 minutes.
Analysing the Relationship between Seasons and Release Year of TV Shows:
sns.lineplot(y='seasons',x='release_year',data=df_tvshows)
The result is:
TV shows on Amazon Prime which were released between 1960 to 1980 have the highest amount of seasons.
Importing WordCloud and setting Stopwords:
In order to gain a deeper insight into the columns of description, cast, director, genre and Movie/TV show titles, the wordcloud library is used and the stopwords are set in the following manner:
from wordcloud import WordCloud,STOPWORDS
stopwords=set(STOPWORDS)
Analysing the Description of Movies:
comment_words=" "
for val in df_movies.description:
val=str(val)
tokens=str.split(val)
for i in range(0,len(tokens)):
tokens[i]=tokens[i].lower()
comment_words=comment_words+" ".join(tokens)+" "
wordcloud=WordCloud(stopwords=stopwords,width=1200,height=800,background_color="black",colormap="Set3",collocations=False).generate(comment_words)
plt.imshow(wordcloud)
plt.axis("off")
The result is:
Love, life, young, one and find are the most used words for describing movies available on Amazon Prime.
Analysing the description of TV Shows:
comment_words=" "
for val in df_tvshows.description:
val=str(val)
tokens=str.split(val)
for i in range(0,len(tokens)):
tokens[i]=tokens[i].lower()
comment_words=comment_words+" ".join(tokens)+" "
wordcloud=WordCloud(stopwords=stopwords,background_color="black",colormap="Pastel2",width=1500,height=1000,collocations=False).generate(comment_words)
plt.imshow(wordcloud)
plt.axis("off")
The result is:
Life, series, world, and new are the most used words in the description of TV shows available on Amazon Prime.
Analysing the titles of Movies:
comment_words=" "
for val in df_movies.title:
val=str(val)
tokens=str.split(val)
for i in range(0,len(tokens)):
tokens[i]=tokens[i].lower()
comment_words=comment_words+" ".join(tokens)+" "
wordcloud=WordCloud(stopwords=stopwords,background_color="black",colormap="Pastel2",width=1000,height=800,collocations=False).generate(comment_words)
plt.imshow(wordcloud)
plt.axis("off")
The output is:
Kid, love and little are the most used words in the movie titles available on Amazon Prime.
Analyzing the titles of TV Shows:
comment_words=" "
for val in df_tvshows.title:
val=str(val)
tokens=str.split(val)
for i in range(0,len(tokens)):
tokens[i]=tokens[i].lower()
comment_words=comment_words+" ".join(tokens)+" "
wordcloud=WordCloud(stopwords=stopwords,background_color="black",colormap="Set2",width=1000,height=800,collocations=False).generate(comment_words)
plt.imshow(wordcloud)
plt.axis("off")
The output is:
Serie, love, world and pink are the most commonly used words in the titles of TV shows available on Amazon Prime.
Analysing the Age Rating of Movies:
In order to find the count of values taken by the rating column of "df_movies", the following snippet is executed:
df_movies['rating'].value_counts()
The output is:
To visualize this information, the next code cell is executed:
keys=['above 13','above 18','above 16','above 7','parental guidance','for all']
data=[2573,2113,1275,288,253,130]
explode=6*[0.05]
palette_color=sns.color_palette("mako")
plt.pie(data,labels=keys,explode=explode,colors=palette_color)
plt.title("Trend of Age Rating in Movies")
The output is:
The majority of the movies available on Amazon Prime have been listed in either above 13 or above 18 category
Analysing the Age Rating of TV Shows:
In order to find the count of values of the rating column present in "df_tvshows", the next code snippet is executed:
df_tvshows.rating.value_counts()
The output is:
In order to visualize this information, the next set of code is executed:
keys=['above 16','above 13','above 18','for all','parental guidance','above 7']
data=[275,274,223,186,169,136]
explode=6*[0.05]
palette_color=sns.color_palette("RdPu")
plt.pie(data,labels=keys,explode=explode,colors=palette_color)
plt.title("Trend of Age Rating in TV Shows")
The result is:
There is approximately an even distribution of age ratings for the TV shows available on Amazon Prime.
Found this article interesting? Give it a like and share it with your friends !!
#data #datascience #dataanalysis #analytics #analysis #datanalytics #python #seaborn #matplotlib #wordcloud #movies #movie #cinema #film #tv #tvshow