Ryan's Ramblings

Cooking Up a Decision Tree: Indian Food-set

Ryan Anastasiadis — Sat, 31 Oct 2020 02:11:41 GMT

I love food

Probably a bit too much. Honestly, it's the main reason I travel. When I was scrolling through Kaggle and found a list of recipes for Indian cuisine, I knew I had to do something with them. Enter another topic I've recently covered in my courses; data modeling. Now, quick disclaimer, I love cooking but Indian food not my forte, so this is diving in the proverbial deep-end for me subject area-wise! Jumping in, let's see what the data looks like by loading it into a DataFrame:

import csv
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
file = r'\indian_food.csv'
directory = r'C:\PATH\I like food'
df = pd.read_csv(directory+file,encoding='latin-1')

df.head()

out:

Looking at the Data

Ignoring my subject matter ignorance, I thought it could be fun to see if I could put together a decision tree to predict what area of India each recipe came from using prep methods and/or ingredients! After all, spices and food ingredients can tell a lot about a culture. And...looking at the data, we've definitely got those criteria as well as a region name! Let's see if we've got enough instances of each state to actually perform this analysis. Using the Seaborn displot, we can get a good idea for the different 'buckets' in each variable!

for i in range(0, 7):
    sns.displot(data=df,x=df.columns[i] ).set(xticklabels=[])

Will return a variety of charts for each variable in the data, but let's look at state specifically.

It looks like we've got a few states that only have a few entries. We deal with later. For now, there does look to be some promising buckets of data! However, there is one issue that jumps out. Our data is almost completely categorical! Since I was planning on using the scikit-learn decision tree function, this is a problem. In fact, for most purposes in model building - categorical data is good for classification only.

The easiest way to break categorical data into something we can utilize in model building is via 'One-Hot encoding' ( read more here !) Basically, One-Hot encoding is used when a categorical field doesn't have any particular rhyme or reason to the categories in it. This makes the most sense when dealing with locations, objects, or say...ingredients! Let's go ahead and use pd.get_dummies:

pd.get_dummies(df['ingredients'], prefix='ingredients')

out

...Well, it looks like we've only broken out each exact recipe into a variable and not the actual items in each recipe. Let's rethink our approach here, starting with the ingredients column. If we call it, we can see that it's currently a string: df['ingredients'] will return a stack of strings. Let's work some magic to break out the individual items:

df['ingredients'] = list(df['ingredients'].str.split(','))

Now, we have our ingredient list in the proper format...but we still can't use pd.get_dummies. Initially, I thought about using df.iterrows() in combination with a set of all of the different ingredients that was created like so:

##I didn't end up going this route
tot_Ingredients = []
for row in df.iterrows():
    for item in df['ingredients']:
        tot_Ingredients.append(item.split(','))

However, after some google-fu, I found a much more elegant method using scikit-learn and pandas built in functions:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer(sparse_output=True)

df = df.join(
            pd.DataFrame.sparse.from_spmatrix(
                mlb.fit_transform(df.pop('ingredients')),
                index=df.index,
                columns=mlb.classes_))

df.head() now gets us:

Well, that's an awesome time saver but, what the heck does the copy and pasted code do? Googling something isn't really worthwhile if you're not going to bother learning it after all is said and done.

Basically what this neat little code stack is doing is this:

We're initializing mlb as a MultiLabelBinarizer from scikit-learn that breaks an 'iterable' (our ingredients list) into a set of labeled bins. (we use the sparse arg to save some memory if we have a larger df).
We then create a dataframe from a sparse matrix that is then joined back to the original df via join and pop.

This has the end result of binning all of the ingredients into their own col and adding it to the base df.

Moving on from Data Cleaning

Now that the data is in a useable format (more on this later), we can begin training a model on it. Let's remind ourselves how many different states we're working with:

data1 = df
data1.groupby('state').size()

We've got a number of states there that have very few recipes listed...let's take care of anything less than for example 5 via:

data1
data1 = data1.groupby('state').filter(lambda x : len(x)>5)
cn = set(data1['state']) ##Putting the state names in a set for later graphs

Let's import the required module magic and create a train and test split:

from sklearn.model_selection import train_test_split
from pandas.plotting import parallel_coordinates
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
data1 = data1.loc[data1['state'] != str(-1)] #Let's remove that -1 state value
train, test = train_test_split(data1, test_size = 0.6, stratify = data1['state'], random_state = 42)

By calling the train_test_split on our data and providing a test size, we've broken out our data into a training and test or validation set so we can gauge the accuracy of our predictions with a variety of data our model doesn't already know. We stratify on the state data to ensure we get a variety of each state[the classifier].

Let's remove the other variables other than the ingredients list - there's some good stuff there, but I don't necessarily think that something like cook time will be useful. We're removing state as well, but that's because we've already got that stored.

features = list(data1.columns)
features.remove('state')
features.remove('region')
features.remove('name')
features.remove('course')
features.remove('flavor_profile')
features.remove('diet')
features.remove('prep_time')
features.remove('cook_time')

Now that we have a list of the classifiers and features, we can run the following:

X_train = train[features] #Training features
y_train = train.state #Training classifiers
X_test = test[features] #Testing features
y_test = test.state #Testing classifiers
mod_dt = DecisionTreeClassifier(max_depth = 9, random_state = 1) #A decision tree that we've told to go 
#9 levels deep at max
mod_dt.fit(X_train,y_train) #training to model! This is the exciting part!
prediction=mod_dt.predict(X_test) #applying the model to the testing data
print('The accuracy of the Decision Tree is','{:.3f}'.format(metrics.accuracy_score(prediction,y_test))) #Let's see how accurate it is!

We get an accuracy rating of ~34%...which, while not great, is better than I initially planned from this exercise. Let's take a look at the confusion matrix to better understand what's going on under the hood:

cn = list(set(data1['state']))
disp = metrics.plot_confusion_matrix(mod_dt, X_test, y_test,
                                 display_labels=cn,
                                 cmap=plt.cm.Blues,
                                 normalize=None)

Ignoring the clutter on the x axis, we can see that we're misclassifying a lot of categories into the West Bengal state. Maybe there's some ingredient we're relying on that we can pull out in the actual decision tree!

fn = list(features)

fig = plt.figure(figsize=(25,20))
plot_tree(mod_dt, feature_names = fn, class_names = cn, filled = True)
plt.savefig('foo.pdf')

This gets us the following large image, which is why we've saved it to a 'foo.pdf' file, cracking that open we can see that at least on the first few splits there aren't any oddities.

Let's rethink this a bit, maybe I was overzealous in breaking it down by state...lets try again with region!

data1 =data1.loc[data1['name'] != 'Panjeeri' ] #cutting out an odd null value
data1['region'].isnull().values.any() ##Checks for other nulls


##region
train, test = train_test_split(data1, test_size = 0.4, stratify = data1['region'], random_state = 42)
X_train = train[features]
y_train = train.region
X_test = test[features]
y_test = test.region
mod_dt = DecisionTreeClassifier(max_depth =5, random_state = 1)
mod_dt.fit(X_train,y_train)
prediction=mod_dt.predict(X_test)
print('The accuracy of the Decision Tree is','{:.3f}'.format(metrics.accuracy_score(prediction,y_test)))

Well 50% ain't too bad for a half-joking dataset! Let's look at the decision tree and confusion matrix:

It doesn't look too bad! Seems like we're misclassifying a large amount into the 'East' dataset, which is where our previous misclassification was as well.

What We Can do to Improve the Model

Overall, this model came out better than anticipated by just using the ingredients list of the recipes, but it could be improved. Let's list a few of the methods we can use here to improve:

There are some instances of a single ingredient having multiple entries...Gram Masala vs Gram Masala Powder. This has the effect of 'lessening' the impact of the actual single ingredient and causing it to split more than once. We could clean the data a bit more to improve it.

Potentially, we could bring in the flavor profile and other variables we did not encode to improve the model.

Clustering! We could have clustered similar recipes into another variable and utilized that in our model to see if we could extrapolate on the ingredients for a better trend!

Bringing in more fields, with such an odd dataset it may be difficult, but bringing in something like 'dinner/lunch/breakfast' and amounts of each ingredient might have helped here

Overall this was a good dataset to get used to One-Hot encoding and data cleaning. Thanks for reading!

Tooling Around With Text Sentiment: Trump Town Halls

Ryan Anastasiadis — Thu, 22 Oct 2020 19:07:17 GMT

Hello World! This is my first article here

Recently I finished up my masters classes in both Python and Data Mining.

And while I enjoy my newly reclaimed time on Mondays, Thursdays and god-knows what days I did homework - I wanted to keep my skills fresh and start tinkering.

In class we had worked with basic sentiment analysis from the RNC/DNC as well as some of the townhalls to plot out scatterplots of sentiment along with polarity:

And while I had completed the assignment as specified, it got me thinking about the differences in speech patterns and how varying modules derive meaning from them. I mean after all, there's a certain amount of context encoded in normal speech and text that humans innately understand. Anyone who has worked with freeform text is familiar with this problem, it's the basis of then entire NLP field!

So with this in mind, and returning to the scatterplot I had turned in for class - I noticed that this visual, while interesting - did a bad job of a few things. Namely, it does a bad job of calling out trends! Every dot on the chart is a sentence plotted at a coordinate of polarity (Good/Bad context) and subjectivity (Fact/Hearsay context). What happens however, when there are multiple statements in the same position on the graph? We lose those points in favor of whatever one was plotted over them. There's a level of context missing here - one that becomes even more complex of a capture when you consider the problems inherent in understanding different forms of speech.

So the scatter plot is a neat visual for a talking point but how can we make it more meaningful?

Problems to Improve Upon:

1. Townhalls are complicated

After all, they're really a free form method of question and answer for candidates that could have anything from political rants to niceties shared with the base. Varying individual performance of either metric here could be misleading!

2. We're dropping out trends in our chart

Assuming someone has a distinct way of speaking - over time the law of large numbers should kick in right? Does length of phrase and frequency muddy this?

3. Data sanitization & text interpretation

Data cleaning is everyone's favorite topic and it becomes a little harder when we're looking at text! Does 'Alright' become a sentence on it's own? Does it have a polarity? Does a module understand text enough to accurately assess, say, a statement involving the social inequalities of the militarization of a police force? (probably not) But these are all things we have to consider when looking at text. Where do data cleaning/formatting problems step out and NLP complexities step in?

In the Spirit of Tinkering, I Didn't Let That Stop Me

I'm not about to try and address an entire field of text parsing/understanding but that doesn't mean I can't improve my chart and maybe learn something!

With this in mind, I went and pulled down transcripts of all of Trump's town halls...I had initially intended to pull this from Rev.com (a seriously great site) myself, but I found some kind soul on Kaggle (Thank you!!) had already compiled and cleaned them up

With the files in place, I opened some up to see the format we were working with:

And you see what's happening, right? It's being rigged against … It's sad. It's being rigged against Crazy Bernie. Crazy Bernie is going to go crazy. Crazy. I think Crazy Bernie is going to be more crazy when they see what they're doing. I called it a long time ago. [... ]The Democrat Party has gone crazy. Whether it's Bernie Sanders plan to eliminate private healthcare, Elizabeth Pocahontas's plan … By the way, she's history. She's history.

Looks like the data has already had the html, speaker tags, time, and speakers other than Trump cleaned out. This is a great starting dataset! Let's get to work loading the actual text into some dataframes. But first, imports!

import os
import requests               
from wordcloud import WordCloud      
from textblob import TextBlob       
from pathlib import Path   
import pandas as pd  
import seaborn as sns  
import matplotlib.pyplot as plt  
import matplotlib  
from plotly import express as px  
from plotly.subplots import make_subplots  
import plotly.graph_objects as go  
import nltk  
import textatistic

(Note that I just ripped out my imports from my class assignment so some of these are unused)

Okay great, so let's point to the files:

directory = r'C:\Users\PATH'
for filename in os.listdir(directory):
    print(filename)

out: BattleCreekDec19_2019.txt Looks like we have a locale, as well as a date in these townhalls, that might be useful - let's get that bit out and save it later. It could be interesting to see if Trump's speech trends vary over time!

directory = r'C:\Users\PATH\'
capMons = ['J','F','M','A','S','O','N','D']

townDict = {}
polarity = []
subject = []
txt = []
length = []
event = []
for filename in os.listdir(directory):
    #holder for if we've got a double digit date
    doubleMonth = 0
    #searching for the underscore to find the date loc
    endLoc = filename.find('_')
    #search for the period to drop the file extension
    perdLoc = filename.find('.')
    for char in filename[endLoc-4:perdLoc]:
        #checking if we've got a two digit date via string comprehension
        if char in capMons:
            doubleMonth += 1
        else:
            continue
    if doubleMonth == 1:
        #changing how far back we go from the date split char if we have double digit month
        fullMo = filename[endLoc-4:perdLoc].split('_')
    else:
        fullMo = filename[endLoc-5:perdLoc].split('_')   
    date = fullMo[0][:3]+'-'+fullMo[0][3:]+'-'+fullMo[1] 
    townDict[filename] = date

townDict returns BattleCreekDec19_2019.txt': 'Dec-19-2019'

So now that we've got that bit down, let's move on to the actual content of the files! Putting this under the for loop that we've already go going gets us the data into a dataframe with the polarity and sentiment!

 fullPath = str(directory+'\\'+filename)
    with open(fullPath,'r', encoding='utf8' ) as file:
        #let's read the file!
        townhall = file.readlines()
    #data is coming through as a list, let's fix that
    townhall = str(townhall)
    #there are some break chars coming through, let's get rid of em
    townhall = townhall.replace("\\", "")
    #putting it in a blob for the sentiment analysis
    THblob = TextBlob(townhall)
    #Print overall sentiment and analysis for the entire file:
    print(f'DT Analysis{THblob.sentiment}')            
    # Save sentiment data to dataframe
    pd.set_option('max_colwidth', 400)
    for sentence in THblob.sentences: ##getting the Trump text sentiment and putting it in one big data frame
        polarity.append(sentence.sentiment.polarity)  
        subject.append(sentence.sentiment.subjectivity)
        txt.append(str(sentence))
## Getting the words in a sentence to understand lenght!
        length.append(textatistic.word_count(str(sentence)))
df_TH = pd.DataFrame(polarity,columns=['polarity'])
df_TH ['subjectivity'] = subject
df_TH ['text'] = txt
df_TH ['len'] = length

Let's check out the data we've got to make sure it's looking good! df_TH.head()

    polarity    subjectivity    text    len
0    0.0    0.00    ['Thank you.    2
1    0.0    0.00    Thank you.    2
2    0.0    0.00    Thank you to Vice President Pence.    6
3    0.7    0.60    He's a good guy.    4
4    0.8    0.75    We've done a great job together.    6

Looks like we've got some errant chars but it looks like it's only at the overall start and end...let's not worry about that for now! Taking a quick look at the data the same way as my earlier assignment:

fig = px.scatter(df_TH,
                 x = 'polarity' ,
                 y = 'subjectivity',
                 hover_data = ['text'],
                 color = 'len'
                )
fig.show()

Gets us...an image of confirmation that there's a lot of data here

just from a visual perspective, it looks like the middle right area has a lot of points overlapping...but it's hard to tell. Maybe a heatmap would be a better way to compare all of these speeches!

First let's take a look at the distribution since we're trying to understand how frequently phrases are at certain coordinates:
sns.displot(df_TH, x="polarity")

That's a lot of zeroes for polarity, and sentiment doesn't look much better:
sns.displot(df_TH, x="subjectivity")

I played around with the length of the sentences vs the polarity and I found that there was a good amount of zeroes at both even with sentences of a len of 25 words(!!!) At this point I think it's safe to say this module might be a little less than ready to handle these speeches but let's move onward since there's still some positives - handling the zero values:

df_Filt = df_TH.loc[df_TH['subjectivity'] !=0]
df_Filt = df_TH.loc[df_TH['polarity'] !=0]

Also a quirk of the seaborn heatplot - our data is in the wrong format, it doesn't really handle multiple entries of a value intelligently. We're going to have to bin values but there's 200 possible values at the .01 rank...let's avoid that and go with .05. We'll have to do a quick function to make this easier!

def round_to(n, precision):
    correction = 0.5 if n >= 0 else -0.5
    return int( n/precision+correction ) * precision

def round_to_05(n):
    return round_to(n, 0.05)

df_ThHeat = pd.DataFrame([df_Filt['polarity'],df_Filt['subjectivity']])
df_ThHeat = df_ThHeat.transpose()
df_ThHeat['polarity'] = df_ThHeat['polarity'].apply(round_to_05)
df_ThHeat['subjectivity'] = df_ThHeat['subjectivity'].apply(round_to_05)

So I snuck in some extra code, but now we've got a dataframe of rounded values so if we just bin similar numbers we'll be good to go!

mytable = df_ThHeat.groupby(['polarity','subjectivity']).size().reset_index().rename(columns={0:'count'})
mytable.transpose()
mytable.head()

But we also need to get the data into the right format, let's set a pivot table to have the right dimensions

df = pd.pivot_table(data = mytable, index='subjectivity', 
                  values='count', columns='polarity')
df.fillna(0)

Cool so let's start working with the heatplot!

#Let's set the size of the plot to be bigger so we can see better
plt.gcf().set_size_inches(15, 8)
#white on the back looks a little bad, let's set a gray to attract to a better color
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
#let's make it a bit bigger/better to look at as an image
sns.set_context("poster")
#actual graph stuff, setting min below zero so we get a good color, setting line width to break em out a bit
ax = sns.heatmap(df, cbar=True, cmap='rocket_r', linewidths=.5,
                 vmin=-50, #vmax=500
                 #setting robust to True to get a better variety of colors try setting the vmax and see what you get!
                # center = mytable['count'].mean(),
                 robust = True
                )
#formatting the lables to look better!
ax.set_xticklabels(['{:.2f}'.format(float(t.get_text())) for t in ax.get_xticklabels()])
ax.set_yticklabels(['{:.2f}'.format(float(t.get_text())) for t in ax.get_yticklabels()])
ax.invert_yaxis()
#Title to finish it up
plt.title('Trump Townhall Speaking Trends')

Well now that we've overlaid all of the townhalls, we can see that Trump tends to stay in the middle subjective, slightly positive range but does have a large number of statements in the very subjective and very negative range. In fact, if you bump up the vmax on the graph to make more extreme data stand out more - you see that he's got a lot of statements in that range

Wrapping Up & Closing Thoughts

Trump's got a very unique circular speaking style that tends to be self-referential. I wonder how much of the fallout is due to this and how much is a failure on the module I've used. I should be able to recreate this with a different module.

Other random thoughts:

I could use length to weight the instances of polarity and subjectivity to try and capture the sentiment of more complex thoughts - this might tease out better trends
I'm going to run this for Biden as well to see if there are different trends

I could create a 'difference' heatmap showing differences in trends via matrix subtraction between Biden and Trump's heatmaps to visualize if they tend to speak in different quadrants

I learned a lot with this one and had some fun tinkering!