Cooking Up a Decision Tree: Indian Food-set

I'm an analyst/data science nerd that thinks learning is fun. This is for me to document and walk through things I tinker with since I learn best by teaching! Professionally I'm a Supervisor of Financial Customer Reporting at a Fortune 500 top 50 company. I have a B.S. degree in MIS and Accounting and I am also currently enrolled in a Business Intelligence & Analytics Master's with a concentration in Data Science.
Previously I've worked on Voice of the Customer, Customer Satisfaction, Call Center Technology & Analysis, and Performance & Incentive programs and systems.
Also I like bouldering, biking, video games, and my dog
I love food
Probably a bit too much. Honestly, it's the main reason I travel. When I was scrolling through Kaggle and found a list of recipes for Indian cuisine, I knew I had to do something with them. Enter another topic I've recently covered in my courses; data modeling. Now, quick disclaimer, I love cooking but Indian food not my forte, so this is diving in the proverbial deep-end for me subject area-wise! Jumping in, let's see what the data looks like by loading it into a DataFrame:
import csv
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
file = r'\indian_food.csv'
directory = r'C:\PATH\I like food'
df = pd.read_csv(directory+file,encoding='latin-1')
df.head()
out:

Looking at the Data
Ignoring my subject matter ignorance, I thought it could be fun to see if I could put together a decision tree to predict what area of India each recipe came from using prep methods and/or ingredients! After all, spices and food ingredients can tell a lot about a culture. And...looking at the data, we've definitely got those criteria as well as a region name! Let's see if we've got enough instances of each state to actually perform this analysis. Using the Seaborn displot, we can get a good idea for the different 'buckets' in each variable!
for i in range(0, 7):
sns.displot(data=df,x=df.columns[i] ).set(xticklabels=[])
Will return a variety of charts for each variable in the data, but let's look at state specifically.

It looks like we've got a few states that only have a few entries. We deal with later. For now, there does look to be some promising buckets of data! However, there is one issue that jumps out. Our data is almost completely categorical! Since I was planning on using the scikit-learn decision tree function, this is a problem. In fact, for most purposes in model building - categorical data is good for classification only.
The easiest way to break categorical data into something we can utilize in model building is via 'One-Hot encoding' ( read more here !) Basically, One-Hot encoding is used when a categorical field doesn't have any particular rhyme or reason to the categories in it. This makes the most sense when dealing with locations, objects, or say...ingredients! Let's go ahead and use pd.get_dummies:
pd.get_dummies(df['ingredients'], prefix='ingredients')
out

...Well, it looks like we've only broken out each exact recipe into a variable and not the actual items in each recipe. Let's rethink our approach here, starting with the ingredients column. If we call it, we can see that it's currently a string:
df['ingredients'] will return a stack of strings. Let's work some magic to break out the individual items:
df['ingredients'] = list(df['ingredients'].str.split(','))

Now, we have our ingredient list in the proper format...but we still can't use pd.get_dummies. Initially, I thought about using df.iterrows() in combination with a set of all of the different ingredients that was created like so:
##I didn't end up going this route
tot_Ingredients = []
for row in df.iterrows():
for item in df['ingredients']:
tot_Ingredients.append(item.split(','))
However, after some google-fu, I found a much more elegant method using scikit-learn and pandas built in functions:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(sparse_output=True)
df = df.join(
pd.DataFrame.sparse.from_spmatrix(
mlb.fit_transform(df.pop('ingredients')),
index=df.index,
columns=mlb.classes_))
df.head() now gets us:

Well, that's an awesome time saver but, what the heck does the copy and pasted code do? Googling something isn't really worthwhile if you're not going to bother learning it after all is said and done.
Basically what this neat little code stack is doing is this:
We're initializing mlb as a MultiLabelBinarizer from scikit-learn that breaks an 'iterable' (our ingredients list) into a set of labeled bins. (we use the sparse arg to save some memory if we have a larger df).
We then create a dataframe from a sparse matrix that is then joined back to the original df via join and pop.
This has the end result of binning all of the ingredients into their own col and adding it to the base df.
Moving on from Data Cleaning
Now that the data is in a useable format (more on this later), we can begin training a model on it. Let's remind ourselves how many different states we're working with:
data1 = df
data1.groupby('state').size()
We've got a number of states there that have very few recipes listed...let's take care of anything less than for example 5 via:
data1
data1 = data1.groupby('state').filter(lambda x : len(x)>5)
cn = set(data1['state']) ##Putting the state names in a set for later graphs
Let's import the required module magic and create a train and test split:
from sklearn.model_selection import train_test_split
from pandas.plotting import parallel_coordinates
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
data1 = data1.loc[data1['state'] != str(-1)] #Let's remove that -1 state value
train, test = train_test_split(data1, test_size = 0.6, stratify = data1['state'], random_state = 42)
By calling the train_test_split on our data and providing a test size, we've broken out our data into a training and test or validation set so we can gauge the accuracy of our predictions with a variety of data our model doesn't already know. We stratify on the state data to ensure we get a variety of each state[the classifier].
Let's remove the other variables other than the ingredients list - there's some good stuff there, but I don't necessarily think that something like cook time will be useful. We're removing state as well, but that's because we've already got that stored.
features = list(data1.columns)
features.remove('state')
features.remove('region')
features.remove('name')
features.remove('course')
features.remove('flavor_profile')
features.remove('diet')
features.remove('prep_time')
features.remove('cook_time')
Now that we have a list of the classifiers and features, we can run the following:
X_train = train[features] #Training features
y_train = train.state #Training classifiers
X_test = test[features] #Testing features
y_test = test.state #Testing classifiers
mod_dt = DecisionTreeClassifier(max_depth = 9, random_state = 1) #A decision tree that we've told to go
#9 levels deep at max
mod_dt.fit(X_train,y_train) #training to model! This is the exciting part!
prediction=mod_dt.predict(X_test) #applying the model to the testing data
print('The accuracy of the Decision Tree is','{:.3f}'.format(metrics.accuracy_score(prediction,y_test))) #Let's see how accurate it is!
We get an accuracy rating of ~34%...which, while not great, is better than I initially planned from this exercise. Let's take a look at the confusion matrix to better understand what's going on under the hood:
cn = list(set(data1['state']))
disp = metrics.plot_confusion_matrix(mod_dt, X_test, y_test,
display_labels=cn,
cmap=plt.cm.Blues,
normalize=None)

Ignoring the clutter on the x axis, we can see that we're misclassifying a lot of categories into the West Bengal state. Maybe there's some ingredient we're relying on that we can pull out in the actual decision tree!
fn = list(features)
fig = plt.figure(figsize=(25,20))
plot_tree(mod_dt, feature_names = fn, class_names = cn, filled = True)
plt.savefig('foo.pdf')
This gets us the following large image, which is why we've saved it to a 'foo.pdf' file, cracking that open we can see that at least on the first few splits there aren't any oddities.

Let's rethink this a bit, maybe I was overzealous in breaking it down by state...lets try again with region!
data1 =data1.loc[data1['name'] != 'Panjeeri' ] #cutting out an odd null value
data1['region'].isnull().values.any() ##Checks for other nulls
##region
train, test = train_test_split(data1, test_size = 0.4, stratify = data1['region'], random_state = 42)
X_train = train[features]
y_train = train.region
X_test = test[features]
y_test = test.region
mod_dt = DecisionTreeClassifier(max_depth =5, random_state = 1)
mod_dt.fit(X_train,y_train)
prediction=mod_dt.predict(X_test)
print('The accuracy of the Decision Tree is','{:.3f}'.format(metrics.accuracy_score(prediction,y_test)))
Well 50% ain't too bad for a half-joking dataset! Let's look at the decision tree and confusion matrix:


It doesn't look too bad! Seems like we're misclassifying a large amount into the 'East' dataset, which is where our previous misclassification was as well.
What We Can do to Improve the Model
Overall, this model came out better than anticipated by just using the ingredients list of the recipes, but it could be improved. Let's list a few of the methods we can use here to improve:
- There are some instances of a single ingredient having multiple entries...Gram Masala vs Gram Masala Powder. This has the effect of 'lessening' the impact of the actual single ingredient and causing it to split more than once. We could clean the data a bit more to improve it.
- Potentially, we could bring in the flavor profile and other variables we did not encode to improve the model.
- Clustering! We could have clustered similar recipes into another variable and utilized that in our model to see if we could extrapolate on the ingredients for a better trend!
- Bringing in more fields, with such an odd dataset it may be difficult, but bringing in something like 'dinner/lunch/breakfast' and amounts of each ingredient might have helped here
Overall this was a good dataset to get used to One-Hot encoding and data cleaning. Thanks for reading!
