<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Ryan's Ramblings]]></title><description><![CDATA[Ryan's Ramblings]]></description><link>https://anastasiadis.us</link><generator>RSS for Node</generator><lastBuildDate>Sat, 11 Apr 2026 07:41:15 GMT</lastBuildDate><atom:link href="https://anastasiadis.us/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Cooking Up a Decision Tree: Indian Food-set]]></title><description><![CDATA[I love food
Probably a bit too much. Honestly, it's the main reason I travel. When I was scrolling through Kaggle and found a list of  recipes for Indian cuisine, I knew I had to do something with them. Enter another topic I've recently covered in my...]]></description><link>https://anastasiadis.us/cooking-up-a-decision-tree-indian-food-set</link><guid isPermaLink="true">https://anastasiadis.us/cooking-up-a-decision-tree-indian-food-set</guid><category><![CDATA[Python]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[creativity]]></category><dc:creator><![CDATA[Ryan Anastasiadis]]></dc:creator><pubDate>Sat, 31 Oct 2020 02:11:41 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1604110195631/JXjxDtxIC.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="i-love-food">I love food</h3>
<p>Probably a bit too much. Honestly, it's the main reason I travel. When I was scrolling through Kaggle and found a list of  <a target="_blank" href="https://www.kaggle.com/nehaprabhavalkar/indian-food-101">recipes for Indian cuisine</a>, I knew I had to do something with them. Enter another topic I've recently covered in my courses; data modeling. Now, quick disclaimer, I love cooking but Indian food not my forte, so this is diving in the proverbial deep-end for me subject area-wise! Jumping in, let's see what the data looks like by loading it into a DataFrame:</p>
<pre><code><span class="hljs-keyword">import</span> csv
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
file = <span class="hljs-string">r'\indian_food.csv'</span>
directory = <span class="hljs-string">r'C:\PATH\I like food'</span>
df = pd.read_csv(directory+file,encoding=<span class="hljs-string">'latin-1'</span>)

df.head()
</code></pre><p><code>out:</code> 
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1603839624913/B5g8ctLiv.png" alt="image.png" /></p>
<h3 id="looking-at-the-data">Looking at the Data</h3>
<p>Ignoring my subject matter ignorance, I thought it could be fun to see if I could put together a decision tree to predict what area of India each recipe came from using prep methods and/or ingredients! After all, spices and food ingredients can tell a lot about a culture. And...looking at the data, we've definitely got those criteria as well as a region name! Let's see if we've got enough instances of each state to actually perform this analysis. Using the Seaborn displot, we can get a good idea for the different 'buckets' in each variable!</p>
<pre><code><span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, <span class="hljs-number">7</span>):
    sns.displot(data=df,x=df.<span class="hljs-keyword">columns</span>[i] ).<span class="hljs-keyword">set</span>(xticklabels=[])
</code></pre><p>Will return a variety of charts for each variable in the data, but let's look at state specifically.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1603841241638/8pzj13kD6.png" alt="image.png" /></p>
<p>It looks like we've got a few states that only have a few entries. We deal with later. For now, there does look to be some promising buckets of data! However, there is one issue that jumps out. Our data is almost completely categorical! Since I was planning on using the scikit-learn decision tree function, this is a problem. In fact, for most purposes in model building - categorical data is good for classification only. </p>
<p>The easiest way to break categorical data into something we can utilize in model building is via 'One-Hot encoding' ( <a target="_blank" href="https://www.kaggle.com/getting-started/27270">read more here</a> !) Basically, One-Hot encoding is used when a categorical field doesn't have any particular rhyme or reason to the categories in it. This makes the most sense when dealing with locations, objects, or say...ingredients! Let's go ahead and use <code>pd.get_dummies</code>:</p>
<pre><code><span class="hljs-selector-tag">pd</span><span class="hljs-selector-class">.get_dummies</span>(df[<span class="hljs-string">'ingredients'</span>], prefix=<span class="hljs-string">'ingredients'</span>)
</code></pre><p><code>out</code> 
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1603845194599/xqxTBq7Pt.png" alt="image.png" /></p>
<p>...Well, it looks like we've only broken out each exact recipe into a variable and not the actual items in each recipe. Let's rethink our approach here, starting with the ingredients column. If we call it, we can see that it's currently a string:
<code>df['ingredients']</code> will return a stack of strings. Let's work some magic to break out the individual items:</p>
<pre><code>df[<span class="hljs-string">'ingredients'</span>] = <span class="hljs-keyword">list</span>(df[<span class="hljs-string">'ingredients'</span>].str.split(<span class="hljs-string">','</span>))
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1603845839037/77_QmWkq6.png" alt="image.png" /></p>
<p>Now, we have our ingredient list in the proper format...but we still can't use pd.get_dummies. Initially, I thought about using <code>df.iterrows()</code> in combination with a set of all of the different ingredients that was created like so:</p>
<pre><code><span class="hljs-comment">##I didn't end up going this route</span>
tot_Ingredients = []
<span class="hljs-keyword">for</span> row in df.iterrows():
    <span class="hljs-keyword">for</span> item in df[<span class="hljs-string">'ingredients'</span>]:
        tot_Ingredients.append(item.<span class="hljs-keyword">split</span>(<span class="hljs-string">','</span>))
</code></pre><p>However, after some google-fu, I found a much more elegant method using scikit-learn and pandas built in functions:</p>
<pre><code><span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> MultiLabelBinarizer

mlb = MultiLabelBinarizer(sparse_output=<span class="hljs-keyword">True</span>)

df = df.<span class="hljs-keyword">join</span>(
            pd.DataFrame.sparse.from_spmatrix(
                mlb.fit_transform(df.pop(<span class="hljs-string">'ingredients'</span>)),
                <span class="hljs-keyword">index</span>=df.<span class="hljs-keyword">index</span>,
                <span class="hljs-keyword">columns</span>=mlb.classes_))
</code></pre><p><code>df.head()</code> now gets us:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1603846858102/mzr-hTWFe.png" alt="image.png" /></p>
<p>Well, that's an awesome time saver but, what the heck does the copy and pasted code do? Googling something isn't really worthwhile if you're not going to bother learning it after all is said and done. </p>
<p>Basically what this neat little code stack is doing is this:</p>
<ul>
<li><p>We're initializing mlb as a  <a target="_blank" href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html">MultiLabelBinarizer from scikit-learn</a> that breaks an 'iterable' (our ingredients list) into a set of labeled bins. (we use the sparse arg to save some memory if we have a larger df). </p>
</li>
<li><p>We then create a dataframe from a sparse matrix that is then joined back to the original df via join and pop. </p>
</li>
</ul>
<p>This has the end result of binning all of the ingredients into their own col and adding it to the base df. </p>
<h3 id="moving-on-from-data-cleaning">Moving on from Data Cleaning</h3>
<p>Now that the data is in a useable format (more on this later), we can begin training a model on it. Let's remind ourselves how many different states we're working with:</p>
<pre><code><span class="hljs-attribute">data1</span> = df
data1.groupby(<span class="hljs-string">'state'</span>).size()
</code></pre><p>We've got a number of states there that have very few recipes listed...let's take care of anything less than for example 5 via:</p>
<pre><code>data1
data1 = data1.groupby(<span class="hljs-string">'state'</span>).filter(<span class="hljs-keyword">lambda</span> x : len(x)&gt;<span class="hljs-number">5</span>)
cn = set(data1[<span class="hljs-string">'state'</span>]) <span class="hljs-comment">##Putting the state names in a set for later graphs</span>
</code></pre><p>Let's import the required module magic and create a train and test split:</p>
<pre><code><span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
<span class="hljs-keyword">from</span> pandas.plotting <span class="hljs-keyword">import</span> parallel_coordinates
<span class="hljs-keyword">from</span> sklearn.tree <span class="hljs-keyword">import</span> DecisionTreeClassifier, plot_tree
<span class="hljs-keyword">from</span> sklearn <span class="hljs-keyword">import</span> metrics
<span class="hljs-keyword">from</span> sklearn.naive_bayes <span class="hljs-keyword">import</span> GaussianNB
<span class="hljs-keyword">from</span> sklearn.discriminant_analysis <span class="hljs-keyword">import</span> LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
<span class="hljs-keyword">from</span> sklearn.neighbors <span class="hljs-keyword">import</span> KNeighborsClassifier
<span class="hljs-keyword">from</span> sklearn.svm <span class="hljs-keyword">import</span> SVC
<span class="hljs-keyword">from</span> sklearn.linear_model <span class="hljs-keyword">import</span> LogisticRegression
data1 = data1.loc[data1[<span class="hljs-string">'state'</span>] != str(<span class="hljs-number">-1</span>)] <span class="hljs-comment">#Let's remove that -1 state value</span>
train, test = train_test_split(data1, test_size = <span class="hljs-number">0.6</span>, stratify = data1[<span class="hljs-string">'state'</span>], random_state = <span class="hljs-number">42</span>)
</code></pre><p>By calling the train_test_split on our data and providing a test size, we've broken out our data into a training and test or validation set so we can gauge the accuracy of our predictions with a variety of data our model doesn't already know. We stratify on the state data to ensure we get a variety of each state[the classifier].</p>
<p>Let's remove the other variables other than the ingredients list - there's some good stuff there, but I don't necessarily think that something like cook time will be useful. We're removing state as well, but that's because we've already got that stored. </p>
<pre><code>features = list(data1.columns)
features.<span class="hljs-keyword">remove</span>(<span class="hljs-string">'state'</span>)
features.<span class="hljs-keyword">remove</span>(<span class="hljs-string">'region'</span>)
features.<span class="hljs-keyword">remove</span>(<span class="hljs-string">'name'</span>)
features.<span class="hljs-keyword">remove</span>(<span class="hljs-string">'course'</span>)
features.<span class="hljs-keyword">remove</span>(<span class="hljs-string">'flavor_profile'</span>)
features.<span class="hljs-keyword">remove</span>(<span class="hljs-string">'diet'</span>)
features.<span class="hljs-keyword">remove</span>(<span class="hljs-string">'prep_time'</span>)
features.<span class="hljs-keyword">remove</span>(<span class="hljs-string">'cook_time'</span>)
</code></pre><p>Now that we have a list of the classifiers and features, we can run the following:</p>
<pre><code>X_train = train[features] <span class="hljs-comment">#Training features</span>
y_train = train.state <span class="hljs-comment">#Training classifiers</span>
X_test = test[features] <span class="hljs-comment">#Testing features</span>
y_test = test.state <span class="hljs-comment">#Testing classifiers</span>
mod_dt = DecisionTreeClassifier(max_depth = 9, random_state = 1) <span class="hljs-comment">#A decision tree that we've told to go </span>
<span class="hljs-comment">#9 levels deep at max</span>
mod_dt.fit(X_train,y_train) <span class="hljs-comment">#training to model! This is the exciting part!</span>
prediction=mod_dt.predict(X_test) <span class="hljs-comment">#applying the model to the testing data</span>
print('The accuracy of the Decision Tree is','{:.3f}'.format(metrics.accuracy_score(prediction,y_test))) <span class="hljs-comment">#Let's see how accurate it is!</span>
</code></pre><p>We get an accuracy rating of ~34%...which, while not great, is better than I initially planned from this exercise. Let's take a look at the confusion matrix to better understand what's going on under the hood:</p>
<pre><code><span class="hljs-attr">cn</span> = list(set(data1[<span class="hljs-string">'state'</span>]))
<span class="hljs-attr">disp</span> = metrics.plot_confusion_matrix(mod_dt, X_test, y_test,
                                 <span class="hljs-attr">display_labels</span>=cn,
                                 <span class="hljs-attr">cmap</span>=plt.cm.Blues,
                                 <span class="hljs-attr">normalize</span>=None)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1603851254148/L1gCGOcav.png" alt="image.png" /></p>
<p>Ignoring the clutter on the x axis, we can see that we're misclassifying a lot of categories into the West Bengal state. Maybe there's some ingredient we're relying on that we can pull out in the actual decision tree!</p>
<pre><code><span class="hljs-function"><span class="hljs-keyword">fn</span> = <span class="hljs-title">list</span>(<span class="hljs-params">features</span>)

<span class="hljs-title">fig</span> = <span class="hljs-title">plt</span>.<span class="hljs-title">figure</span>(<span class="hljs-params">figsize=(<span class="hljs-params"><span class="hljs-number">25</span>,<span class="hljs-number">20</span></span>)</span>)
<span class="hljs-title">plot_tree</span>(<span class="hljs-params">mod_dt, feature_names = fn, class_names = cn, filled = <span class="hljs-literal">True</span></span>)
<span class="hljs-title">plt</span>.<span class="hljs-title">savefig</span>(<span class="hljs-params"><span class="hljs-string">'foo.pdf'</span></span>)</span>
</code></pre><p>This gets us the following large image, which is why we've saved it to a 'foo.pdf' file, cracking that open we can see that at least on the first few splits there aren't any oddities. </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1603851758381/2Ba1b7vQc.png" alt="image.png" /></p>
<p>Let's rethink this a bit, maybe I was overzealous in breaking it down by state...lets try again with region!</p>
<pre><code>data1 =data1.loc[data1[<span class="hljs-string">'name'</span>] != <span class="hljs-string">'Panjeeri'</span> ] #cutting <span class="hljs-keyword">out</span> an odd <span class="hljs-keyword">null</span> <span class="hljs-keyword">value</span>
data1[<span class="hljs-string">'region'</span>].<span class="hljs-keyword">isnull</span>().<span class="hljs-keyword">values</span>.<span class="hljs-keyword">any</span>() ##Checks <span class="hljs-keyword">for</span> other nulls


##region
train, test = train_test_split(data1, test_size = <span class="hljs-number">0.4</span>, stratify = data1[<span class="hljs-string">'region'</span>], random_state = <span class="hljs-number">42</span>)
X_train = train[features]
y_train = train.region
X_test = test[features]
y_test = test.region
mod_dt = DecisionTreeClassifier(max_depth =<span class="hljs-number">5</span>, random_state = <span class="hljs-number">1</span>)
mod_dt.fit(X_train,y_train)
prediction=mod_dt.predict(X_test)
print(<span class="hljs-string">'The accuracy of the Decision Tree is'</span>,<span class="hljs-string">'{:.3f}'</span>.format(metrics.accuracy_score(prediction,y_test)))
</code></pre><p>Well 50% ain't too bad for a half-joking dataset! Let's look at the decision tree and confusion matrix:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1603922888544/6pmwOar-F.png" alt="image.png" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1603922907651/lwnAlRSXE.png" alt="image.png" /></p>
<p>It doesn't look too bad! Seems like we're misclassifying a large amount into the 'East' dataset, which is where our previous misclassification was as well. </p>
<h3 id="what-we-can-do-to-improve-the-model">What We Can do to Improve the Model</h3>
<p>Overall, this model came out better than anticipated by just using the ingredients list of the recipes, but it could be improved. Let's list a few of the methods we can use here to improve:</p>
<ul>
<li>There are some instances of a single ingredient having multiple entries...Gram Masala vs Gram Masala Powder. This has the effect of 'lessening' the impact of the actual single ingredient and causing it to split more than once. We could clean the data a bit more to improve it.</li>
</ul>
<ul>
<li>Potentially, we could bring in the flavor profile and other variables we did not encode to improve the model.</li>
</ul>
<ul>
<li>Clustering! We could have clustered similar recipes into another variable and utilized that in our model to see if we could extrapolate on the ingredients for a better trend!</li>
</ul>
<ul>
<li>Bringing in more fields, with such an odd dataset it may be difficult, but bringing in something like 'dinner/lunch/breakfast' and amounts of each ingredient might have helped here</li>
</ul>
<p>Overall this was a good dataset to get used to One-Hot encoding and data cleaning. Thanks for reading!</p>
]]></content:encoded></item><item><title><![CDATA[Tooling Around With Text Sentiment: Trump Town Halls]]></title><description><![CDATA[Hello World! This is my first article here

Recently I finished up my masters classes in both Python and Data Mining.
And while I enjoy my newly reclaimed time on Mondays, Thursdays and god-knows what days I did homework - I wanted to keep my skills ...]]></description><link>https://anastasiadis.us/tooling-around-with-text-sentiment-trump-town-halls</link><guid isPermaLink="true">https://anastasiadis.us/tooling-around-with-text-sentiment-trump-town-halls</guid><category><![CDATA[Python]]></category><category><![CDATA[#data visualisation]]></category><dc:creator><![CDATA[Ryan Anastasiadis]]></dc:creator><pubDate>Thu, 22 Oct 2020 19:07:17 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1603392885733/qRMuPJ1o8.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p>Hello World! This is my first article here</p>
</blockquote>
<h3 id="recently-i-finished-up-my-masters-classes-in-both-python-and-data-mining">Recently I finished up my masters classes in both Python and Data Mining.</h3>
<p>And while I enjoy my newly reclaimed time on Mondays, Thursdays and god-knows what days I did homework - I wanted to keep my skills fresh and start tinkering. </p>
<p>In class we had worked with basic sentiment analysis from the RNC/DNC as well as some of the townhalls to plot out scatterplots of sentiment along with polarity:
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1603385637540/q6WqT1L86.png" alt="10/15 Dueling Townhalls" /></p>
<p>And while I had completed the assignment as specified, it got me thinking about the differences in speech patterns and how varying modules derive meaning from them. I mean after all, there's a certain amount of context encoded in normal speech and text that humans innately understand. Anyone who has worked with freeform text is familiar with this problem, it's the basis of then entire NLP field! </p>
<p>So with this in mind, and returning to the scatterplot I had turned in for class - I noticed that this visual, while interesting - did a bad job of a few things. Namely, it does a bad job of calling out trends! Every dot on the chart is a sentence plotted at a coordinate of polarity (Good/Bad context) and subjectivity (Fact/Hearsay context). What happens however, when there are multiple statements in the same position on the graph? We lose those points in favor of whatever one was plotted over them. There's a level of context missing here - one that becomes even more complex of a capture when you consider the problems inherent in understanding different forms of speech.</p>
<p>So the scatter plot is a neat visual for a talking point but how can we make it more meaningful? </p>
<h3 id="problems-to-improve-upon">Problems to Improve Upon:</h3>
<h4 id="1-townhalls-are-complicated">1. Townhalls are complicated</h4>
<p>After all, they're really a free form method of question and answer for candidates that could have anything from political rants to niceties shared with the base. Varying individual performance of either metric here could be misleading!  </p>
<h4 id="2-were-dropping-out-trends-in-our-chart">2. We're dropping out trends in our chart</h4>
<p>Assuming someone has a distinct way of speaking - over time the law of large numbers should kick in right? Does length of phrase and frequency muddy this?</p>
<h4 id="3-data-sanitization-and-text-interpretation">3. Data sanitization &amp; text interpretation</h4>
<p> Data cleaning is <a target="_blank" href="https://counting.substack.com/p/data-cleaning-is-analysis-not-grunt">everyone's favorite topic</a> and it becomes a little harder when we're looking at text! Does 'Alright' become a sentence on it's own? Does it have a polarity? Does a module understand text enough to accurately assess, say, a statement involving the social inequalities of the militarization of a police force? (probably not) But these are all things we have to consider when looking at text. Where do data cleaning/formatting problems step out and NLP complexities step in?</p>
<h3 id="in-the-spirit-of-tinkering-i-didnt-let-that-stop-me">In the Spirit of Tinkering, I Didn't Let That Stop Me</h3>
<p>I'm not about to try and address an entire field of text parsing/understanding but that doesn't mean I can't improve my chart and maybe learn something!  </p>
<p>With this in mind, I went and pulled down transcripts of all of Trump's town halls...I had initially intended to pull this from <a target="_blank" href="https://www.rev.com/blog/">Rev.com</a> (a seriously great site) myself, but I found  <a target="_blank" href="https://www.kaggle.com/christianlillelund/donald-trumps-rallies">some kind soul on Kaggle</a> (Thank you!!) had already compiled and cleaned them up</p>
<p>With the files in place, I opened some up to see the format we were working with:</p>
<blockquote>
<p>And you see what's happening, right? It's being rigged against … It's sad. It's being rigged against Crazy Bernie. Crazy Bernie is going to go crazy. Crazy. I think Crazy Bernie is going to be more crazy when they see what they're doing. I called it a long time ago. [... ]The Democrat Party has gone crazy. Whether it's Bernie Sanders plan to eliminate private healthcare, Elizabeth Pocahontas's plan … By the way, she's history. She's history.</p>
</blockquote>
<p>Looks like the data has already had the html, speaker tags, time, and speakers other than Trump cleaned out. This is a great starting dataset! Let's get to work loading the actual text into some dataframes.  But first, imports!  </p>
<pre><code><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> requests               
from wordcloud <span class="hljs-keyword">import</span> WordCloud      
from textblob <span class="hljs-keyword">import</span> TextBlob       
from pathlib <span class="hljs-keyword">import</span> Path   
<span class="hljs-keyword">import</span> pandas as pd  
<span class="hljs-keyword">import</span> seaborn as sns  
<span class="hljs-keyword">import</span> matplotlib.pyplot as plt  
<span class="hljs-keyword">import</span> matplotlib  
from plotly <span class="hljs-keyword">import</span> express as px  
from plotly.subplots <span class="hljs-keyword">import</span> make_subplots  
<span class="hljs-keyword">import</span> plotly.graph_objects as go  
<span class="hljs-keyword">import</span> nltk  
<span class="hljs-keyword">import</span> textatistic
</code></pre><p>(Note that I just ripped out my imports from my class assignment so some of these are unused) </p>
<p>Okay great, so let's point to the files:</p>
<pre><code>directory = <span class="hljs-string">r'C:\Users\PATH'</span>
<span class="hljs-keyword">for</span> filename <span class="hljs-keyword">in</span> os.listdir(directory):
    print(filename)
</code></pre><p><code>out: BattleCreekDec19_2019.txt</code>
Looks like we have a locale, as well as a date in these townhalls, that might be useful - let's get that bit out and save it later. It could be interesting to see if Trump's speech trends vary over time!</p>
<pre><code><span class="hljs-built_in">directory</span> = r<span class="hljs-string">'C:\Users\PATH\'
capMons = ['</span>J<span class="hljs-string">','</span>F<span class="hljs-string">','</span>M<span class="hljs-string">','</span>A<span class="hljs-string">','</span>S<span class="hljs-string">','</span>O<span class="hljs-string">','</span>N<span class="hljs-string">','</span>D<span class="hljs-string">']

townDict = {}
polarity = []
subject = []
txt = []
length = []
event = []
for filename in os.listdir(directory):
    #holder for if we'</span>ve got a <span class="hljs-keyword">double</span> digit date
    doubleMonth = <span class="hljs-number">0</span>
    <span class="hljs-comment">#searching for the underscore to find the date loc</span>
    endLoc = filename.find(<span class="hljs-string">'_'</span>)
    <span class="hljs-comment">#search for the period to drop the file extension</span>
    perdLoc = filename.find(<span class="hljs-string">'.'</span>)
    <span class="hljs-keyword">for</span> char in filename[endLoc<span class="hljs-number">-4</span>:perdLoc]:
        <span class="hljs-comment">#checking if we've got a two digit date via string comprehension</span>
        <span class="hljs-keyword">if</span> char in capMons:
            doubleMonth += <span class="hljs-number">1</span>
        <span class="hljs-keyword">else</span>:
            <span class="hljs-keyword">continue</span>
    <span class="hljs-keyword">if</span> doubleMonth == <span class="hljs-number">1</span>:
        <span class="hljs-comment">#changing how far back we go from the date split char if we have double digit month</span>
        fullMo = filename[endLoc<span class="hljs-number">-4</span>:perdLoc].split(<span class="hljs-string">'_'</span>)
    <span class="hljs-keyword">else</span>:
        fullMo = filename[endLoc<span class="hljs-number">-5</span>:perdLoc].split(<span class="hljs-string">'_'</span>)   
    date = fullMo[<span class="hljs-number">0</span>][:<span class="hljs-number">3</span>]+<span class="hljs-string">'-'</span>+fullMo[<span class="hljs-number">0</span>][<span class="hljs-number">3</span>:]+<span class="hljs-string">'-'</span>+fullMo[<span class="hljs-number">1</span>] 
    townDict[filename] = date
</code></pre><p><code>townDict</code> returns <code>BattleCreekDec19_2019.txt': 'Dec-19-2019'</code></p>
<p>So now that we've got that bit down, let's move on to the actual content of the files! Putting this under the for loop that we've already go going gets us the data into a dataframe with the polarity and sentiment!</p>
<pre><code> fullPath = str(directory+'\\'+filename)
    <span class="hljs-keyword">with</span> <span class="hljs-keyword">open</span>(fullPath,<span class="hljs-string">'r'</span>, <span class="hljs-keyword">encoding</span>=<span class="hljs-string">'utf8'</span> ) <span class="hljs-keyword">as</span> <span class="hljs-keyword">file</span>:
        <span class="hljs-comment">#let's read the file!</span>
        townhall = file.readlines()
    <span class="hljs-comment">#data is coming through as a list, let's fix that</span>
    townhall = <span class="hljs-keyword">str</span>(townhall)
    <span class="hljs-comment">#there are some break chars coming through, let's get rid of em</span>
    townhall = townhall.replace(<span class="hljs-string">"\\"</span>, <span class="hljs-string">""</span>)
    <span class="hljs-comment">#putting it in a blob for the sentiment analysis</span>
    THblob = TextBlob(townhall)
    <span class="hljs-comment">#Print overall sentiment and analysis for the entire file:</span>
    print(f<span class="hljs-string">'DT Analysis{THblob.sentiment}'</span>)            
    <span class="hljs-comment"># Save sentiment data to dataframe</span>
    pd.set_option(<span class="hljs-string">'max_colwidth'</span>, <span class="hljs-number">400</span>)
    <span class="hljs-keyword">for</span> sentence <span class="hljs-keyword">in</span> THblob.sentences: <span class="hljs-comment">##getting the Trump text sentiment and putting it in one big data frame</span>
        polarity.append(sentence.sentiment.polarity)  
        subject.append(sentence.sentiment.subjectivity)
        txt.append(<span class="hljs-keyword">str</span>(sentence))
<span class="hljs-comment">## Getting the words in a sentence to understand lenght!</span>
        length.append(textatistic.word_count(<span class="hljs-keyword">str</span>(sentence)))
df_TH = pd.DataFrame(polarity,<span class="hljs-keyword">columns</span>=[<span class="hljs-string">'polarity'</span>])
df_TH [<span class="hljs-string">'subjectivity'</span>] = subject
df_TH [<span class="hljs-string">'text'</span>] = txt
df_TH [<span class="hljs-string">'len'</span>] = <span class="hljs-keyword">length</span>
</code></pre><p>Let's check out the data we've got to make sure it's looking good!
<code>df_TH.head()</code></p>
<pre><code>    <span class="hljs-string">polarity</span>    <span class="hljs-string">subjectivity</span>    <span class="hljs-string">text</span>    <span class="hljs-string">len</span>
<span class="hljs-number">0</span>    <span class="hljs-number">0.0</span>    <span class="hljs-number">0.00</span>    [<span class="hljs-string">'Thank you.    2
1    0.0    0.00    Thank you.    2
2    0.0    0.00    Thank you to Vice President Pence.    6
3    0.7    0.60    He'</span><span class="hljs-string">s</span> <span class="hljs-string">a</span> <span class="hljs-string">good</span> <span class="hljs-string">guy.</span>    <span class="hljs-number">4</span>
<span class="hljs-number">4</span>    <span class="hljs-number">0.8</span>    <span class="hljs-number">0.75</span>    <span class="hljs-string">We've</span> <span class="hljs-string">done</span> <span class="hljs-string">a</span> <span class="hljs-string">great</span> <span class="hljs-string">job</span> <span class="hljs-string">together.</span>    <span class="hljs-number">6</span>
</code></pre><p>Looks like we've got some errant chars but it looks like it's only at the overall start and end...let's not worry about that for now! Taking a quick look at the data the same way as my earlier assignment:</p>
<pre><code>fig = px.scatter(df_TH,
                 x = <span class="hljs-string">'polarity'</span> ,
                 y = <span class="hljs-string">'subjectivity'</span>,
                 hover_data = [<span class="hljs-string">'text'</span>],
                 color = <span class="hljs-string">'len'</span>
                )
fig.<span class="hljs-keyword">show</span>()
</code></pre><p>Gets us...an image of confirmation that there's a lot of data here</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1603389995478/884btqhtk.png" alt="a complex image" />
just from a visual perspective, it looks like the middle right area has a lot of points overlapping...but it's hard to tell. Maybe a heatmap would be a better way to compare all of these speeches!</p>
<p>First let's take a look at the distribution since we're trying to understand how frequently phrases are at certain coordinates:<br /><code>sns.displot(df_TH, x="polarity")</code></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1603390418908/5O24bdhme.png" alt="image.png" /></p>
<p>That's a lot of zeroes for polarity, and sentiment doesn't look much better:<br /><code>sns.displot(df_TH, x="subjectivity")</code></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1603390500352/GLSsvQEgy.png" alt="image.png" /></p>
<p>I played around with the length of the sentences vs the polarity and I found that there was a good amount of zeroes at both even with sentences of a len of 25 words(!!!) At this point I think it's safe to say this module might be a little less than ready to handle these speeches but let's move onward since there's still some positives - handling the zero values:</p>
<pre><code><span class="hljs-attr">df_Filt</span> = df_TH.loc[df_TH[<span class="hljs-string">'subjectivity'</span>] !=<span class="hljs-number">0</span>]
<span class="hljs-attr">df_Filt</span> = df_TH.loc[df_TH[<span class="hljs-string">'polarity'</span>] !=<span class="hljs-number">0</span>]
</code></pre><p>Also a quirk of the seaborn heatplot - our data is in the wrong format, it doesn't really handle multiple entries of a value intelligently. We're going to have to bin values but there's 200 possible values at the .01 rank...let's avoid that and go with .05. We'll have to do a quick function to make this easier!</p>
<pre><code><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">round_to</span>(<span class="hljs-params">n, precision</span>):</span>
    correction = <span class="hljs-number">0.5</span> <span class="hljs-keyword">if</span> n &gt;= <span class="hljs-number">0</span> <span class="hljs-keyword">else</span> <span class="hljs-number">-0.5</span>
    <span class="hljs-keyword">return</span> int( n/precision+correction ) * precision

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">round_to_05</span>(<span class="hljs-params">n</span>):</span>
    <span class="hljs-keyword">return</span> round_to(n, <span class="hljs-number">0.05</span>)

df_ThHeat = pd.DataFrame([df_Filt[<span class="hljs-string">'polarity'</span>],df_Filt[<span class="hljs-string">'subjectivity'</span>]])
df_ThHeat = df_ThHeat.transpose()
df_ThHeat[<span class="hljs-string">'polarity'</span>] = df_ThHeat[<span class="hljs-string">'polarity'</span>].apply(round_to_05)
df_ThHeat[<span class="hljs-string">'subjectivity'</span>] = df_ThHeat[<span class="hljs-string">'subjectivity'</span>].apply(round_to_05)
</code></pre><p>So I snuck in some extra code, but now we've got a dataframe of rounded values so if we just bin similar numbers we'll be good to go!</p>
<pre><code>mytable = df_ThHeat.groupby([<span class="hljs-string">'polarity'</span>,<span class="hljs-string">'subjectivity'</span>]).size().reset_index().<span class="hljs-keyword">rename</span>(<span class="hljs-keyword">columns</span>={<span class="hljs-number">0</span>:<span class="hljs-string">'count'</span>})
mytable.transpose()
mytable.head()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1603391484138/tQ5FAA2TK.png" alt="image.png" /></p>
<p>But we also need to get the data into the right format, let's set a pivot table to have the right dimensions </p>
<pre><code>df = pd.pivot_table(data = mytable, <span class="hljs-keyword">index</span>=<span class="hljs-string">'subjectivity'</span>, 
                  <span class="hljs-keyword">values</span>=<span class="hljs-string">'count'</span>, <span class="hljs-keyword">columns</span>=<span class="hljs-string">'polarity'</span>)
df.fillna(<span class="hljs-number">0</span>)
</code></pre><p>Cool so let's start working with the heatplot!</p>
<pre><code><span class="hljs-comment">#Let's set the size of the plot to be bigger so we can see better</span>
plt.gcf().set_size_inches(<span class="hljs-number">15</span>, <span class="hljs-number">8</span>)
<span class="hljs-comment">#white on the back looks a little bad, let's set a gray to attract to a better color</span>
sns.set_style(<span class="hljs-string">"darkgrid"</span>, {<span class="hljs-string">"axes.facecolor"</span>: <span class="hljs-string">".9"</span>})
<span class="hljs-comment">#let's make it a bit bigger/better to look at as an image</span>
sns.set_context(<span class="hljs-string">"poster"</span>)
<span class="hljs-comment">#actual graph stuff, setting min below zero so we get a good color, setting line width to break em out a bit</span>
ax = sns.heatmap(df, cbar=<span class="hljs-literal">True</span>, cmap=<span class="hljs-string">'rocket_r'</span>, linewidths=<span class="hljs-number">.5</span>,
                 vmin=<span class="hljs-number">-50</span>, <span class="hljs-comment">#vmax=500</span>
                 <span class="hljs-comment">#setting robust to True to get a better variety of colors try setting the vmax and see what you get!</span>
                <span class="hljs-comment"># center = mytable['count'].mean(),</span>
                 robust = <span class="hljs-literal">True</span>
                )
<span class="hljs-comment">#formatting the lables to look better!</span>
ax.set_xticklabels([<span class="hljs-string">'{:.2f}'</span>.format(<span class="hljs-keyword">float</span>(t.get_text())) <span class="hljs-keyword">for</span> t in ax.get_xticklabels()])
ax.set_yticklabels([<span class="hljs-string">'{:.2f}'</span>.format(<span class="hljs-keyword">float</span>(t.get_text())) <span class="hljs-keyword">for</span> t in ax.get_yticklabels()])
ax.invert_yaxis()
<span class="hljs-comment">#Title to finish it up</span>
plt.title(<span class="hljs-string">'Trump Townhall Speaking Trends'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1603391924845/XcFpqBtjI.png" alt="ta-da!" /></p>
<p>Well now that we've overlaid all of the townhalls, we can see that Trump tends to stay in the middle subjective, slightly positive range but does have a large number of statements in the very subjective and very negative range. In fact, if you bump up the vmax on the graph to make more extreme data stand out more - you see that he's got a <strong>lot</strong> of statements in that range  </p>
<h3 id="wrapping-up-and-closing-thoughts">Wrapping Up &amp; Closing Thoughts</h3>
<p>Trump's got a very <em>unique</em> circular speaking style that tends to be self-referential. I wonder how much of the fallout is due to this and how much is a failure on the module I've used. I should be able to recreate this with a different module.   </p>
<p>Other random thoughts:  </p>
<ul>
<li><p>I could use length to weight the instances of polarity and subjectivity to try and capture the sentiment of more complex thoughts - this might tease out better trends</p>
</li>
<li><p>I'm going to run this for Biden as well to see if there are different trends</p>
</li>
</ul>
<ul>
<li>I could create a 'difference' heatmap showing differences in trends via matrix subtraction between Biden and Trump's heatmaps to visualize if they tend to speak in different quadrants</li>
</ul>
<p>I learned a lot with this one and had some fun tinkering!</p>
]]></content:encoded></item></channel></rss>