In [481]:
# # if you do not have the folder to begin with:
#from google.colab import drive
#drive.mount('/content/drive')#,force_remount=True)
#%cd content/drive/MyDrive/MadBeignet.github.io.git
# !git clone git@github.com:MadBeignet/MadBeignet.github.io.git
# !git clone https://github.com/MadBeignet/MadBeignet.github.io
In [482]:
#!ls
In [483]:
#%cd../../../
In [484]:
# # first, mount your google drive, change to the course folder, pull latest changes, and change to the lab folder.
#from google.colab import drive
#drive.mount('/content/drive',force_remount=True)
# %cd content/MadBeignet.github.io
# !git pull
%cd Data
[Errno 2] No such file or directory: 'Data'
/Users/maddiewisinski/Documents/GitHub/MadBeignet.github.io/Data
In [485]:
!ls
PIRUS_May2020      Protests           archive_folder
Population         Voter_Turnouts     fancy_website.html
In [486]:
#%cd "drive/MyDrive/MadBeignet.github.io/Data"
In [487]:
# imports
import pandas as pd
import re
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import numpy as np
from matplotlib.pyplot import figure
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from numpy.core.getlimits import inf
pd.set_option('mode.chained_assignment',None)

Team: Merrilee Montgomery and Maddie Wisinski

Website Link: https://madbeignet.github.io/

Project Goals

The team will be looking at the relationship between political participation and political resistance in the United States from 2000-2021 by state.

To measure political participation, the team will use voter turnout statistics by state from collected by the Election Project. The election project website derives all its data from individual state websites.

This project will distinguish between violent and nonviolent political resistance. To measure nonviolent political resistance, this group will use protest frequency and size from Count Love, a group from MIT that began tracking protests amidst the 2017 Women's March. To study violent political resistance, this project will use Profiles of Individual Radicalization in the US (PIRUS) from University of Maryland National Consortium for the Study of Terrorism and Responses to Terrorism (START). The PIRUS Dataset contains informaiton about individuals who's radicalization became apparent through their plotting to engage in violent activity.

Election Project: https://www.electproject.org/home

Count Love: https://countlove.org/faq.html

PIRUS: https://www.start.umd.edu/data-tools/profiles-individual-radicalization-united-states-pirus

Voter Turnout: 2000-2022

Cleaning the Data

The Election Project collects voter turnout data for the general election that occur every two comes in separate CSVs by year. Here we want to read all by-year files into a single DataFrame. To do so, we must account for the following:

  1. Years 2000-2010 are in a uniform format, but missing state abbreviation.
  2. Years 2012-2020 have an extra column of state abreviation that can be used to create a index value consisting of Year and State Abbreviation.
  3. Years 2016-2020 have notes at the end of each csv that must be deleted.

Step 1: Concatenate years 2000-2010

In [488]:
csv_final = pd.read_csv("./Voter_Turnouts/2000 November General Election - Turnout Rates.csv",
                        header = None,
                        skiprows = 2)#first two rows is header in CSV
csv_final['Year']=2000

l = []#we will use this to make sure all files loaded
for a in range (2002,2012,2):
  csv_temp = pd.read_csv("./Voter_Turnouts/"+str(a)+" November General Election - Turnout Rates.csv",
                         header = None,
                         skiprows = 2)#first two rows are headers in csv
  csv_temp['Year']=a#As year is incremented,value changes
  l.append(1)
  csv_final = pd.concat([csv_final,csv_temp],ignore_index = True)
final_df = pd.DataFrame(csv_final)
print(len(l) == 5)#Test to make sure all files were uploaded, returns true if successful, false else
final_df.columns = ['Region', 'VEP Total Ballots Counted', 'VEP Highest Office', 'VAP Highest Office', 'Total Ballots Counted', 'Highest Office', 'Voting-Eligible Population (VEP)', 'Voting-Age Population (VAP)', '% Non-citizen', 'Prison', 'Probation', 'Parole', 'Total Ineligible Felon', 'Overseas Eligible', 'Year']
#Rename all columns

final_df
True
Out[488]:
Region VEP Total Ballots Counted VEP Highest Office VAP Highest Office Total Ballots Counted Highest Office Voting-Eligible Population (VEP) Voting-Age Population (VAP) % Non-citizen Prison Probation Parole Total Ineligible Felon Overseas Eligible Year
0 United States 55.3% 54.2% 50.0% 107,390,107 105,375,486 194,331,436 210,623,408 7.7% 1,377,013 2,339,388 536,039 3,082,746 2,937,000 2000
1 Alabama NaN 51.6% 50.1% NaN 1,672,551 3,241,682 3,334,576 1.5% 26,225 40,178 5,484 51,798 NaN 2000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
310 Wisconsin 52.4% 52.0% 49.7% 2,185,021 2,171,331 4,172,130 4,365,214 3.2% 22,724 22,602 19,572 55,112 NaN 2010
311 Wyoming 46.0% 45.5% 43.8% 190,822 188,463 414,536 430,673 2.4% 2,059 3,231 682 5,684 NaN 2010

312 rows × 15 columns

Step 2: Drop State Abbreviations and Excess Rows

To concatenate Voter Turnout from years 2012-2022, we have to remove the abbreviation column and any excess rows (which are usually methodology notes.)

In [489]:
l = []#we will use this to make sure all files loaded
for a in range (2012,2016,2):
  csv_temp = pd.read_csv("./Voter_Turnouts/"+str(a)+" November General Election - Turnout Rates.csv",
                         header = None,
                         skiprows = 2,
                         names = ['Region', 'VEP Total Ballots Counted', 'VEP Highest Office', 'VAP Highest Office', 'Total Ballots Counted', 'Highest Office', 'Voting-Eligible Population (VEP)', 'Voting-Age Population (VAP)', '% Non-citizen', 'Prison', 'Probation', 'Parole', 'Total Ineligible Felon', 'Overseas Eligible', 'State Abv'])#first two rows are headers in csv
  csv_temp['Year']=a#As year is incremented,value changes
  csv_temp = csv_temp.iloc[:52]
  csv_temp.drop('State Abv',inplace=True,axis=1)
  csv_final = pd.concat([csv_final,csv_temp],ignore_index=True)#not sure whats going wrong now, I'll ask Dr. Culotta

csv_final
Out[489]:
Region VEP Total Ballots Counted VEP Highest Office VAP Highest Office Total Ballots Counted Highest Office Voting-Eligible Population (VEP) Voting-Age Population (VAP) % Non-citizen Prison Probation Parole Total Ineligible Felon Overseas Eligible Year
0 United States 55.3% 54.2% 50.0% 107,390,107 105,375,486 194,331,436 210,623,408 7.7% 1,377,013 2,339,388 536,039 3,082,746 2,937,000 2000
1 Alabama NaN 51.6% 50.1% NaN 1,672,551 3,241,682 3,334,576 1.5% 26,225 40,178 5,484 51,798 NaN 2000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
414 Wisconsin 56.9% 56.6% 53.9% 2,422,248 2,410,314 4,260,427 4,454,970 3.1% 22,097 46,212 20,010 67,986 NaN 2014
415 Wyoming 39.7% 39.0% 37.3% 171,153 168,390 431,434 445,626 2.7% 2,330 5,196 715 5,955 NaN 2014

416 rows × 15 columns

After 2014, Voter Turnout Data Column names and values vary more. As a result, we must clean each individual dataset to concatenate

2018

In [490]:
temp_18 = pd.read_csv("./Voter_Turnouts/2018 November General Election - Turnout Rates.csv",
                      names =['Region', 'Estimated or Actual 2018 Total Ballots Counted VEP Turnout Rate', '2018 Vote for Highest Office VEP Turnout Rate', 'Status', 'Source', 'Estimated or Actual 2018 Total Ballots Counted', '2018 Vote for Highest Office', 'Voting-Eligible Population (VEP)', 'Voting-Age Population (VAP)', '% Non-citizen', 'Prison', 'Probation', 'Parole', 'Total Ineligible Felon', 'Overseas Eligible', 'State Abv'],
                      skiprows=2,
                      header = None)
temp_18.drop('Source',inplace=True,axis=1)
temp_18.drop('Status',inplace=True,axis=1)
temp_18.drop('State Abv',inplace=True,axis=1)
csv_final.drop('VAP Highest Office',inplace=True,axis=1)
csv_final.columns
temp_18.columns = ['Region', 'VEP Total Ballots Counted', 'VEP Highest Office',
       'Total Ballots Counted', 'Highest Office',
       'Voting-Eligible Population (VEP)', 'Voting-Age Population (VAP)',
       '% Non-citizen', 'Prison', 'Probation', 'Parole',
       'Total Ineligible Felon', 'Overseas Eligible']
temp_18['Year']=2018
temp_18 = temp_18.iloc[:52]
csv_final = pd.concat([csv_final,temp_18],ignore_index = True)
csv_final
Out[490]:
Region VEP Total Ballots Counted VEP Highest Office Total Ballots Counted Highest Office Voting-Eligible Population (VEP) Voting-Age Population (VAP) % Non-citizen Prison Probation Parole Total Ineligible Felon Overseas Eligible Year
0 United States 55.3% 54.2% 107,390,107 105,375,486 194,331,436 210,623,408 7.7% 1,377,013 2,339,388 536,039 3,082,746 2,937,000 2000
1 Alabama NaN 51.6% NaN 1,672,551 3,241,682 3,334,576 1.5% 26,225 40,178 5,484 51,798 NaN 2000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
466 Wisconsin 61.4% 61.4% 2,675,000 2,673,308 4,354,527 4,563,564 3.1% 22,889 44,489 20,401 68,649 NaN 2018
467 Wyoming 47.9% 47.4% 205,275 203,420 428,898 445,747 2.5% 2,323 4,666 842 5,825 NaN 2018

468 rows × 14 columns

In [491]:
l = 'Region,Source,Status,Total Ballots Counted (Estimate),Vote for Highest Office (President),VEP Turnout Rate (Total Ballots Counted),VEP Turnout Rate (Highest Office),Voting-Eligible Population (VEP),Voting-Age Population (VAP),% Non-citizen,Prison,Probation,Parole,Total Ineligible Felon,Overseas Eligible,State Abv'
lis = l.split(',')
print(lis)
['Region', 'Source', 'Status', 'Total Ballots Counted (Estimate)', 'Vote for Highest Office (President)', 'VEP Turnout Rate (Total Ballots Counted)', 'VEP Turnout Rate (Highest Office)', 'Voting-Eligible Population (VEP)', 'Voting-Age Population (VAP)', '% Non-citizen', 'Prison', 'Probation', 'Parole', 'Total Ineligible Felon', 'Overseas Eligible', 'State Abv']

2016

In [492]:
temp_16 = pd.read_csv("./Voter_Turnouts/2016 November General Election - Turnout Rates.csv",
                      names =['Region', 'State Results Website', 'Status', 'VEP Total Ballots Counted', 'VEP Highest Office', 'VAP Highest Office', 'Total Ballots Counted (Estimate)', 'Highest Office', 'Voting-Eligible Population (VEP)', 'Voting-Age Population (VAP)', '% Non-citizen', 'Prison', 'Probation', 'Parole', 'Total Ineligible Felon', 'Overseas Eligible', 'State Abv'],
                      skiprows=2,
                      header = None)
temp_16.drop('Status',inplace=True,axis=1)
temp_16.drop('State Results Website',inplace=True,axis=1)
temp_16.drop('State Abv',inplace=True,axis=1)
temp_16.drop('VAP Highest Office',inplace=True,axis=1)
temp_16['Year']=2016
temp_16.columns = csv_final.columns
temp_16 = temp_16.iloc[:52]
csv_final = pd.concat([csv_final,temp_16],ignore_index = True)
csv_final
Out[492]:
Region VEP Total Ballots Counted VEP Highest Office Total Ballots Counted Highest Office Voting-Eligible Population (VEP) Voting-Age Population (VAP) % Non-citizen Prison Probation Parole Total Ineligible Felon Overseas Eligible Year
0 United States 55.3% 54.2% 107,390,107 105,375,486 194,331,436 210,623,408 7.7% 1,377,013 2,339,388 536,039 3,082,746 2,937,000 2000
1 Alabama NaN 51.6% NaN 1,672,551 3,241,682 3,334,576 1.5% 26,225 40,178 5,484 51,798 NaN 2000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
518 Wisconsin NaN 69.5% NaN 2,976,150 4,285,071 4,495,783 3.2% 22,889 44,489 20,401 68,649 NaN 2016
519 Wyoming 60.2% 59.5% 258,788 255,849 429,682 446,396 2.4% 2,323 4,666 842 5,825 NaN 2016

520 rows × 14 columns

2020

In [493]:
temp_20 = pd.read_csv("./Voter_Turnouts/2020 November General Election - Turnout Rates.csv",
                      names =['Region', 'Source', 'Status', 'Total Ballots Counted (Estimate)', 'Vote for Highest Office (President)', 'VEP Turnout Rate (Total Ballots Counted)', 'VEP Turnout Rate (Highest Office)', 'Voting-Eligible Population (VEP)', 'Voting-Age Population (VAP)', '% Non-citizen', 'Prison', 'Probation', 'Parole', 'Total Ineligible Felon', 'Overseas Eligible', 'State Abv'],
                      skiprows=2,
                      header = None)
temp_20.drop('Source',inplace=True,axis=1)
temp_20.drop('Status',inplace=True,axis=1)
temp_20.drop('State Abv',inplace=True,axis=1)
temp_20 = temp_20[['Region','VEP Turnout Rate (Total Ballots Counted)','VEP Turnout Rate (Highest Office)','Total Ballots Counted (Estimate)', 'Vote for Highest Office (President)', 'Voting-Eligible Population (VEP)', 'Voting-Age Population (VAP)', '% Non-citizen', 'Prison', 'Probation', 'Parole', 'Total Ineligible Felon', 'Overseas Eligible']]
temp_20['Year'] = 2020
temp_20.columns = csv_final.columns
temp_20 = temp_20.iloc[:52]
csv_final = pd.concat([csv_final,temp_20],ignore_index = True)
csv_final
Out[493]:
Region VEP Total Ballots Counted VEP Highest Office Total Ballots Counted Highest Office Voting-Eligible Population (VEP) Voting-Age Population (VAP) % Non-citizen Prison Probation Parole Total Ineligible Felon Overseas Eligible Year
0 United States 55.3% 54.2% 107,390,107 105,375,486 194,331,436 210,623,408 7.7% 1,377,013 2,339,388 536,039 3,082,746 2,937,000 2000
1 Alabama NaN 51.6% NaN 1,672,551 3,241,682 3,334,576 1.5% 26,225 40,178 5,484 51,798 NaN 2000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
570 Wisconsin 75.8% 75.5% 3,310,000 3,298,041 4,368,530 4,586,746 3.2% 23,574 42,909 21,015 71,193 NaN 2020
571 Wyoming 64.6% 64.2% 278,503 276,765 431,364 447,915 2.2% 2,488 5,383 934 6,759 NaN 2020

572 rows × 14 columns

In [ ]:
 
In [494]:
states_cleaned = []
for e in csv_final.Region:
    e = str(e).replace('*','')
    states_cleaned.append(e)
csv_final.Region = states_cleaned

pd.unique(csv_final.Region)
Out[494]:
array(['United States', 'Alabama', 'Alaska', 'Arizona', 'Arkansas',
       'California', 'Colorado', 'Connecticut', 'Delaware',
       'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota',
       'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'], dtype=object)

Radicalized Individuals in the United States

This data set is collected on the individual level. Because we are examining trends on the state level, we will save this Data grouped to the individuals' origin states.

1. Loading the PIRUS Data

The PIRUS data measures 145 categorical and quantitative variables that do not load nicely into COLAB. We have taken the first header line from the CSV an split it into a list that can be passed as column names for the CSV.

In [495]:
a = "Subject_ID,Loc_Plot_State1,Loc_Plot_City1,Loc_Plot_State2,Loc_Plot_City2,Date_Exposure,Plot_Target1,Plot_Target2,Plot_Target3,Attack_Preparation,Op_Security,Changing_Target,Anticp_Fatals_Targ,Internet_Use_Plot,Extent_Plot,Violent,Criminal_Severity,Criminal_Charges,Indict_Arrest,Current_Status,Group_Membership,Terrorist_Group_Name1,Terrorist_Group_Name2,Terrorist_Group_Name3,Actively_Recruited,Recruiter1,Recruiter2,Recruiter3,Actively_Connect,Group_Competition,Role_Group,Length_Group,Clique,Clique_Radicalize,Clique_Connect,Internet_Radicalization,Media_Radicalization,Social_Media,Social_Media_Frequency,Social_Media_Platform1,Social_Media_Platform2,Social_Media_Platform3,Social_Media_Platform4,Social_Media_Platform5,Social_Media_Activities1,Social_Media_Activities2,Social_Media_Activities3,Social_Media_Activities4,Social_Media_Activities5,Social_Media_Activities6,Social_Media_Activities7,Radicalization_Islamist,Radicalization_Far_Right,Radicalization_Far_Left,Radicalization_Single_Issue,Ideological_Sub_Category1,Ideological_Sub_Category2,Ideological_Sub_Category3,Loc_Habitation_State1,Loc_Habitation_City1,Loc_Habitation_State2,Loc_Habitation_City2,Itinerant,External_Rad,Rad_duration,Radical_Behaviors,Radical_Beliefs,US_Govt_Leader,Foreign_Govt_Leader,Event_Influence1,Event_Influence2,Event_Influence3,Event_Influence4,Beliefs_Trajectory,Behaviors_Trajectory,Radicalization_Sequence,Radicalization_Place,Prison_Radicalize,Broad_Ethnicity,Age,Marital_Status,Children,Age_Child,Gender,Religious_Background,Convert,Convert_Date,Reawakening,Reawakening_Date,Citizenship,Residency_Status,Nativity,Time_US_Months,Immigrant_Generation,Immigrant_Source,Language_English,Diaspora_Ties,Education,Student,Education_Change,Employment_Status,Change_Performance,Work_History,Military,Foreign_Military,Social_Stratum_Childhood,Social_Stratum_Adulthood,Aspirations,Abuse_Child,Abuse_Adult,Abuse_type1,Abuse_Type2,Abuse_Type3,Psychological,Alcohol_Drug,Absent_Parent,Overseas_Family,Close_Family,Family_Religiosity,Family_Ideology,Family_Ideological_Level,Prison_Family_Friend,Crime_Family_Friend,Radical_Friend,Radical_Family,Radical_Signif_Other,Relationship_Troubles,Platonic_Troubles,Unstructured_Time,Friendship_Source1,Friendship_Source2,Friendship_Source3,Kicked_Out,Previous_Criminal_Activity,Previous_Criminal_Activity_Type1,Previous_Criminal_Activity_Type2,Previous_Criminal_Activity_Type3,Previous_Criminal_Activity_Age,Gang,Gang_Age_Joined,Trauma,Other_Ideologies,Angry_US,Group_Grievance,Standing"
def listify(mis_string):
  return mis_string.split(",")
pirus_headlist = listify(a)
print(pirus_headlist)
['Subject_ID', 'Loc_Plot_State1', 'Loc_Plot_City1', 'Loc_Plot_State2', 'Loc_Plot_City2', 'Date_Exposure', 'Plot_Target1', 'Plot_Target2', 'Plot_Target3', 'Attack_Preparation', 'Op_Security', 'Changing_Target', 'Anticp_Fatals_Targ', 'Internet_Use_Plot', 'Extent_Plot', 'Violent', 'Criminal_Severity', 'Criminal_Charges', 'Indict_Arrest', 'Current_Status', 'Group_Membership', 'Terrorist_Group_Name1', 'Terrorist_Group_Name2', 'Terrorist_Group_Name3', 'Actively_Recruited', 'Recruiter1', 'Recruiter2', 'Recruiter3', 'Actively_Connect', 'Group_Competition', 'Role_Group', 'Length_Group', 'Clique', 'Clique_Radicalize', 'Clique_Connect', 'Internet_Radicalization', 'Media_Radicalization', 'Social_Media', 'Social_Media_Frequency', 'Social_Media_Platform1', 'Social_Media_Platform2', 'Social_Media_Platform3', 'Social_Media_Platform4', 'Social_Media_Platform5', 'Social_Media_Activities1', 'Social_Media_Activities2', 'Social_Media_Activities3', 'Social_Media_Activities4', 'Social_Media_Activities5', 'Social_Media_Activities6', 'Social_Media_Activities7', 'Radicalization_Islamist', 'Radicalization_Far_Right', 'Radicalization_Far_Left', 'Radicalization_Single_Issue', 'Ideological_Sub_Category1', 'Ideological_Sub_Category2', 'Ideological_Sub_Category3', 'Loc_Habitation_State1', 'Loc_Habitation_City1', 'Loc_Habitation_State2', 'Loc_Habitation_City2', 'Itinerant', 'External_Rad', 'Rad_duration', 'Radical_Behaviors', 'Radical_Beliefs', 'US_Govt_Leader', 'Foreign_Govt_Leader', 'Event_Influence1', 'Event_Influence2', 'Event_Influence3', 'Event_Influence4', 'Beliefs_Trajectory', 'Behaviors_Trajectory', 'Radicalization_Sequence', 'Radicalization_Place', 'Prison_Radicalize', 'Broad_Ethnicity', 'Age', 'Marital_Status', 'Children', 'Age_Child', 'Gender', 'Religious_Background', 'Convert', 'Convert_Date', 'Reawakening', 'Reawakening_Date', 'Citizenship', 'Residency_Status', 'Nativity', 'Time_US_Months', 'Immigrant_Generation', 'Immigrant_Source', 'Language_English', 'Diaspora_Ties', 'Education', 'Student', 'Education_Change', 'Employment_Status', 'Change_Performance', 'Work_History', 'Military', 'Foreign_Military', 'Social_Stratum_Childhood', 'Social_Stratum_Adulthood', 'Aspirations', 'Abuse_Child', 'Abuse_Adult', 'Abuse_type1', 'Abuse_Type2', 'Abuse_Type3', 'Psychological', 'Alcohol_Drug', 'Absent_Parent', 'Overseas_Family', 'Close_Family', 'Family_Religiosity', 'Family_Ideology', 'Family_Ideological_Level', 'Prison_Family_Friend', 'Crime_Family_Friend', 'Radical_Friend', 'Radical_Family', 'Radical_Signif_Other', 'Relationship_Troubles', 'Platonic_Troubles', 'Unstructured_Time', 'Friendship_Source1', 'Friendship_Source2', 'Friendship_Source3', 'Kicked_Out', 'Previous_Criminal_Activity', 'Previous_Criminal_Activity_Type1', 'Previous_Criminal_Activity_Type2', 'Previous_Criminal_Activity_Type3', 'Previous_Criminal_Activity_Age', 'Gang', 'Gang_Age_Joined', 'Trauma', 'Other_Ideologies', 'Angry_US', 'Group_Grievance', 'Standing']
In [496]:
pirus_temp = pd.read_csv("./PIRUS_May2020/PIRUS_Public_May2020.csv",
                         header=1,
                         names = pirus_headlist)
pirus_temp
Out[496]:
Subject_ID Loc_Plot_State1 Loc_Plot_City1 Loc_Plot_State2 Loc_Plot_City2 Date_Exposure Plot_Target1 Plot_Target2 Plot_Target3 Attack_Preparation ... Previous_Criminal_Activity_Type2 Previous_Criminal_Activity_Type3 Previous_Criminal_Activity_Age Gang Gang_Age_Joined Trauma Other_Ideologies Angry_US Group_Grievance Standing
0 4857 New York -99 NaN NaN 1/1/49 -88 NaN NaN -88 ... NaN NaN -99 0 -88 3 0 1 -99 -99
1 5803 Alabama Birmingham NaN NaN 1/1/49 -88 NaN NaN -88 ... NaN NaN -88 0 -88 -99 0 -99 2 -99
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2223 1374 California Los Angeles NaN NaN 11/26/18 14 NaN NaN 1 ... NaN NaN -99 0 -88 -99 0 -99 -99 -99
2224 8295 Ohio Toledo NaN NaN 12/10/18 3 14.0 15.0 2 ... NaN NaN 2 0 -88 -99 0 1 3 -99

2225 rows × 145 columns

2. Filtering and Grouping By States</h2>

We will only examine radicalized individual since 2000 due to the fact that voter data comes from the 2000-2020 years, and protest data comes from the 2017-2021 years. (This data starts in 1948 and goes through 2018.)

We will also use value_counts to determine the different states that from which the radicalized individual originate from. We are assuming that this is also the state in which the individual is mostly likely to engage in popular protest and vote.</p>

In [497]:
#Date_Exposure is not comparable because it is of dtype string.
#Create column 'Year' of int values that represents the last 2 digit of the year.
l = []
for val in pirus_temp['Date_Exposure']:
  a = val.split('/')
  b = a[-1:]
  l.append(int(b[0]))
pirus_temp['Year'] = l
#Any row with 'Year' under 22 occured in the 2000's and is in the scope of this study
pirus_temp = pirus_temp[pirus_temp['Year'] <= 22]
#Group by state
pirus_states_since_2000 = pirus_temp.value_counts('Loc_Habitation_State1')
pirus_states_since_2000.plot(kind='bar', figsize=(20, 10))
Out[497]:
<AxesSubplot:xlabel='Loc_Habitation_State1'>

Protests in the United States

This data set is collected on the event level. Because we are examining trends on the state level, we will save this Data grouped to the protest event location. It is worth noting that popular protest often spreads. This data is harvested by webcrawling for news articles and similar media referencing the protest to a location. Therefore, protests that happened in wave, such as those in response to George Floyd's murder, will appearch multiple times. However, we will still count these as separate events, even if such events are comorbid.

Because this data set only covers 4 years, we do not have to filter it. We will only group it by state.

In [498]:
protests_temp = pd.read_csv("./Protests/data.csv",
                         header = 1,
                         names = ['Date','Location','Attendees',
                         'Event (legacy; see tags)','Tags',
                         'Curated','Source','Total_Articles'])
protests_temp.head(20)
Out[498]:
Date Location Attendees Event (legacy; see tags) Tags Curated Source Total_Articles
0 2017-01-16 Johnson City, TN 300.0 Civil Rights Civil Rights; For racial justice; Martin Luthe... Yes http://www.johnsoncitypress.com/Local/2017/01/... 4
1 2017-01-16 Indianapolis, IN 20.0 Environment Environment; For wilderness preservation Yes http://wishtv.com/2017/01/16/nature-groups-pro... 1
... ... ... ... ... ... ... ... ...
18 2017-01-20 Richmond, VA 2000.0 Executive (Inauguration March) Executive; Against 45th president Yes http://richmondfreepress.com/news/2017/jan/20/... 2
19 2017-01-20 Madison, WI 100.0 Executive Executive; Against 45th president Yes http://www.channel3000.com/news/politics/peace... 1

20 rows × 8 columns

In [499]:
#Must create state attribute to groupby state, similar to extracting year, but first create dictionary matching statges to abbreviation.
states = ["Alabama","Alaska","Arizona","Arkansas","California","Colorado","Connecticut","District of Columbia","Delaware","Florida","Georgia","Guam","Hawaii","Idaho","Illinois","Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland","Massachusetts","Michigan","Minnesota","Mississippi","Missouri","Montana","Nebraska","Nevada","New Hampshire","New Jersey","New Mexico","New York","North Carolina","North Dakota","Ohio","Oklahoma","Oregon","Pennsylvania","Puerto Rico","Rhode Island","South Carolina","South Dakota","Tennessee","Texas","United States","Utah", "Vermont","Virginia","Washington","West Virginia","Wisconsin","Wyoming"]
abbrev = ["AL","AK","AZ","AR","CA","CO","CT","DC","DE","FL","GA","GU","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","PR","RI","SC","SD","TN","TX","US","UT","VT","VA","WA","WV","WI","WY"]
states_dict = {}
i = 0
for name in abbrev:
  states_dict[name] = states[i]
  i += 1
print(states_dict)
#create list of state names
protests_temp
{'AL': 'Alabama', 'AK': 'Alaska', 'AZ': 'Arizona', 'AR': 'Arkansas', 'CA': 'California', 'CO': 'Colorado', 'CT': 'Connecticut', 'DC': 'District of Columbia', 'DE': 'Delaware', 'FL': 'Florida', 'GA': 'Georgia', 'GU': 'Guam', 'HI': 'Hawaii', 'ID': 'Idaho', 'IL': 'Illinois', 'IN': 'Indiana', 'IA': 'Iowa', 'KS': 'Kansas', 'KY': 'Kentucky', 'LA': 'Louisiana', 'ME': 'Maine', 'MD': 'Maryland', 'MA': 'Massachusetts', 'MI': 'Michigan', 'MN': 'Minnesota', 'MS': 'Mississippi', 'MO': 'Missouri', 'MT': 'Montana', 'NE': 'Nebraska', 'NV': 'Nevada', 'NH': 'New Hampshire', 'NJ': 'New Jersey', 'NM': 'New Mexico', 'NY': 'New York', 'NC': 'North Carolina', 'ND': 'North Dakota', 'OH': 'Ohio', 'OK': 'Oklahoma', 'OR': 'Oregon', 'PA': 'Pennsylvania', 'PR': 'Puerto Rico', 'RI': 'Rhode Island', 'SC': 'South Carolina', 'SD': 'South Dakota', 'TN': 'Tennessee', 'TX': 'Texas', 'US': 'United States', 'UT': 'Utah', 'VT': 'Vermont', 'VA': 'Virginia', 'WA': 'Washington', 'WV': 'West Virginia', 'WI': 'Wisconsin', 'WY': 'Wyoming'}
Out[499]:
Date Location Attendees Event (legacy; see tags) Tags Curated Source Total_Articles
0 2017-01-16 Johnson City, TN 300.0 Civil Rights Civil Rights; For racial justice; Martin Luthe... Yes http://www.johnsoncitypress.com/Local/2017/01/... 4
1 2017-01-16 Indianapolis, IN 20.0 Environment Environment; For wilderness preservation Yes http://wishtv.com/2017/01/16/nature-groups-pro... 1
... ... ... ... ... ... ... ... ...
38094 2021-01-31 Salt Lake City, UT NaN Other Other; Against deregulation; Business No https://www.abc4.com/news/local-news/crowds-ga... 1
38095 2021-01-31 San Francisco, CA 100.0 Other Other; Against hazardous conditions; Prisons; ... Yes https://www.mercurynews.com/2021/01/31/activis... 1

38096 rows × 8 columns

In [500]:
#Create a list that can be added as a column to the DataFrame, representing the locatiion the protest took place in.
l = []
for val in protests_temp['Location']:
  m = val.split(',')
  if len(m) >= 2:
    n = m[-1][-2:]
    state = states_dict[n.upper()]
    l.append(state)
  else: #accounting for abnormal cases. implementation based on printing individual cases
    if m == 'La Porte County Courthouse in La Porte':
      l.append('Indiana')
    if m == 'Space':
      l.append('New York')
    if n == 'WA':
      l.append('Washington')
    if n == 'DE':
      l.append('Delaware')
protests_temp['State'] = l
#protests_temp
In [501]:
protests_by_state = protests_temp.value_counts('State')
protests_by_state.plot.barh(figsize=(10,15))
Out[501]:
<AxesSubplot:ylabel='State'>

From this bar chart, we can see that California has the highest number of protests. California also had the highest number of radicalized individuals.

In [502]:
protests_by_state.mean()
Out[502]:
718.7924528301887

We can also see that the mean number of protests by state is 718.

Population Data

Source: Census Bureau. Notes: The years 2020 and 2021 were in different files, so had to join them.

In [503]:
pop20_21 = pd.read_csv('./Population/2020-2021 Census Bureau Population.csv')
#Rename columns due to header reading error
pop20_21.rename(columns={'Population Estimate\n (as of July 1)':'2020','Unnamed: 3':'2021'},inplace=True)
#Drop the first 6 rows becausethey are aggregates
pop20_21 = pop20_21.iloc[6:]
list1 = []
for i in pop20_21['Geographic Area']:
  i = i[1:]
  list1.append(i)
pop20_21['Geographic Area'] = list1
pop20_21.head()
Out[503]:
Geographic Area April 1, 2020 Estimates Base 2020 2021
6 Alabama 5024279.0 5024803.0 5039877.0
7 Alaska 733391.0 732441.0 732673.0
8 Arizona 7151502.0 7177986.0 7276316.0
9 Arkansas 3011524.0 3012232.0 3025891.0
10 California 39538223.0 39499738.0 39237836.0
In [504]:
pop10_19 = pd.read_csv('./Population/nst-est2019-01.csv')
#Rename columns due to header reading error
pop10_19.rename(columns={'Population Estimate (as of July 1)':'2010','Unnamed: 2':'Estimates Base','Unnamed: 4':'2011','Unnamed: 5':'2012','Unnamed: 5':'2012','Unnamed: 6':'2013','Unnamed: 7':'2014','Unnamed: 8':'2015','Unnamed: 9':'2016','Unnamed: 10':'2017','Unnamed: 11':'2018','Unnamed: 12': '2019'},inplace=True)
#Drop the first 6 rows becausethey are aggregates
pop10_19 = pop10_19.iloc[6:]
list1 = []
for i in pop10_19['Geographic Area']:
  i = i[1:]
  list1.append(i)
pop10_19['Geographic Area'] = list1
pop10_19.head()
Out[504]:
Geographic Area April 1, 2010 Estimates Base 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
6 Alabama 4779736.00 4780125.00 4785437.0 4799069.0 4815588.0 4830081.0 4841799.0 4852347.0 4863525.0 4874486.0 4887681.0 4903185.0
7 Alaska 710231.00 710249.00 713910.0 722128.0 730443.0 737068.0 736283.0 737498.0 741456.0 739700.0 735139.0 731545.0
8 Arizona 6392017.00 6392288.00 6407172.0 6472643.0 6554978.0 6632764.0 6730413.0 6829676.0 6941072.0 7044008.0 7158024.0 7278717.0
9 Arkansas 2915918.00 2916031.00 2921964.0 2940667.0 2952164.0 2959400.0 2967392.0 2978048.0 2989918.0 3001345.0 3009733.0 3017804.0
10 California 37253956.00 37254519.00 37319502.0 37638369.0 37948800.0 38260787.0 38596972.0 38918045.0 39167117.0 39358497.0 39461588.0 39512223.0
In [505]:
total_population = pop10_19
total_population['2020'] = pop20_21['2020']
total_population['2021'] = pop20_21['2021']
total_population.drop(['April 1, 2010','Estimates Base'],inplace=True,axis=1)
In [506]:
total_population.columns
Out[506]:
Index(['Geographic Area', '2010', '2011', '2012', '2013', '2014', '2015',
       '2016', '2017', '2018', '2019', '2020', '2021'],
      dtype='object')
In [507]:
def df_creation(row):
  ret_val = pd.DataFrame()
  ret_val['Population'] = list(row)[1:]
  ret_val['Year'] = total_population.columns[1:]
  return ret_val

S_pop = {}
for index, row in total_population.iterrows():
  S_pop[list(row)[0]] = df_creation(row)
In [508]:
ax=S_pop['Alabama'].plot(x='Year',y='Population')
In [509]:
total_population
Out[509]:
Geographic Area 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
6 Alabama 4785437.0 4799069.0 4815588.0 4830081.0 4841799.0 4852347.0 4863525.0 4874486.0 4887681.0 4903185.0 5024803.0 5039877.0
7 Alaska 713910.0 722128.0 730443.0 737068.0 736283.0 737498.0 741456.0 739700.0 735139.0 731545.0 732441.0 732673.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
55 Wisconsin 5690475.0 5705288.0 5719960.0 5736754.0 5751525.0 5760940.0 5772628.0 5790186.0 5807406.0 5822434.0 5892323.0 5895908.0
56 Wyoming 564487.0 567299.0 576305.0 582122.0 582531.0 585613.0 584215.0 578931.0 577601.0 578759.0 577267.0 578803.0

51 rows × 13 columns

In [510]:
per_population_growth = total_population.copy()
years = [2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021]
for i in range(len(years)):
    per_population_growth[str(years[len(years) - i - 1])] = total_population[str(years[len(years)- i- 1])]/total_population['2010']-1

per_population_growth.drop(['2010'],inplace=True,axis=1)
per_population_growth.set_index('Geographic Area').transpose().plot(figsize=(10,15))
plt.legend()
#plt.yscale("log")
plt.xlabel("Years")
plt.ylabel("Population")
plt.title("Population Growth Over Time For Each State")
plt.grid(linestyle=':')

handles, labels = plt.gca().get_legend_handles_labels()
order = per_population_growth['2021'].sort_values(ascending=False).keys()
order = order-6

plt.legend([handles[idx] for idx in order],[labels[idx] for idx in order],bbox_to_anchor=(1., 1.0), fancybox=True, shadow=True, ncol=1)
Out[510]:
<matplotlib.legend.Legend at 0x2a0647a60>

Percent Population Growth by State

Above is the percentage of population growth of each state based on its initial population in 2010. Each state starts on the value 1 for the year 2010, so 2010 was not included. The legend is sorted by the max value at the end, so it's easier to compare each state's line, and also see which state proportionally grew the most over the decade. This is important because the population growth of a state will affect the number of protests and the number of radicalized individuals, and therefore the number of protests per radicalized individual.

In [511]:
pd.pivot_table(total_population, index='Geographic Area').head()
Out[511]:
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Geographic Area
Alabama 4785437.0 4799069.0 4815588.0 4830081.0 4841799.0 4852347.0 4863525.0 4874486.0 4887681.0 4903185.0 5024803.0 5039877.0
Alaska 713910.0 722128.0 730443.0 737068.0 736283.0 737498.0 741456.0 739700.0 735139.0 731545.0 732441.0 732673.0
Arizona 6407172.0 6472643.0 6554978.0 6632764.0 6730413.0 6829676.0 6941072.0 7044008.0 7158024.0 7278717.0 7177986.0 7276316.0
Arkansas 2921964.0 2940667.0 2952164.0 2959400.0 2967392.0 2978048.0 2989918.0 3001345.0 3009733.0 3017804.0 3012232.0 3025891.0
California 37319502.0 37638369.0 37948800.0 38260787.0 38596972.0 38918045.0 39167117.0 39358497.0 39461588.0 39512223.0 39499738.0 39237836.0

Merging Data

Both protest and radicalization measure resistance to social or governmental structures. Therefore, it makes sense to join aspects of the data into a very simple table to compare radicalization and protest activity. We will not merge this data on the 'State' attribute, because both datasets have such a large number of variables that the table produced would be unweildy.

In [512]:
resistance_data = pd.DataFrame()
resistance_data['Radicalized_num'] = pirus_states_since_2000
resistance_data['Protest_num'] = protests_by_state
resistance_data.head()
Out[512]:
Radicalized_num Protest_num
Loc_Habitation_State1
California 137 4439.0
New York 108 2688.0
Texas 84 1649.0
Florida 83 1823.0
Minnesota 70 747.0

We can look at the relationship now between the number of protests in a state and the number of radicalized individuals in a state. Unsurprisingly, there is a visually obvious correlation.

In [513]:
resistance_data.rename({'Loc_Habitation_State1':'State'},inplace=True)
resistance_data.plot(kind='scatter',
                     y='Radicalized_num',
                     x='Protest_num',
                     ylabel = "Number of People Radicalized",
                     xlabel = "Number of Protests",
                     figsize=(10,8),
                     alpha=0.4,
                     color='purple',
                     s=30)
Out[513]:
<AxesSubplot:xlabel='Number of Protests', ylabel='Number of People Radicalized'>

We can compute the correlation between these two variables as follows:

In [514]:
resistance_data['Protest_num'].corr(resistance_data['Radicalized_num'])
Out[514]:
0.8996117793328046

This is a significant, but unsurprising correlation. We can represent the population size of the state through the dot size.

In [515]:
resistance_data_merged = resistance_data.reset_index().rename(columns={'Loc_Habitation_State1':'State'}).merge(total_population.rename(columns={'Geographic Area':'State',"2021":"Population"})[["State","Population"]],on='State', how="right").set_index("State")
resistance_data_merged.head()
Out[515]:
Radicalized_num Protest_num Population
State
Alabama 24.0 281.0 5039877.0
Alaska 8.0 252.0 732673.0
Arizona 36.0 563.0 7276316.0
Arkansas 7.0 174.0 3025891.0
California 137.0 4439.0 39237836.0
In [516]:
resistance_data_merged["Population"]
Out[516]:
State
Alabama      5039877.0
Alaska        732673.0
               ...    
Wisconsin    5895908.0
Wyoming       578803.0
Name: Population, Length: 51, dtype: float64
In [517]:
resistance_data_merged.plot(kind='scatter',
                     y='Radicalized_num',
                     x='Protest_num',
                     ylabel = "Number of People Radicalized",
                     xlabel = "Number of Protests",
                     title="Number of People Radicalized vs Number of Protests",
                     figsize=(10,8),
                     alpha=0.4,
                     color='purple',
                     #s = resistance_data_merged['Population']
                     s=resistance_data_merged["Population"].apply({lambda x: x/1e4}),
                     )
plt.xscale("log")
plt.yscale("log")
x_vals = list(resistance_data_merged.reset_index()["Protest_num"])
y_vals = list(resistance_data_merged.reset_index()["Radicalized_num"])
states = list(resistance_data_merged.reset_index()["State"])
x_vals.pop(41)
y_vals.pop(41)
states.pop(41) # South Dakota has nan values
for i in range(len(x_vals)):
    plt.text(x_vals[i], y_vals[i], states[i], fontsize=8)

Protests and Radicalized Individuals Based On State

One flaw with the graph above using the population value was that it essentially meant nothing. When comparing large sets of data belonging to different areas, it's important to normalize them by some factor, so that the data is proportional. After normalizing the data, a very interesting but telling picture shows. Therefore, it is unsurprising to see the largest states have both the most radicalized individuals as well as protests, so a better perspective would be to normalize each of those values based on population. Below will give a better view on each state's participation in politics

In [518]:
resistance_data_normalized = resistance_data_merged.copy()
resistance_data_normalized["Protest_num"] = resistance_data_normalized["Protest_num"]/resistance_data_normalized["Population"]
resistance_data_normalized["Radicalized_num"] = resistance_data_normalized["Radicalized_num"]/resistance_data_normalized["Population"]
resistance_data_normalized.plot(kind='scatter',
                     y='Radicalized_num',
                     x='Protest_num',
                     ylabel = "Number of People Radicalized (Normalized by Population)",
                     xlabel = "Number of Protests (Normalized by Population)",
                     title="Number of People Radicalized vs Number of Protests (Normalized by Population)",
                     figsize=(10,8),
                     alpha=0.4,
                     color='purple',
                     #s = resistance_data_merged['Population']
                     s=resistance_data_normalized["Population"].apply({lambda x: x/1e4}))
plt.xscale("log")
plt.yscale("log")
x_vals = list(resistance_data_normalized.reset_index()["Protest_num"])
y_vals = list(resistance_data_normalized.reset_index()["Radicalized_num"])
x_vals.pop(41)
y_vals.pop(41)
for i in range(len(x_vals)):
    plt.text(x=x_vals[i], y=y_vals[i], s=states[i], fontsize=7)

Normalized Protests and Radicalized Individuals

DC is in the top right, making it the most participatory "state" in the United States. This makes sense because it's home to the White House, and many protests likely occur here by others outside of DC. The number of protests in proportion to its population as just a city make it key for political involvement. This conclusion can mostly be drawn from the fact that in the graph using just the population values, DC is pretty central in the graph, meaning that it can stand on its own without being normalized.

Models

Model 1: Preface

The Protest data Attendance column is missing values for many events. We will build a model to predict what the attendance would have been based on the issues the protest addressed, the state protest took place in, and the proportion of the radicalized individuals from that state.

First, let's look at what issues people protest about most often.

In [521]:
protests_iss =protests_temp[['Date','Location','Event (legacy; see tags)', 'Attendees','State','Tags']]
protests_iss_known = protests_iss
In [523]:
protests_iss_known.rename(columns={'Event (legacy; see tags)':'Event'},inplace=True)

Tagging System

The person who gathered this protest data did not report the political/social topics of the protest consistently. We will create a new tagging system using regular expressions to search the current tagging system for the most common issues present. This tagging system will build a set of issues that can be searched.

In [524]:
def categorizer(word):
    p_list = [r"\s*([Rr]acial)",
    r'\s*(45)', r"\s*([Gg]un\s[Rr]ights)",r"\s*([Gg]un\s[Cc]ontrol)",
     r"\s*([Oo]ther)", r"\s*([Ee]nvironment)", 
     r"\s*([Ee]ducation)",r'\s*([Hh]ealthcare)',
     r"\s*([Ii]mmigration)",r"\s*([Ee]xecutive)", 
     r"\s*([Ii]nternational\s[Rr]elations)",
     r"\s*([Ll]egislative)",r"\s*([Cc]ivil\s[Rr]ights)"]
    tag_dict = {r"\s*([Rr]acial)":"Racial",
    r'\s*(45)':"45th President", r"\s*([Gg]un\s[Rr]ights)":"Gun Rights",r"\s*([Gg]un\s[Cc]ontrol)":"Gun Control",
     r"\s*([Oo]ther)":'Other', r"\s*([Ee]nvironment)":'Environment', 
     r"\s*([Ee]ducation)":"Education",r'\s*([Hh]ealthcare)':'Healthcare',
     r"\s*([Ii]mmigration)":'Immigration',r"\s*([Ee]xecutive)":'Executive', 
     r"\s*([Ii]nternational\s[Rr]elations)":'International Relations',
     r"\s*([Ll]egislative)":'Legislative',r"\s*([Cc]ivil\s[Rr]ights)":'Civil Rights','[]':'Other'}
    ret_list = set([])
    for w in word.split(';'):
        #print(w)
        for pattern in p_list:
            m = re.search(pattern,w)
            if m != None:
                b = tag_dict[pattern]
                if b not in ret_list:
                    ret_list.add(b)
    return ret_list
In [525]:
events = []
for a in protests_iss_known["Tags"]:
    events.append(categorizer(a))
#eventsssss = protests_iss_known["Tags"].apply({lambda x: categorizer(x)}) # "Racial Injustice" if "Racial Injustice" in x else "Gun Rights" if "Guns" in x else "Other" if "Other" in x else "Environment" if "Environment" in x else "Education" if "Education" in x else "Immigration" if "Immigration" in x else x
protests_iss_known["Event"] = events
Common_Events = protests_iss_known["Event"].value_counts().head(12).keys()
#print(len(protests_iss_known["Event"].value_counts()))
#protests_iss_known["Event"] = protests_iss_known["Event"].apply({lambda x: x if x in Common_Events else "Other"})
In [526]:
def overlapping_value_count(df,return_dict):
    s = df['Event']
    for entry in s:
        l = list(entry)
        for e in l:
            if e in return_dict.keys():
                return_dict[e] += 1
            else:
                return_dict[e] = 1
    ret_val = pd.DataFrame(list(return_dict.items()),index=range(0,len(return_dict.keys())))
    ret_val.columns = ['Tag','Count']
    ret_val.set_index('Tag',inplace=True)
    return ret_val
tag_counts = overlapping_value_count(protests_iss_known,{})
tag_counts.sort_values("Count", ascending=False).plot(y='Count',kind='pie',figsize=(10,10),fontsize=10,legend=True,title='Protest Topics',colors=sns.color_palette('tab20'))
plt.legend(bbox_to_anchor=(1., 1.0), fancybox=True, shadow=True, ncol=1)
Out[526]:
<matplotlib.legend.Legend at 0x2ab003fd0>

Notice that this graph does not show the distribution of topics across all events, but across all tags, as each event may have multiple tags in its list.

Just from looking at this chart, it looks like civil rights, racial justice, guns, and immigration are major issues that people protest about..

Let's also look at the relationship between the time of year that the protests occur and the number of attendees. We will have to drop rows that do not have attendees listed, and convert the data column to a datetime object.

In [528]:
protests_iss_known.Date = pd.to_datetime(protests_iss_known.Date)
protests_real_test = protests_iss.query('Attendees != Attendees')
protests_iss_attendees_known = protests_iss.dropna(subset=['Attendees'])
In [529]:
protests_iss_attendees_known.Date.value_counts().plot(figsize=(15,10))
Out[529]:
<AxesSubplot:>
In [530]:
protests_real_test
Out[530]:
Date Location Event Attendees State Tags
2 2017-01-16 Cincinnati, OH {Racial, Civil Rights} NaN Ohio Civil Rights; For racial justice; Martin Luthe...
4 2017-01-19 Washington, DC {45th President, Executive} NaN District of Columbia Executive; Against 45th president
... ... ... ... ... ... ...
38092 2021-01-31 Topeka, KS {Civil Rights} NaN Kansas Civil Rights; For abortion rights
38094 2021-01-31 Salt Lake City, UT {Other} NaN Utah Other; Against deregulation; Business

15061 rows × 6 columns

Model 1: Building the Model</h1>

We previously saved the protests with unknown attendees to the DataFrame protests_real_test. Let's revisit that data.

In [531]:
tag_unknown = overlapping_value_count(protests_real_test,{})
tag_unknown.sort_values("Count", ascending=False).plot(y='Count',kind='pie',figsize=(8,8),colors=sns.color_palette('tab20'))
plt.legend(bbox_to_anchor=(1., 1.0), fancybox=True, shadow=True, ncol=1)
Out[531]:
<matplotlib.legend.Legend at 0x2a922d8e0>

We can build a K-nearest neighbor predictor of the number of Attendees at a protest based on the issues the protest addressed, the state that the protest took place in, and the proportion of the radicalized individuals from that state

In [532]:
protests_iss_known
Out[532]:
Date Location Event Attendees State Tags
0 2017-01-16 Johnson City, TN {Racial, Civil Rights} 300.0 Tennessee Civil Rights; For racial justice; Martin Luthe...
1 2017-01-16 Indianapolis, IN {Environment} 20.0 Indiana Environment; For wilderness preservation
... ... ... ... ... ... ...
38094 2021-01-31 Salt Lake City, UT {Other} NaN Utah Other; Against deregulation; Business
38095 2021-01-31 San Francisco, CA {Other} 100.0 California Other; Against hazardous conditions; Prisons; ...

38096 rows × 6 columns

In [533]:
pirus_temp.Date_Exposure = pd.to_datetime(pirus_temp.Date_Exposure)
type(pirus_temp.iloc[0].Date_Exposure)
Out[533]:
pandas._libs.tslibs.timestamps.Timestamp

Here we create some tools to calculate the voter turnout and number of radicalized individuals. We will use these tools to add this information to out protests data frame for the moment in time that the protest occurs.

In [534]:
#How to select out certain Protest issues, when Event attribute is saved to a list:
def issue_search(issue):
    return protests_iss_known[protests_iss_known['Event']&{issue}]
def state_date(row):
    return (row.Date,row.State)
def state_year(row):
    return (row.Date.year,row.State)
def vote_pcnt(tuple):
    #print('tuple',tuple)
    year = tuple[0]
    state = tuple[1]
    if year%2 != 0:
        year -= 1
        #print(year)
    if state not in pd.unique(csv_final.Region):
        return (np.nan)
    line = str(csv_final[(csv_final.Region == state)&(csv_final.Year == year)]['VEP Highest Office'])
    #print(line)
    pcnt = re.search(r'(....%)',line)
    #print(pcnt.groups())
    return float(pcnt[0][:-1])
def get_rads_by_population(tuple):
    date = tuple[0]
    state = tuple[1]
    if state not in pd.unique(total_population['Geographic Area']):
        return ('NaN')
    radicals = pd.DataFrame(pirus_temp[(pirus_temp.Date_Exposure < pd.to_datetime(date))&(pirus_temp.Loc_Plot_State1 == state)&(pirus_temp.Date_Exposure > pd.to_datetime('2000-01-01 00:00:00'))]).size
    population = total_population[total_population['Geographic Area'] == state][str(date.year)]
    return list(population/radicals)[0]
def to_raw(string):
    return fr"{string}"

We will get assign the voter turnout of the most recent election to the protest, and the number of radicalized individuals from 2000 to the date of the protest as the number of radicalized individuals.

In [535]:
votes = ['NaN']*38096
rads = [0]*38096
for e in range(0,38096):
    pcnt = vote_pcnt(state_year(protests_iss_known.iloc[e]))
    votes[e] = pcnt
    var1 = state_date(protests_iss_known.iloc[e])
    rad = get_rads_by_population(var1)
    rads[e]=rad
protests_iss_known['State_voters'] = votes
protests_iss_known['Radicals'] = rads

all_tags= ['Racial','45th President', 'Gun Rights', 'Gun Control', 'Other', 'Environment', 'Education', 'Healthcare', 'Immigration', 'Executive', 'International Relations', 'Legislative', 'Civil Rights', 'Other']
for t in all_tags:
    protests_iss_known[t] = [0]*38096
    for e in list(issue_search(t).index):
        e = int(e)
        protests_iss_known.loc[e,t]=1
In [536]:
protests_iss_known.loc[protests_iss_known.Radicals == inf,'Radicals'] = 0
protests_iss_known.loc[protests_iss_known.Radicals == pd.NA,'Radicals'] = np.nan
In [537]:
protests_real_test = protests_iss_known[protests_iss_known.Attendees.isnull()]
protests_real_test = protests_real_test.dropna(subset =['Radicals'])
protests_real_test = protests_real_test.dropna(subset =['State_voters'])
protests_iss_attendees_known = protests_iss_known.dropna(subset=['Attendees'])
protests_iss_attendees_known = protests_iss_attendees_known.dropna(subset=['State_voters'])
protests_iss_attendees_known = protests_iss_attendees_known.dropna(subset=['Radicals'])
protests_iss_attendees_known = protests_iss_attendees_known.dropna(subset=['Radicals'])
In [538]:
protests_iss_attendees_known.Radicals = protests_iss_attendees_known.Radicals.astype('int')
protests_iss_attendees_known.Attendees = protests_iss_attendees_known.Attendees.astype('int')

While we ultimately want to predict on the protest data into both data frames--with and without known protest attendance--we must first scale the attendance data to be able to predict upon it more accurately. As we see below, the data is heavily left skewed.

In [539]:
percentile = []
threshold = []
for num in pd.unique(protests_iss_attendees_known.Attendees):
  percentile.append(protests_iss_attendees_known[protests_iss_attendees_known.Attendees < num].size/protests_iss_attendees_known.size)
  threshold.append(num)
eval = pd.DataFrame()
eval['Distribution'] = percentile
eval['Threshold'] = threshold
eval.plot(kind='scatter',x='Threshold',y='Distribution',figsize = (15,8),alpha=0.4,title='Attendance Percentile Values')
Out[539]:
<AxesSubplot:title={'center':'Attendance Percentile Values'}, xlabel='Threshold', ylabel='Distribution'>

Looking at protest attendance over time, we can see that there are clear outliers.

In [540]:
protests_iss_attendees_known.Date = pd.to_datetime(protests_iss_attendees_known.Date)
In [541]:
protests_iss_attendees_known.plot(kind='scatter',x='Date',y='Attendees',figsize=(15,8),s=8,c='red',alpha=0.7, title='Protest Attendance by Event 2017-2021')
Out[541]:
<AxesSubplot:title={'center':'Protest Attendance by Event 2017-2021'}, xlabel='Date', ylabel='Attendees'>

We will apply a non-linear scaling.

In [542]:
protests_attendees_known_filtered = protests_iss_attendees_known.loc[protests_iss_attendees_known.Attendees<2500]
In [543]:
protests_attendees_known_filtered.plot(kind='scatter',x='Date',y='Attendees',figsize=(15,8),s=8,c='red',alpha=0.7, title='Protest Attendance by Event 2017-2021')
Out[543]:
<AxesSubplot:title={'center':'Protest Attendance by Event 2017-2021'}, xlabel='Date', ylabel='Attendees'>
In [544]:
from sklearn.metrics import accuracy_score
protests_attendees_known_filtered.Date = protests_iss_attendees_known.Date.astype('int')
feats = ['Date','State','State_voters', 'Racial', '45th President', 'Gun Rights', 'Gun Control',
       'Other', 'Environment', 'Education', 'Healthcare', 'Immigration',
       'Executive', 'International Relations', 'Legislative', 'Civil Rights',
       'Radicals']
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
X_dict = protests_attendees_known_filtered[feats].to_dict(orient="records")
X_train = vec.fit_transform(X_dict)
y = protests_attendees_known_filtered["Attendees"]

# specify the pipeline
kays = []
accuracy = []
cvs = []

for num in range(1,75,2):
  model = KNeighborsRegressor(n_neighbors=num)
  pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
  scores = cross_val_score(pipeline, X_dict, y, 
                         cv=10, scoring='neg_root_mean_squared_error')
  accuracy.append(scores.mean())
  kays.append(num)

for_plot = pd.DataFrame()
for_plot['K-value'] = kays
for_plot['RMSE'] = accuracy
In [545]:
for_plot['F1'] = accuracy
for_plot.rename(columns={'F1':'RMSE'},inplace=True)
for_plot['Class'] = 'Train'
In [546]:
plt1 = for_plot.plot(x='K-value',y='RMSE',figsize=(15,8),ylabel='Negative Root of Mean Squared Error',title="Negative Root of Mean Squared Error for Neighbors Across 10 Folds")
"""plt2 = for_plot[for_plot.Division == 2].plot(x='K-value',y='RMSE',ax=plt1)
plt3 = for_plot[for_plot.Division == 3].plot(x='K-value',y='RMSE',ax=plt1)
plt4 = for_plot[for_plot.Division == 4].plot(x='K-value',y='RMSE',ax=plt1)
plt5 = for_plot[for_plot.Division == 5].plot(x='K-value',y='RMSE',ax=plt1)
plt6 = for_plot[for_plot.Division == 6].plot(x='K-value',y='RMSE',ax=plt1)
plt7 = for_plot[for_plot.Division == 7].plot(x='K-value',y='RMSE',ax=plt1)
plt9 = for_plot[for_plot.Division == 9].plot(x='K-value',y='RMSE',ax=plt1)
plt8 = for_plot[for_plot.Division == 8].plot(x='K-value',y='RMSE',ax=plt1)
plt10 = for_plot[for_plot.Division == 10].plot(x='K-value',y='RMSE',ax=plt1)"""
Out[546]:
"plt2 = for_plot[for_plot.Division == 2].plot(x='K-value',y='RMSE',ax=plt1)\nplt3 = for_plot[for_plot.Division == 3].plot(x='K-value',y='RMSE',ax=plt1)\nplt4 = for_plot[for_plot.Division == 4].plot(x='K-value',y='RMSE',ax=plt1)\nplt5 = for_plot[for_plot.Division == 5].plot(x='K-value',y='RMSE',ax=plt1)\nplt6 = for_plot[for_plot.Division == 6].plot(x='K-value',y='RMSE',ax=plt1)\nplt7 = for_plot[for_plot.Division == 7].plot(x='K-value',y='RMSE',ax=plt1)\nplt9 = for_plot[for_plot.Division == 9].plot(x='K-value',y='RMSE',ax=plt1)\nplt8 = for_plot[for_plot.Division == 8].plot(x='K-value',y='RMSE',ax=plt1)\nplt10 = for_plot[for_plot.Division == 10].plot(x='K-value',y='RMSE',ax=plt1)"

In this graph, we can see that the line tends to flatten out at around 55, so k = 55, would be the most ideal number of neighbors to use.

In [547]:
protests_iss_attendees_known[protests_iss_attendees_known.Attendees < 10000].plot(kind='scatter',x='Date',y='Attendees',figsize=(20,10),s=8,c='red',alpha=0.2)
Out[547]:
<AxesSubplot:xlabel='Date', ylabel='Attendees'>

We see clear spikes in protest participation and protest size, maybe even documenting important dates in the world of politics.

In [548]:
pirus_temp.Date_Exposure = pd.to_datetime(pirus_temp.Date_Exposure)
In [549]:
pirus_temp
Out[549]:
Subject_ID Loc_Plot_State1 Loc_Plot_City1 Loc_Plot_State2 Loc_Plot_City2 Date_Exposure Plot_Target1 Plot_Target2 Plot_Target3 Attack_Preparation ... Previous_Criminal_Activity_Type3 Previous_Criminal_Activity_Age Gang Gang_Age_Joined Trauma Other_Ideologies Angry_US Group_Grievance Standing Year
882 3005 -99 -99 NaN NaN 2000-01-01 -88 NaN NaN -88 ... NaN -88 0 -88 -99 0 1 -99 -99 0
883 3655 Montana -99 NaN NaN 2000-01-01 -88 NaN NaN -88 ... NaN -99 0 -88 -99 0 -99 -99 -99 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2223 1374 California Los Angeles NaN NaN 2018-11-26 14 NaN NaN 1 ... NaN -99 0 -88 -99 0 -99 -99 -99 18
2224 8295 Ohio Toledo NaN NaN 2018-12-10 3 14.0 15.0 2 ... NaN 2 0 -88 -99 0 1 3 -99 18

1327 rows × 146 columns

In [550]:
pirus_temp.set_index('Date_Exposure')
rad_counts = pirus_temp.sort_index().value_counts('Date_Exposure',sort=False)
rad_counts.plot(x='Date_Exposure',figsize=(20,10))

#rad_counts.plot(x='Date_Exposure', y = '0')
Out[550]:
<AxesSubplot:xlabel='Date_Exposure'>

Model 2

Radicalization and Protests Over Time: We will look at the correlation between radicalized individuals and protests over time. Perhaps there are relationships between radicalization on certain issues and more protests on certain issues. For example, we know that internet searches for "Straight pride" peak each year during June, which is Pride Month for LGBTQ+ folks. (https://trends.google.com/trends/explore?date=all&geo=US&q=straight%20pride) Perhaps more discussion around an issue in the form of protests causes more radicalization on the opposing side. We will use time data and issue categories for both radicalized individuals from the PIRUS data and protest events.

As an exploratory exercise, let's plot both the PIRUS and the protests data over time to see the spikes in activity.

Now we can plot protests and radicalization on the same axis, though our protest data only starts at 2017. We will have to filter the radicalization data.

In [551]:
pirus_temp.Year = pd.to_numeric(pirus_temp.Year)
since_17 = pirus_temp.loc[pirus_temp.Year>=17]
since_17.set_index('Date_Exposure')
Out[551]:
Subject_ID Loc_Plot_State1 Loc_Plot_City1 Loc_Plot_State2 Loc_Plot_City2 Plot_Target1 Plot_Target2 Plot_Target3 Attack_Preparation Op_Security ... Previous_Criminal_Activity_Type3 Previous_Criminal_Activity_Age Gang Gang_Age_Joined Trauma Other_Ideologies Angry_US Group_Grievance Standing Year
Date_Exposure
2017-01-01 6610 California Los Molinos Oregon NaN -88 NaN NaN -88 -88 ... NaN -88 0 -88 0 0 0 0 0 17
2017-01-01 6734 Minnesota Minneapolis NaN NaN -88 NaN NaN -88 -88 ... NaN -99 0 -88 -99 0 1 2 0 17
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2018-11-26 1374 California Los Angeles NaN NaN 14 NaN NaN 1 -99 ... NaN -99 0 -88 -99 0 -99 -99 -99 18
2018-12-10 8295 Ohio Toledo NaN NaN 3 14.0 15.0 2 2 ... NaN 2 0 -88 -99 0 1 3 -99 18

226 rows × 145 columns

In [552]:
rad_counts2 = since_17.sort_index().value_counts('Date_Exposure',sort=False)
In [553]:
rad_counts2 = rad_counts2.reset_index().rename(columns={"Date_Exposure":"Date",0:"freq"})
In [554]:
merged_data = protests_iss_attendees_known[["Date","Attendees"]].merge(rad_counts2[["Date","freq"]], on='Date', how='inner')
merged_data
Out[554]:
Date Attendees freq
0 2017-01-16 300 9
1 2017-01-16 20 9
... ... ... ...
2947 2018-12-10 300 1
2948 2018-12-10 20 1

2949 rows × 3 columns

Model 2

Radicalization and Protests Over Time: We will look at the correlation between radicalized individuals and protests over time. Perhaps there are relationships between radicalization on certain issues and more protests on certain issues. For example, we know that internet searches for "Straight pride" peak each year during June, which is Pride Month for LGBTQ+ folks. (https://trends.google.com/trends/explore?date=all&geo=US&q=straight%20pride) Perhaps more discussion around an issue in the form of protests causes more radicalization on the opposing side. We will use time data and issue categories for both radicalized individuals from the PIRUS data and protest events.

As an exploratory exercise, let's plot both the PIRUS and the protests data over time to see the spikes in activity.

In [555]:
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
fig.set_size_inches(18.5, 10.5)

ax1.scatter(merged_data["Date"], merged_data["Attendees"], c='blue', s=50, alpha=0.4)
ax2.scatter(merged_data["Date"], merged_data["freq"], c='red', s=50, alpha=0.15)
plt.title("Temporary Title") # title this
ax1.set_xlabel("Date")
ax1.set_ylabel("Attendees")
ax2.set_ylabel("Frequency")

plt.show()

This chart shows the relationship between protest attendance/number and the number of individuals radicalized. We will have to code protests by issue and radicalized individuals by issue to get a better idea of the relationships between radicalization and protests.

Missing Attendance Data

In [556]:
fin = pd.DataFrame(protests_iss_attendees_known.value_counts('State'),columns=['Known'])
In [557]:
fin2 = pd.DataFrame(protests_real_test.value_counts('State'),columns=['Unknown']).join(fin)
#fin['Unknown'] = fin2['Unknown']
In [558]:
fin2.plot(kind = 'scatter',figsize=(15,10),x='Known',y='Unknown')
states = list(fin2.reset_index()["State"])
x_vals = list(fin2.reset_index()["Known"])
y_vals = list(fin2.reset_index()["Unknown"])
'''for i in range(len(x_vals)):
    plt.text(x_vals[i], y_vals[i], states[i], fontsize=8)'''
Out[558]:
'for i in range(len(x_vals)):\n    plt.text(x_vals[i], y_vals[i], states[i], fontsize=8)'

This graph shows that the number of protests with known attendees for in each state is correlated with the number of protests with unknown attendees for each state. So, missing variables are not weighted for one state/

In [559]:
"""def issue_search_df(issue,df):
    return df[df['Event']&{issue}]"""
def autopct(pct):
    return ('%.2f' % pct)
tag_counts_real = overlapping_value_count(protests_real_test,{})
tag_counts_real.sort_values("Count",ascending=False).plot(y='Count',kind='pie',figsize=(10,10),fontsize=10,legend=True,autopct=autopct,title='Protest Topics',colors=sns.color_palette('tab20'))
plt.legend(bbox_to_anchor=(1., 1.0), fancybox=True, shadow=True, ncol=1)
Out[559]:
<matplotlib.legend.Legend at 0x17f573970>
In [560]:
tag_counts_known = overlapping_value_count(protests_iss_attendees_known,{})
tag_counts_known.sort_values("Count",ascending=False).plot(y='Count',kind='pie',figsize=(10,10),autopct=autopct,fontsize=10,legend=True,title='Protest Topics',colors=sns.color_palette('tab20'))
plt.legend(bbox_to_anchor=(1., 1.0), fancybox=True, shadow=True, ncol=1)
Out[560]:
<matplotlib.legend.Legend at 0x17f4f9f70>

The proportion of protests for each topic follow roughly the same distribution as those for which the the attendance is known, with less that 1% difference for all topics except immigration and gun control (barely).

In [561]:
protests_iss_attendees_known.head()
Out[561]:
Date Location Event Attendees State Tags State_voters Radicals Racial 45th President ... Gun Control Other Environment Education Healthcare Immigration Executive International Relations Legislative Civil Rights
0 2017-01-16 Johnson City, TN {Racial, Civil Rights} 300 Tennessee Civil Rights; For racial justice; Martin Luthe... 51.1 1997 1 0 ... 0 0 0 0 0 0 0 0 0 1
1 2017-01-16 Indianapolis, IN {Environment} 20 Indiana Environment; For wilderness preservation 56.4 2533 0 0 ... 0 0 1 0 0 0 0 0 0 0
3 2017-01-18 Hartford, CT {Healthcare} 300 Connecticut Healthcare; For Planned Parenthood 63.7 4079 0 0 ... 0 0 0 0 1 0 0 0 0 0
7 2017-01-20 Westlake Park, Seattle, WA {45th President, Executive} 100 Washington Executive; Against 45th president 64.7 1303 0 1 ... 0 0 0 0 0 0 1 0 0 0
8 2017-01-20 Columbus, OH {Civil Rights} 2450 Ohio Civil Rights; For women's rights; Women's March 62.9 2101 0 0 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 21 columns

In [562]:
protests_iss_attendees_known.set_index('Date')
known_freq = protests_iss_attendees_known.sort_index().value_counts('Date',sort=False)#,columns=['Freq'],index=None)
axis = known_freq.plot(x='Date',figsize=(20,10))

protests_real_test.set_index('Date')
unknown_freq = protests_real_test.sort_index().value_counts('Date',sort=False)#,columns=['Freq'],index=None)
unknown_freq.plot(x='Date',figsize=(20,10),ax=axis)

#known_freq#.plot(x='Date',y="Freq", c='blue', alpha=0.4)


"""pirus_temp.set_index('Date_Exposure')
rad_counts = pirus_temp.sort_index().value_counts('Date_Exposure',sort=False)
rad_counts.plot(x='Date_Exposure',figsize=(20,10))"""
Out[562]:
"pirus_temp.set_index('Date_Exposure')\nrad_counts = pirus_temp.sort_index().value_counts('Date_Exposure',sort=False)\nrad_counts.plot(x='Date_Exposure',figsize=(20,10))"

Protests with recorded attendance spiked at roughly the same time of year as those without.

In [563]:
x =protests_iss_attendees_known.value_counts('State').plot(kind='bar',y='State',color='orange',figsize=(15,8))
protests_real_test.value_counts('State').plot(kind='bar',y='State',ax=x)
Out[563]:
<AxesSubplot:xlabel='State'>
In [564]:
known = 23009
unknown = 15046
fin2['Unknown_pcnt'] = fin2['Unknown']/unknown
fin2['Known_pcnt'] = fin2['Known']/known
fin2.plot(kind='scatter',x='Unknown_pcnt',y='Known_pcnt',figsize=(15,8))
Out[564]:
<AxesSubplot:xlabel='Unknown_pcnt', ylabel='Known_pcnt'>

Redundant of the first graph, to a certain degree

In [565]:
fin2['Known'].corr(fin2['Unknown'])
Out[565]:
0.9799869422463429

So the correlation between the known and unknown number of protests per state is almost 0.98, which is not bad. Additionally, this shows that the data is MAR, so the data is usable and comparable. If it were not MAR, we would have to consider biases and their impact on our data.

In [566]:
protests_attendees_known_filtered.Date = protests_iss_attendees_known.Date.astype('int')
protests_real_test.Date = protests_real_test.Date.astype('int')
pd.options.display.max_rows = 5

feats = ['Date','State','State_voters', 'Racial', '45th President', 'Gun Rights', 'Gun Control',
       'Other', 'Environment', 'Education', 'Healthcare', 'Immigration',
       'Executive', 'International Relations', 'Legislative', 'Civil Rights',
       'Radicals']
X_train_dict = protests_attendees_known_filtered[feats].to_dict(orient="records")
y_train = protests_attendees_known_filtered["Attendees"]

x_new = protests_real_test
X_new_dict=x_new[feats].to_dict(orient="records")

# Dummy encoding
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_new = vec.transform(X_new_dict)

# Standardization
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_new_sc = scaler.transform(X_new)

# K-Nearest Neighbors Model
model = KNeighborsRegressor(n_neighbors=55)
model.fit(X_train_sc, y_train)
protests_real_test['predicted_attendeance'] = model.predict(X_new_sc)
In [567]:
protests_attendees_known_filtered.Date = pd.to_datetime(protests_attendees_known_filtered.Date)
protests_real_test.Date = pd.to_datetime(protests_real_test.Date)
fin_axis = protests_attendees_known_filtered.plot(kind="scatter",x="Date",y="Attendees",c='green',alpha=0.4,figsize=(18,10))
protests_real_test.plot(kind="scatter",x="Date",y="predicted_attendeance",ax=fin_axis,alpha=0.4)
Out[567]:
<AxesSubplot:xlabel='Date', ylabel='predicted_attendeance'>

Conclusion

Comparing the radicalized individuals and protest frequency within states leads to eye opening information. There is clearly and visually a correlation between radicalized individuals and protest. An even more interesting piece of information was seeing DC at the forefront of all the political business, which makes perfect sense when taking into account that it's the central place of our government Additionally, the different kinds of protests are clearly important because it's critical that when we are trying to understand the political climate of a state, we need to know what the people are protesting about. The most common issues are civil rights, racial justice, and immigration. These are heated topics, and it's important to know that people are protesting about them. Although there was missing data in the recordance of attendees for protests, it was clear that these datapoints were MAR as they shared similar distributions with the known data, even having a strong correlation with the data that wasn't missing. There were heavy time limitations with this project, and ideally, we would've loved to compare policing stats, such as total budgets in each state as well as protest frequency and attendance. We would've also loved to dive deep into the details of each individual protest and try to properly map them maybe on a graph with lines connecting them to see if there were any patterns. Overall, there are so many more questions to ask and to answer with the path of this project, but for now, we can visualize the political participation graphically.