# # if you do not have the folder to begin with:
#from google.colab import drive
#drive.mount('/content/drive')#,force_remount=True)
#%cd content/drive/MyDrive/MadBeignet.github.io.git
# !git clone git@github.com:MadBeignet/MadBeignet.github.io.git
# !git clone https://github.com/MadBeignet/MadBeignet.github.io
#!ls
#%cd../../../
# # first, mount your google drive, change to the course folder, pull latest changes, and change to the lab folder.
#from google.colab import drive
#drive.mount('/content/drive',force_remount=True)
# %cd content/MadBeignet.github.io
# !git pull
%cd Data
[Errno 2] No such file or directory: 'Data' /Users/maddiewisinski/Documents/GitHub/MadBeignet.github.io/Data
!ls
PIRUS_May2020 Protests archive_folder Population Voter_Turnouts fancy_website.html
#%cd "drive/MyDrive/MadBeignet.github.io/Data"
# imports
import pandas as pd
import re
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import numpy as np
from matplotlib.pyplot import figure
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from numpy.core.getlimits import inf
pd.set_option('mode.chained_assignment',None)
Team: Merrilee Montgomery and Maddie Wisinski
Website Link: https://madbeignet.github.io/
The team will be looking at the relationship between political participation and political resistance in the United States from 2000-2021 by state.
To measure political participation, the team will use voter turnout statistics by state from collected by the Election Project. The election project website derives all its data from individual state websites.
This project will distinguish between violent and nonviolent political resistance. To measure nonviolent political resistance, this group will use protest frequency and size from Count Love, a group from MIT that began tracking protests amidst the 2017 Women's March. To study violent political resistance, this project will use Profiles of Individual Radicalization in the US (PIRUS) from University of Maryland National Consortium for the Study of Terrorism and Responses to Terrorism (START). The PIRUS Dataset contains informaiton about individuals who's radicalization became apparent through their plotting to engage in violent activity.
Election Project: https://www.electproject.org/home
Count Love: https://countlove.org/faq.html
PIRUS: https://www.start.umd.edu/data-tools/profiles-individual-radicalization-united-states-pirus
The Election Project collects voter turnout data for the general election that occur every two comes in separate CSVs by year. Here we want to read all by-year files into a single DataFrame. To do so, we must account for the following:
csv_final = pd.read_csv("./Voter_Turnouts/2000 November General Election - Turnout Rates.csv",
header = None,
skiprows = 2)#first two rows is header in CSV
csv_final['Year']=2000
l = []#we will use this to make sure all files loaded
for a in range (2002,2012,2):
csv_temp = pd.read_csv("./Voter_Turnouts/"+str(a)+" November General Election - Turnout Rates.csv",
header = None,
skiprows = 2)#first two rows are headers in csv
csv_temp['Year']=a#As year is incremented,value changes
l.append(1)
csv_final = pd.concat([csv_final,csv_temp],ignore_index = True)
final_df = pd.DataFrame(csv_final)
print(len(l) == 5)#Test to make sure all files were uploaded, returns true if successful, false else
final_df.columns = ['Region', 'VEP Total Ballots Counted', 'VEP Highest Office', 'VAP Highest Office', 'Total Ballots Counted', 'Highest Office', 'Voting-Eligible Population (VEP)', 'Voting-Age Population (VAP)', '% Non-citizen', 'Prison', 'Probation', 'Parole', 'Total Ineligible Felon', 'Overseas Eligible', 'Year']
#Rename all columns
final_df
True
Region | VEP Total Ballots Counted | VEP Highest Office | VAP Highest Office | Total Ballots Counted | Highest Office | Voting-Eligible Population (VEP) | Voting-Age Population (VAP) | % Non-citizen | Prison | Probation | Parole | Total Ineligible Felon | Overseas Eligible | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | United States | 55.3% | 54.2% | 50.0% | 107,390,107 | 105,375,486 | 194,331,436 | 210,623,408 | 7.7% | 1,377,013 | 2,339,388 | 536,039 | 3,082,746 | 2,937,000 | 2000 |
1 | Alabama | NaN | 51.6% | 50.1% | NaN | 1,672,551 | 3,241,682 | 3,334,576 | 1.5% | 26,225 | 40,178 | 5,484 | 51,798 | NaN | 2000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
310 | Wisconsin | 52.4% | 52.0% | 49.7% | 2,185,021 | 2,171,331 | 4,172,130 | 4,365,214 | 3.2% | 22,724 | 22,602 | 19,572 | 55,112 | NaN | 2010 |
311 | Wyoming | 46.0% | 45.5% | 43.8% | 190,822 | 188,463 | 414,536 | 430,673 | 2.4% | 2,059 | 3,231 | 682 | 5,684 | NaN | 2010 |
312 rows × 15 columns
To concatenate Voter Turnout from years 2012-2022, we have to remove the abbreviation column and any excess rows (which are usually methodology notes.)
l = []#we will use this to make sure all files loaded
for a in range (2012,2016,2):
csv_temp = pd.read_csv("./Voter_Turnouts/"+str(a)+" November General Election - Turnout Rates.csv",
header = None,
skiprows = 2,
names = ['Region', 'VEP Total Ballots Counted', 'VEP Highest Office', 'VAP Highest Office', 'Total Ballots Counted', 'Highest Office', 'Voting-Eligible Population (VEP)', 'Voting-Age Population (VAP)', '% Non-citizen', 'Prison', 'Probation', 'Parole', 'Total Ineligible Felon', 'Overseas Eligible', 'State Abv'])#first two rows are headers in csv
csv_temp['Year']=a#As year is incremented,value changes
csv_temp = csv_temp.iloc[:52]
csv_temp.drop('State Abv',inplace=True,axis=1)
csv_final = pd.concat([csv_final,csv_temp],ignore_index=True)#not sure whats going wrong now, I'll ask Dr. Culotta
csv_final
Region | VEP Total Ballots Counted | VEP Highest Office | VAP Highest Office | Total Ballots Counted | Highest Office | Voting-Eligible Population (VEP) | Voting-Age Population (VAP) | % Non-citizen | Prison | Probation | Parole | Total Ineligible Felon | Overseas Eligible | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | United States | 55.3% | 54.2% | 50.0% | 107,390,107 | 105,375,486 | 194,331,436 | 210,623,408 | 7.7% | 1,377,013 | 2,339,388 | 536,039 | 3,082,746 | 2,937,000 | 2000 |
1 | Alabama | NaN | 51.6% | 50.1% | NaN | 1,672,551 | 3,241,682 | 3,334,576 | 1.5% | 26,225 | 40,178 | 5,484 | 51,798 | NaN | 2000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
414 | Wisconsin | 56.9% | 56.6% | 53.9% | 2,422,248 | 2,410,314 | 4,260,427 | 4,454,970 | 3.1% | 22,097 | 46,212 | 20,010 | 67,986 | NaN | 2014 |
415 | Wyoming | 39.7% | 39.0% | 37.3% | 171,153 | 168,390 | 431,434 | 445,626 | 2.7% | 2,330 | 5,196 | 715 | 5,955 | NaN | 2014 |
416 rows × 15 columns
After 2014, Voter Turnout Data Column names and values vary more. As a result, we must clean each individual dataset to concatenate
temp_18 = pd.read_csv("./Voter_Turnouts/2018 November General Election - Turnout Rates.csv",
names =['Region', 'Estimated or Actual 2018 Total Ballots Counted VEP Turnout Rate', '2018 Vote for Highest Office VEP Turnout Rate', 'Status', 'Source', 'Estimated or Actual 2018 Total Ballots Counted', '2018 Vote for Highest Office', 'Voting-Eligible Population (VEP)', 'Voting-Age Population (VAP)', '% Non-citizen', 'Prison', 'Probation', 'Parole', 'Total Ineligible Felon', 'Overseas Eligible', 'State Abv'],
skiprows=2,
header = None)
temp_18.drop('Source',inplace=True,axis=1)
temp_18.drop('Status',inplace=True,axis=1)
temp_18.drop('State Abv',inplace=True,axis=1)
csv_final.drop('VAP Highest Office',inplace=True,axis=1)
csv_final.columns
temp_18.columns = ['Region', 'VEP Total Ballots Counted', 'VEP Highest Office',
'Total Ballots Counted', 'Highest Office',
'Voting-Eligible Population (VEP)', 'Voting-Age Population (VAP)',
'% Non-citizen', 'Prison', 'Probation', 'Parole',
'Total Ineligible Felon', 'Overseas Eligible']
temp_18['Year']=2018
temp_18 = temp_18.iloc[:52]
csv_final = pd.concat([csv_final,temp_18],ignore_index = True)
csv_final
Region | VEP Total Ballots Counted | VEP Highest Office | Total Ballots Counted | Highest Office | Voting-Eligible Population (VEP) | Voting-Age Population (VAP) | % Non-citizen | Prison | Probation | Parole | Total Ineligible Felon | Overseas Eligible | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | United States | 55.3% | 54.2% | 107,390,107 | 105,375,486 | 194,331,436 | 210,623,408 | 7.7% | 1,377,013 | 2,339,388 | 536,039 | 3,082,746 | 2,937,000 | 2000 |
1 | Alabama | NaN | 51.6% | NaN | 1,672,551 | 3,241,682 | 3,334,576 | 1.5% | 26,225 | 40,178 | 5,484 | 51,798 | NaN | 2000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
466 | Wisconsin | 61.4% | 61.4% | 2,675,000 | 2,673,308 | 4,354,527 | 4,563,564 | 3.1% | 22,889 | 44,489 | 20,401 | 68,649 | NaN | 2018 |
467 | Wyoming | 47.9% | 47.4% | 205,275 | 203,420 | 428,898 | 445,747 | 2.5% | 2,323 | 4,666 | 842 | 5,825 | NaN | 2018 |
468 rows × 14 columns
l = 'Region,Source,Status,Total Ballots Counted (Estimate),Vote for Highest Office (President),VEP Turnout Rate (Total Ballots Counted),VEP Turnout Rate (Highest Office),Voting-Eligible Population (VEP),Voting-Age Population (VAP),% Non-citizen,Prison,Probation,Parole,Total Ineligible Felon,Overseas Eligible,State Abv'
lis = l.split(',')
print(lis)
['Region', 'Source', 'Status', 'Total Ballots Counted (Estimate)', 'Vote for Highest Office (President)', 'VEP Turnout Rate (Total Ballots Counted)', 'VEP Turnout Rate (Highest Office)', 'Voting-Eligible Population (VEP)', 'Voting-Age Population (VAP)', '% Non-citizen', 'Prison', 'Probation', 'Parole', 'Total Ineligible Felon', 'Overseas Eligible', 'State Abv']
temp_16 = pd.read_csv("./Voter_Turnouts/2016 November General Election - Turnout Rates.csv",
names =['Region', 'State Results Website', 'Status', 'VEP Total Ballots Counted', 'VEP Highest Office', 'VAP Highest Office', 'Total Ballots Counted (Estimate)', 'Highest Office', 'Voting-Eligible Population (VEP)', 'Voting-Age Population (VAP)', '% Non-citizen', 'Prison', 'Probation', 'Parole', 'Total Ineligible Felon', 'Overseas Eligible', 'State Abv'],
skiprows=2,
header = None)
temp_16.drop('Status',inplace=True,axis=1)
temp_16.drop('State Results Website',inplace=True,axis=1)
temp_16.drop('State Abv',inplace=True,axis=1)
temp_16.drop('VAP Highest Office',inplace=True,axis=1)
temp_16['Year']=2016
temp_16.columns = csv_final.columns
temp_16 = temp_16.iloc[:52]
csv_final = pd.concat([csv_final,temp_16],ignore_index = True)
csv_final
Region | VEP Total Ballots Counted | VEP Highest Office | Total Ballots Counted | Highest Office | Voting-Eligible Population (VEP) | Voting-Age Population (VAP) | % Non-citizen | Prison | Probation | Parole | Total Ineligible Felon | Overseas Eligible | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | United States | 55.3% | 54.2% | 107,390,107 | 105,375,486 | 194,331,436 | 210,623,408 | 7.7% | 1,377,013 | 2,339,388 | 536,039 | 3,082,746 | 2,937,000 | 2000 |
1 | Alabama | NaN | 51.6% | NaN | 1,672,551 | 3,241,682 | 3,334,576 | 1.5% | 26,225 | 40,178 | 5,484 | 51,798 | NaN | 2000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
518 | Wisconsin | NaN | 69.5% | NaN | 2,976,150 | 4,285,071 | 4,495,783 | 3.2% | 22,889 | 44,489 | 20,401 | 68,649 | NaN | 2016 |
519 | Wyoming | 60.2% | 59.5% | 258,788 | 255,849 | 429,682 | 446,396 | 2.4% | 2,323 | 4,666 | 842 | 5,825 | NaN | 2016 |
520 rows × 14 columns
temp_20 = pd.read_csv("./Voter_Turnouts/2020 November General Election - Turnout Rates.csv",
names =['Region', 'Source', 'Status', 'Total Ballots Counted (Estimate)', 'Vote for Highest Office (President)', 'VEP Turnout Rate (Total Ballots Counted)', 'VEP Turnout Rate (Highest Office)', 'Voting-Eligible Population (VEP)', 'Voting-Age Population (VAP)', '% Non-citizen', 'Prison', 'Probation', 'Parole', 'Total Ineligible Felon', 'Overseas Eligible', 'State Abv'],
skiprows=2,
header = None)
temp_20.drop('Source',inplace=True,axis=1)
temp_20.drop('Status',inplace=True,axis=1)
temp_20.drop('State Abv',inplace=True,axis=1)
temp_20 = temp_20[['Region','VEP Turnout Rate (Total Ballots Counted)','VEP Turnout Rate (Highest Office)','Total Ballots Counted (Estimate)', 'Vote for Highest Office (President)', 'Voting-Eligible Population (VEP)', 'Voting-Age Population (VAP)', '% Non-citizen', 'Prison', 'Probation', 'Parole', 'Total Ineligible Felon', 'Overseas Eligible']]
temp_20['Year'] = 2020
temp_20.columns = csv_final.columns
temp_20 = temp_20.iloc[:52]
csv_final = pd.concat([csv_final,temp_20],ignore_index = True)
csv_final
Region | VEP Total Ballots Counted | VEP Highest Office | Total Ballots Counted | Highest Office | Voting-Eligible Population (VEP) | Voting-Age Population (VAP) | % Non-citizen | Prison | Probation | Parole | Total Ineligible Felon | Overseas Eligible | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | United States | 55.3% | 54.2% | 107,390,107 | 105,375,486 | 194,331,436 | 210,623,408 | 7.7% | 1,377,013 | 2,339,388 | 536,039 | 3,082,746 | 2,937,000 | 2000 |
1 | Alabama | NaN | 51.6% | NaN | 1,672,551 | 3,241,682 | 3,334,576 | 1.5% | 26,225 | 40,178 | 5,484 | 51,798 | NaN | 2000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
570 | Wisconsin | 75.8% | 75.5% | 3,310,000 | 3,298,041 | 4,368,530 | 4,586,746 | 3.2% | 23,574 | 42,909 | 21,015 | 71,193 | NaN | 2020 |
571 | Wyoming | 64.6% | 64.2% | 278,503 | 276,765 | 431,364 | 447,915 | 2.2% | 2,488 | 5,383 | 934 | 6,759 | NaN | 2020 |
572 rows × 14 columns
states_cleaned = []
for e in csv_final.Region:
e = str(e).replace('*','')
states_cleaned.append(e)
csv_final.Region = states_cleaned
pd.unique(csv_final.Region)
array(['United States', 'Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming'], dtype=object)
This data set is collected on the individual level. Because we are examining trends on the state level, we will save this Data grouped to the individuals' origin states.
The PIRUS data measures 145 categorical and quantitative variables that do not load nicely into COLAB. We have taken the first header line from the CSV an split it into a list that can be passed as column names for the CSV.
a = "Subject_ID,Loc_Plot_State1,Loc_Plot_City1,Loc_Plot_State2,Loc_Plot_City2,Date_Exposure,Plot_Target1,Plot_Target2,Plot_Target3,Attack_Preparation,Op_Security,Changing_Target,Anticp_Fatals_Targ,Internet_Use_Plot,Extent_Plot,Violent,Criminal_Severity,Criminal_Charges,Indict_Arrest,Current_Status,Group_Membership,Terrorist_Group_Name1,Terrorist_Group_Name2,Terrorist_Group_Name3,Actively_Recruited,Recruiter1,Recruiter2,Recruiter3,Actively_Connect,Group_Competition,Role_Group,Length_Group,Clique,Clique_Radicalize,Clique_Connect,Internet_Radicalization,Media_Radicalization,Social_Media,Social_Media_Frequency,Social_Media_Platform1,Social_Media_Platform2,Social_Media_Platform3,Social_Media_Platform4,Social_Media_Platform5,Social_Media_Activities1,Social_Media_Activities2,Social_Media_Activities3,Social_Media_Activities4,Social_Media_Activities5,Social_Media_Activities6,Social_Media_Activities7,Radicalization_Islamist,Radicalization_Far_Right,Radicalization_Far_Left,Radicalization_Single_Issue,Ideological_Sub_Category1,Ideological_Sub_Category2,Ideological_Sub_Category3,Loc_Habitation_State1,Loc_Habitation_City1,Loc_Habitation_State2,Loc_Habitation_City2,Itinerant,External_Rad,Rad_duration,Radical_Behaviors,Radical_Beliefs,US_Govt_Leader,Foreign_Govt_Leader,Event_Influence1,Event_Influence2,Event_Influence3,Event_Influence4,Beliefs_Trajectory,Behaviors_Trajectory,Radicalization_Sequence,Radicalization_Place,Prison_Radicalize,Broad_Ethnicity,Age,Marital_Status,Children,Age_Child,Gender,Religious_Background,Convert,Convert_Date,Reawakening,Reawakening_Date,Citizenship,Residency_Status,Nativity,Time_US_Months,Immigrant_Generation,Immigrant_Source,Language_English,Diaspora_Ties,Education,Student,Education_Change,Employment_Status,Change_Performance,Work_History,Military,Foreign_Military,Social_Stratum_Childhood,Social_Stratum_Adulthood,Aspirations,Abuse_Child,Abuse_Adult,Abuse_type1,Abuse_Type2,Abuse_Type3,Psychological,Alcohol_Drug,Absent_Parent,Overseas_Family,Close_Family,Family_Religiosity,Family_Ideology,Family_Ideological_Level,Prison_Family_Friend,Crime_Family_Friend,Radical_Friend,Radical_Family,Radical_Signif_Other,Relationship_Troubles,Platonic_Troubles,Unstructured_Time,Friendship_Source1,Friendship_Source2,Friendship_Source3,Kicked_Out,Previous_Criminal_Activity,Previous_Criminal_Activity_Type1,Previous_Criminal_Activity_Type2,Previous_Criminal_Activity_Type3,Previous_Criminal_Activity_Age,Gang,Gang_Age_Joined,Trauma,Other_Ideologies,Angry_US,Group_Grievance,Standing"
def listify(mis_string):
return mis_string.split(",")
pirus_headlist = listify(a)
print(pirus_headlist)
['Subject_ID', 'Loc_Plot_State1', 'Loc_Plot_City1', 'Loc_Plot_State2', 'Loc_Plot_City2', 'Date_Exposure', 'Plot_Target1', 'Plot_Target2', 'Plot_Target3', 'Attack_Preparation', 'Op_Security', 'Changing_Target', 'Anticp_Fatals_Targ', 'Internet_Use_Plot', 'Extent_Plot', 'Violent', 'Criminal_Severity', 'Criminal_Charges', 'Indict_Arrest', 'Current_Status', 'Group_Membership', 'Terrorist_Group_Name1', 'Terrorist_Group_Name2', 'Terrorist_Group_Name3', 'Actively_Recruited', 'Recruiter1', 'Recruiter2', 'Recruiter3', 'Actively_Connect', 'Group_Competition', 'Role_Group', 'Length_Group', 'Clique', 'Clique_Radicalize', 'Clique_Connect', 'Internet_Radicalization', 'Media_Radicalization', 'Social_Media', 'Social_Media_Frequency', 'Social_Media_Platform1', 'Social_Media_Platform2', 'Social_Media_Platform3', 'Social_Media_Platform4', 'Social_Media_Platform5', 'Social_Media_Activities1', 'Social_Media_Activities2', 'Social_Media_Activities3', 'Social_Media_Activities4', 'Social_Media_Activities5', 'Social_Media_Activities6', 'Social_Media_Activities7', 'Radicalization_Islamist', 'Radicalization_Far_Right', 'Radicalization_Far_Left', 'Radicalization_Single_Issue', 'Ideological_Sub_Category1', 'Ideological_Sub_Category2', 'Ideological_Sub_Category3', 'Loc_Habitation_State1', 'Loc_Habitation_City1', 'Loc_Habitation_State2', 'Loc_Habitation_City2', 'Itinerant', 'External_Rad', 'Rad_duration', 'Radical_Behaviors', 'Radical_Beliefs', 'US_Govt_Leader', 'Foreign_Govt_Leader', 'Event_Influence1', 'Event_Influence2', 'Event_Influence3', 'Event_Influence4', 'Beliefs_Trajectory', 'Behaviors_Trajectory', 'Radicalization_Sequence', 'Radicalization_Place', 'Prison_Radicalize', 'Broad_Ethnicity', 'Age', 'Marital_Status', 'Children', 'Age_Child', 'Gender', 'Religious_Background', 'Convert', 'Convert_Date', 'Reawakening', 'Reawakening_Date', 'Citizenship', 'Residency_Status', 'Nativity', 'Time_US_Months', 'Immigrant_Generation', 'Immigrant_Source', 'Language_English', 'Diaspora_Ties', 'Education', 'Student', 'Education_Change', 'Employment_Status', 'Change_Performance', 'Work_History', 'Military', 'Foreign_Military', 'Social_Stratum_Childhood', 'Social_Stratum_Adulthood', 'Aspirations', 'Abuse_Child', 'Abuse_Adult', 'Abuse_type1', 'Abuse_Type2', 'Abuse_Type3', 'Psychological', 'Alcohol_Drug', 'Absent_Parent', 'Overseas_Family', 'Close_Family', 'Family_Religiosity', 'Family_Ideology', 'Family_Ideological_Level', 'Prison_Family_Friend', 'Crime_Family_Friend', 'Radical_Friend', 'Radical_Family', 'Radical_Signif_Other', 'Relationship_Troubles', 'Platonic_Troubles', 'Unstructured_Time', 'Friendship_Source1', 'Friendship_Source2', 'Friendship_Source3', 'Kicked_Out', 'Previous_Criminal_Activity', 'Previous_Criminal_Activity_Type1', 'Previous_Criminal_Activity_Type2', 'Previous_Criminal_Activity_Type3', 'Previous_Criminal_Activity_Age', 'Gang', 'Gang_Age_Joined', 'Trauma', 'Other_Ideologies', 'Angry_US', 'Group_Grievance', 'Standing']
pirus_temp = pd.read_csv("./PIRUS_May2020/PIRUS_Public_May2020.csv",
header=1,
names = pirus_headlist)
pirus_temp
Subject_ID | Loc_Plot_State1 | Loc_Plot_City1 | Loc_Plot_State2 | Loc_Plot_City2 | Date_Exposure | Plot_Target1 | Plot_Target2 | Plot_Target3 | Attack_Preparation | ... | Previous_Criminal_Activity_Type2 | Previous_Criminal_Activity_Type3 | Previous_Criminal_Activity_Age | Gang | Gang_Age_Joined | Trauma | Other_Ideologies | Angry_US | Group_Grievance | Standing | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4857 | New York | -99 | NaN | NaN | 1/1/49 | -88 | NaN | NaN | -88 | ... | NaN | NaN | -99 | 0 | -88 | 3 | 0 | 1 | -99 | -99 |
1 | 5803 | Alabama | Birmingham | NaN | NaN | 1/1/49 | -88 | NaN | NaN | -88 | ... | NaN | NaN | -88 | 0 | -88 | -99 | 0 | -99 | 2 | -99 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2223 | 1374 | California | Los Angeles | NaN | NaN | 11/26/18 | 14 | NaN | NaN | 1 | ... | NaN | NaN | -99 | 0 | -88 | -99 | 0 | -99 | -99 | -99 |
2224 | 8295 | Ohio | Toledo | NaN | NaN | 12/10/18 | 3 | 14.0 | 15.0 | 2 | ... | NaN | NaN | 2 | 0 | -88 | -99 | 0 | 1 | 3 | -99 |
2225 rows × 145 columns
We will only examine radicalized individual since 2000 due to the fact that voter data comes from the 2000-2020 years, and protest data comes from the 2017-2021 years. (This data starts in 1948 and goes through 2018.)
We will also use value_counts to determine the different states that from which the radicalized individual originate from. We are assuming that this is also the state in which the individual is mostly likely to engage in popular protest and vote.</p>
#Date_Exposure is not comparable because it is of dtype string.
#Create column 'Year' of int values that represents the last 2 digit of the year.
l = []
for val in pirus_temp['Date_Exposure']:
a = val.split('/')
b = a[-1:]
l.append(int(b[0]))
pirus_temp['Year'] = l
#Any row with 'Year' under 22 occured in the 2000's and is in the scope of this study
pirus_temp = pirus_temp[pirus_temp['Year'] <= 22]
#Group by state
pirus_states_since_2000 = pirus_temp.value_counts('Loc_Habitation_State1')
pirus_states_since_2000.plot(kind='bar', figsize=(20, 10))
<AxesSubplot:xlabel='Loc_Habitation_State1'>
This data set is collected on the event level. Because we are examining trends on the state level, we will save this Data grouped to the protest event location. It is worth noting that popular protest often spreads. This data is harvested by webcrawling for news articles and similar media referencing the protest to a location. Therefore, protests that happened in wave, such as those in response to George Floyd's murder, will appearch multiple times. However, we will still count these as separate events, even if such events are comorbid.
Because this data set only covers 4 years, we do not have to filter it. We will only group it by state.
protests_temp = pd.read_csv("./Protests/data.csv",
header = 1,
names = ['Date','Location','Attendees',
'Event (legacy; see tags)','Tags',
'Curated','Source','Total_Articles'])
protests_temp.head(20)
Date | Location | Attendees | Event (legacy; see tags) | Tags | Curated | Source | Total_Articles | |
---|---|---|---|---|---|---|---|---|
0 | 2017-01-16 | Johnson City, TN | 300.0 | Civil Rights | Civil Rights; For racial justice; Martin Luthe... | Yes | http://www.johnsoncitypress.com/Local/2017/01/... | 4 |
1 | 2017-01-16 | Indianapolis, IN | 20.0 | Environment | Environment; For wilderness preservation | Yes | http://wishtv.com/2017/01/16/nature-groups-pro... | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
18 | 2017-01-20 | Richmond, VA | 2000.0 | Executive (Inauguration March) | Executive; Against 45th president | Yes | http://richmondfreepress.com/news/2017/jan/20/... | 2 |
19 | 2017-01-20 | Madison, WI | 100.0 | Executive | Executive; Against 45th president | Yes | http://www.channel3000.com/news/politics/peace... | 1 |
20 rows × 8 columns
#Must create state attribute to groupby state, similar to extracting year, but first create dictionary matching statges to abbreviation.
states = ["Alabama","Alaska","Arizona","Arkansas","California","Colorado","Connecticut","District of Columbia","Delaware","Florida","Georgia","Guam","Hawaii","Idaho","Illinois","Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland","Massachusetts","Michigan","Minnesota","Mississippi","Missouri","Montana","Nebraska","Nevada","New Hampshire","New Jersey","New Mexico","New York","North Carolina","North Dakota","Ohio","Oklahoma","Oregon","Pennsylvania","Puerto Rico","Rhode Island","South Carolina","South Dakota","Tennessee","Texas","United States","Utah", "Vermont","Virginia","Washington","West Virginia","Wisconsin","Wyoming"]
abbrev = ["AL","AK","AZ","AR","CA","CO","CT","DC","DE","FL","GA","GU","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","PR","RI","SC","SD","TN","TX","US","UT","VT","VA","WA","WV","WI","WY"]
states_dict = {}
i = 0
for name in abbrev:
states_dict[name] = states[i]
i += 1
print(states_dict)
#create list of state names
protests_temp
{'AL': 'Alabama', 'AK': 'Alaska', 'AZ': 'Arizona', 'AR': 'Arkansas', 'CA': 'California', 'CO': 'Colorado', 'CT': 'Connecticut', 'DC': 'District of Columbia', 'DE': 'Delaware', 'FL': 'Florida', 'GA': 'Georgia', 'GU': 'Guam', 'HI': 'Hawaii', 'ID': 'Idaho', 'IL': 'Illinois', 'IN': 'Indiana', 'IA': 'Iowa', 'KS': 'Kansas', 'KY': 'Kentucky', 'LA': 'Louisiana', 'ME': 'Maine', 'MD': 'Maryland', 'MA': 'Massachusetts', 'MI': 'Michigan', 'MN': 'Minnesota', 'MS': 'Mississippi', 'MO': 'Missouri', 'MT': 'Montana', 'NE': 'Nebraska', 'NV': 'Nevada', 'NH': 'New Hampshire', 'NJ': 'New Jersey', 'NM': 'New Mexico', 'NY': 'New York', 'NC': 'North Carolina', 'ND': 'North Dakota', 'OH': 'Ohio', 'OK': 'Oklahoma', 'OR': 'Oregon', 'PA': 'Pennsylvania', 'PR': 'Puerto Rico', 'RI': 'Rhode Island', 'SC': 'South Carolina', 'SD': 'South Dakota', 'TN': 'Tennessee', 'TX': 'Texas', 'US': 'United States', 'UT': 'Utah', 'VT': 'Vermont', 'VA': 'Virginia', 'WA': 'Washington', 'WV': 'West Virginia', 'WI': 'Wisconsin', 'WY': 'Wyoming'}
Date | Location | Attendees | Event (legacy; see tags) | Tags | Curated | Source | Total_Articles | |
---|---|---|---|---|---|---|---|---|
0 | 2017-01-16 | Johnson City, TN | 300.0 | Civil Rights | Civil Rights; For racial justice; Martin Luthe... | Yes | http://www.johnsoncitypress.com/Local/2017/01/... | 4 |
1 | 2017-01-16 | Indianapolis, IN | 20.0 | Environment | Environment; For wilderness preservation | Yes | http://wishtv.com/2017/01/16/nature-groups-pro... | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
38094 | 2021-01-31 | Salt Lake City, UT | NaN | Other | Other; Against deregulation; Business | No | https://www.abc4.com/news/local-news/crowds-ga... | 1 |
38095 | 2021-01-31 | San Francisco, CA | 100.0 | Other | Other; Against hazardous conditions; Prisons; ... | Yes | https://www.mercurynews.com/2021/01/31/activis... | 1 |
38096 rows × 8 columns
#Create a list that can be added as a column to the DataFrame, representing the locatiion the protest took place in.
l = []
for val in protests_temp['Location']:
m = val.split(',')
if len(m) >= 2:
n = m[-1][-2:]
state = states_dict[n.upper()]
l.append(state)
else: #accounting for abnormal cases. implementation based on printing individual cases
if m == 'La Porte County Courthouse in La Porte':
l.append('Indiana')
if m == 'Space':
l.append('New York')
if n == 'WA':
l.append('Washington')
if n == 'DE':
l.append('Delaware')
protests_temp['State'] = l
#protests_temp
protests_by_state = protests_temp.value_counts('State')
protests_by_state.plot.barh(figsize=(10,15))
<AxesSubplot:ylabel='State'>
From this bar chart, we can see that California has the highest number of protests. California also had the highest number of radicalized individuals.
protests_by_state.mean()
718.7924528301887
We can also see that the mean number of protests by state is 718.
pop20_21 = pd.read_csv('./Population/2020-2021 Census Bureau Population.csv')
#Rename columns due to header reading error
pop20_21.rename(columns={'Population Estimate\n (as of July 1)':'2020','Unnamed: 3':'2021'},inplace=True)
#Drop the first 6 rows becausethey are aggregates
pop20_21 = pop20_21.iloc[6:]
list1 = []
for i in pop20_21['Geographic Area']:
i = i[1:]
list1.append(i)
pop20_21['Geographic Area'] = list1
pop20_21.head()
Geographic Area | April 1, 2020 Estimates Base | 2020 | 2021 | |
---|---|---|---|---|
6 | Alabama | 5024279.0 | 5024803.0 | 5039877.0 |
7 | Alaska | 733391.0 | 732441.0 | 732673.0 |
8 | Arizona | 7151502.0 | 7177986.0 | 7276316.0 |
9 | Arkansas | 3011524.0 | 3012232.0 | 3025891.0 |
10 | California | 39538223.0 | 39499738.0 | 39237836.0 |
pop10_19 = pd.read_csv('./Population/nst-est2019-01.csv')
#Rename columns due to header reading error
pop10_19.rename(columns={'Population Estimate (as of July 1)':'2010','Unnamed: 2':'Estimates Base','Unnamed: 4':'2011','Unnamed: 5':'2012','Unnamed: 5':'2012','Unnamed: 6':'2013','Unnamed: 7':'2014','Unnamed: 8':'2015','Unnamed: 9':'2016','Unnamed: 10':'2017','Unnamed: 11':'2018','Unnamed: 12': '2019'},inplace=True)
#Drop the first 6 rows becausethey are aggregates
pop10_19 = pop10_19.iloc[6:]
list1 = []
for i in pop10_19['Geographic Area']:
i = i[1:]
list1.append(i)
pop10_19['Geographic Area'] = list1
pop10_19.head()
Geographic Area | April 1, 2010 | Estimates Base | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | Alabama | 4779736.00 | 4780125.00 | 4785437.0 | 4799069.0 | 4815588.0 | 4830081.0 | 4841799.0 | 4852347.0 | 4863525.0 | 4874486.0 | 4887681.0 | 4903185.0 |
7 | Alaska | 710231.00 | 710249.00 | 713910.0 | 722128.0 | 730443.0 | 737068.0 | 736283.0 | 737498.0 | 741456.0 | 739700.0 | 735139.0 | 731545.0 |
8 | Arizona | 6392017.00 | 6392288.00 | 6407172.0 | 6472643.0 | 6554978.0 | 6632764.0 | 6730413.0 | 6829676.0 | 6941072.0 | 7044008.0 | 7158024.0 | 7278717.0 |
9 | Arkansas | 2915918.00 | 2916031.00 | 2921964.0 | 2940667.0 | 2952164.0 | 2959400.0 | 2967392.0 | 2978048.0 | 2989918.0 | 3001345.0 | 3009733.0 | 3017804.0 |
10 | California | 37253956.00 | 37254519.00 | 37319502.0 | 37638369.0 | 37948800.0 | 38260787.0 | 38596972.0 | 38918045.0 | 39167117.0 | 39358497.0 | 39461588.0 | 39512223.0 |
total_population = pop10_19
total_population['2020'] = pop20_21['2020']
total_population['2021'] = pop20_21['2021']
total_population.drop(['April 1, 2010','Estimates Base'],inplace=True,axis=1)
total_population.columns
Index(['Geographic Area', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021'], dtype='object')
def df_creation(row):
ret_val = pd.DataFrame()
ret_val['Population'] = list(row)[1:]
ret_val['Year'] = total_population.columns[1:]
return ret_val
S_pop = {}
for index, row in total_population.iterrows():
S_pop[list(row)[0]] = df_creation(row)
ax=S_pop['Alabama'].plot(x='Year',y='Population')
total_population
Geographic Area | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | Alabama | 4785437.0 | 4799069.0 | 4815588.0 | 4830081.0 | 4841799.0 | 4852347.0 | 4863525.0 | 4874486.0 | 4887681.0 | 4903185.0 | 5024803.0 | 5039877.0 |
7 | Alaska | 713910.0 | 722128.0 | 730443.0 | 737068.0 | 736283.0 | 737498.0 | 741456.0 | 739700.0 | 735139.0 | 731545.0 | 732441.0 | 732673.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
55 | Wisconsin | 5690475.0 | 5705288.0 | 5719960.0 | 5736754.0 | 5751525.0 | 5760940.0 | 5772628.0 | 5790186.0 | 5807406.0 | 5822434.0 | 5892323.0 | 5895908.0 |
56 | Wyoming | 564487.0 | 567299.0 | 576305.0 | 582122.0 | 582531.0 | 585613.0 | 584215.0 | 578931.0 | 577601.0 | 578759.0 | 577267.0 | 578803.0 |
51 rows × 13 columns
per_population_growth = total_population.copy()
years = [2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021]
for i in range(len(years)):
per_population_growth[str(years[len(years) - i - 1])] = total_population[str(years[len(years)- i- 1])]/total_population['2010']-1
per_population_growth.drop(['2010'],inplace=True,axis=1)
per_population_growth.set_index('Geographic Area').transpose().plot(figsize=(10,15))
plt.legend()
#plt.yscale("log")
plt.xlabel("Years")
plt.ylabel("Population")
plt.title("Population Growth Over Time For Each State")
plt.grid(linestyle=':')
handles, labels = plt.gca().get_legend_handles_labels()
order = per_population_growth['2021'].sort_values(ascending=False).keys()
order = order-6
plt.legend([handles[idx] for idx in order],[labels[idx] for idx in order],bbox_to_anchor=(1., 1.0), fancybox=True, shadow=True, ncol=1)
<matplotlib.legend.Legend at 0x2a0647a60>
Percent Population Growth by State
Above is the percentage of population growth of each state based on its initial population in 2010. Each state starts on the value 1 for the year 2010, so 2010 was not included. The legend is sorted by the max value at the end, so it's easier to compare each state's line, and also see which state proportionally grew the most over the decade. This is important because the population growth of a state will affect the number of protests and the number of radicalized individuals, and therefore the number of protests per radicalized individual.
pd.pivot_table(total_population, index='Geographic Area').head()
2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Geographic Area | ||||||||||||
Alabama | 4785437.0 | 4799069.0 | 4815588.0 | 4830081.0 | 4841799.0 | 4852347.0 | 4863525.0 | 4874486.0 | 4887681.0 | 4903185.0 | 5024803.0 | 5039877.0 |
Alaska | 713910.0 | 722128.0 | 730443.0 | 737068.0 | 736283.0 | 737498.0 | 741456.0 | 739700.0 | 735139.0 | 731545.0 | 732441.0 | 732673.0 |
Arizona | 6407172.0 | 6472643.0 | 6554978.0 | 6632764.0 | 6730413.0 | 6829676.0 | 6941072.0 | 7044008.0 | 7158024.0 | 7278717.0 | 7177986.0 | 7276316.0 |
Arkansas | 2921964.0 | 2940667.0 | 2952164.0 | 2959400.0 | 2967392.0 | 2978048.0 | 2989918.0 | 3001345.0 | 3009733.0 | 3017804.0 | 3012232.0 | 3025891.0 |
California | 37319502.0 | 37638369.0 | 37948800.0 | 38260787.0 | 38596972.0 | 38918045.0 | 39167117.0 | 39358497.0 | 39461588.0 | 39512223.0 | 39499738.0 | 39237836.0 |
Both protest and radicalization measure resistance to social or governmental structures. Therefore, it makes sense to join aspects of the data into a very simple table to compare radicalization and protest activity. We will not merge this data on the 'State' attribute, because both datasets have such a large number of variables that the table produced would be unweildy.
resistance_data = pd.DataFrame()
resistance_data['Radicalized_num'] = pirus_states_since_2000
resistance_data['Protest_num'] = protests_by_state
resistance_data.head()
Radicalized_num | Protest_num | |
---|---|---|
Loc_Habitation_State1 | ||
California | 137 | 4439.0 |
New York | 108 | 2688.0 |
Texas | 84 | 1649.0 |
Florida | 83 | 1823.0 |
Minnesota | 70 | 747.0 |
We can look at the relationship now between the number of protests in a state and the number of radicalized individuals in a state. Unsurprisingly, there is a visually obvious correlation.
resistance_data.rename({'Loc_Habitation_State1':'State'},inplace=True)
resistance_data.plot(kind='scatter',
y='Radicalized_num',
x='Protest_num',
ylabel = "Number of People Radicalized",
xlabel = "Number of Protests",
figsize=(10,8),
alpha=0.4,
color='purple',
s=30)
<AxesSubplot:xlabel='Number of Protests', ylabel='Number of People Radicalized'>
We can compute the correlation between these two variables as follows:
resistance_data['Protest_num'].corr(resistance_data['Radicalized_num'])
0.8996117793328046
This is a significant, but unsurprising correlation. We can represent the population size of the state through the dot size.
resistance_data_merged = resistance_data.reset_index().rename(columns={'Loc_Habitation_State1':'State'}).merge(total_population.rename(columns={'Geographic Area':'State',"2021":"Population"})[["State","Population"]],on='State', how="right").set_index("State")
resistance_data_merged.head()
Radicalized_num | Protest_num | Population | |
---|---|---|---|
State | |||
Alabama | 24.0 | 281.0 | 5039877.0 |
Alaska | 8.0 | 252.0 | 732673.0 |
Arizona | 36.0 | 563.0 | 7276316.0 |
Arkansas | 7.0 | 174.0 | 3025891.0 |
California | 137.0 | 4439.0 | 39237836.0 |
resistance_data_merged["Population"]
State Alabama 5039877.0 Alaska 732673.0 ... Wisconsin 5895908.0 Wyoming 578803.0 Name: Population, Length: 51, dtype: float64
resistance_data_merged.plot(kind='scatter',
y='Radicalized_num',
x='Protest_num',
ylabel = "Number of People Radicalized",
xlabel = "Number of Protests",
title="Number of People Radicalized vs Number of Protests",
figsize=(10,8),
alpha=0.4,
color='purple',
#s = resistance_data_merged['Population']
s=resistance_data_merged["Population"].apply({lambda x: x/1e4}),
)
plt.xscale("log")
plt.yscale("log")
x_vals = list(resistance_data_merged.reset_index()["Protest_num"])
y_vals = list(resistance_data_merged.reset_index()["Radicalized_num"])
states = list(resistance_data_merged.reset_index()["State"])
x_vals.pop(41)
y_vals.pop(41)
states.pop(41) # South Dakota has nan values
for i in range(len(x_vals)):
plt.text(x_vals[i], y_vals[i], states[i], fontsize=8)
One flaw with the graph above using the population value was that it essentially meant nothing. When comparing large sets of data belonging to different areas, it's important to normalize them by some factor, so that the data is proportional. After normalizing the data, a very interesting but telling picture shows. Therefore, it is unsurprising to see the largest states have both the most radicalized individuals as well as protests, so a better perspective would be to normalize each of those values based on population. Below will give a better view on each state's participation in politics
resistance_data_normalized = resistance_data_merged.copy()
resistance_data_normalized["Protest_num"] = resistance_data_normalized["Protest_num"]/resistance_data_normalized["Population"]
resistance_data_normalized["Radicalized_num"] = resistance_data_normalized["Radicalized_num"]/resistance_data_normalized["Population"]
resistance_data_normalized.plot(kind='scatter',
y='Radicalized_num',
x='Protest_num',
ylabel = "Number of People Radicalized (Normalized by Population)",
xlabel = "Number of Protests (Normalized by Population)",
title="Number of People Radicalized vs Number of Protests (Normalized by Population)",
figsize=(10,8),
alpha=0.4,
color='purple',
#s = resistance_data_merged['Population']
s=resistance_data_normalized["Population"].apply({lambda x: x/1e4}))
plt.xscale("log")
plt.yscale("log")
x_vals = list(resistance_data_normalized.reset_index()["Protest_num"])
y_vals = list(resistance_data_normalized.reset_index()["Radicalized_num"])
x_vals.pop(41)
y_vals.pop(41)
for i in range(len(x_vals)):
plt.text(x=x_vals[i], y=y_vals[i], s=states[i], fontsize=7)
DC is in the top right, making it the most participatory "state" in the United States. This makes sense because it's home to the White House, and many protests likely occur here by others outside of DC. The number of protests in proportion to its population as just a city make it key for political involvement. This conclusion can mostly be drawn from the fact that in the graph using just the population values, DC is pretty central in the graph, meaning that it can stand on its own without being normalized.
The Protest data Attendance column is missing values for many events. We will build a model to predict what the attendance would have been based on the issues the protest addressed, the state protest took place in, and the proportion of the radicalized individuals from that state.
First, let's look at what issues people protest about most often.
protests_iss =protests_temp[['Date','Location','Event (legacy; see tags)', 'Attendees','State','Tags']]
protests_iss_known = protests_iss
protests_iss_known.rename(columns={'Event (legacy; see tags)':'Event'},inplace=True)
The person who gathered this protest data did not report the political/social topics of the protest consistently. We will create a new tagging system using regular expressions to search the current tagging system for the most common issues present. This tagging system will build a set of issues that can be searched.
def categorizer(word):
p_list = [r"\s*([Rr]acial)",
r'\s*(45)', r"\s*([Gg]un\s[Rr]ights)",r"\s*([Gg]un\s[Cc]ontrol)",
r"\s*([Oo]ther)", r"\s*([Ee]nvironment)",
r"\s*([Ee]ducation)",r'\s*([Hh]ealthcare)',
r"\s*([Ii]mmigration)",r"\s*([Ee]xecutive)",
r"\s*([Ii]nternational\s[Rr]elations)",
r"\s*([Ll]egislative)",r"\s*([Cc]ivil\s[Rr]ights)"]
tag_dict = {r"\s*([Rr]acial)":"Racial",
r'\s*(45)':"45th President", r"\s*([Gg]un\s[Rr]ights)":"Gun Rights",r"\s*([Gg]un\s[Cc]ontrol)":"Gun Control",
r"\s*([Oo]ther)":'Other', r"\s*([Ee]nvironment)":'Environment',
r"\s*([Ee]ducation)":"Education",r'\s*([Hh]ealthcare)':'Healthcare',
r"\s*([Ii]mmigration)":'Immigration',r"\s*([Ee]xecutive)":'Executive',
r"\s*([Ii]nternational\s[Rr]elations)":'International Relations',
r"\s*([Ll]egislative)":'Legislative',r"\s*([Cc]ivil\s[Rr]ights)":'Civil Rights','[]':'Other'}
ret_list = set([])
for w in word.split(';'):
#print(w)
for pattern in p_list:
m = re.search(pattern,w)
if m != None:
b = tag_dict[pattern]
if b not in ret_list:
ret_list.add(b)
return ret_list
events = []
for a in protests_iss_known["Tags"]:
events.append(categorizer(a))
#eventsssss = protests_iss_known["Tags"].apply({lambda x: categorizer(x)}) # "Racial Injustice" if "Racial Injustice" in x else "Gun Rights" if "Guns" in x else "Other" if "Other" in x else "Environment" if "Environment" in x else "Education" if "Education" in x else "Immigration" if "Immigration" in x else x
protests_iss_known["Event"] = events
Common_Events = protests_iss_known["Event"].value_counts().head(12).keys()
#print(len(protests_iss_known["Event"].value_counts()))
#protests_iss_known["Event"] = protests_iss_known["Event"].apply({lambda x: x if x in Common_Events else "Other"})
def overlapping_value_count(df,return_dict):
s = df['Event']
for entry in s:
l = list(entry)
for e in l:
if e in return_dict.keys():
return_dict[e] += 1
else:
return_dict[e] = 1
ret_val = pd.DataFrame(list(return_dict.items()),index=range(0,len(return_dict.keys())))
ret_val.columns = ['Tag','Count']
ret_val.set_index('Tag',inplace=True)
return ret_val
tag_counts = overlapping_value_count(protests_iss_known,{})
tag_counts.sort_values("Count", ascending=False).plot(y='Count',kind='pie',figsize=(10,10),fontsize=10,legend=True,title='Protest Topics',colors=sns.color_palette('tab20'))
plt.legend(bbox_to_anchor=(1., 1.0), fancybox=True, shadow=True, ncol=1)
<matplotlib.legend.Legend at 0x2ab003fd0>
Notice that this graph does not show the distribution of topics across all events, but across all tags, as each event may have multiple tags in its list.
Just from looking at this chart, it looks like civil rights, racial justice, guns, and immigration are major issues that people protest about..
Let's also look at the relationship between the time of year that the protests occur and the number of attendees. We will have to drop rows that do not have attendees listed, and convert the data column to a datetime object.
protests_iss_known.Date = pd.to_datetime(protests_iss_known.Date)
protests_real_test = protests_iss.query('Attendees != Attendees')
protests_iss_attendees_known = protests_iss.dropna(subset=['Attendees'])
protests_iss_attendees_known.Date.value_counts().plot(figsize=(15,10))
<AxesSubplot:>
protests_real_test
Date | Location | Event | Attendees | State | Tags | |
---|---|---|---|---|---|---|
2 | 2017-01-16 | Cincinnati, OH | {Racial, Civil Rights} | NaN | Ohio | Civil Rights; For racial justice; Martin Luthe... |
4 | 2017-01-19 | Washington, DC | {45th President, Executive} | NaN | District of Columbia | Executive; Against 45th president |
... | ... | ... | ... | ... | ... | ... |
38092 | 2021-01-31 | Topeka, KS | {Civil Rights} | NaN | Kansas | Civil Rights; For abortion rights |
38094 | 2021-01-31 | Salt Lake City, UT | {Other} | NaN | Utah | Other; Against deregulation; Business |
15061 rows × 6 columns
We previously saved the protests with unknown attendees to the DataFrame protests_real_test. Let's revisit that data.
tag_unknown = overlapping_value_count(protests_real_test,{})
tag_unknown.sort_values("Count", ascending=False).plot(y='Count',kind='pie',figsize=(8,8),colors=sns.color_palette('tab20'))
plt.legend(bbox_to_anchor=(1., 1.0), fancybox=True, shadow=True, ncol=1)
<matplotlib.legend.Legend at 0x2a922d8e0>
We can build a K-nearest neighbor predictor of the number of Attendees at a protest based on the issues the protest addressed, the state that the protest took place in, and the proportion of the radicalized individuals from that state
protests_iss_known
Date | Location | Event | Attendees | State | Tags | |
---|---|---|---|---|---|---|
0 | 2017-01-16 | Johnson City, TN | {Racial, Civil Rights} | 300.0 | Tennessee | Civil Rights; For racial justice; Martin Luthe... |
1 | 2017-01-16 | Indianapolis, IN | {Environment} | 20.0 | Indiana | Environment; For wilderness preservation |
... | ... | ... | ... | ... | ... | ... |
38094 | 2021-01-31 | Salt Lake City, UT | {Other} | NaN | Utah | Other; Against deregulation; Business |
38095 | 2021-01-31 | San Francisco, CA | {Other} | 100.0 | California | Other; Against hazardous conditions; Prisons; ... |
38096 rows × 6 columns
pirus_temp.Date_Exposure = pd.to_datetime(pirus_temp.Date_Exposure)
type(pirus_temp.iloc[0].Date_Exposure)
pandas._libs.tslibs.timestamps.Timestamp
Here we create some tools to calculate the voter turnout and number of radicalized individuals. We will use these tools to add this information to out protests data frame for the moment in time that the protest occurs.
#How to select out certain Protest issues, when Event attribute is saved to a list:
def issue_search(issue):
return protests_iss_known[protests_iss_known['Event']&{issue}]
def state_date(row):
return (row.Date,row.State)
def state_year(row):
return (row.Date.year,row.State)
def vote_pcnt(tuple):
#print('tuple',tuple)
year = tuple[0]
state = tuple[1]
if year%2 != 0:
year -= 1
#print(year)
if state not in pd.unique(csv_final.Region):
return (np.nan)
line = str(csv_final[(csv_final.Region == state)&(csv_final.Year == year)]['VEP Highest Office'])
#print(line)
pcnt = re.search(r'(....%)',line)
#print(pcnt.groups())
return float(pcnt[0][:-1])
def get_rads_by_population(tuple):
date = tuple[0]
state = tuple[1]
if state not in pd.unique(total_population['Geographic Area']):
return ('NaN')
radicals = pd.DataFrame(pirus_temp[(pirus_temp.Date_Exposure < pd.to_datetime(date))&(pirus_temp.Loc_Plot_State1 == state)&(pirus_temp.Date_Exposure > pd.to_datetime('2000-01-01 00:00:00'))]).size
population = total_population[total_population['Geographic Area'] == state][str(date.year)]
return list(population/radicals)[0]
def to_raw(string):
return fr"{string}"
We will get assign the voter turnout of the most recent election to the protest, and the number of radicalized individuals from 2000 to the date of the protest as the number of radicalized individuals.
votes = ['NaN']*38096
rads = [0]*38096
for e in range(0,38096):
pcnt = vote_pcnt(state_year(protests_iss_known.iloc[e]))
votes[e] = pcnt
var1 = state_date(protests_iss_known.iloc[e])
rad = get_rads_by_population(var1)
rads[e]=rad
protests_iss_known['State_voters'] = votes
protests_iss_known['Radicals'] = rads
all_tags= ['Racial','45th President', 'Gun Rights', 'Gun Control', 'Other', 'Environment', 'Education', 'Healthcare', 'Immigration', 'Executive', 'International Relations', 'Legislative', 'Civil Rights', 'Other']
for t in all_tags:
protests_iss_known[t] = [0]*38096
for e in list(issue_search(t).index):
e = int(e)
protests_iss_known.loc[e,t]=1
protests_iss_known.loc[protests_iss_known.Radicals == inf,'Radicals'] = 0
protests_iss_known.loc[protests_iss_known.Radicals == pd.NA,'Radicals'] = np.nan
protests_real_test = protests_iss_known[protests_iss_known.Attendees.isnull()]
protests_real_test = protests_real_test.dropna(subset =['Radicals'])
protests_real_test = protests_real_test.dropna(subset =['State_voters'])
protests_iss_attendees_known = protests_iss_known.dropna(subset=['Attendees'])
protests_iss_attendees_known = protests_iss_attendees_known.dropna(subset=['State_voters'])
protests_iss_attendees_known = protests_iss_attendees_known.dropna(subset=['Radicals'])
protests_iss_attendees_known = protests_iss_attendees_known.dropna(subset=['Radicals'])
protests_iss_attendees_known.Radicals = protests_iss_attendees_known.Radicals.astype('int')
protests_iss_attendees_known.Attendees = protests_iss_attendees_known.Attendees.astype('int')
While we ultimately want to predict on the protest data into both data frames--with and without known protest attendance--we must first scale the attendance data to be able to predict upon it more accurately. As we see below, the data is heavily left skewed.
percentile = []
threshold = []
for num in pd.unique(protests_iss_attendees_known.Attendees):
percentile.append(protests_iss_attendees_known[protests_iss_attendees_known.Attendees < num].size/protests_iss_attendees_known.size)
threshold.append(num)
eval = pd.DataFrame()
eval['Distribution'] = percentile
eval['Threshold'] = threshold
eval.plot(kind='scatter',x='Threshold',y='Distribution',figsize = (15,8),alpha=0.4,title='Attendance Percentile Values')
<AxesSubplot:title={'center':'Attendance Percentile Values'}, xlabel='Threshold', ylabel='Distribution'>
Looking at protest attendance over time, we can see that there are clear outliers.
protests_iss_attendees_known.Date = pd.to_datetime(protests_iss_attendees_known.Date)
protests_iss_attendees_known.plot(kind='scatter',x='Date',y='Attendees',figsize=(15,8),s=8,c='red',alpha=0.7, title='Protest Attendance by Event 2017-2021')
<AxesSubplot:title={'center':'Protest Attendance by Event 2017-2021'}, xlabel='Date', ylabel='Attendees'>
We will apply a non-linear scaling.
protests_attendees_known_filtered = protests_iss_attendees_known.loc[protests_iss_attendees_known.Attendees<2500]
protests_attendees_known_filtered.plot(kind='scatter',x='Date',y='Attendees',figsize=(15,8),s=8,c='red',alpha=0.7, title='Protest Attendance by Event 2017-2021')
<AxesSubplot:title={'center':'Protest Attendance by Event 2017-2021'}, xlabel='Date', ylabel='Attendees'>
from sklearn.metrics import accuracy_score
protests_attendees_known_filtered.Date = protests_iss_attendees_known.Date.astype('int')
feats = ['Date','State','State_voters', 'Racial', '45th President', 'Gun Rights', 'Gun Control',
'Other', 'Environment', 'Education', 'Healthcare', 'Immigration',
'Executive', 'International Relations', 'Legislative', 'Civil Rights',
'Radicals']
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
X_dict = protests_attendees_known_filtered[feats].to_dict(orient="records")
X_train = vec.fit_transform(X_dict)
y = protests_attendees_known_filtered["Attendees"]
# specify the pipeline
kays = []
accuracy = []
cvs = []
for num in range(1,75,2):
model = KNeighborsRegressor(n_neighbors=num)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
scores = cross_val_score(pipeline, X_dict, y,
cv=10, scoring='neg_root_mean_squared_error')
accuracy.append(scores.mean())
kays.append(num)
for_plot = pd.DataFrame()
for_plot['K-value'] = kays
for_plot['RMSE'] = accuracy
for_plot['F1'] = accuracy
for_plot.rename(columns={'F1':'RMSE'},inplace=True)
for_plot['Class'] = 'Train'
plt1 = for_plot.plot(x='K-value',y='RMSE',figsize=(15,8),ylabel='Negative Root of Mean Squared Error',title="Negative Root of Mean Squared Error for Neighbors Across 10 Folds")
"""plt2 = for_plot[for_plot.Division == 2].plot(x='K-value',y='RMSE',ax=plt1)
plt3 = for_plot[for_plot.Division == 3].plot(x='K-value',y='RMSE',ax=plt1)
plt4 = for_plot[for_plot.Division == 4].plot(x='K-value',y='RMSE',ax=plt1)
plt5 = for_plot[for_plot.Division == 5].plot(x='K-value',y='RMSE',ax=plt1)
plt6 = for_plot[for_plot.Division == 6].plot(x='K-value',y='RMSE',ax=plt1)
plt7 = for_plot[for_plot.Division == 7].plot(x='K-value',y='RMSE',ax=plt1)
plt9 = for_plot[for_plot.Division == 9].plot(x='K-value',y='RMSE',ax=plt1)
plt8 = for_plot[for_plot.Division == 8].plot(x='K-value',y='RMSE',ax=plt1)
plt10 = for_plot[for_plot.Division == 10].plot(x='K-value',y='RMSE',ax=plt1)"""
"plt2 = for_plot[for_plot.Division == 2].plot(x='K-value',y='RMSE',ax=plt1)\nplt3 = for_plot[for_plot.Division == 3].plot(x='K-value',y='RMSE',ax=plt1)\nplt4 = for_plot[for_plot.Division == 4].plot(x='K-value',y='RMSE',ax=plt1)\nplt5 = for_plot[for_plot.Division == 5].plot(x='K-value',y='RMSE',ax=plt1)\nplt6 = for_plot[for_plot.Division == 6].plot(x='K-value',y='RMSE',ax=plt1)\nplt7 = for_plot[for_plot.Division == 7].plot(x='K-value',y='RMSE',ax=plt1)\nplt9 = for_plot[for_plot.Division == 9].plot(x='K-value',y='RMSE',ax=plt1)\nplt8 = for_plot[for_plot.Division == 8].plot(x='K-value',y='RMSE',ax=plt1)\nplt10 = for_plot[for_plot.Division == 10].plot(x='K-value',y='RMSE',ax=plt1)"
In this graph, we can see that the line tends to flatten out at around 55, so k = 55, would be the most ideal number of neighbors to use.
protests_iss_attendees_known[protests_iss_attendees_known.Attendees < 10000].plot(kind='scatter',x='Date',y='Attendees',figsize=(20,10),s=8,c='red',alpha=0.2)
<AxesSubplot:xlabel='Date', ylabel='Attendees'>
We see clear spikes in protest participation and protest size, maybe even documenting important dates in the world of politics.
pirus_temp.Date_Exposure = pd.to_datetime(pirus_temp.Date_Exposure)
pirus_temp
Subject_ID | Loc_Plot_State1 | Loc_Plot_City1 | Loc_Plot_State2 | Loc_Plot_City2 | Date_Exposure | Plot_Target1 | Plot_Target2 | Plot_Target3 | Attack_Preparation | ... | Previous_Criminal_Activity_Type3 | Previous_Criminal_Activity_Age | Gang | Gang_Age_Joined | Trauma | Other_Ideologies | Angry_US | Group_Grievance | Standing | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
882 | 3005 | -99 | -99 | NaN | NaN | 2000-01-01 | -88 | NaN | NaN | -88 | ... | NaN | -88 | 0 | -88 | -99 | 0 | 1 | -99 | -99 | 0 |
883 | 3655 | Montana | -99 | NaN | NaN | 2000-01-01 | -88 | NaN | NaN | -88 | ... | NaN | -99 | 0 | -88 | -99 | 0 | -99 | -99 | -99 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2223 | 1374 | California | Los Angeles | NaN | NaN | 2018-11-26 | 14 | NaN | NaN | 1 | ... | NaN | -99 | 0 | -88 | -99 | 0 | -99 | -99 | -99 | 18 |
2224 | 8295 | Ohio | Toledo | NaN | NaN | 2018-12-10 | 3 | 14.0 | 15.0 | 2 | ... | NaN | 2 | 0 | -88 | -99 | 0 | 1 | 3 | -99 | 18 |
1327 rows × 146 columns
pirus_temp.set_index('Date_Exposure')
rad_counts = pirus_temp.sort_index().value_counts('Date_Exposure',sort=False)
rad_counts.plot(x='Date_Exposure',figsize=(20,10))
#rad_counts.plot(x='Date_Exposure', y = '0')
<AxesSubplot:xlabel='Date_Exposure'>
Radicalization and Protests Over Time: We will look at the correlation between radicalized individuals and protests over time. Perhaps there are relationships between radicalization on certain issues and more protests on certain issues. For example, we know that internet searches for "Straight pride" peak each year during June, which is Pride Month for LGBTQ+ folks. (https://trends.google.com/trends/explore?date=all&geo=US&q=straight%20pride) Perhaps more discussion around an issue in the form of protests causes more radicalization on the opposing side. We will use time data and issue categories for both radicalized individuals from the PIRUS data and protest events.
As an exploratory exercise, let's plot both the PIRUS and the protests data over time to see the spikes in activity.
Now we can plot protests and radicalization on the same axis, though our protest data only starts at 2017. We will have to filter the radicalization data.
pirus_temp.Year = pd.to_numeric(pirus_temp.Year)
since_17 = pirus_temp.loc[pirus_temp.Year>=17]
since_17.set_index('Date_Exposure')
Subject_ID | Loc_Plot_State1 | Loc_Plot_City1 | Loc_Plot_State2 | Loc_Plot_City2 | Plot_Target1 | Plot_Target2 | Plot_Target3 | Attack_Preparation | Op_Security | ... | Previous_Criminal_Activity_Type3 | Previous_Criminal_Activity_Age | Gang | Gang_Age_Joined | Trauma | Other_Ideologies | Angry_US | Group_Grievance | Standing | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date_Exposure | |||||||||||||||||||||
2017-01-01 | 6610 | California | Los Molinos | Oregon | NaN | -88 | NaN | NaN | -88 | -88 | ... | NaN | -88 | 0 | -88 | 0 | 0 | 0 | 0 | 0 | 17 |
2017-01-01 | 6734 | Minnesota | Minneapolis | NaN | NaN | -88 | NaN | NaN | -88 | -88 | ... | NaN | -99 | 0 | -88 | -99 | 0 | 1 | 2 | 0 | 17 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2018-11-26 | 1374 | California | Los Angeles | NaN | NaN | 14 | NaN | NaN | 1 | -99 | ... | NaN | -99 | 0 | -88 | -99 | 0 | -99 | -99 | -99 | 18 |
2018-12-10 | 8295 | Ohio | Toledo | NaN | NaN | 3 | 14.0 | 15.0 | 2 | 2 | ... | NaN | 2 | 0 | -88 | -99 | 0 | 1 | 3 | -99 | 18 |
226 rows × 145 columns
rad_counts2 = since_17.sort_index().value_counts('Date_Exposure',sort=False)
rad_counts2 = rad_counts2.reset_index().rename(columns={"Date_Exposure":"Date",0:"freq"})
merged_data = protests_iss_attendees_known[["Date","Attendees"]].merge(rad_counts2[["Date","freq"]], on='Date', how='inner')
merged_data
Date | Attendees | freq | |
---|---|---|---|
0 | 2017-01-16 | 300 | 9 |
1 | 2017-01-16 | 20 | 9 |
... | ... | ... | ... |
2947 | 2018-12-10 | 300 | 1 |
2948 | 2018-12-10 | 20 | 1 |
2949 rows × 3 columns
Radicalization and Protests Over Time: We will look at the correlation between radicalized individuals and protests over time. Perhaps there are relationships between radicalization on certain issues and more protests on certain issues. For example, we know that internet searches for "Straight pride" peak each year during June, which is Pride Month for LGBTQ+ folks. (https://trends.google.com/trends/explore?date=all&geo=US&q=straight%20pride) Perhaps more discussion around an issue in the form of protests causes more radicalization on the opposing side. We will use time data and issue categories for both radicalized individuals from the PIRUS data and protest events.
As an exploratory exercise, let's plot both the PIRUS and the protests data over time to see the spikes in activity.
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
fig.set_size_inches(18.5, 10.5)
ax1.scatter(merged_data["Date"], merged_data["Attendees"], c='blue', s=50, alpha=0.4)
ax2.scatter(merged_data["Date"], merged_data["freq"], c='red', s=50, alpha=0.15)
plt.title("Temporary Title") # title this
ax1.set_xlabel("Date")
ax1.set_ylabel("Attendees")
ax2.set_ylabel("Frequency")
plt.show()
This chart shows the relationship between protest attendance/number and the number of individuals radicalized. We will have to code protests by issue and radicalized individuals by issue to get a better idea of the relationships between radicalization and protests.
fin = pd.DataFrame(protests_iss_attendees_known.value_counts('State'),columns=['Known'])
fin2 = pd.DataFrame(protests_real_test.value_counts('State'),columns=['Unknown']).join(fin)
#fin['Unknown'] = fin2['Unknown']
fin2.plot(kind = 'scatter',figsize=(15,10),x='Known',y='Unknown')
states = list(fin2.reset_index()["State"])
x_vals = list(fin2.reset_index()["Known"])
y_vals = list(fin2.reset_index()["Unknown"])
'''for i in range(len(x_vals)):
plt.text(x_vals[i], y_vals[i], states[i], fontsize=8)'''
'for i in range(len(x_vals)):\n plt.text(x_vals[i], y_vals[i], states[i], fontsize=8)'
This graph shows that the number of protests with known attendees for in each state is correlated with the number of protests with unknown attendees for each state. So, missing variables are not weighted for one state/
"""def issue_search_df(issue,df):
return df[df['Event']&{issue}]"""
def autopct(pct):
return ('%.2f' % pct)
tag_counts_real = overlapping_value_count(protests_real_test,{})
tag_counts_real.sort_values("Count",ascending=False).plot(y='Count',kind='pie',figsize=(10,10),fontsize=10,legend=True,autopct=autopct,title='Protest Topics',colors=sns.color_palette('tab20'))
plt.legend(bbox_to_anchor=(1., 1.0), fancybox=True, shadow=True, ncol=1)
<matplotlib.legend.Legend at 0x17f573970>
tag_counts_known = overlapping_value_count(protests_iss_attendees_known,{})
tag_counts_known.sort_values("Count",ascending=False).plot(y='Count',kind='pie',figsize=(10,10),autopct=autopct,fontsize=10,legend=True,title='Protest Topics',colors=sns.color_palette('tab20'))
plt.legend(bbox_to_anchor=(1., 1.0), fancybox=True, shadow=True, ncol=1)
<matplotlib.legend.Legend at 0x17f4f9f70>
The proportion of protests for each topic follow roughly the same distribution as those for which the the attendance is known, with less that 1% difference for all topics except immigration and gun control (barely).
protests_iss_attendees_known.head()
Date | Location | Event | Attendees | State | Tags | State_voters | Radicals | Racial | 45th President | ... | Gun Control | Other | Environment | Education | Healthcare | Immigration | Executive | International Relations | Legislative | Civil Rights | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2017-01-16 | Johnson City, TN | {Racial, Civil Rights} | 300 | Tennessee | Civil Rights; For racial justice; Martin Luthe... | 51.1 | 1997 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 2017-01-16 | Indianapolis, IN | {Environment} | 20 | Indiana | Environment; For wilderness preservation | 56.4 | 2533 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 2017-01-18 | Hartford, CT | {Healthcare} | 300 | Connecticut | Healthcare; For Planned Parenthood | 63.7 | 4079 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
7 | 2017-01-20 | Westlake Park, Seattle, WA | {45th President, Executive} | 100 | Washington | Executive; Against 45th president | 64.7 | 1303 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
8 | 2017-01-20 | Columbus, OH | {Civil Rights} | 2450 | Ohio | Civil Rights; For women's rights; Women's March | 62.9 | 2101 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 21 columns
protests_iss_attendees_known.set_index('Date')
known_freq = protests_iss_attendees_known.sort_index().value_counts('Date',sort=False)#,columns=['Freq'],index=None)
axis = known_freq.plot(x='Date',figsize=(20,10))
protests_real_test.set_index('Date')
unknown_freq = protests_real_test.sort_index().value_counts('Date',sort=False)#,columns=['Freq'],index=None)
unknown_freq.plot(x='Date',figsize=(20,10),ax=axis)
#known_freq#.plot(x='Date',y="Freq", c='blue', alpha=0.4)
"""pirus_temp.set_index('Date_Exposure')
rad_counts = pirus_temp.sort_index().value_counts('Date_Exposure',sort=False)
rad_counts.plot(x='Date_Exposure',figsize=(20,10))"""
"pirus_temp.set_index('Date_Exposure')\nrad_counts = pirus_temp.sort_index().value_counts('Date_Exposure',sort=False)\nrad_counts.plot(x='Date_Exposure',figsize=(20,10))"
Protests with recorded attendance spiked at roughly the same time of year as those without.
x =protests_iss_attendees_known.value_counts('State').plot(kind='bar',y='State',color='orange',figsize=(15,8))
protests_real_test.value_counts('State').plot(kind='bar',y='State',ax=x)
<AxesSubplot:xlabel='State'>
known = 23009
unknown = 15046
fin2['Unknown_pcnt'] = fin2['Unknown']/unknown
fin2['Known_pcnt'] = fin2['Known']/known
fin2.plot(kind='scatter',x='Unknown_pcnt',y='Known_pcnt',figsize=(15,8))
<AxesSubplot:xlabel='Unknown_pcnt', ylabel='Known_pcnt'>
Redundant of the first graph, to a certain degree
fin2['Known'].corr(fin2['Unknown'])
0.9799869422463429
So the correlation between the known and unknown number of protests per state is almost 0.98, which is not bad. Additionally, this shows that the data is MAR, so the data is usable and comparable. If it were not MAR, we would have to consider biases and their impact on our data.
protests_attendees_known_filtered.Date = protests_iss_attendees_known.Date.astype('int')
protests_real_test.Date = protests_real_test.Date.astype('int')
pd.options.display.max_rows = 5
feats = ['Date','State','State_voters', 'Racial', '45th President', 'Gun Rights', 'Gun Control',
'Other', 'Environment', 'Education', 'Healthcare', 'Immigration',
'Executive', 'International Relations', 'Legislative', 'Civil Rights',
'Radicals']
X_train_dict = protests_attendees_known_filtered[feats].to_dict(orient="records")
y_train = protests_attendees_known_filtered["Attendees"]
x_new = protests_real_test
X_new_dict=x_new[feats].to_dict(orient="records")
# Dummy encoding
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_new = vec.transform(X_new_dict)
# Standardization
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_new_sc = scaler.transform(X_new)
# K-Nearest Neighbors Model
model = KNeighborsRegressor(n_neighbors=55)
model.fit(X_train_sc, y_train)
protests_real_test['predicted_attendeance'] = model.predict(X_new_sc)
protests_attendees_known_filtered.Date = pd.to_datetime(protests_attendees_known_filtered.Date)
protests_real_test.Date = pd.to_datetime(protests_real_test.Date)
fin_axis = protests_attendees_known_filtered.plot(kind="scatter",x="Date",y="Attendees",c='green',alpha=0.4,figsize=(18,10))
protests_real_test.plot(kind="scatter",x="Date",y="predicted_attendeance",ax=fin_axis,alpha=0.4)
<AxesSubplot:xlabel='Date', ylabel='predicted_attendeance'>
Comparing the radicalized individuals and protest frequency within states leads to eye opening information. There is clearly and visually a correlation between radicalized individuals and protest. An even more interesting piece of information was seeing DC at the forefront of all the political business, which makes perfect sense when taking into account that it's the central place of our government Additionally, the different kinds of protests are clearly important because it's critical that when we are trying to understand the political climate of a state, we need to know what the people are protesting about. The most common issues are civil rights, racial justice, and immigration. These are heated topics, and it's important to know that people are protesting about them. Although there was missing data in the recordance of attendees for protests, it was clear that these datapoints were MAR as they shared similar distributions with the known data, even having a strong correlation with the data that wasn't missing. There were heavy time limitations with this project, and ideally, we would've loved to compare policing stats, such as total budgets in each state as well as protest frequency and attendance. We would've also loved to dive deep into the details of each individual protest and try to properly map them maybe on a graph with lines connecting them to see if there were any patterns. Overall, there are so many more questions to ask and to answer with the path of this project, but for now, we can visualize the political participation graphically.