Background
This project engages with the findings of the 2024 Nature publication: 'Post-January 6th deplatforming reduced the reach of misinformation on Twitter' by McCabe et al. The paper investigates Twitter's decision to deplatform 70,000 misinformation-spreading accounts after the January 6th Capitol riot. Using panel data from over 500,000 Twitter users, the study applied Difference-in-Differences (DID) and Sharp Regression Discontinuity (SRD) to assess the causal impact of deplatforming on misinformation spread.
Part 1: Data Simulation and Regression
Data Simulation and Regression Notebook (click to expand)
Data Simulation & Regression¶
Instructions:
- Take a copy of this notebook and answer the questions in Sections 2, 3, and 4. Add as many code and markdown cells as needed within those sections.
- Answer the External Resources question in Section 5.
Section 1: Data Simulation Engine¶
The code in this section provides the tools needed to simulate data given a set of simulation_parameters being provided.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# example set of simulation parameters
default_simulation_parameters = {
"num_employees_per_company": 50,
"satisfaction_variance": 1,
"data_collection_start_date": pd.to_datetime("2022-01-01"),
"data_collection_end_date": pd.to_datetime("2023-12-31"),
"remote_work_onset_date": pd.to_datetime("2023-01-01"),
"remote_work_treatment_effect": 2,
"satisfaction_levels": {"Rubicon":5.2, "Giggle": 6.1}
}
def simulate_data(simulation_parameters):
data = []
# create the date range for data collection
dates = pd.date_range(
start=simulation_parameters["data_collection_start_date"],
end=simulation_parameters["data_collection_end_date"],
freq='W' # weekly intervals between start and end
)
# set number of employees per company
num_employees = simulation_parameters["num_employees_per_company"]
for company in ['Rubicon', 'Giggle']:
# company satisfaction parameters
satisfaction_mean = simulation_parameters["satisfaction_levels"][company]
satisfaction_variance = simulation_parameters["satisfaction_variance"]
# simulate data collection
for date in dates:
time_at_company = (date - simulation_parameters["data_collection_start_date"]).days
for i in range(num_employees):
employee_age = np.random.randint(18, 60)
satisfaction = np.random.normal(
satisfaction_mean,
satisfaction_variance
)
# treatment effect
if date > simulation_parameters["remote_work_onset_date"]:
if company == "Giggle":
satisfaction += simulation_parameters["remote_work_treatment_effect"]
# save the data in a useful format
datapoint = {
"date": date,
"company": company,
"satisfaction": satisfaction,
"employee_id": f"{company[0]}{i}", # create a fake employee id by combining the company name initial letter and the loop index variable
"time_at_company": time_at_company,
"employee_age": employee_age
}
data.append(datapoint)
return pd.DataFrame(data)
df_default = simulate_data(default_simulation_parameters)
df_default
| date | company | satisfaction | employee_id | time_at_company | employee_age | |
|---|---|---|---|---|---|---|
| 0 | 2022-01-02 | Rubicon | 4.999562 | R0 | 1 | 33 |
| 1 | 2022-01-02 | Rubicon | 5.209023 | R1 | 1 | 38 |
| 2 | 2022-01-02 | Rubicon | 5.060302 | R2 | 1 | 57 |
| 3 | 2022-01-02 | Rubicon | 4.626068 | R3 | 1 | 57 |
| 4 | 2022-01-02 | Rubicon | 3.793625 | R4 | 1 | 48 |
| ... | ... | ... | ... | ... | ... | ... |
| 10495 | 2023-12-31 | Giggle | 7.935362 | G45 | 729 | 46 |
| 10496 | 2023-12-31 | Giggle | 7.429865 | G46 | 729 | 36 |
| 10497 | 2023-12-31 | Giggle | 6.627971 | G47 | 729 | 41 |
| 10498 | 2023-12-31 | Giggle | 9.220513 | G48 | 729 | 21 |
| 10499 | 2023-12-31 | Giggle | 7.028789 | G49 | 729 | 39 |
10500 rows × 6 columns
df_default.sample(n=10)
| date | company | satisfaction | employee_id | time_at_company | employee_age | |
|---|---|---|---|---|---|---|
| 3662 | 2023-05-28 | Rubicon | 6.020167 | R12 | 512 | 39 |
| 135 | 2022-01-16 | Rubicon | 6.246700 | R35 | 15 | 24 |
| 9784 | 2023-09-24 | Giggle | 9.478592 | G34 | 631 | 33 |
| 1267 | 2022-06-26 | Rubicon | 6.045406 | R17 | 176 | 59 |
| 4399 | 2023-09-03 | Rubicon | 5.616598 | R49 | 610 | 18 |
| 6787 | 2022-07-31 | Giggle | 4.576153 | G37 | 211 | 20 |
| 4764 | 2023-10-29 | Rubicon | 5.667133 | R14 | 666 | 26 |
| 3463 | 2023-04-30 | Rubicon | 6.442557 | R13 | 484 | 27 |
| 878 | 2022-05-01 | Rubicon | 6.013686 | R28 | 120 | 43 |
| 1403 | 2022-07-17 | Rubicon | 5.911192 | R3 | 197 | 58 |
Section 2 Alternative Simulation Parameters¶
The goal in this section is to demonstrate understanding of simulating data using different sets of parameters. The section will start with one example, followed by two questions.
2.1 Example¶
Example: simulate a dataset in which the onset date for the remote work treatment is takes place in March
# take a copy of the default simulation parameters dictionary
march_onset_parameters = default_simulation_parameters.copy()
# overwrite the parameter for data collection onset within in our new copy of the parameters
march_onset_parameters["remote_work_onset_date"] = pd.to_datetime("2023-03-01")
# look at the updated parameters
march_onset_parameters
{'num_employees_per_company': 50,
'satisfaction_variance': 1,
'data_collection_start_date': Timestamp('2022-01-01 00:00:00'),
'data_collection_end_date': Timestamp('2023-12-31 00:00:00'),
'remote_work_onset_date': Timestamp('2023-03-01 00:00:00'),
'remote_work_treatment_effect': 2,
'satisfaction_levels': {'Rubicon': 5.2, 'Giggle': 6.1}}
# simulate data with new parameters
march_onset_df = simulate_data(march_onset_parameters)
# look at the data
march_onset_df.sample(5)
| date | company | satisfaction | employee_id | time_at_company | employee_age | |
|---|---|---|---|---|---|---|
| 1955 | 2022-10-02 | Rubicon | 5.687980 | R5 | 274 | 19 |
| 8483 | 2023-03-26 | Giggle | 7.570722 | G33 | 449 | 22 |
| 5556 | 2022-02-13 | Giggle | 7.174432 | G6 | 43 | 43 |
| 856 | 2022-05-01 | Rubicon | 3.808717 | R6 | 120 | 59 |
| 560 | 2022-03-20 | Rubicon | 7.371126 | R10 | 78 | 47 |
# confirm the change
plt.figure(figsize=(12,2))
sns.lineplot(data=march_onset_df, x="date", y="satisfaction", hue="company")
<Axes: xlabel='date', ylabel='satisfaction'>
2.2 Question 1¶
Simulate a dataset in which remote work was assosciated with a decrease in employee satisfaction.
# take a copy of the default simulation parameters dictionary
remote_decrease_satisfation_parameters = default_simulation_parameters.copy()
# overwrite the parameter for negative remote_work_treatment_effect(equivalent to the association with a decrease in employee satisfaction)
remote_decrease_satisfation_parameters["remote_work_treatment_effect"] = -2
# look at the updated parameters
remote_decrease_satisfation_parameters
{'num_employees_per_company': 50,
'satisfaction_variance': 1,
'data_collection_start_date': Timestamp('2022-01-01 00:00:00'),
'data_collection_end_date': Timestamp('2023-12-31 00:00:00'),
'remote_work_onset_date': Timestamp('2023-01-01 00:00:00'),
'remote_work_treatment_effect': -2,
'satisfaction_levels': {'Rubicon': 5.2, 'Giggle': 6.1}}
# simulate data with new parameters
remote_decrease_satisfation_df=simulate_data(remote_decrease_satisfation_parameters)
# look at the data
remote_decrease_satisfation_df.sample(5)
| date | company | satisfaction | employee_id | time_at_company | employee_age | |
|---|---|---|---|---|---|---|
| 9453 | 2023-08-13 | Giggle | 4.924890 | G3 | 589 | 28 |
| 8128 | 2023-02-05 | Giggle | 2.895555 | G28 | 400 | 24 |
| 1367 | 2022-07-10 | Rubicon | 6.439668 | R17 | 190 | 36 |
| 4231 | 2023-08-13 | Rubicon | 5.833665 | R31 | 589 | 29 |
| 6215 | 2022-05-15 | Giggle | 6.125773 | G15 | 134 | 54 |
# confirm the change
plt.figure(figsize=(12,2))
sns.lineplot(data=remote_decrease_satisfation_df, x="date", y="satisfaction", hue="company")
<Axes: xlabel='date', ylabel='satisfaction'>
# Additional Note (see below)
The manipulation to the dataset has worked. While the simulated dataset was set to associated with a decrease in employee satisfaction, namely, a negative remote_work_treatment_effect(-2), a decrease trend has been demonstrated in the graph above. The satisfaction score has decreased since the treatment day and lasted for the rest of the time during the period of data collection. And the effect is quite evident as compared with the one in Example section, where the treatment effect was positive.
2.3 Question 2¶
Simulate a dataset in which remote work begun much sooner after the data collection start date. Illustrate clearly that the manipulation to the dataset has worked.
# take a copy of the default simulation parameters dictionary
much_sooner_onset_parameters = default_simulation_parameters.copy()
# The default date for data collection is 2022-01-01, and originally the remote work started at 2023-01-01.
# To meet the requirement of the prompt, the remote work begun much earlier than 2023-01-01, but should still after 2022-01-01.
# Let‘s try the date, 2022-03-01.
# overwrite the parameter for remote work onset within in our new copy of the parameters
much_sooner_onset_parameters["remote_work_onset_date"] = pd.to_datetime("2022-03-01")
# look at the updated parameters
much_sooner_onset_parameters
{'num_employees_per_company': 50,
'satisfaction_variance': 1,
'data_collection_start_date': Timestamp('2022-01-01 00:00:00'),
'data_collection_end_date': Timestamp('2023-12-31 00:00:00'),
'remote_work_onset_date': Timestamp('2022-03-01 00:00:00'),
'remote_work_treatment_effect': 2,
'satisfaction_levels': {'Rubicon': 5.2, 'Giggle': 6.1}}
# simulate data with new parameters
much_sooner_onset_df=simulate_data(much_sooner_onset_parameters)
# look at the data
much_sooner_onset_df.sample(5)
| date | company | satisfaction | employee_id | time_at_company | employee_age | |
|---|---|---|---|---|---|---|
| 2228 | 2022-11-06 | Rubicon | 5.923113 | R28 | 309 | 21 |
| 8084 | 2023-01-29 | Giggle | 7.498240 | G34 | 393 | 57 |
| 3475 | 2023-04-30 | Rubicon | 5.194315 | R25 | 484 | 53 |
| 5907 | 2022-04-03 | Giggle | 8.696635 | G7 | 92 | 53 |
| 6644 | 2022-07-10 | Giggle | 8.021419 | G44 | 190 | 43 |
# confirm the change
plt.figure(figsize=(12,2))
sns.lineplot(data=much_sooner_onset_df, x="date", y="satisfaction", hue="company")
<Axes: xlabel='date', ylabel='satisfaction'>
# Additional Note (see below)
The manipulation to the dataset has worked. The simulated dataset was reset where remote work begun much sooner after the data collection start date ('2022-01-01'), in this case, '2022-03-01' was chosen. The remote work begun only two months after the data collection start date, which took a year(by default) or even more(in the sample) originally. This is also very evident in the visualization right above as the turning point of the new trend shifts towards the left-hand side.
Section 3 Regression analyses¶
3.1 Question 3¶
Using statsmodels, perform a regression analysis on a dataset simulated with the default parameters. The regression analysis should examine whether an employee's age predicts their satisfaction levels. Explain how the results of the analysis support the conclusion.
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Establish the formula
regression_formula = 'satisfaction ~ employee_age'
model = smf.ols(regression_formula, data=df_default)
# Fit the regression model
results = model.fit()
# Print the summary table
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.000
Model: OLS Adj. R-squared: 0.000
Method: Least Squares F-statistic: 1.468
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.226
Time: 22:22:57 Log-Likelihood: -19579.
No. Observations: 10500 AIC: 3.916e+04
Df Residuals: 10498 BIC: 3.918e+04
Df Model: 1
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 6.2168 0.050 123.443 0.000 6.118 6.315
employee_age -0.0015 0.001 -1.211 0.226 -0.004 0.001
==============================================================================
Omnibus: 333.018 Durbin-Watson: 0.818
Prob(Omnibus): 0.000 Jarque-Bera (JB): 331.859
Skew: 0.403 Prob(JB): 8.66e-73
Kurtosis: 2.669 Cond. No. 133.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Additional Note (See below)
An employee's age does not predict their satisfaction levels. As demostrated by the above table, the absolute value of the coefficient of the employee_age is relatively small(close to zero). And even such a small effect is not statistically significant at an acceptable value(i.e.,0.05,0.1). Besides, the confident interval includes the zero value, which again confirms that an employee's age could not predict their satisfaction levels. But I think this is also the expected outcome, as referring back to the process of generating data for employee's age, the values were randomly assigned. It is always the case that a random number could not predict another randomly generated number(the satisfaction). Otherwise, they are not the so-called "random" numbers. Last but not the least, the R-squared is zero, implying that the model did not have predictive power.
3.2 Question 4¶
- Simulate a dataset using the default parameters.
- Using
statsmodels, perform a regression analysis with one dependent variable (satisfaction) and one predictor (time_at_company) to examine whether employees who have been working longer are happier.
What does the regression result show? Is the result accurate? Explain how the results of analysis support the conclusion and any limitations of the analysis.
# Stmulate a dataset again using the default parameters
df_default_new = simulate_data(default_simulation_parameters)
# Necessary libararies has been imported in Question 3, so we don't have to import them again
# import statsmodels.api as sm
# import statsmodels.formula.api as smf
# Establish the formula
regression_formula_new = 'satisfaction ~ time_at_company'
model_new = smf.ols(regression_formula_new, data=df_default_new)
# Fit the new regression model
results_new = model_new.fit()
print(results_new.summary())
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.078
Model: OLS Adj. R-squared: 0.078
Method: Least Squares F-statistic: 892.7
Date: Thu, 03 Oct 2024 Prob (F-statistic): 2.46e-188
Time: 22:34:42 Log-Likelihood: -19022.
No. Observations: 10500 AIC: 3.805e+04
Df Residuals: 10498 BIC: 3.806e+04
Df Model: 1
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 5.3986 0.029 187.705 0.000 5.342 5.455
time_at_company 0.0020 6.81e-05 29.879 0.000 0.002 0.002
==============================================================================
Omnibus: 110.100 Durbin-Watson: 0.910
Prob(Omnibus): 0.000 Jarque-Bera (JB): 78.608
Skew: 0.102 Prob(JB): 8.52e-18
Kurtosis: 2.629 Cond. No. 840.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# The above coefficient is too small, which might be resulted from the unit used ("day")
# Let's try a circumstance where time_at_company was represented by "year"
df_default_new["time_at_company_y"]= df_default_new["time_at_company"]/365
# Establish the formula
regression_formula_new = 'satisfaction ~ time_at_company_y'
model_new = smf.ols(regression_formula_new, data=df_default_new)
# Fit the new regression model
results_new = model_new.fit()
print(results_new.summary())
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.078
Model: OLS Adj. R-squared: 0.078
Method: Least Squares F-statistic: 892.7
Date: Thu, 03 Oct 2024 Prob (F-statistic): 2.46e-188
Time: 22:36:58 Log-Likelihood: -19022.
No. Observations: 10500 AIC: 3.805e+04
Df Residuals: 10498 BIC: 3.806e+04
Df Model: 1
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept 5.3986 0.029 187.705 0.000 5.342 5.455
time_at_company_y 0.7429 0.025 29.879 0.000 0.694 0.792
==============================================================================
Omnibus: 110.100 Durbin-Watson: 0.910
Prob(Omnibus): 0.000 Jarque-Bera (JB): 78.608
Skew: 0.102 Prob(JB): 8.52e-18
Kurtosis: 2.629 Cond. No. 3.76
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Additional Note (See below)
At the first glance, the model seems to support the conclusion that "employees who have been working longer are happier. I was confused by this at the begining given that the data were generated randomly. However, after a second thought, I've figured out that those stay longer at the companies were also those who might possibly receive the treament we pre-set(if they are in the treatment group,the Giggle company). In this case, the overall satisfaction is "inflated" after a given time, when all the data mixed altogether for analysis. Therefore, the model is flawed to seperate the time effect (which is not random), and the DID model discussed below could work mcuh better in this aspect. By the way, the R-squared is samll, implying that the model could not predict the outcomes properly.
# To tell if my reseasoning is correct, here we could try a new dataset where treatment effect is zero.
satisfation_parameters_no_effect = default_simulation_parameters.copy()
satisfation_parameters_no_effect["remote_work_treatment_effect"] = 0
df_no_effect = simulate_data(satisfation_parameters_no_effect)
# Establish the formula
regression_formula_no_effect = 'satisfaction ~ time_at_company'
model_no_effect = smf.ols(regression_formula_no_effect, data=df_no_effect)
results_model_no_effect = model_no_effect.fit()
print(results_model_no_effect.summary())
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.000
Model: OLS Adj. R-squared: 0.000
Method: Least Squares F-statistic: 3.354
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.0671
Time: 22:43:00 Log-Likelihood: -15855.
No. Observations: 10500 AIC: 3.171e+04
Df Residuals: 10498 BIC: 3.173e+04
Df Model: 1
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 5.6232 0.021 264.350 0.000 5.582 5.665
time_at_company 9.228e-05 5.04e-05 1.831 0.067 -6.49e-06 0.000
==============================================================================
Omnibus: 2.999 Durbin-Watson: 1.650
Prob(Omnibus): 0.223 Jarque-Bera (JB): 2.966
Skew: 0.025 Prob(JB): 0.227
Kurtosis: 2.935 Cond. No. 840.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# It shows no relationship now. And my reasoning is correct.
# Something beyond our simulation
Theoretically, there were many possibilities for the outcome. Employees who have been working longer might not associated with higher or lower satisfaction scores. But it is possible that there is a positive correlation if those stay longer are mainly resulted from their loyalty and high satsfaction with the companies. However, it is also possible that those stay longer are become more and more unhappy with the companies, but for some reasons, they still stay in the companies and did not quit. Then a negative correlation emerged.
Section 4 Difference in Differences analyses¶
4.1 Question 5¶
- Simulate a dataset in which remote work has a strong negative effect on employee satisfaction.
- Using
statsmodels, perform a Difference in Differences analysis to examine the effect of remote work on satisfaction.
What does the regression result show? Is the result accurate? Explain how the results of analysis support the conclusion.
# Stimulate a new dataset which remote work has a strong negative effect on employee satisfaction
## take a copy of the default simulation parameters dictionary
stong_neagtive_parameters = default_simulation_parameters.copy()
## Check the baseline to ensure chose a reasonable negative values.
## Look at the baseine, and the value of -4 could be a godd choice.
stong_neagtive_parameters["remote_work_treatment_effect"]= -4
# look at the updated parameters
stong_neagtive_parameters
{'num_employees_per_company': 50,
'satisfaction_variance': 1,
'data_collection_start_date': Timestamp('2022-01-01 00:00:00'),
'data_collection_end_date': Timestamp('2023-12-31 00:00:00'),
'remote_work_onset_date': Timestamp('2023-01-01 00:00:00'),
'remote_work_treatment_effect': -4,
'satisfaction_levels': {'Rubicon': 5.2, 'Giggle': 6.1}}
stong_neagtive_df=simulate_data(stong_neagtive_parameters)
stong_neagtive_df
| date | company | satisfaction | employee_id | time_at_company | employee_age | |
|---|---|---|---|---|---|---|
| 0 | 2022-01-02 | Rubicon | 5.053856 | R0 | 1 | 52 |
| 1 | 2022-01-02 | Rubicon | 5.885893 | R1 | 1 | 21 |
| 2 | 2022-01-02 | Rubicon | 4.258374 | R2 | 1 | 21 |
| 3 | 2022-01-02 | Rubicon | 4.027964 | R3 | 1 | 18 |
| 4 | 2022-01-02 | Rubicon | 6.009310 | R4 | 1 | 23 |
| ... | ... | ... | ... | ... | ... | ... |
| 10495 | 2023-12-31 | Giggle | 0.905752 | G45 | 729 | 57 |
| 10496 | 2023-12-31 | Giggle | 3.229823 | G46 | 729 | 59 |
| 10497 | 2023-12-31 | Giggle | 2.652730 | G47 | 729 | 59 |
| 10498 | 2023-12-31 | Giggle | 0.722620 | G48 | 729 | 31 |
| 10499 | 2023-12-31 | Giggle | 2.754140 | G49 | 729 | 52 |
10500 rows × 6 columns
# confirm the change
plt.figure(figsize=(12,2))
sns.lineplot(data=stong_neagtive_df, x="date", y="satisfaction", hue="company")
<Axes: xlabel='date', ylabel='satisfaction'>
# By default, remote_work_onset_date= 2023-01-01 00:00:00
remote_work_onset_date= pd.to_datetime("2023-01-01")
# Create dummy variables
stong_neagtive_df['post_treatment'] = (stong_neagtive_df['date'] > remote_work_onset_date).astype(int)
stong_neagtive_df['treatment_group'] = (stong_neagtive_df['company'] == 'Giggle').astype(int)
stong_neagtive_df.sample(10)
| date | company | satisfaction | employee_id | time_at_company | employee_age | post_treatment | treatment_group | |
|---|---|---|---|---|---|---|---|---|
| 10496 | 2023-12-31 | Giggle | 3.229823 | G46 | 729 | 59 | 1 | 1 |
| 144 | 2022-01-16 | Rubicon | 5.283587 | R44 | 15 | 18 | 0 | 0 |
| 7967 | 2023-01-15 | Giggle | 1.672606 | G17 | 379 | 39 | 1 | 1 |
| 2220 | 2022-11-06 | Rubicon | 3.463370 | R20 | 309 | 19 | 0 | 0 |
| 7793 | 2022-12-18 | Giggle | 5.571966 | G43 | 351 | 53 | 0 | 1 |
| 3882 | 2023-06-25 | Rubicon | 5.449915 | R32 | 540 | 34 | 1 | 0 |
| 4667 | 2023-10-15 | Rubicon | 4.354222 | R17 | 652 | 23 | 1 | 0 |
| 562 | 2022-03-20 | Rubicon | 4.910582 | R12 | 78 | 20 | 0 | 0 |
| 558 | 2022-03-20 | Rubicon | 3.581364 | R8 | 78 | 49 | 0 | 0 |
| 6192 | 2022-05-08 | Giggle | 6.436148 | G42 | 127 | 37 | 0 | 1 |
did_formula = 'satisfaction ~ post_treatment + treatment_group + post_treatment*treatment_group'
did_model = smf.ols(did_formula, data=stong_neagtive_df)
did_results = did_model.fit()
print(did_results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.700
Model: OLS Adj. R-squared: 0.700
Method: Least Squares F-statistic: 8153.
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.00
Time: 22:54:22 Log-Likelihood: -14841.
No. Observations: 10500 AIC: 2.969e+04
Df Residuals: 10496 BIC: 2.972e+04
Df Model: 3
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept 5.1806 0.019 268.107 0.000 5.143 5.218
post_treatment 0.0171 0.027 0.623 0.533 -0.037 0.071
treatment_group 0.9328 0.027 34.136 0.000 0.879 0.986
post_treatment:treatment_group -4.0375 0.039 -103.976 0.000 -4.114 -3.961
==============================================================================
Omnibus: 0.687 Durbin-Watson: 1.987
Prob(Omnibus): 0.709 Jarque-Bera (JB): 0.661
Skew: 0.017 Prob(JB): 0.718
Kurtosis: 3.017 Cond. No. 6.83
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## examine the effect of remote work on satisfaction
The result shows that intercept, treatment_group, and the interactive term post_treatment*treatment_group are statistically significant. The Intercept shows the avaerage satisfaction of control group(Rubicon) before the treament day(nearly matched Rubicon's score 5.2). Intercept plus the treatment_group comes the treatment group (Giggle)'s satisfaction score before the treament date (matched 6.1). The post_treatment indicator is not statistically significant in this case.
The effect of remote work on satisfaction is displayed by the interaction terms, which is pretty close to -4 (exactly the one value we have set at the beginning of data simulation). The performance of the DID model is quite good as the R-squared is 0.7.
4.2 Question 6¶
Use data simulation and DiD analyses to examine whether the DiD analysis is robust to variations in the dataset, such as the size of the dataset, the strength of the treatment effect, and the difference in baseline levels of satisfaction between the two companies.
Do not need to analyze all of these factors, but analyses should examine how at least one variable has an impact on the capacity of the DiD analysis to accurately detect the true effect of the remote work intervention.
# Necessary libararies has been imported in Question 3, so we don't have to import them again
# import statsmodels.api as sm
# import statsmodels.formula.api as smf
# Try to do the manipulation manually (one by one) on size of the datase
# Let's try the default setting first
df_default_demo = df_default.copy()
remote_work_onset_date= pd.to_datetime("2023-01-01")
# And the corresponding default DID model
df_default_demo['post_treatment'] = (df_default_demo['date'] > remote_work_onset_date).astype(int)
df_default_demo['treatment_group'] = (df_default_demo['company'] == 'Giggle').astype(int)
df_default_demo.sample(10)
| date | company | satisfaction | employee_id | time_at_company | employee_age | post_treatment | treatment_group | |
|---|---|---|---|---|---|---|---|---|
| 7851 | 2023-01-01 | Giggle | 6.133468 | G1 | 365 | 52 | 0 | 1 |
| 8354 | 2023-03-12 | Giggle | 7.644481 | G4 | 435 | 50 | 1 | 1 |
| 4115 | 2023-07-30 | Rubicon | 4.318494 | R15 | 575 | 33 | 1 | 0 |
| 2417 | 2022-12-04 | Rubicon | 6.191785 | R17 | 337 | 33 | 0 | 0 |
| 805 | 2022-04-24 | Rubicon | 4.913599 | R5 | 113 | 32 | 0 | 0 |
| 6352 | 2022-06-05 | Giggle | 6.516512 | G2 | 155 | 20 | 0 | 1 |
| 2045 | 2022-10-09 | Rubicon | 4.601151 | R45 | 281 | 38 | 0 | 0 |
| 7176 | 2022-09-25 | Giggle | 7.238588 | G26 | 267 | 56 | 0 | 1 |
| 6417 | 2022-06-12 | Giggle | 5.891866 | G17 | 162 | 31 | 0 | 1 |
| 4047 | 2023-07-16 | Rubicon | 3.521234 | R47 | 561 | 42 | 1 | 0 |
did_formula = 'satisfaction ~ post_treatment + treatment_group + post_treatment*treatment_group'
did_model_demo = smf.ols(did_formula, data=df_default_demo)
did_results_demo = did_model_demo.fit()
print(did_results_demo.summary())
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.588
Model: OLS Adj. R-squared: 0.588
Method: Least Squares F-statistic: 4997.
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.00
Time: 23:05:22 Log-Likelihood: -14922.
No. Observations: 10500 AIC: 2.985e+04
Df Residuals: 10496 BIC: 2.988e+04
Df Model: 3
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept 5.2461 0.019 269.408 0.000 5.208 5.284
post_treatment -0.0729 0.028 -2.634 0.008 -0.127 -0.019
treatment_group 0.8374 0.028 30.407 0.000 0.783 0.891
post_treatment:treatment_group 2.1401 0.039 54.690 0.000 2.063 2.217
==============================================================================
Omnibus: 1.764 Durbin-Watson: 1.986
Prob(Omnibus): 0.414 Jarque-Bera (JB): 1.764
Skew: 0.032 Prob(JB): 0.414
Kurtosis: 2.998 Cond. No. 6.83
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Begin to manipulate the size of the datasets
# Expand the size of the dataset (L here means Larger)
num_employees_L_parameters=default_simulation_parameters.copy()
num_employees_L_parameters["num_employees_per_company"] = 200
num_employees_L_df= simulate_data(num_employees_L_parameters)
# Similarly, prepare dummy variables for later DID analysis
remote_work_onset_date= pd.to_datetime("2023-01-01")
num_employees_L_df['post_treatment'] = (num_employees_L_df['date'] > remote_work_onset_date).astype(int)
num_employees_L_df['treatment_group'] = (num_employees_L_df['company'] == 'Giggle').astype(int)
num_employees_L_df.sample(10)
| date | company | satisfaction | employee_id | time_at_company | employee_age | post_treatment | treatment_group | |
|---|---|---|---|---|---|---|---|---|
| 29554 | 2022-10-23 | Giggle | 4.519462 | G154 | 295 | 28 | 0 | 1 |
| 40760 | 2023-11-19 | Giggle | 7.266917 | G160 | 687 | 47 | 1 | 1 |
| 5144 | 2022-06-26 | Rubicon | 6.220201 | R144 | 176 | 58 | 0 | 0 |
| 32474 | 2023-02-05 | Giggle | 7.041340 | G74 | 400 | 41 | 1 | 1 |
| 31455 | 2023-01-01 | Giggle | 6.718191 | G55 | 365 | 59 | 0 | 1 |
| 18862 | 2023-10-22 | Rubicon | 7.467705 | R62 | 659 | 35 | 1 | 0 |
| 5827 | 2022-07-24 | Rubicon | 3.914091 | R27 | 204 | 47 | 0 | 0 |
| 18861 | 2023-10-22 | Rubicon | 6.634394 | R61 | 659 | 58 | 1 | 0 |
| 37760 | 2023-08-06 | Giggle | 9.113765 | G160 | 582 | 26 | 1 | 1 |
| 34626 | 2023-04-23 | Giggle | 6.766272 | G26 | 477 | 20 | 1 | 1 |
# Perform the DID analyses on larger dataset
did_model_L = smf.ols(did_formula, data=num_employees_L_df)
did_results_L = did_model_L.fit()
print(did_results_L.summary())
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.581
Model: OLS Adj. R-squared: 0.581
Method: Least Squares F-statistic: 1.944e+04
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.00
Time: 23:08:45 Log-Likelihood: -59727.
No. Observations: 42000 AIC: 1.195e+05
Df Residuals: 41996 BIC: 1.195e+05
Df Model: 3
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept 5.1933 0.010 532.991 0.000 5.174 5.212
post_treatment 0.0041 0.014 0.298 0.766 -0.023 0.031
treatment_group 0.9036 0.014 65.576 0.000 0.877 0.931
post_treatment:treatment_group 1.9986 0.020 102.071 0.000 1.960 2.037
==============================================================================
Omnibus: 0.301 Durbin-Watson: 1.976
Prob(Omnibus): 0.860 Jarque-Bera (JB): 0.287
Skew: -0.004 Prob(JB): 0.866
Kurtosis: 3.010 Cond. No. 6.83
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# let's try smaller dataset size
# Begin to manipulate the size of the datasets
# Decrease the size of the dataset (S here means Small)
num_employees_S_parameters=default_simulation_parameters.copy()
num_employees_S_parameters["num_employees_per_company"] = 30
num_employees_S_df= simulate_data(num_employees_S_parameters)
remote_work_onset_date= pd.to_datetime("2023-01-01")
num_employees_S_df['post_treatment'] = (num_employees_S_df['date'] > remote_work_onset_date).astype(int)
num_employees_S_df['treatment_group'] = (num_employees_S_df['company'] == 'Giggle').astype(int)
num_employees_S_df.sample(10)
| date | company | satisfaction | employee_id | time_at_company | employee_age | post_treatment | treatment_group | |
|---|---|---|---|---|---|---|---|---|
| 4022 | 2022-07-24 | Giggle | 5.843885 | G2 | 204 | 43 | 0 | 1 |
| 2915 | 2023-11-12 | Rubicon | 5.035775 | R5 | 680 | 49 | 1 | 0 |
| 4923 | 2023-02-19 | Giggle | 8.033658 | G3 | 414 | 27 | 1 | 1 |
| 5868 | 2023-09-24 | Giggle | 7.737370 | G18 | 631 | 44 | 1 | 1 |
| 5124 | 2023-04-02 | Giggle | 9.773210 | G24 | 456 | 20 | 1 | 1 |
| 3756 | 2022-05-22 | Giggle | 5.656828 | G6 | 141 | 43 | 0 | 1 |
| 2794 | 2023-10-15 | Rubicon | 3.910319 | R4 | 652 | 31 | 1 | 0 |
| 1524 | 2022-12-18 | Rubicon | 4.697491 | R24 | 351 | 30 | 0 | 0 |
| 5769 | 2023-09-03 | Giggle | 6.691046 | G9 | 610 | 52 | 1 | 1 |
| 4898 | 2023-02-12 | Giggle | 8.469248 | G8 | 407 | 57 | 1 | 1 |
did_model_S = smf.ols(did_formula, data=num_employees_S_df)
did_results_S = did_model_S.fit()
print(did_results_S.summary())
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.582
Model: OLS Adj. R-squared: 0.582
Method: Least Squares F-statistic: 2926.
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.00
Time: 23:20:47 Log-Likelihood: -8868.3
No. Observations: 6300 AIC: 1.774e+04
Df Residuals: 6296 BIC: 1.777e+04
Df Model: 3
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept 5.2103 0.025 210.049 0.000 5.162 5.259
post_treatment 0.0012 0.035 0.033 0.974 -0.068 0.070
treatment_group 0.8800 0.035 25.085 0.000 0.811 0.949
post_treatment:treatment_group 1.9875 0.050 39.871 0.000 1.890 2.085
==============================================================================
Omnibus: 3.639 Durbin-Watson: 1.976
Prob(Omnibus): 0.162 Jarque-Bera (JB): 3.675
Skew: 0.054 Prob(JB): 0.159
Kurtosis: 2.952 Cond. No. 6.83
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# let's try smaller dataset size
# Begin to manipulate the size of the datasets
# Decrease the size of the dataset (s here means smaller)
num_employees_s_parameters=default_simulation_parameters.copy()
num_employees_s_parameters["num_employees_per_company"] = 10
num_employees_s_df= simulate_data(num_employees_s_parameters)
remote_work_onset_date= pd.to_datetime("2023-01-01")
num_employees_s_df['post_treatment'] = (num_employees_s_df['date'] > remote_work_onset_date).astype(int)
num_employees_s_df['treatment_group'] = (num_employees_s_df['company'] == 'Giggle').astype(int)
num_employees_s_df.sample(10)
| date | company | satisfaction | employee_id | time_at_company | employee_age | post_treatment | treatment_group | |
|---|---|---|---|---|---|---|---|---|
| 205 | 2022-05-22 | Rubicon | 6.237281 | R5 | 141 | 53 | 0 | 0 |
| 1435 | 2022-09-25 | Giggle | 7.342051 | G5 | 267 | 50 | 0 | 1 |
| 786 | 2023-07-02 | Rubicon | 5.997104 | R6 | 547 | 19 | 1 | 0 |
| 1513 | 2022-11-20 | Giggle | 6.355691 | G3 | 323 | 43 | 0 | 1 |
| 1263 | 2022-05-29 | Giggle | 5.765894 | G3 | 148 | 53 | 0 | 1 |
| 1675 | 2023-03-12 | Giggle | 7.855803 | G5 | 435 | 57 | 1 | 1 |
| 147 | 2022-04-10 | Rubicon | 7.331843 | R7 | 99 | 52 | 0 | 0 |
| 1700 | 2023-04-02 | Giggle | 6.952891 | G0 | 456 | 21 | 1 | 1 |
| 14 | 2022-01-09 | Rubicon | 6.464313 | R4 | 8 | 42 | 0 | 0 |
| 431 | 2022-10-30 | Rubicon | 3.757645 | R1 | 302 | 30 | 0 | 0 |
did_model_s = smf.ols(did_formula, data=num_employees_s_df)
did_results_s = did_model_s.fit()
print(did_results_s.summary())
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.578
Model: OLS Adj. R-squared: 0.577
Method: Least Squares F-statistic: 956.7
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.00
Time: 23:22:15 Log-Likelihood: -2986.1
No. Observations: 2100 AIC: 5980.
Df Residuals: 2096 BIC: 6003.
Df Model: 3
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept 5.2405 0.044 120.167 0.000 5.155 5.326
post_treatment -0.0528 0.062 -0.852 0.394 -0.174 0.069
treatment_group 0.9297 0.062 15.074 0.000 0.809 1.051
post_treatment:treatment_group 1.9809 0.088 22.603 0.000 1.809 2.153
==============================================================================
Omnibus: 0.453 Durbin-Watson: 2.014
Prob(Omnibus): 0.797 Jarque-Bera (JB): 0.394
Skew: -0.028 Prob(JB): 0.821
Kurtosis: 3.036 Cond. No. 6.83
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Additioanl Note(See below)
The DiD analysis is relatively robust to variations in the dataset, at least for changes in the size of the dataset(and at least based on the parameters we have tested). According to the summary tables, the interactive terms are always statistically signicificant with a value of around 2 (almost a equivalent of the treatment effect). But as the size of the dataset changes, the values would change slightly, but not that much. Besides, the values of R-squared are relatively stable at around 0.58. This conclusion at the primary stage might not be very accurate and we would go through several rounds of changes of the parameters below.
# In order to make the repeated process more easy to execute, I combine it as an integrated function
import statsmodels.api as sm
import statsmodels.formula.api as smf
def variation(changed_variable, value):
new_parameters = default_simulation_parameters.copy()
new_parameters[changed_variable] = value
df_variation = simulate_data(new_parameters)
remote_work_onset_date= pd.to_datetime("2023-01-01")
df_variation['post_treatment'] = (df_variation['date'] > remote_work_onset_date).astype(int)
df_variation['treatment_group'] = (df_variation['company'] == 'Giggle').astype(int)
did_formula = 'satisfaction ~ post_treatment + treatment_group + post_treatment*treatment_group'
did_model_variation = smf.ols(did_formula, data=df_variation)
did_results_variation = did_model_variation.fit()
return print(did_results_variation.summary())
# Other Attempts-Let me change the the strength of the treatment effect,
# and the difference in baseline levels of satisfaction between the two companies
# Try positive treatment with different strengths
variation("remote_work_treatment_effect",3)
variation("remote_work_treatment_effect",5)
variation("remote_work_treatment_effect",10)
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.717
Model: OLS Adj. R-squared: 0.716
Method: Least Squares F-statistic: 8845.
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.00
Time: 23:26:31 Log-Likelihood: -14938.
No. Observations: 10500 AIC: 2.988e+04
Df Residuals: 10496 BIC: 2.991e+04
Df Model: 3
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept 5.1810 0.020 265.658 0.000 5.143 5.219
post_treatment 0.0238 0.028 0.859 0.390 -0.031 0.078
treatment_group 0.9081 0.028 32.924 0.000 0.854 0.962
post_treatment:treatment_group 2.9789 0.039 76.007 0.000 2.902 3.056
==============================================================================
Omnibus: 11.114 Durbin-Watson: 2.013
Prob(Omnibus): 0.004 Jarque-Bera (JB): 11.107
Skew: 0.079 Prob(JB): 0.00387
Kurtosis: 3.018 Cond. No. 6.83
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.856
Model: OLS Adj. R-squared: 0.856
Method: Least Squares F-statistic: 2.080e+04
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.00
Time: 23:26:32 Log-Likelihood: -14885.
No. Observations: 10500 AIC: 2.978e+04
Df Residuals: 10496 BIC: 2.981e+04
Df Model: 3
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept 5.1824 0.019 267.089 0.000 5.144 5.220
post_treatment 0.0184 0.028 0.667 0.505 -0.036 0.072
treatment_group 0.9619 0.027 35.053 0.000 0.908 1.016
post_treatment:treatment_group 4.9165 0.039 126.087 0.000 4.840 4.993
==============================================================================
Omnibus: 1.494 Durbin-Watson: 2.018
Prob(Omnibus): 0.474 Jarque-Bera (JB): 1.493
Skew: -0.007 Prob(JB): 0.474
Kurtosis: 2.943 Cond. No. 6.83
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.954
Model: OLS Adj. R-squared: 0.954
Method: Least Squares F-statistic: 7.283e+04
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.00
Time: 23:26:32 Log-Likelihood: -14924.
No. Observations: 10500 AIC: 2.986e+04
Df Residuals: 10496 BIC: 2.988e+04
Df Model: 3
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept 5.2105 0.019 267.540 0.000 5.172 5.249
post_treatment -0.0034 0.028 -0.123 0.902 -0.058 0.051
treatment_group 0.9057 0.028 32.883 0.000 0.852 0.960
post_treatment:treatment_group 9.9607 0.039 254.499 0.000 9.884 10.037
==============================================================================
Omnibus: 2.294 Durbin-Watson: 2.028
Prob(Omnibus): 0.318 Jarque-Bera (JB): 2.317
Skew: -0.014 Prob(JB): 0.314
Kurtosis: 3.067 Cond. No. 6.83
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Additional note
Again, when I changed the strength of the positive treatment effect, the interactive term could still capture the effect of treatment with a value that is almost the same. However, it is also worth noting that the R-squared increase along with the magtitute of treatment effect(at least for what I have observed so far).
# Try neagtive treatment with different strengths
variation("remote_work_treatment_effect",-3)
variation("remote_work_treatment_effect",-5)
variation("remote_work_treatment_effect",-10)
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.552
Model: OLS Adj. R-squared: 0.551
Method: Least Squares F-statistic: 4304.
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.00
Time: 23:29:35 Log-Likelihood: -14924.
No. Observations: 10500 AIC: 2.986e+04
Df Residuals: 10496 BIC: 2.988e+04
Df Model: 3
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept 5.1595 0.019 264.924 0.000 5.121 5.198
post_treatment 0.0821 0.028 2.966 0.003 0.028 0.136
treatment_group 0.9561 0.028 34.713 0.000 0.902 1.010
post_treatment:treatment_group -3.1143 0.039 -79.573 0.000 -3.191 -3.038
==============================================================================
Omnibus: 1.638 Durbin-Watson: 1.997
Prob(Omnibus): 0.441 Jarque-Bera (JB): 1.607
Skew: -0.020 Prob(JB): 0.448
Kurtosis: 3.046 Cond. No. 6.83
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.789
Model: OLS Adj. R-squared: 0.789
Method: Least Squares F-statistic: 1.309e+04
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.00
Time: 23:29:35 Log-Likelihood: -14969.
No. Observations: 10500 AIC: 2.995e+04
Df Residuals: 10496 BIC: 2.998e+04
Df Model: 3
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept 5.2102 0.020 266.372 0.000 5.172 5.249
post_treatment -0.0088 0.028 -0.316 0.752 -0.063 0.046
treatment_group 0.8888 0.028 32.131 0.000 0.835 0.943
post_treatment:treatment_group -5.0155 0.039 -127.596 0.000 -5.093 -4.938
==============================================================================
Omnibus: 2.947 Durbin-Watson: 2.006
Prob(Omnibus): 0.229 Jarque-Bera (JB): 2.995
Skew: 0.019 Prob(JB): 0.224
Kurtosis: 3.073 Cond. No. 6.83
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.943
Model: OLS Adj. R-squared: 0.943
Method: Least Squares F-statistic: 5.739e+04
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.00
Time: 23:29:35 Log-Likelihood: -14941.
No. Observations: 10500 AIC: 2.989e+04
Df Residuals: 10496 BIC: 2.992e+04
Df Model: 3
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept 5.1695 0.020 264.991 0.000 5.131 5.208
post_treatment 0.0354 0.028 1.278 0.201 -0.019 0.090
treatment_group 0.9236 0.028 33.477 0.000 0.870 0.978
post_treatment:treatment_group -10.0195 0.039 -255.573 0.000 -10.096 -9.943
==============================================================================
Omnibus: 4.377 Durbin-Watson: 1.990
Prob(Omnibus): 0.112 Jarque-Bera (JB): 4.181
Skew: -0.021 Prob(JB): 0.124
Kurtosis: 2.912 Cond. No. 6.83
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Alright, the above conclusion holds true (and again, based on the obseravtions)
# Lastly, let's try the difference in baseline levels of satisfaction between the two companies.
# Let me try the the difference in baseline levels of satisfaction
variation("satisfaction_levels",{'Rubicon': 5.2, 'Giggle': 6.1})
variation("satisfaction_levels",{'Rubicon': 2, 'Giggle': 6.1})
variation("satisfaction_levels",{'Rubicon': 1, 'Giggle': 6.1})
variation("satisfaction_levels",{'Rubicon': 0.2, 'Giggle': 6.1})
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.579
Model: OLS Adj. R-squared: 0.579
Method: Least Squares F-statistic: 4817.
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.00
Time: 23:32:30 Log-Likelihood: -14962.
No. Observations: 10500 AIC: 2.993e+04
Df Residuals: 10496 BIC: 2.996e+04
Df Model: 3
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept 5.1787 0.020 264.933 0.000 5.140 5.217
post_treatment 0.0159 0.028 0.571 0.568 -0.039 0.070
treatment_group 0.9536 0.028 34.496 0.000 0.899 1.008
post_treatment:treatment_group 1.9393 0.039 49.369 0.000 1.862 2.016
==============================================================================
Omnibus: 2.305 Durbin-Watson: 2.018
Prob(Omnibus): 0.316 Jarque-Bera (JB): 2.288
Skew: -0.019 Prob(JB): 0.319
Kurtosis: 2.939 Cond. No. 6.83
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.875
Model: OLS Adj. R-squared: 0.875
Method: Least Squares F-statistic: 2.449e+04
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.00
Time: 23:32:30 Log-Likelihood: -14851.
No. Observations: 10500 AIC: 2.971e+04
Df Residuals: 10496 BIC: 2.974e+04
Df Model: 3
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept 2.0045 0.019 103.642 0.000 1.967 2.042
post_treatment -0.0024 0.027 -0.086 0.932 -0.056 0.052
treatment_group 4.0719 0.027 148.868 0.000 4.018 4.126
post_treatment:treatment_group 2.0179 0.039 51.918 0.000 1.942 2.094
==============================================================================
Omnibus: 0.839 Durbin-Watson: 2.005
Prob(Omnibus): 0.658 Jarque-Bera (JB): 0.804
Skew: -0.016 Prob(JB): 0.669
Kurtosis: 3.028 Cond. No. 6.83
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.907
Model: OLS Adj. R-squared: 0.907
Method: Least Squares F-statistic: 3.423e+04
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.00
Time: 23:32:31 Log-Likelihood: -14951.
No. Observations: 10500 AIC: 2.991e+04
Df Residuals: 10496 BIC: 2.994e+04
Df Model: 3
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept 0.9902 0.020 50.715 0.000 0.952 1.029
post_treatment -0.0009 0.028 -0.033 0.974 -0.055 0.053
treatment_group 5.1426 0.028 186.232 0.000 5.088 5.197
post_treatment:treatment_group 1.9895 0.039 50.703 0.000 1.913 2.066
==============================================================================
Omnibus: 1.896 Durbin-Watson: 2.043
Prob(Omnibus): 0.387 Jarque-Bera (JB): 1.886
Skew: 0.016 Prob(JB): 0.390
Kurtosis: 3.057 Cond. No. 6.83
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.926
Model: OLS Adj. R-squared: 0.926
Method: Least Squares F-statistic: 4.401e+04
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.00
Time: 23:32:31 Log-Likelihood: -14834.
No. Observations: 10500 AIC: 2.968e+04
Df Residuals: 10496 BIC: 2.970e+04
Df Model: 3
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept 0.2016 0.019 10.438 0.000 0.164 0.239
post_treatment -0.0193 0.027 -0.705 0.481 -0.073 0.034
treatment_group 5.9032 0.027 216.174 0.000 5.850 5.957
post_treatment:treatment_group 2.0247 0.039 52.177 0.000 1.949 2.101
==============================================================================
Omnibus: 1.619 Durbin-Watson: 2.001
Prob(Omnibus): 0.445 Jarque-Bera (JB): 1.624
Skew: 0.013 Prob(JB): 0.444
Kurtosis: 2.945 Cond. No. 6.83
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Additioanl note (See below)
Based on the observations, the interactive terms in the models could alwyas capture the treatment effect with a relatively accurate value. And the R-squared seems to enlarged as the gap expanded.
# Let me narrow the distance betwwen the two
variation("satisfaction_levels",{'Rubicon': 5.9, 'Giggle': 6.1})
variation("satisfaction_levels",{'Rubicon': 5.95, 'Giggle': 6.1})
variation("satisfaction_levels",{'Rubicon': 6, 'Giggle': 6.1})
variation("satisfaction_levels",{'Rubicon': 6.05, 'Giggle': 6.1})
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.462
Model: OLS Adj. R-squared: 0.462
Method: Least Squares F-statistic: 3010.
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.00
Time: 23:36:23 Log-Likelihood: -14896.
No. Observations: 10500 AIC: 2.980e+04
Df Residuals: 10496 BIC: 2.983e+04
Df Model: 3
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept 5.9052 0.019 304.015 0.000 5.867 5.943
post_treatment -0.0196 0.028 -0.709 0.478 -0.074 0.035
treatment_group 0.2114 0.027 7.696 0.000 0.158 0.265
post_treatment:treatment_group 2.0107 0.039 51.510 0.000 1.934 2.087
==============================================================================
Omnibus: 0.440 Durbin-Watson: 1.977
Prob(Omnibus): 0.802 Jarque-Bera (JB): 0.467
Skew: -0.012 Prob(JB): 0.792
Kurtosis: 2.978 Cond. No. 6.83
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.453
Model: OLS Adj. R-squared: 0.452
Method: Least Squares F-statistic: 2893.
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.00
Time: 23:36:24 Log-Likelihood: -14828.
No. Observations: 10500 AIC: 2.966e+04
Df Residuals: 10496 BIC: 2.969e+04
Df Model: 3
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept 5.9603 0.019 308.849 0.000 5.922 5.998
post_treatment -0.0164 0.027 -0.599 0.549 -0.070 0.037
treatment_group 0.1564 0.027 5.731 0.000 0.103 0.210
post_treatment:treatment_group 1.9938 0.039 51.410 0.000 1.918 2.070
==============================================================================
Omnibus: 0.580 Durbin-Watson: 1.996
Prob(Omnibus): 0.748 Jarque-Bera (JB): 0.608
Skew: -0.014 Prob(JB): 0.738
Kurtosis: 2.975 Cond. No. 6.83
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.441
Model: OLS Adj. R-squared: 0.441
Method: Least Squares F-statistic: 2764.
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.00
Time: 23:36:24 Log-Likelihood: -14909.
No. Observations: 10500 AIC: 2.983e+04
Df Residuals: 10496 BIC: 2.986e+04
Df Model: 3
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept 5.9993 0.019 308.475 0.000 5.961 6.037
post_treatment -0.0312 0.028 -1.128 0.259 -0.085 0.023
treatment_group 0.1337 0.028 4.863 0.000 0.080 0.188
post_treatment:treatment_group 1.9884 0.039 50.876 0.000 1.912 2.065
==============================================================================
Omnibus: 0.266 Durbin-Watson: 1.996
Prob(Omnibus): 0.876 Jarque-Bera (JB): 0.294
Skew: 0.005 Prob(JB): 0.863
Kurtosis: 2.976 Cond. No. 6.83
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: satisfaction R-squared: 0.426
Model: OLS Adj. R-squared: 0.426
Method: Least Squares F-statistic: 2600.
Date: Thu, 03 Oct 2024 Prob (F-statistic): 0.00
Time: 23:36:24 Log-Likelihood: -14988.
No. Observations: 10500 AIC: 2.998e+04
Df Residuals: 10496 BIC: 3.001e+04
Df Model: 3
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept 6.0456 0.020 308.525 0.000 6.007 6.084
post_treatment 0.0418 0.028 1.499 0.134 -0.013 0.096
treatment_group 0.0635 0.028 2.290 0.022 0.009 0.118
post_treatment:treatment_group 1.9434 0.039 49.351 0.000 1.866 2.021
==============================================================================
Omnibus: 1.521 Durbin-Watson: 1.975
Prob(Omnibus): 0.467 Jarque-Bera (JB): 1.534
Skew: -0.015 Prob(JB): 0.465
Kurtosis: 2.948 Cond. No. 6.83
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Additioanl note (See below)
Based on the observations, when the gap narrows down, the interactive terms in the models could still capture the treatment effects. And the R-squared seems to contracted as the gap decreased.
Conclusion to Q6¶
The conclusion to question 6 goes here
In the above section, we have tried the adjustments on several variations in the dataset, such as the size of the dataset, the strength of the treatment effect, and the difference in baseline levels of satisfaction between the two companies.The results show that the DiD analyses could always capture the treatment effects, as demostrated by the almost identical values with the treatment effcts in their interactive terms(and statistically significant). But it is worth mentioning that the values of effects detected could change slightly, but overall, they have no huge discrepancies. In addition, effects of the single "post_treatment" vary as the variables changed, namely, it could sometimes be statistically significant and could sometimes not. And the R-squared also changed along with the adjustments.Lastly, the model is sensitive towards changes in the different parameters(more details could be seen in the each analyse section).
Part 2: Twitter Data Analysis
Twitter Misinformation Analysis Notebook (click to expand)
Section 1: Twitter Dataset¶
The dataset that accompanies this paper has been compiled and included below as a Pandas dataframe (assigned to the variable mccabe_data). Please base your main analyses on this shared dataset.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.formula.api import ols
import statsmodels.api as sm
import statsmodels.formula.api as smf
mccabe_data = pd.read_csv('/home/jovyan/compss-214a/mccabe-public-data.csv')
You are welcome to rename the dataset or work with different subsets of this data or with additional datasets if neccesary, but this shared dataset should be the primary source for your analyses, so that we are all working with the same underlying source of information.
Section 2 Exploring the structure of the dataset¶
Describe the key variables you are interested in. Feel free to include data summaries and/or vizualizations that illustrate how the dataset is structured, such as the different groups of users you are interested in and the different measures of whether posts are classified as misinformation, etc.
Section 2-1 Overall data structure¶
# Load the dataset, rename it as df and make a copy
df = mccabe_data.copy()
df.shape
(32968, 29)
df.head()
| date | fake_merged | fake_merged_initiation | fake_merged_rt | fake_grinberg_initiation | fake_grinberg_rt | fake_grinberg_rb_initiation | fake_grinberg_rb_rt | fake_newsguard_initiation | fake_newsguard_rt | ... | not_fake_shopping | not_fake_shopping_initiation | not_fake_shopping_rt | not_fake_sports | not_fake_sports_initiation | not_fake_sports_rt | n | stat | nusers | group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019-11-30 | 875.0 | 199.0 | 676.0 | 74.0 | 207.0 | 42.0 | 138.0 | 188.0 | 653.0 | ... | 196.0 | 61.0 | 135.0 | 16.0 | 7.0 | 9.0 | 12387.0 | total | 4390 | fns |
| 1 | 2019-12-01 | 3382.0 | 825.0 | 2557.0 | 257.0 | 941.0 | 120.0 | 546.0 | 760.0 | 2293.0 | ... | 608.0 | 207.0 | 401.0 | 99.0 | 33.0 | 66.0 | 54897.0 | total | 11629 | fns |
| 2 | 2019-12-02 | 3644.0 | 992.0 | 2652.0 | 280.0 | 780.0 | 141.0 | 479.0 | 926.0 | 2455.0 | ... | 684.0 | 289.0 | 395.0 | 82.0 | 37.0 | 45.0 | 68505.0 | total | 13132 | fns |
| 3 | 2019-12-03 | 4184.0 | 1110.0 | 3074.0 | 339.0 | 921.0 | 185.0 | 562.0 | 1052.0 | 2890.0 | ... | 782.0 | 236.0 | 546.0 | 92.0 | 41.0 | 51.0 | 74502.0 | total | 13997 | fns |
| 4 | 2019-12-04 | 4436.0 | 1100.0 | 3336.0 | 307.0 | 1171.0 | 135.0 | 540.0 | 1038.0 | 3146.0 | ... | 540.0 | 261.0 | 279.0 | 124.0 | 53.0 | 71.0 | 71762.0 | total | 13967 | fns |
5 rows × 29 columns
# See the complete list of column names
df.columns
Index(['date', 'fake_merged', 'fake_merged_initiation', 'fake_merged_rt',
'fake_grinberg_initiation', 'fake_grinberg_rt',
'fake_grinberg_rb_initiation', 'fake_grinberg_rb_rt',
'fake_newsguard_initiation', 'fake_newsguard_rt', 'not_fake',
'not_fake_initiation', 'not_fake_rt', 'not_fake_conservative',
'not_fake_conservative_initiation', 'not_fake_conservative_rt',
'not_fake_liberal', 'not_fake_liberal_initiation',
'not_fake_liberal_rt', 'not_fake_shopping',
'not_fake_shopping_initiation', 'not_fake_shopping_rt',
'not_fake_sports', 'not_fake_sports_initiation', 'not_fake_sports_rt',
'n', 'stat', 'nusers', 'group'],
dtype='object')
# Some columns have been obmitted in presentation, let's see a complete one
df.iloc[0]
date 2019-11-30 fake_merged 875.0 fake_merged_initiation 199.0 fake_merged_rt 676.0 fake_grinberg_initiation 74.0 fake_grinberg_rt 207.0 fake_grinberg_rb_initiation 42.0 fake_grinberg_rb_rt 138.0 fake_newsguard_initiation 188.0 fake_newsguard_rt 653.0 not_fake 11512.0 not_fake_initiation 4357.0 not_fake_rt 7155.0 not_fake_conservative 529.0 not_fake_conservative_initiation 156.0 not_fake_conservative_rt 373.0 not_fake_liberal 1030.0 not_fake_liberal_initiation 247.0 not_fake_liberal_rt 783.0 not_fake_shopping 196.0 not_fake_shopping_initiation 61.0 not_fake_shopping_rt 135.0 not_fake_sports 16.0 not_fake_sports_initiation 7.0 not_fake_sports_rt 9.0 n 12387.0 stat total nusers 4390 group fns Name: 0, dtype: object
# See what're contained in the non-numerical value
df["stat"].unique(), df["group"].unique()
(array(['total', 'avg'], dtype=object),
array(['fns', 'suspended', 'ha', 'ma', 'la', 'qanon', 'av', 'ss1', 'ss5',
'A', 'B', 'D', 'F', 'all', 'nfns', 'nfns_ha', 'nfns_ma', 'nfns_la',
'A_ha', 'B_ha', 'D_ha', 'F_ha', 'A_ma', 'B_ma', 'D_ma', 'F_ma',
'A_la', 'B_la', 'D_la', 'F_la'], dtype=object))
# Some simple observations(see the text below)
Based on simple calculation and reference to the codebook, the fake_merged is simply the combination of fake_merged_initiation and fake_merged_rt. The same principle applies to not_fake, not_fake_conservative, not_fake_liberal, not_fake_shopping, and not_fake_sports. Besides, the nusers is a combination of fake_merged and not_fake. And it is worth noting that the stat has two distintive values including total and average, so we should deal with this carefully in the later analyses.
# To see what the avg and total means by observing a random selected date
df_selected = df[df['date'] == '2019-11-30']
# df_selected
# For the cleaniess of the pdf file, here I did not run the cell; but during the analysis, I did look at the whole picture
# The data structure becomes clear now (see the notes below)
It seems like each data are supposed to have 60 rows (the 30 classifications in the "group" mutiplied by the two kinds of stat: total and avg). But as the total number could not be divided evenly by 60, I would write a function below to see if there's any inconsistency and figure out the reasons for those (potential) inconsistencies. But before that, let's check the ditribution of time first.
df['date'] = pd.to_datetime(df['date'])
# Data summary and descriptive statistics
df_total=df[df["stat"]=="total"]
summary_stats = df_total.describe(include='all').style.format(precision=2)
summary_stats
| date | fake_merged | fake_merged_initiation | fake_merged_rt | fake_grinberg_initiation | fake_grinberg_rt | fake_grinberg_rb_initiation | fake_grinberg_rb_rt | fake_newsguard_initiation | fake_newsguard_rt | not_fake | not_fake_initiation | not_fake_rt | not_fake_conservative | not_fake_conservative_initiation | not_fake_conservative_rt | not_fake_liberal | not_fake_liberal_initiation | not_fake_liberal_rt | not_fake_shopping | not_fake_shopping_initiation | not_fake_shopping_rt | not_fake_sports | not_fake_sports_initiation | not_fake_sports_rt | n | stat | nusers | group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 16484 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484.00 | 16484 | 16484.00 | 16484 |
| unique | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 1 | nan | 30 |
| top | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | total | nan | fns |
| freq | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 16484 | nan | 550 |
| mean | 2020-08-29 19:27:21.446251008 | 1816.38 | 379.00 | 1437.38 | 116.53 | 484.67 | 67.94 | 302.52 | 360.77 | 1351.67 | 34438.99 | 14423.19 | 20015.80 | 1552.06 | 554.76 | 997.29 | 2109.63 | 507.11 | 1602.52 | 710.08 | 333.27 | 376.80 | 53.93 | 19.02 | 34.91 | 36255.37 | nan | 9600.57 | nan |
| min | 2019-11-30 00:00:00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | nan | 1.00 | nan |
| 25% | 2020-04-15 00:00:00 | 116.00 | 33.00 | 77.00 | 7.00 | 18.00 | 2.00 | 6.00 | 31.00 | 71.00 | 4235.00 | 1246.00 | 2432.00 | 230.00 | 71.00 | 136.00 | 155.00 | 34.00 | 119.00 | 18.00 | 6.00 | 9.00 | 4.00 | 1.00 | 2.00 | 4686.25 | nan | 519.75 | nan |
| 50% | 2020-08-29 00:00:00 | 636.00 | 146.50 | 458.00 | 50.00 | 134.00 | 32.00 | 68.00 | 142.00 | 429.00 | 13385.50 | 5334.00 | 8437.50 | 766.00 | 215.00 | 530.00 | 738.00 | 204.50 | 517.00 | 127.00 | 59.00 | 63.00 | 16.00 | 4.00 | 11.00 | 14769.50 | nan | 1882.50 | nan |
| 75% | 2021-01-14 00:00:00 | 2800.50 | 598.00 | 2266.25 | 176.00 | 761.00 | 97.00 | 473.25 | 569.00 | 2123.00 | 44738.00 | 12778.75 | 28380.25 | 2312.25 | 726.25 | 1416.00 | 3089.00 | 671.00 | 2345.25 | 1107.00 | 288.00 | 640.25 | 56.00 | 20.00 | 36.00 | 48138.00 | nan | 6164.25 | nan |
| max | 2021-05-31 00:00:00 | 19143.00 | 3124.00 | 16145.00 | 1034.00 | 5142.00 | 706.00 | 4186.00 | 3033.00 | 15829.00 | 355049.00 | 131886.00 | 223163.00 | 15075.00 | 5649.00 | 11149.00 | 27622.00 | 5691.00 | 23579.00 | 6368.00 | 3544.00 | 3134.00 | 1026.00 | 372.00 | 718.00 | 363619.00 | nan | 97893.00 | nan |
| std | nan | 2407.34 | 478.02 | 1952.45 | 151.58 | 687.88 | 92.58 | 461.43 | 455.32 | 1847.60 | 48069.00 | 22993.78 | 26827.75 | 1924.80 | 766.98 | 1238.35 | 2968.71 | 725.00 | 2279.25 | 1107.51 | 599.00 | 561.65 | 92.07 | 33.48 | 60.80 | 49543.95 | nan | 17992.41 | nan |
The time range for the dataset becomes clear now, starting from 2019-11-30, and ending on 2021-05-31.
# Define the start date and end date
start_date='2019-11-30'
end_date= '2021-05-31'
# Here is a function to check the number of rows for each day within the timeframe
def check_row_count_per_date(df, start_date, end_date):
# Create a date range from start_date to end_date
date_range = pd.date_range(start=start_date, end=end_date, freq='D')
inconsistent_dates = {}
for date in date_range:
df_filtered = df[df['date'] == date.strftime('%Y-%m-%d')]
if len(df_filtered) != 60:
inconsistent_dates[date.strftime('%Y-%m-%d')] = len(df_filtered)
return inconsistent_dates
inconsistent_dates = check_row_count_per_date(df, start_date, end_date)
inconsistent_dates
{'2020-06-30': 120,
'2020-07-08': 58,
'2020-07-10': 58,
'2020-08-17': 58,
'2020-10-24': 58,
'2020-10-26': 58,
'2020-10-29': 58,
'2020-10-31': 58,
'2020-11-01': 58,
'2021-01-12': 58,
'2021-01-13': 58,
'2021-01-14': 58,
'2021-01-15': 58,
'2021-01-16': 58,
'2021-01-17': 58,
'2021-01-19': 58,
'2021-01-21': 58}
# Simple description has attached below
There were even 120 rows in the date "2020-06-30", exactly two times the value it is supposed to be. Therefore, it is very likely that there are duplicated values in this particular date. It looks like there're fewer than 60 rows in some dates, but they are consistently 58 rows. The "missing" values were not always the same. But one thing in common is that it is always missed for a same subgroup category. After some careful examination, I figured out that the missing values are always those with subgroup ineligible for certain grouping labels.
# I have tried as many dates as I can manually, but I would not display them here for the cleaniess of the final pdf file.
# Example code
# df_filtered_1 = df[df['date'] == '2020-06-30']
# df_filtered_1
# Alternatively, we could write a function here to visulize the distribution of different groups.
def plot_group_distribution(df, group_column='group'):
group_counts = df[group_column].value_counts()
# Plotting
plt.figure(figsize=(10, 6))
group_counts.plot(kind='bar')
plt.title(f'Distribution of {group_column}')
plt.xlabel('Groups')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
plot_group_distribution(df,group_column='group')
The graph indicates that each group has almost a same frequency, as the bars appear to be almost equal in height, suggesting a balanced dataset across groups. This implies that the number of data points or observations is relatively uniform among the groups, which is beneficial for comparative analysis. However, we could notice a slightly difference in the A_ha, A_la and suspended groups.
Section 2-2 Interesting Variables and More on Visualization¶
df['date'] = pd.to_datetime(df['date'])
# Write a function to visualize the number of users within a specific group
def plot_group_totals(df, group_name):
group_totals = df[(df['group'] == group_name) & (df['stat'] == "total")].copy()
suspension_start = pd.to_datetime('2021-01-06')
suspension_end = pd.to_datetime('2021-01-12')
# Remove duplicates and handle missing values in 'date' or 'nusers'
group_totals = group_totals.drop_duplicates(subset=['date'])
group_totals = group_totals.dropna(subset=['date', 'nusers'])
group_totals = group_totals.sort_values(by='date')
# Create a new figure and plot the data as a line plot
plt.figure(figsize=(16, 2)) # Specify a wide but shallow figure size
plt.plot(group_totals['date'], group_totals['nusers'], color="#00008B")
plt.xlabel('Date')
plt.ylabel('Number of Users')
plt.axvline(suspension_start, color='r', linestyle='--', label='Suspension Starts (January 6th 2021)')
plt.axvline(suspension_end, color='g', linestyle='--', label='Suspension Ends (January 12th 2021)')
plt.legend()
plt.title(f'{group_name.capitalize()} Group: Total Users Over Time')
plt.show()
plot_group_totals(df, "suspended")
plot_group_totals(df, "qanon")
plot_group_totals(df, "fns")
plot_group_totals(df, "nfns")
The graph displays the total number of users over time for four different groups: Suspended, QAnon, Fns, and Nfns, from early 2020 to mid-2021. In particular, the number of Suspended group users has seen a sharp decline and maintained zero afterward. This shows the mechanism of deplatforming. The QAnon group also sees a similar trend. For the more general groups, Fns and Nfns, the numbers fluctuated during the period and demonstrated a slight decline afterward. This pattern suggests that the suspension significantly impacted user participation across all four groups, but with different levels of magnitude.
### Apart from the above, I am also very interested in the Anti-Vaccine(av) group, and would like to explore more of this subgroup
## subset to just the Anti-Vaccine group
av = df[(df['group'] == "av")].copy()
av_totals = av[
(av["stat"] == "total")
].copy()
# Subset to just tweets during 2021
av_totals_2021 = av_totals[av_totals.date >= "2021-01-01"]
# Make a wide figure for the timeseries
plt.figure(figsize=(16, 2))
# Plot date on the x axis, and n (total number of tweets) on the y axis
plt.bar(av_totals_2021.date, av_totals_2021.n, color='lightblue', label="All Tweets")
# Overlay a count of the misinformation tweets (the fake_merged variable) in a different color
plt.bar(
av_totals_2021.date,
av_totals_2021.fake_merged,
color='magenta',
alpha=0.75, # alpha controls the opacity (alpha = 1 is solid, alpha = 0 is completely transparent)
label="Misinformation" # whatever string you put here will go into the legend
)
plt.legend()
<matplotlib.legend.Legend at 0x7cdcd770fc50>
The bar chart shows the volume of tweets over time from January to June 2021, with a comparison between all tweets (in light blue) and tweets identified as misinformation (in magenta). Initially, there is a high volume of tweets, with a noticeable proportion consisting of misinformation. Over time, both total tweets and misinformation tweets decrease. However, the proportion of misinformation remains relatively low afterwards, indicating the deplatforming event also have some effects on these users.
# I also would like to see if the standard for classification also affect the visulization in an evident way.
av_totals_2021["fake_grinberg"]=av_totals_2021["fake_grinberg_initiation"]+av_totals_2021["fake_grinberg_rt"]
/tmp/ipykernel_5539/2617567456.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy av_totals_2021["fake_grinberg"]=av_totals_2021["fake_grinberg_initiation"]+av_totals_2021["fake_grinberg_rt"]
# Make a wide figure for the timeseries
plt.figure(figsize=(16, 2))
# Plot date on the x axis, and n (total number of tweets) on the y axis
plt.bar(av_totals_2021.date, av_totals_2021.n, color='lightblue', label="All Tweets")
# Overlay a count of the misinformation tweets (the fake_merged variable) in a different color
plt.bar(
av_totals_2021.date,
av_totals_2021.fake_grinberg,
color='magenta',
alpha=0.75, # alpha controls the opacity (alpha = 1 is solid, alpha = 0 is completely transparent)
label="Misinformation" # whatever string you put here will go into the legend
)
plt.legend()
<matplotlib.legend.Legend at 0x7cdcd785a210>
The chart shows the total number of tweets and misinformation tweets from January to June 2021, with a clear decline in both over time. Similar to the first chart, there is a sharp drop in early January, followed by a gradual decrease. The proportion of misinformation is highest in early January but steadily decreases, mirroring the trend in total tweets. Both charts consistently indicate a sustained presence of misinformation. We could conclude that the classification of misinformation in the paper could have the same general trend.
Section 3 Replication of Main DiD Results¶
In this section, you will perform at least one Difference in Differences analysis with the goal of conceptually replicating the key DiD analysis that McCabe et al performed to support their primary conclusion.
Section 3-1 Main DiD Results: Impacts on Deplatformed-User Followers and Non-followers¶
Let's first try to analyze deplatforming on two groups‘ sharing behaviors, including the followers of deplatfromed users(Group_B) and those not following them(Group_F). And the outcome variable chosen for comparision is the fake_merged_rt, which demostrates one of an important aspect of communicating misinformation.
# Import the dataset again, and pay special attention to the stat type, here we choose followers and non-followers, and total stat values
df = mccabe_data.copy()
df['date'] = pd.to_datetime(df['date'])
df = df[df["stat"] == "total"]
# Here I create a new dataframe to incoporate the followers and non-followers
df_mdid = df[(df['group'] == 'B') | (df['group'] == 'F')].copy()
# The date that the deplatforming event occurred
suspension_start = pd.to_datetime('2021-01-06')
# We would like to perform the DID analysis, therefore we should set up the post-treatment date and treatment group
df_mdid['post_treatment'] = (df_mdid['date'] > suspension_start).astype(int)
df_mdid['treatment_group'] = (df['group'] == 'B').astype(int)
# Here is the very commonly used format for DID formula, and we choose the fake_merged_rt as outcome variables
formula = 'fake_merged_rt ~ post_treatment + treatment_group + post_treatment*treatment_group'
model = smf.ols(formula, data=df_mdid)
results = model.fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: fake_merged_rt R-squared: 0.802
Model: OLS Adj. R-squared: 0.801
Method: Least Squares F-statistic: 1476.
Date: Sat, 19 Oct 2024 Prob (F-statistic): 0.00
Time: 21:24:23 Log-Likelihood: -9067.7
No. Observations: 1100 AIC: 1.814e+04
Df Residuals: 1096 BIC: 1.816e+04
Df Model: 3
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept 489.9012 45.800 10.696 0.000 400.035 579.768
post_treatment -207.8599 89.200 -2.330 0.020 -382.882 -32.837
treatment_group 3958.7333 64.771 61.118 0.000 3831.643 4085.823
post_treatment:treatment_group -2049.0644 126.148 -16.243 0.000 -2296.584 -1801.545
==============================================================================
Omnibus: 418.613 Durbin-Watson: 0.309
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3516.415
Skew: 1.519 Prob(JB): 0.00
Kurtosis: 11.215 Cond. No. 6.44
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Interpretation of the above summary table has been attached (see below)
The model explains a significant amount of variance (R-squared = 0.802, indicating that 80.2% of the variance could be explained). All predictors, including the interaction term, are statistically significant(p < 0.05). The intercept suggests that without any treatment or intervention (baseline condition), the average level of misinformation retweeted is about 490. And in the post-treatment period (after suspension), the average level of misinformation retweeted decreases by about 208 units compared to the pre-treatment period. In the pre-treatment period, the deplatformed user followers has a much higher average number of misinformation retweeted (by about 3959 units) compared to the not-followers. Lastly, in the post-treatment period, the average number of misinformation retweeted for the treatment group decreases by about 2049 units, beyond what is captured by the main effects of post-treatment and treatment group alone. This implies that deplatforming leads to a significant reduction in misinformation retweets for followers compared to non-followers.
Section 3-2 Given Time Window and Exact Duplication for Visulization¶
What I am going to do in this subsection is to duplicate the Fig. 4 | Time series of misinformation retweeting for followers and not-followers.
# Here I have attached a screenshot of the DID performed by the authors
# Before doing the exact visulization, I would like to align the time window and statistical preprocessing(standardization)
# Copy the dataset and filter for 'total' stat
df = mccabe_data.copy()
df = df[df["stat"] == "total"]
df['date'] = pd.to_datetime(df['date'])
# Filter data for groups 'B_ha' and 'F_ha' within the date range
df = df[(df['group'].isin(['B_ha', 'F_ha'])) &
(df['date'] >= '2020-12-01') &
(df['date'] <= '2021-01-20')]
# Standardize 'fake_merged_rt' column separately for each group
df['fake_merged_rt_std'] = df.groupby('group')['fake_merged_rt'].transform(
lambda x: (x - x.mean()) / x.std()
)
# Create pre- and post-treatment indicators
df['pre_treatment'] = df['date'] <= '2021-01-06'
df['post_treatment'] = df['date'] > '2021-01-12'
# Fit OLS regressions for group B_ha
ols_before_B_ha = ols('fake_merged_rt_std ~ date', data=df[(df['group'] == 'B_ha') & (df['pre_treatment'])]).fit()
ols_after_B_ha = ols('fake_merged_rt_std ~ date', data=df[(df['group'] == 'B_ha') & (df['post_treatment'])]).fit()
# Fit OLS regressions for group F_ha
ols_before_F_ha = ols('fake_merged_rt_std ~ date', data=df[(df['group'] == 'F_ha') & (df['pre_treatment'])]).fit()
ols_after_F_ha = ols('fake_merged_rt_std ~ date', data=df[(df['group'] == 'F_ha') & (df['post_treatment'])]).fit()
# Prepare pre-treatment and post-treatment dates for B_ha and F_ha
dates_pre_B_ha = df[(df['group'] == 'B_ha') & df['pre_treatment']]['date']
dates_post_B_ha = df[(df['group'] == 'B_ha') & df['post_treatment']]['date']
dates_pre_F_ha = df[(df['group'] == 'F_ha') & df['pre_treatment']]['date']
dates_post_F_ha = df[(df['group'] == 'F_ha') & df['post_treatment']]['date']
# Align fitted values with corresponding dates for B_ha and F_ha
fitted_pre_B_ha = pd.Series(ols_before_B_ha.fittedvalues, index=dates_pre_B_ha.index)
fitted_post_B_ha = pd.Series(ols_after_B_ha.fittedvalues, index=dates_post_B_ha.index)
fitted_pre_F_ha = pd.Series(ols_before_F_ha.fittedvalues, index=dates_pre_F_ha.index)
fitted_post_F_ha = pd.Series(ols_after_F_ha.fittedvalues, index=dates_post_F_ha.index)
# Plot setup
plt.figure(figsize=(12, 7))
# Plot actual data points for groups B_ha and F_ha
sns.lineplot(data=df[df['group'] == 'B_ha'], x='date', y='fake_merged_rt_std',
color='black', label='Group B_ha', marker='o', linestyle='-')
sns.lineplot(data=df[df['group'] == 'F_ha'], x='date', y='fake_merged_rt_std',
color='gray', label='Group F_ha', marker='o', linestyle='-')
# Plot fitted lines for groups B_ha and F_ha
plt.plot(dates_pre_B_ha, fitted_pre_B_ha, color='black', linestyle='-')
plt.plot(dates_post_B_ha, fitted_post_B_ha, color='black', linestyle='-')
plt.plot(dates_pre_F_ha, fitted_pre_F_ha, color='gray', linestyle='-')
plt.plot(dates_post_F_ha, fitted_post_F_ha, color='gray', linestyle='-')
# Add the counterfactual dashed lines
plt.axvline(x=pd.to_datetime('2021-01-06'), color='black', linestyle='--', label='Counterfactual')
plt.axvline(x=pd.to_datetime('2021-01-12'), color='black', linestyle='--')
# Add labels, title, and legend
plt.title('Time series of misinformation retweeting for followers and not-followers')
plt.xlabel('Date')
plt.ylabel('Total misinformation retweeted (std)')
plt.legend(title='Group', loc='upper right')
plt.grid(True)
# Display the plot
plt.show()
Here I followed the instruction of the footnotes in the paper, and as indicated, sample size includes 51 observations (days) from 1 December 2020 to 20 January 2021. The counterfactual identified under the parallel path assumption is shown as a dashed line after 12 January 2021. In the paper, fitted straight lines are ordinary least squares regressions of standardized daily total retweeted misinformation, fitted separately before 6 January 2021 and after 12 January 2021 and by group. My results for visualization looks exactly the same as the one in the paper, which implies that I have successfully duplicate the results after more complicated standardization. And the visulization after std also implies a rougly parallel trend, which was an essential part for DiD analysis. During the suspension period, there was a clear divergence.
Section 4 Extensions and follow up analyses¶
In this section, you will perform follow-up analyses, summaries, or visualizations that you feel help shed light on the robustness of the conclusion reached by McCabe et al. You are welcome to draw on insights you gained through data simulation, and to draw on the questions we discussed in class surrounding the key assumptions and study decisions in Notebook 1: Data Acquisition.
Section 4-1 Impacts of Changes of Key Variables on Results¶
# Given that in the following analysis, DID could be repeatedly applied, I integrated it into a function
def run_did_analysis(df, stat, treatment_group, control_group, outcome_var):
# Start over and import the dataset
df = mccabe_data.copy()
df['date'] = pd.to_datetime(df['date'])
df = df[df['stat'] == stat]
# Create a new dataframe for the treatment and control group
df_mdid = df[(df['group'] == treatment_group) | (df['group'] == control_group)].copy()
# Define the date of the deplatforming event
suspension_start = pd.to_datetime('2021-01-06')
# Set up post-treatment and treatment group indicators
df_mdid['post_treatment'] = (df_mdid['date'] > suspension_start).astype(int)
df_mdid['treatment_group'] = (df_mdid['group'] == treatment_group).astype(int)
# Define the DID formula
formula = f'{outcome_var} ~ post_treatment + treatment_group + post_treatment*treatment_group'
# Run the DID model
model = smf.ols(formula, data=df_mdid)
results = model.fit()
return results.summary()
# I was wondering if the outcome variables and the stat used for the datasets would affect the general outcomes, let's try them below
Another DiD¶
In the previous section, we examined the main DID in the paper to see the "indirect" effects of the delpatforming on deplaformed user-followers. Here we would like to see the impact of effects from another aspect. Let's see how it affects the misinformation sharers and non-misinformation sharers.
run_did_analysis(df=df, stat='total',treatment_group='fns', control_group='nfns',outcome_var='fake_merged_initiation')
| Dep. Variable: | fake_merged_initiation | R-squared: | 0.865 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.864 |
| Method: | Least Squares | F-statistic: | 2333. |
| Date: | Sat, 19 Oct 2024 | Prob (F-statistic): | 0.00 |
| Time: | 21:24:26 | Log-Likelihood: | -7736.7 |
| No. Observations: | 1100 | AIC: | 1.548e+04 |
| Df Residuals: | 1096 | BIC: | 1.550e+04 |
| Df Model: | 3 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 115.2568 | 13.658 | 8.439 | 0.000 | 88.459 | 142.055 |
| post_treatment | -48.3258 | 26.599 | -1.817 | 0.070 | -100.517 | 3.866 |
| treatment_group | 1480.1975 | 19.315 | 76.636 | 0.000 | 1442.299 | 1518.096 |
| post_treatment:treatment_group | -548.3699 | 37.617 | -14.578 | 0.000 | -622.180 | -474.560 |
| Omnibus: | 158.913 | Durbin-Watson: | 0.501 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 1099.916 |
| Skew: | 0.449 | Prob(JB): | 1.43e-239 |
| Kurtosis: | 7.816 | Cond. No. | 6.44 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Conclusion attached below
Again, the model explains a significant amount of variance (R-squared = 0.865, indicating that 86.5% of the variance could be explained). All predictors, including the interaction terms, are statistically significant(p < 0.05). The intercept suggests that without any treatment or intervention (baseline condition), the average level of misinformation initiated is about 115. In the pre-treatment period, the misinformation sharers has a much higher average number of misinformation initiated (by about 1480 units) compared to the non-misinformation sharers. Lastly, in the post-treatment period, the average number of misinformation initiated for the treatment group decreases by about 548 units, beyond what is captured by the main effects of post-treatment and treatment group alone. This implies that deplatforming also leads to a significant reduction in misinformation sharing.
Changing Outcome Variables and(or) Stat¶
For the ease of comparision, in the following content, unless specified otherwise, I would use the one in Section 3-1 as a benchmark model. And in the newly defined functions, the parameters is # df=df, stat='total',treatment_group='B',control_group='F', outcome_var='fake_merged_rt'.
# Here I changed the stat to avg, taking into account the user size of different groups after deplatforming
run_did_analysis(df=df,stat='avg',treatment_group='B',control_group='F',outcome_var='fake_merged_rt')
| Dep. Variable: | fake_merged_rt | R-squared: | 0.824 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.824 |
| Method: | Least Squares | F-statistic: | 1713. |
| Date: | Sat, 19 Oct 2024 | Prob (F-statistic): | 0.00 |
| Time: | 21:24:27 | Log-Likelihood: | 1190.3 |
| No. Observations: | 1100 | AIC: | -2373. |
| Df Residuals: | 1096 | BIC: | -2353. |
| Df Model: | 3 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 0.0833 | 0.004 | 20.411 | 0.000 | 0.075 | 0.091 |
| post_treatment | -0.0287 | 0.008 | -3.609 | 0.000 | -0.044 | -0.013 |
| treatment_group | 0.3763 | 0.006 | 65.177 | 0.000 | 0.365 | 0.388 |
| post_treatment:treatment_group | -0.1504 | 0.011 | -13.377 | 0.000 | -0.172 | -0.128 |
| Omnibus: | 589.761 | Durbin-Watson: | 0.305 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 7582.192 |
| Skew: | 2.175 | Prob(JB): | 0.00 |
| Kurtosis: | 15.104 | Cond. No. | 6.44 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Conclusion attached below
Interesting. The general findings still apply even if I've changed the stat from "total" to "avg". The model explains a significant amount of variance (R-squared = 0.824, indicating that 82.4% of the variance could be explained). All predictors, including the interaction term, are statistically significant(p < 0.05). The intercept suggests that without any treatment or intervention (baseline condition), the average level of misinformation retweeted(per user) is about 0.0833. And in the post-treatment period (after suspension), the average level of misinformation retweeted (per user) in the control group decreases by about 0.0287 units compared to the pre-treatment period. In the pre-treatment period, the deplatformed user followers has a much higher average number of misinformation retweeted per user (by about 0.3763 units) compared to the not-followers. Lastly, in the post-treatment period, the average number of misinformation retweeted (per user) for the treatment group decreases by about 0.1504 units, beyond what is captured by the main effects of post-treatment and treatment group alone. This implies that deplatforming leads to a significant reduction in misinformation retweets for followers compared to non-followers, even we take into account the fluctuating user size.
# How about the impacts of misinformation initiation sharing
run_did_analysis(df=df,stat='total',treatment_group='B',control_group='F',outcome_var='fake_merged_initiation')
| Dep. Variable: | fake_merged_initiation | R-squared: | 0.814 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.814 |
| Method: | Least Squares | F-statistic: | 1602. |
| Date: | Sat, 19 Oct 2024 | Prob (F-statistic): | 0.00 |
| Time: | 21:24:27 | Log-Likelihood: | -7450.4 |
| No. Observations: | 1100 | AIC: | 1.491e+04 |
| Df Residuals: | 1096 | BIC: | 1.493e+04 |
| Df Model: | 3 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 255.2840 | 10.528 | 24.249 | 0.000 | 234.627 | 275.941 |
| post_treatment | -104.6702 | 20.504 | -5.105 | 0.000 | -144.901 | -64.440 |
| treatment_group | 918.0346 | 14.888 | 61.661 | 0.000 | 888.822 | 947.247 |
| post_treatment:treatment_group | -221.5725 | 28.996 | -7.641 | 0.000 | -278.467 | -164.678 |
| Omnibus: | 147.558 | Durbin-Watson: | 0.560 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 849.482 |
| Skew: | 0.460 | Prob(JB): | 3.45e-185 |
| Kurtosis: | 7.206 | Cond. No. | 6.44 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Conclusion attached below
Once again, the general findings still apply when I changed the outcome variable from retweets to initiate tweets. The model explains a significant amount of variance (R-squared = 0.814, indicating that 81.4% of the variance could be explained). All predictors, including the interaction term, are statistically significant(p < 0.05). The intercept suggests that without any treatment or intervention (baseline condition), the average level of creating misinformation tweets is about 255.28. And in the post-treatment period (after suspension), the average level of creating misinformation tweets in the control group decreases by about 221.6 units compared to the pre-treatment period. Lastly, in the post-treatment period, the average number of misinformation tweets created by the treatment group decreases by about 221 units. This implies that deplatforming leads to a significant reduction in misinformation tweets for followers compared to non-followers.
# How about changing the stat from the above modified version?
run_did_analysis(df=df,stat='avg',treatment_group='B',control_group='F',outcome_var='fake_merged_initiation')
| Dep. Variable: | fake_merged_initiation | R-squared: | 0.810 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.810 |
| Method: | Least Squares | F-statistic: | 1558. |
| Date: | Sat, 19 Oct 2024 | Prob (F-statistic): | 0.00 |
| Time: | 21:24:28 | Log-Likelihood: | 2788.1 |
| No. Observations: | 1100 | AIC: | -5568. |
| Df Residuals: | 1096 | BIC: | -5548. |
| Df Model: | 3 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 0.0438 | 0.001 | 45.839 | 0.000 | 0.042 | 0.046 |
| post_treatment | -0.0147 | 0.002 | -7.886 | 0.000 | -0.018 | -0.011 |
| treatment_group | 0.0774 | 0.001 | 57.325 | 0.000 | 0.075 | 0.080 |
| post_treatment:treatment_group | 0.0038 | 0.003 | 1.449 | 0.148 | -0.001 | 0.009 |
| Omnibus: | 556.340 | Durbin-Watson: | 0.625 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 8434.906 |
| Skew: | 1.952 | Prob(JB): | 0.00 |
| Kurtosis: | 15.992 | Cond. No. | 6.44 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Conclusion attached below
Although the model still has a relatively high R2 score, the most important feature that we might be interested in has become not statistically significant. While the benchmark model did show the relationship, the impact of deplatforming for other users(regardless of their relationships with the deplatformed users) seems less connnected now. Therefore, we might not get the results as anticipated. But this implies that the effectiveness of the model could sometimes suffer from changes in variables. Although so far the other tests still show robust results.
# Changing the definition of misinformation by using different lists (summary in the end)
# How about changing the list for misinformation classification-Grinberg et al. (2019) list
run_did_analysis(df=df, stat='total',treatment_group='B',control_group='F',outcome_var='fake_grinberg_rt')
| Dep. Variable: | fake_grinberg_rt | R-squared: | 0.795 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.794 |
| Method: | Least Squares | F-statistic: | 1416. |
| Date: | Sat, 19 Oct 2024 | Prob (F-statistic): | 0.00 |
| Time: | 21:24:29 | Log-Likelihood: | -7910.8 |
| No. Observations: | 1100 | AIC: | 1.583e+04 |
| Df Residuals: | 1096 | BIC: | 1.585e+04 |
| Df Model: | 3 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 146.5457 | 16.000 | 9.159 | 0.000 | 115.152 | 177.939 |
| post_treatment | -65.9043 | 31.161 | -2.115 | 0.035 | -127.046 | -4.762 |
| treatment_group | 1351.9654 | 22.627 | 59.750 | 0.000 | 1307.568 | 1396.363 |
| post_treatment:treatment_group | -867.4689 | 44.068 | -19.685 | 0.000 | -953.936 | -781.001 |
| Omnibus: | 335.154 | Durbin-Watson: | 0.455 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 2177.721 |
| Skew: | 1.239 | Prob(JB): | 0.00 |
| Kurtosis: | 9.432 | Cond. No. | 6.44 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# How about changing the list for misinformation classification-Newsguard list
run_did_analysis(df=df,stat='total',treatment_group='B',control_group='F',outcome_var='fake_grinberg_rb_rt')
| Dep. Variable: | fake_grinberg_rb_rt | R-squared: | 0.716 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.715 |
| Method: | Least Squares | F-statistic: | 919.0 |
| Date: | Sat, 19 Oct 2024 | Prob (F-statistic): | 1.42e-298 |
| Time: | 21:24:29 | Log-Likelihood: | -7636.1 |
| No. Observations: | 1100 | AIC: | 1.528e+04 |
| Df Residuals: | 1096 | BIC: | 1.530e+04 |
| Df Model: | 3 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 70.6099 | 12.464 | 5.665 | 0.000 | 46.153 | 95.067 |
| post_treatment | -44.3478 | 24.276 | -1.827 | 0.068 | -91.980 | 3.284 |
| treatment_group | 840.7556 | 17.627 | 47.696 | 0.000 | 806.168 | 875.343 |
| post_treatment:treatment_group | -619.5556 | 34.331 | -18.047 | 0.000 | -686.917 | -552.194 |
| Omnibus: | 556.026 | Durbin-Watson: | 0.375 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 5543.562 |
| Skew: | 2.102 | Prob(JB): | 0.00 |
| Kurtosis: | 13.162 | Cond. No. | 6.44 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# How about changing the list for misinformation classification-Newsguard list
run_did_analysis(df=df,stat='total',treatment_group='B',control_group='F',outcome_var='fake_newsguard_rt')
| Dep. Variable: | fake_newsguard_rt | R-squared: | 0.789 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.788 |
| Method: | Least Squares | F-statistic: | 1365. |
| Date: | Sat, 19 Oct 2024 | Prob (F-statistic): | 0.00 |
| Time: | 21:24:29 | Log-Likelihood: | -9043.7 |
| No. Observations: | 1100 | AIC: | 1.810e+04 |
| Df Residuals: | 1096 | BIC: | 1.812e+04 |
| Df Model: | 3 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 455.5333 | 44.814 | 10.165 | 0.000 | 367.603 | 543.464 |
| post_treatment | -202.7471 | 87.279 | -2.323 | 0.020 | -373.999 | -31.495 |
| treatment_group | 3722.5951 | 63.376 | 58.738 | 0.000 | 3598.243 | 3846.947 |
| post_treatment:treatment_group | -1938.4640 | 123.430 | -15.705 | 0.000 | -2180.651 | -1696.277 |
| Omnibus: | 474.398 | Durbin-Watson: | 0.286 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 4450.884 |
| Skew: | 1.734 | Prob(JB): | 0.00 |
| Kurtosis: | 12.224 | Cond. No. | 6.44 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Conclusion goes below
Despite changes for identifying misinformation, the benchmark model would still generate a good result, such as a high R2 and all the four statistically significant variables. Besides, the signs of the coefficient always remain the same, including a positive intercept, negative post_treatment, positive treatment and a negative interaction term, which enables our consistent interpretation. However, despite the same general results, the magtitude of the impacts differ as changes imposed on classifying misinformation using different list.
Section 4-2 Impacts of Changes of Time Window¶
# Let me re-define a function to incoporate the effects of time; this is simply a modification of the previous function
def run_did_analysis_with_time_windows(df, stat, treatment_group, control_group, outcome_var,
pre_treatment_window=None, post_treatment_window=None):
"""
- pre_treatment_window: Number of days before the intervention (optional)
- post_treatment_window: Number of days after the intervention (optional)
"""
# Start over and import the dataset
df = mccabe_data.copy()
df['date'] = pd.to_datetime(df['date'])
df = df[df['stat'] == stat]
# Create a new dataframe for the treatment and control groups
df_mdid = df[(df['group'] == treatment_group) | (df['group'] == control_group)].copy()
# Define the date of the deplatforming event
suspension_start = pd.to_datetime('2021-01-06')
# Apply time window filtering if specified
if pre_treatment_window:
pre_treatment_start = suspension_start - pd.Timedelta(days=pre_treatment_window)
df_mdid = df_mdid[df_mdid['date'] >= pre_treatment_start]
if post_treatment_window:
post_treatment_end = suspension_start + pd.Timedelta(days=post_treatment_window)
df_mdid = df_mdid[df_mdid['date'] <= post_treatment_end]
# Set up post-treatment and treatment group indicators
df_mdid['post_treatment'] = (df_mdid['date'] > suspension_start).astype(int)
df_mdid['treatment_group'] = (df_mdid['group'] == treatment_group).astype(int)
# Define the DID formula
formula = f'{outcome_var} ~ post_treatment + treatment_group + post_treatment*treatment_group'
# Run the DID model
model = smf.ols(formula, data=df_mdid)
results = model.fit()
return results
# I'm also very interested in the impact of time windows on the ultimate outcomes
coefficients = []
# Loop through post_treatment_window values from 30 to 365
for window in range(30, 365):
# Run the DID analysis for each window value
model = run_did_analysis_with_time_windows(
df=df,
stat='total',
treatment_group='B',
control_group='F',
outcome_var='fake_merged_rt',
pre_treatment_window=window,
post_treatment_window=30,
)
# Extract the coefficient for 'post_treatment:treatment_group'
coeff = model.params['post_treatment:treatment_group']
# Store the coefficient in the list
coefficients.append(coeff)
# Visualize the coefficients
plt.figure(figsize=(10, 6))
plt.plot(range(30, 365), coefficients, label='DiD Coefficients')
plt.xlabel('Pre-Treatment Window')
plt.ylabel('Coefficient Value')
plt.title('DiD Coefficient over Different Pre-Treatment Windows')
plt.grid(True)
plt.legend()
plt.show()
# Conclusion goes below
While in the previous cells, we've tried different pre-treatment window by running values through 30 to 365, the visulization did demostrates that the windows of the pre-trement really matters in the magtitude of treament effect, which was interpreted as the deplatforming event. The plot starts with a coefficient value of around -1400 and initially drops sharply, reaching a minimum of around -2200. After this sharp decline, the coefficients increase steadily, eventually reaching around -1000 by the end of the window. The line shows fluctuations, particularly in the earlier stages, with the coefficients experiencing some variability before stabilizing in the upward trend. This reflects the dynamic changes in the estimated treatment effect as the pre-treatment window extends.
Section 5 Conclusions and Reflections¶
Here is where you draw together insights you have gained by analyzing this dataset and reflections on the methods we have applied. You should provide a clear answer to the question:
What are your conclusions about the question posed in this assignment: Did deplatforming reduce misinformation on Twitter?
You are welcome to use the bullet points below to guide your reflections if they are helpful, and also to include any additional insights.
- Is the current dataset sufficient to offer insight into this question? What are some key limitations of the dataset, and key merits?
- Is the DiD method sufficient to support strong conclusions related to this question?
- Overall, do you think the conclusions of McCabe et al. (2024) are justified?
- More generally, do you feel that misinformation on social media is a substantial threat to discourse and society that data science can address, and how has this project influenced your view?
Based on previous analyses, I think deplatfroming did reduce misinfromation on Twitter.
1. I think the current dataset is sufficient to offer insight into the question posted; but there are still areas for improvement.
The current dataset contains detailed information on misinformation sharing, retweets, and user categorization. The dataset includes a lot of variations that we could explore, including classification using different lists, different activity levels across subgroups (i.e., high-, medium-, and low-active), and the number of deplatformed users followed, among others. This offers a rich dataset to investigate misinformation spread on Twitter using different variations. Additionally, the dataset covers periods before and after key political events, like the January 6th insurrection, allowing for comparative analysis using quasi-experimental designs such as DiD.
However, there are some limitations that need improvement. The authors were not very careful in preprocessing the dataset, as there is one duplicated value in the date: 2020-06-30. It is strange that on this date, the values for different subgroups even vary, making it difficult to know which one is correct. This might introduce noise to the dataset, despite the high quality it could have maintained. Moreover, there are some inconsistencies in the data composition. There are supposed to be 60 values for each day; nevertheless, there were several dates that only have 58 values. After careful observation and logical reasoning, it appears that the missing values are always those not eligible to become certain group. The dataset creators could remind readers or potential users of this issue beforehand.
Besides, the dataset might suffer from confounding variables due to simultaneous events (like the insurrection) that can impact behavior, making it difficult to isolate the effects of deplatforming alone (McCabe et al., 2024). And while the panel is filtered using voter data, there is potential bias due to the exclusion of non-human actors (bots), which could play a significant role in misinformation dissemination. Moreover, the dataset primarily focuses on retweet and tweet counts without fully capturing the nuances of engagement (e.g., likes, replies) that could be influential. Last but not the least, while deplatformed users have been removed, there is still a chance that they could create new accounts or shift to other social media platforms, which could affect the spectrum of misinformation, though such a pattern is difficult to capture.
2. The DiD method is sufficient to support strong conclusions related to this question, given the strong assumptions made hold true.
Previous analyses indicate that Twitter's deplatforming effectively reduces misinformation, particularly among followers of deplatformed users. After several attempts to check the robustness of the results, the model works effectively most of the time, adding confidence to the conclusions. The DiD approach captures the differences between the control and treatment groups in their baseline values; for instance, followers of deplatformed users are more enthusiastic about misinformation sharing. It is also interesting to observe from the results that deplatforming plays a role, as the pre-treatment and post-treatment periods show a statistically significant decrease. The deplatforming action regulates both the sources and the exposure to misinformation. Deplatforming immediately reduces the ability of deplatformed users to share content. With the decreased amount of misinformation available, retweet behavior for those posts also decreases. Additionally, users may become more cautious about sharing misinformation due to the fear of facing similar suspensions.
However, there are some limitations to the DiD approach. DiD methods assume that, in the absence of treatment, the average outcomes for the treated and control groups would have followed parallel trends over time. However, this assumption is often difficult to verify, and when it is violated, the causal estimates from DiD may be biased. In my attempt to replicate the exact visualization from the authors’ paper, the standardization makes sense to demonstrate similar trends compared to unadjusted values. However, the visualization itself lacks more accurate explanation of the parallel trend. In the case of heterogeneous treatment effects, the DiD approach typically estimates an average treatment effect, potentially hiding crucial subgroup variations.
3. Overall, I think the conclusions of McCabe et al. (2024) are largely justified.
In the paper, McCabe et al. (2024) use more than one approach. Apart from DiD, they also use regression discontinuity (SRD). The SRD analysis indicates a significant decline in misinformation sharing by deplatformed users, as expected. Meanwhile, the DiD analysis shows a notable spillover effect, with a reduction in retweets by users who followed deplatformed accounts. This suggests that deplatforming impacts misinformation both directly and indirectly. The authors made efforts to verify some of the assumptions, although not all of them may hold true. They used a placebo test by investigating the patterns of shopping and sports tweets; if the expectations are correct, the behavior of those users should not change due to the intervention. This would indicate that the intervention did not affect behavior it was not intended to, increasing confidence in the intervention's effect on misinformation sharing.
However, the study faces limitations that temper its causal claims. The SRD design is confounded by concurrent political events, such as the insurrection itself and media coverage of election certification, which complicates efforts to isolate the deplatforming's specific impact. The authors acknowledge that interpreting these results as causal depends on strong assumptions, such as continuity and parallel trends, which may not fully hold given the extraordinary context. Furthermore, although Twitter’s intervention appears effective in reducing misinformation, the findings may not generalize to other deplatforming events due to amplification by media coverage and user awareness. Overall, while McCabe et al.(2024) provide compelling evidence of Twitter’s regulatory capacity, the results might be better viewed as context-specific and contingent upon unverified assumptions.
4. From my perspective, I do feel that misinformation on social media is a substantial threat to discourse and society; however, I am pessimistic about the viewpoint that data science can address this threat fully.
It is always tricky to define misinformation. Is it something that contradicts the truth that we can verify? Or is it simply something taken out of context, making its authenticity difficult to validate? Even the authors did not do very well in this regard, as the study classified tweets as misinformation if they contained URLs from a predefined list of domains. These lists focused on domains that lack editorial norms or have low credibility scores. URLs in tweets were cross-referenced with this list, but the analysis did not evaluate the content’s truthfulness. As such, it might oversimplify the classification of misinformation. On the one hand, the domains might not necessarily represent misinformation but are classified as such (False Positive); on the other hand, the scope of misinformation could be much larger than fake news, with some content not identified as misinformation (False Negative).
Secondly, there are potential harms of misinformation in shaping false beliefs (Ecker, U. K. et al., 2012). People often rely on intuition rather than careful reasoning when determining what is true, making them prone to biases. Repetition of a claim makes it seem more believable, a phenomenon known as the illusory truth effect. This effect can persist over time, regardless of cognitive ability and prior knowledge. Misinformation can continue to influence people’s thinking even after they receive a correction and accept it as true, known as the continued influence effect.
In conclusion, people should practice their critical thinking skills and make sound judgments. We are currently living in the era of Artificial Intelligence, and the issue of DeepFake makes it even more challenging to differentiate truth from misinformation. People should be cautious about the information consumed and exercise the same caution when creating and spreading information. Meanwhile, social media platforms should also play a role in combating misinformation. It shouldn’t necessarily involve coercion or suppression. However, even a kind reminder or downranking (with public voting) could work effectively. The main reason I am not confident that the issue of misinformation can be addressed solely by data science is the awareness of its complexities and human creativity. We should never underestimate human creativity in communication and the ability to create and understand coded language. Social media users often employ countermeasures to circumvent detection by social media algorithms. There could be an infinite number of variations, metaphors with historical roots, and other complexities. Moreover, there are complex ethical considerations about the right to freedom of speech, which adds on another layer of complication. They makes it almost impossible for data science to fully address these issues.
5. Other thoughts
Social media holds significant power in regulating discourse through its terms of use. McCabe et al. (2024) also noted two instruments, including content moderation and the enforcement of users' terms of use. In the paper, the main discussion centered around the enforcement instrument, such as deplatforming; nevertheless, an important part of the communication landscape remains unaddressed. Prior to this project, I anticipated that the direct effects of deplatforming on social media would be straightforward: targeted users are removed, and their "products" inevitably diminish. But it is also interesting to discover spillover effects in this sphere. And while deplatforming may have some indirect effects at first glance, what happens if users change their ways of expression? This could result in misinformation that continues to exist and becomes harder to detect, posing a new challenge. Consequently, our conclusions may be threatened. Lastly, it is always true that data scientists alone cannot adequately address these issues. Tackling them requires broader and deeper collaboration among stakeholders, including the public and policymakers.