Background

This project engages with the findings of the 2024 Nature publication: 'Post-January 6th deplatforming reduced the reach of misinformation on Twitter' by McCabe et al. The paper investigates Twitter's decision to deplatform 70,000 misinformation-spreading accounts after the January 6th Capitol riot. Using panel data from over 500,000 Twitter users, the study applied Difference-in-Differences (DID) and Sharp Regression Discontinuity (SRD) to assess the causal impact of deplatforming on misinformation spread.

Part 1: Data Simulation and Regression

Data Simulation and Regression Notebook (click to expand)

data-simulation-and-regression

Data Simulation & Regression¶

Instructions:

Take a copy of this notebook and answer the questions in Sections 2, 3, and 4. Add as many code and markdown cells as needed within those sections.
Answer the External Resources question in Section 5.

Section 1: Data Simulation Engine¶

The code in this section provides the tools needed to simulate data given a set of simulation_parameters being provided.

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:

# example set of simulation parameters
default_simulation_parameters = {
    "num_employees_per_company": 50,
    "satisfaction_variance": 1, 
    "data_collection_start_date": pd.to_datetime("2022-01-01"),
    "data_collection_end_date": pd.to_datetime("2023-12-31"),
    "remote_work_onset_date": pd.to_datetime("2023-01-01"),
    "remote_work_treatment_effect": 2,
    "satisfaction_levels": {"Rubicon":5.2, "Giggle": 6.1}
}

In [3]:

def simulate_data(simulation_parameters):
    data = []
    
    # create the date range for data collection
    dates = pd.date_range(
        start=simulation_parameters["data_collection_start_date"], 
        end=simulation_parameters["data_collection_end_date"], 
        freq='W' # weekly intervals between start and end
    )

    # set number of employees per company
    num_employees = simulation_parameters["num_employees_per_company"]
    for company in ['Rubicon', 'Giggle']:

        # company satisfaction parameters
        satisfaction_mean = simulation_parameters["satisfaction_levels"][company]
        satisfaction_variance = simulation_parameters["satisfaction_variance"]

        # simulate data collection
        for date in dates:
            time_at_company = (date - simulation_parameters["data_collection_start_date"]).days
            for i in range(num_employees):
                employee_age = np.random.randint(18, 60)
                satisfaction = np.random.normal(
                    satisfaction_mean,
                    satisfaction_variance
                )

                # treatment effect
                if date > simulation_parameters["remote_work_onset_date"]:
                    if company == "Giggle":
                        satisfaction += simulation_parameters["remote_work_treatment_effect"] 
                
                # save the data in a useful format
                datapoint = {
                    "date": date, 
                    "company": company,
                    "satisfaction": satisfaction,
                    "employee_id": f"{company[0]}{i}", # create a fake employee id by combining the company name initial letter and the loop index variable
                    "time_at_company": time_at_company,
                    "employee_age": employee_age
                } 
                data.append(datapoint)
    return pd.DataFrame(data)

In [4]:

df_default = simulate_data(default_simulation_parameters)

In [5]:

df_default

Out[5]:

	date	company	satisfaction	employee_id	time_at_company	employee_age
0	2022-01-02	Rubicon	4.999562	R0	1	33
1	2022-01-02	Rubicon	5.209023	R1	1	38
2	2022-01-02	Rubicon	5.060302	R2	1	57
3	2022-01-02	Rubicon	4.626068	R3	1	57
4	2022-01-02	Rubicon	3.793625	R4	1	48
...	...	...	...	...	...	...
10495	2023-12-31	Giggle	7.935362	G45	729	46
10496	2023-12-31	Giggle	7.429865	G46	729	36
10497	2023-12-31	Giggle	6.627971	G47	729	41
10498	2023-12-31	Giggle	9.220513	G48	729	21
10499	2023-12-31	Giggle	7.028789	G49	729	39

10500 rows × 6 columns

In [6]:

df_default.sample(n=10)

Out[6]:

	date	company	satisfaction	employee_id	time_at_company	employee_age
3662	2023-05-28	Rubicon	6.020167	R12	512	39
135	2022-01-16	Rubicon	6.246700	R35	15	24
9784	2023-09-24	Giggle	9.478592	G34	631	33
1267	2022-06-26	Rubicon	6.045406	R17	176	59
4399	2023-09-03	Rubicon	5.616598	R49	610	18
6787	2022-07-31	Giggle	4.576153	G37	211	20
4764	2023-10-29	Rubicon	5.667133	R14	666	26
3463	2023-04-30	Rubicon	6.442557	R13	484	27
878	2022-05-01	Rubicon	6.013686	R28	120	43
1403	2022-07-17	Rubicon	5.911192	R3	197	58

Section 2 Alternative Simulation Parameters¶

The goal in this section is to demonstrate understanding of simulating data using different sets of parameters. The section will start with one example, followed by two questions.

2.1 Example¶

Example: simulate a dataset in which the onset date for the remote work treatment is takes place in March

In [8]:

# take a copy of the default simulation parameters dictionary
march_onset_parameters = default_simulation_parameters.copy()

# overwrite the parameter for data collection onset within in our new copy of the parameters
march_onset_parameters["remote_work_onset_date"] = pd.to_datetime("2023-03-01")

In [9]:

# look at the updated parameters 
march_onset_parameters

Out[9]:

{'num_employees_per_company': 50,
 'satisfaction_variance': 1,
 'data_collection_start_date': Timestamp('2022-01-01 00:00:00'),
 'data_collection_end_date': Timestamp('2023-12-31 00:00:00'),
 'remote_work_onset_date': Timestamp('2023-03-01 00:00:00'),
 'remote_work_treatment_effect': 2,
 'satisfaction_levels': {'Rubicon': 5.2, 'Giggle': 6.1}}

In [10]:

# simulate data with new parameters
march_onset_df = simulate_data(march_onset_parameters)

In [11]:

# look at the data
march_onset_df.sample(5)

Out[11]:

	date	company	satisfaction	employee_id	time_at_company	employee_age
1955	2022-10-02	Rubicon	5.687980	R5	274	19
8483	2023-03-26	Giggle	7.570722	G33	449	22
5556	2022-02-13	Giggle	7.174432	G6	43	43
856	2022-05-01	Rubicon	3.808717	R6	120	59
560	2022-03-20	Rubicon	7.371126	R10	78	47

In [12]:

# confirm the change
plt.figure(figsize=(12,2))
sns.lineplot(data=march_onset_df, x="date", y="satisfaction", hue="company")

Out[12]:

<Axes: xlabel='date', ylabel='satisfaction'>

No description has been provided for this image

2.2 Question 1¶

Simulate a dataset in which remote work was assosciated with a decrease in employee satisfaction.

In [13]:

# take a copy of the default simulation parameters dictionary
remote_decrease_satisfation_parameters = default_simulation_parameters.copy()

# overwrite the parameter for negative remote_work_treatment_effect(equivalent to the association with a decrease in employee satisfaction)
remote_decrease_satisfation_parameters["remote_work_treatment_effect"] = -2

In [14]:

# look at the updated parameters 
remote_decrease_satisfation_parameters

Out[14]:

{'num_employees_per_company': 50,
 'satisfaction_variance': 1,
 'data_collection_start_date': Timestamp('2022-01-01 00:00:00'),
 'data_collection_end_date': Timestamp('2023-12-31 00:00:00'),
 'remote_work_onset_date': Timestamp('2023-01-01 00:00:00'),
 'remote_work_treatment_effect': -2,
 'satisfaction_levels': {'Rubicon': 5.2, 'Giggle': 6.1}}

In [15]:

# simulate data with new parameters
remote_decrease_satisfation_df=simulate_data(remote_decrease_satisfation_parameters)

In [16]:

# look at the data
remote_decrease_satisfation_df.sample(5)

Out[16]:

	date	company	satisfaction	employee_id	time_at_company	employee_age
9453	2023-08-13	Giggle	4.924890	G3	589	28
8128	2023-02-05	Giggle	2.895555	G28	400	24
1367	2022-07-10	Rubicon	6.439668	R17	190	36
4231	2023-08-13	Rubicon	5.833665	R31	589	29
6215	2022-05-15	Giggle	6.125773	G15	134	54

In [17]:

# confirm the change
plt.figure(figsize=(12,2))
sns.lineplot(data=remote_decrease_satisfation_df, x="date", y="satisfaction", hue="company")

Out[17]:

<Axes: xlabel='date', ylabel='satisfaction'>

In [18]:

# Additional Note (see below)

The manipulation to the dataset has worked. While the simulated dataset was set to associated with a decrease in employee satisfaction, namely, a negative remote_work_treatment_effect(-2), a decrease trend has been demonstrated in the graph above. The satisfaction score has decreased since the treatment day and lasted for the rest of the time during the period of data collection. And the effect is quite evident as compared with the one in Example section, where the treatment effect was positive.

2.3 Question 2¶

Simulate a dataset in which remote work begun much sooner after the data collection start date. Illustrate clearly that the manipulation to the dataset has worked.

In [20]:

# take a copy of the default simulation parameters dictionary
much_sooner_onset_parameters = default_simulation_parameters.copy()

# The default date for data collection is 2022-01-01, and originally the remote work started at 2023-01-01.
# To meet the requirement of the prompt, the remote work begun much earlier than 2023-01-01, but should still after 2022-01-01.
# Let‘s try the date, 2022-03-01.

# overwrite the parameter for remote work onset within in our new copy of the parameters
much_sooner_onset_parameters["remote_work_onset_date"] = pd.to_datetime("2022-03-01")

In [21]:

# look at the updated parameters 
much_sooner_onset_parameters

Out[21]:

{'num_employees_per_company': 50,
 'satisfaction_variance': 1,
 'data_collection_start_date': Timestamp('2022-01-01 00:00:00'),
 'data_collection_end_date': Timestamp('2023-12-31 00:00:00'),
 'remote_work_onset_date': Timestamp('2022-03-01 00:00:00'),
 'remote_work_treatment_effect': 2,
 'satisfaction_levels': {'Rubicon': 5.2, 'Giggle': 6.1}}

In [22]:

# simulate data with new parameters
much_sooner_onset_df=simulate_data(much_sooner_onset_parameters)

In [23]:

# look at the data
much_sooner_onset_df.sample(5)

Out[23]:

	date	company	satisfaction	employee_id	time_at_company	employee_age
2228	2022-11-06	Rubicon	5.923113	R28	309	21
8084	2023-01-29	Giggle	7.498240	G34	393	57
3475	2023-04-30	Rubicon	5.194315	R25	484	53
5907	2022-04-03	Giggle	8.696635	G7	92	53
6644	2022-07-10	Giggle	8.021419	G44	190	43

In [24]:

# confirm the change
plt.figure(figsize=(12,2))
sns.lineplot(data=much_sooner_onset_df, x="date", y="satisfaction", hue="company")

Out[24]:

<Axes: xlabel='date', ylabel='satisfaction'>

In [25]:

# Additional Note (see below)

The manipulation to the dataset has worked. The simulated dataset was reset where remote work begun much sooner after the data collection start date ('2022-01-01'), in this case, '2022-03-01' was chosen. The remote work begun only two months after the data collection start date, which took a year(by default) or even more(in the sample) originally. This is also very evident in the visualization right above as the turning point of the new trend shifts towards the left-hand side.

Section 3 Regression analyses¶

3.1 Question 3¶

Using statsmodels, perform a regression analysis on a dataset simulated with the default parameters. The regression analysis should examine whether an employee's age predicts their satisfaction levels. Explain how the results of the analysis support the conclusion.

In [35]:

import statsmodels.api as sm
import statsmodels.formula.api as smf

In [39]:

# Establish the formula
regression_formula = 'satisfaction ~ employee_age'
model = smf.ols(regression_formula, data=df_default)

In [42]:

# Fit the regression model
results = model.fit()

In [43]:

# Print the summary table
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     1.468
Date:                Thu, 03 Oct 2024   Prob (F-statistic):              0.226
Time:                        22:22:57   Log-Likelihood:                -19579.
No. Observations:               10500   AIC:                         3.916e+04
Df Residuals:                   10498   BIC:                         3.918e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        6.2168      0.050    123.443      0.000       6.118       6.315
employee_age    -0.0015      0.001     -1.211      0.226      -0.004       0.001
==============================================================================
Omnibus:                      333.018   Durbin-Watson:                   0.818
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              331.859
Skew:                           0.403   Prob(JB):                     8.66e-73
Kurtosis:                       2.669   Cond. No.                         133.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [44]:

# Additional Note (See below)

An employee's age does not predict their satisfaction levels. As demostrated by the above table, the absolute value of the coefficient of the employee_age is relatively small(close to zero). And even such a small effect is not statistically significant at an acceptable value（i.e.，0.05，0.1）. Besides, the confident interval includes the zero value, which again confirms that an employee's age could not predict their satisfaction levels. But I think this is also the expected outcome, as referring back to the process of generating data for employee's age, the values were randomly assigned. It is always the case that a random number could not predict another randomly generated number(the satisfaction). Otherwise, they are not the so-called "random" numbers. Last but not the least, the R-squared is zero, implying that the model did not have predictive power.

3.2 Question 4¶

Simulate a dataset using the default parameters.
Using statsmodels, perform a regression analysis with one dependent variable (satisfaction) and one predictor (time_at_company) to examine whether employees who have been working longer are happier.

What does the regression result show? Is the result accurate? Explain how the results of analysis support the conclusion and any limitations of the analysis.

In [45]:

# Stmulate a dataset again using the default parameters
df_default_new = simulate_data(default_simulation_parameters)

In [46]:

# Necessary libararies has been imported in Question 3, so we don't have to import them again
# import statsmodels.api as sm
# import statsmodels.formula.api as smf

In [47]:

# Establish the formula
regression_formula_new = 'satisfaction ~ time_at_company'
model_new = smf.ols(regression_formula_new, data=df_default_new)

In [48]:

# Fit the new regression model
results_new = model_new.fit()

In [49]:

print(results_new.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.078
Model:                            OLS   Adj. R-squared:                  0.078
Method:                 Least Squares   F-statistic:                     892.7
Date:                Thu, 03 Oct 2024   Prob (F-statistic):          2.46e-188
Time:                        22:34:42   Log-Likelihood:                -19022.
No. Observations:               10500   AIC:                         3.805e+04
Df Residuals:                   10498   BIC:                         3.806e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept           5.3986      0.029    187.705      0.000       5.342       5.455
time_at_company     0.0020   6.81e-05     29.879      0.000       0.002       0.002
==============================================================================
Omnibus:                      110.100   Durbin-Watson:                   0.910
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               78.608
Skew:                           0.102   Prob(JB):                     8.52e-18
Kurtosis:                       2.629   Cond. No.                         840.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [50]:

# The above coefficient is too small, which might be resulted from the unit used ("day")

# Let's try a circumstance where time_at_company was represented by "year"
df_default_new["time_at_company_y"]= df_default_new["time_at_company"]/365

# Establish the formula
regression_formula_new = 'satisfaction ~ time_at_company_y'
model_new = smf.ols(regression_formula_new, data=df_default_new)
# Fit the new regression model
results_new = model_new.fit()
print(results_new.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.078
Model:                            OLS   Adj. R-squared:                  0.078
Method:                 Least Squares   F-statistic:                     892.7
Date:                Thu, 03 Oct 2024   Prob (F-statistic):          2.46e-188
Time:                        22:36:58   Log-Likelihood:                -19022.
No. Observations:               10500   AIC:                         3.805e+04
Df Residuals:                   10498   BIC:                         3.806e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept             5.3986      0.029    187.705      0.000       5.342       5.455
time_at_company_y     0.7429      0.025     29.879      0.000       0.694       0.792
==============================================================================
Omnibus:                      110.100   Durbin-Watson:                   0.910
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               78.608
Skew:                           0.102   Prob(JB):                     8.52e-18
Kurtosis:                       2.629   Cond. No.                         3.76
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [51]:

# Additional Note (See below)

At the first glance, the model seems to support the conclusion that "employees who have been working longer are happier. I was confused by this at the begining given that the data were generated randomly. However, after a second thought, I've figured out that those stay longer at the companies were also those who might possibly receive the treament we pre-set(if they are in the treatment group,the Giggle company). In this case, the overall satisfaction is "inflated" after a given time, when all the data mixed altogether for analysis. Therefore, the model is flawed to seperate the time effect (which is not random), and the DID model discussed below could work mcuh better in this aspect. By the way, the R-squared is samll, implying that the model could not predict the outcomes properly.

In [52]:

# To tell if my reseasoning is correct, here we could try a new dataset where treatment effect is zero. 
satisfation_parameters_no_effect = default_simulation_parameters.copy()
satisfation_parameters_no_effect["remote_work_treatment_effect"] = 0
df_no_effect = simulate_data(satisfation_parameters_no_effect)
# Establish the formula
regression_formula_no_effect = 'satisfaction ~ time_at_company'
model_no_effect = smf.ols(regression_formula_no_effect, data=df_no_effect)
results_model_no_effect = model_no_effect.fit()
print(results_model_no_effect.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     3.354
Date:                Thu, 03 Oct 2024   Prob (F-statistic):             0.0671
Time:                        22:43:00   Log-Likelihood:                -15855.
No. Observations:               10500   AIC:                         3.171e+04
Df Residuals:                   10498   BIC:                         3.173e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept           5.6232      0.021    264.350      0.000       5.582       5.665
time_at_company  9.228e-05   5.04e-05      1.831      0.067   -6.49e-06       0.000
==============================================================================
Omnibus:                        2.999   Durbin-Watson:                   1.650
Prob(Omnibus):                  0.223   Jarque-Bera (JB):                2.966
Skew:                           0.025   Prob(JB):                        0.227
Kurtosis:                       2.935   Cond. No.                         840.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [53]:

# It shows no relationship now. And my reasoning is correct.

In [54]:

# Something beyond our simulation

Theoretically, there were many possibilities for the outcome. Employees who have been working longer might not associated with higher or lower satisfaction scores. But it is possible that there is a positive correlation if those stay longer are mainly resulted from their loyalty and high satsfaction with the companies. However, it is also possible that those stay longer are become more and more unhappy with the companies, but for some reasons, they still stay in the companies and did not quit. Then a negative correlation emerged.

Section 4 Difference in Differences analyses¶

4.1 Question 5¶

Simulate a dataset in which remote work has a strong negative effect on employee satisfaction.
Using statsmodels, perform a Difference in Differences analysis to examine the effect of remote work on satisfaction.

What does the regression result show? Is the result accurate? Explain how the results of analysis support the conclusion.

In [62]:

# Stimulate a new dataset which remote work has a strong negative effect on employee satisfaction

## take a copy of the default simulation parameters dictionary
stong_neagtive_parameters = default_simulation_parameters.copy()

## Check the baseline to ensure chose a reasonable negative values. 
## Look at the baseine, and the value of -4 could be a godd choice.
stong_neagtive_parameters["remote_work_treatment_effect"]= -4

In [63]:

# look at the updated parameters 
stong_neagtive_parameters

Out[63]:

{'num_employees_per_company': 50,
 'satisfaction_variance': 1,
 'data_collection_start_date': Timestamp('2022-01-01 00:00:00'),
 'data_collection_end_date': Timestamp('2023-12-31 00:00:00'),
 'remote_work_onset_date': Timestamp('2023-01-01 00:00:00'),
 'remote_work_treatment_effect': -4,
 'satisfaction_levels': {'Rubicon': 5.2, 'Giggle': 6.1}}

In [64]:

stong_neagtive_df=simulate_data(stong_neagtive_parameters)
stong_neagtive_df

Out[64]:

	date	company	satisfaction	employee_id	time_at_company	employee_age
0	2022-01-02	Rubicon	5.053856	R0	1	52
1	2022-01-02	Rubicon	5.885893	R1	1	21
2	2022-01-02	Rubicon	4.258374	R2	1	21
3	2022-01-02	Rubicon	4.027964	R3	1	18
4	2022-01-02	Rubicon	6.009310	R4	1	23
...	...	...	...	...	...	...
10495	2023-12-31	Giggle	0.905752	G45	729	57
10496	2023-12-31	Giggle	3.229823	G46	729	59
10497	2023-12-31	Giggle	2.652730	G47	729	59
10498	2023-12-31	Giggle	0.722620	G48	729	31
10499	2023-12-31	Giggle	2.754140	G49	729	52

10500 rows × 6 columns

In [65]:

# confirm the change
plt.figure(figsize=(12,2))
sns.lineplot(data=stong_neagtive_df, x="date", y="satisfaction", hue="company")

Out[65]:

<Axes: xlabel='date', ylabel='satisfaction'>

In [66]:

# By default, remote_work_onset_date= 2023-01-01 00:00:00
remote_work_onset_date= pd.to_datetime("2023-01-01")

In [67]:

# Create dummy variables
stong_neagtive_df['post_treatment'] = (stong_neagtive_df['date'] > remote_work_onset_date).astype(int)
stong_neagtive_df['treatment_group'] = (stong_neagtive_df['company'] == 'Giggle').astype(int)

In [68]:

stong_neagtive_df.sample(10)

Out[68]:

	date	company	satisfaction	employee_id	time_at_company	employee_age	post_treatment	treatment_group
10496	2023-12-31	Giggle	3.229823	G46	729	59	1	1
144	2022-01-16	Rubicon	5.283587	R44	15	18	0	0
7967	2023-01-15	Giggle	1.672606	G17	379	39	1	1
2220	2022-11-06	Rubicon	3.463370	R20	309	19	0	0
7793	2022-12-18	Giggle	5.571966	G43	351	53	0	1
3882	2023-06-25	Rubicon	5.449915	R32	540	34	1	0
4667	2023-10-15	Rubicon	4.354222	R17	652	23	1	0
562	2022-03-20	Rubicon	4.910582	R12	78	20	0	0
558	2022-03-20	Rubicon	3.581364	R8	78	49	0	0
6192	2022-05-08	Giggle	6.436148	G42	127	37	0	1

In [69]:

did_formula = 'satisfaction ~ post_treatment + treatment_group + post_treatment*treatment_group'

In [70]:

did_model = smf.ols(did_formula, data=stong_neagtive_df)
did_results = did_model.fit()
print(did_results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.700
Model:                            OLS   Adj. R-squared:                  0.700
Method:                 Least Squares   F-statistic:                     8153.
Date:                Thu, 03 Oct 2024   Prob (F-statistic):               0.00
Time:                        22:54:22   Log-Likelihood:                -14841.
No. Observations:               10500   AIC:                         2.969e+04
Df Residuals:                   10496   BIC:                         2.972e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                          5.1806      0.019    268.107      0.000       5.143       5.218
post_treatment                     0.0171      0.027      0.623      0.533      -0.037       0.071
treatment_group                    0.9328      0.027     34.136      0.000       0.879       0.986
post_treatment:treatment_group    -4.0375      0.039   -103.976      0.000      -4.114      -3.961
==============================================================================
Omnibus:                        0.687   Durbin-Watson:                   1.987
Prob(Omnibus):                  0.709   Jarque-Bera (JB):                0.661
Skew:                           0.017   Prob(JB):                        0.718
Kurtosis:                       3.017   Cond. No.                         6.83
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [48]:

## examine the effect of remote work on satisfaction

The result shows that intercept, treatment_group, and the interactive term post_treatment*treatment_group are statistically significant. The Intercept shows the avaerage satisfaction of control group(Rubicon) before the treament day(nearly matched Rubicon's score 5.2). Intercept plus the treatment_group comes the treatment group (Giggle)'s satisfaction score before the treament date (matched 6.1). The post_treatment indicator is not statistically significant in this case.

The effect of remote work on satisfaction is displayed by the interaction terms, which is pretty close to -4 (exactly the one value we have set at the beginning of data simulation). The performance of the DID model is quite good as the R-squared is 0.7.

4.2 Question 6¶

Use data simulation and DiD analyses to examine whether the DiD analysis is robust to variations in the dataset, such as the size of the dataset, the strength of the treatment effect, and the difference in baseline levels of satisfaction between the two companies.

Do not need to analyze all of these factors, but analyses should examine how at least one variable has an impact on the capacity of the DiD analysis to accurately detect the true effect of the remote work intervention.

In [71]:

# Necessary libararies has been imported in Question 3, so we don't have to import them again
# import statsmodels.api as sm
# import statsmodels.formula.api as smf

In [74]:

# Try to do the manipulation manually (one by one) on size of the datase
# Let's try the default setting first
df_default_demo = df_default.copy()
remote_work_onset_date= pd.to_datetime("2023-01-01")
# And the corresponding default DID model
df_default_demo['post_treatment'] = (df_default_demo['date'] > remote_work_onset_date).astype(int)
df_default_demo['treatment_group'] = (df_default_demo['company'] == 'Giggle').astype(int)
df_default_demo.sample(10)

Out[74]:

	date	company	satisfaction	employee_id	time_at_company	employee_age	post_treatment	treatment_group
7851	2023-01-01	Giggle	6.133468	G1	365	52	0	1
8354	2023-03-12	Giggle	7.644481	G4	435	50	1	1
4115	2023-07-30	Rubicon	4.318494	R15	575	33	1	0
2417	2022-12-04	Rubicon	6.191785	R17	337	33	0	0
805	2022-04-24	Rubicon	4.913599	R5	113	32	0	0
6352	2022-06-05	Giggle	6.516512	G2	155	20	0	1
2045	2022-10-09	Rubicon	4.601151	R45	281	38	0	0
7176	2022-09-25	Giggle	7.238588	G26	267	56	0	1
6417	2022-06-12	Giggle	5.891866	G17	162	31	0	1
4047	2023-07-16	Rubicon	3.521234	R47	561	42	1	0

In [75]:

did_formula = 'satisfaction ~ post_treatment + treatment_group + post_treatment*treatment_group'
did_model_demo = smf.ols(did_formula, data=df_default_demo)
did_results_demo = did_model_demo.fit()
print(did_results_demo.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.588
Model:                            OLS   Adj. R-squared:                  0.588
Method:                 Least Squares   F-statistic:                     4997.
Date:                Thu, 03 Oct 2024   Prob (F-statistic):               0.00
Time:                        23:05:22   Log-Likelihood:                -14922.
No. Observations:               10500   AIC:                         2.985e+04
Df Residuals:                   10496   BIC:                         2.988e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                          5.2461      0.019    269.408      0.000       5.208       5.284
post_treatment                    -0.0729      0.028     -2.634      0.008      -0.127      -0.019
treatment_group                    0.8374      0.028     30.407      0.000       0.783       0.891
post_treatment:treatment_group     2.1401      0.039     54.690      0.000       2.063       2.217
==============================================================================
Omnibus:                        1.764   Durbin-Watson:                   1.986
Prob(Omnibus):                  0.414   Jarque-Bera (JB):                1.764
Skew:                           0.032   Prob(JB):                        0.414
Kurtosis:                       2.998   Cond. No.                         6.83
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [77]:

# Begin to manipulate the size of the datasets
# Expand the size of the dataset (L here means Larger)
num_employees_L_parameters=default_simulation_parameters.copy()
num_employees_L_parameters["num_employees_per_company"] = 200
num_employees_L_df= simulate_data(num_employees_L_parameters)

In [78]:

# Similarly, prepare dummy variables for later DID analysis
remote_work_onset_date= pd.to_datetime("2023-01-01")
num_employees_L_df['post_treatment'] = (num_employees_L_df['date'] > remote_work_onset_date).astype(int)
num_employees_L_df['treatment_group'] = (num_employees_L_df['company'] == 'Giggle').astype(int)
num_employees_L_df.sample(10)

Out[78]:

	date	company	satisfaction	employee_id	time_at_company	employee_age	post_treatment	treatment_group
29554	2022-10-23	Giggle	4.519462	G154	295	28	0	1
40760	2023-11-19	Giggle	7.266917	G160	687	47	1	1
5144	2022-06-26	Rubicon	6.220201	R144	176	58	0	0
32474	2023-02-05	Giggle	7.041340	G74	400	41	1	1
31455	2023-01-01	Giggle	6.718191	G55	365	59	0	1
18862	2023-10-22	Rubicon	7.467705	R62	659	35	1	0
5827	2022-07-24	Rubicon	3.914091	R27	204	47	0	0
18861	2023-10-22	Rubicon	6.634394	R61	659	58	1	0
37760	2023-08-06	Giggle	9.113765	G160	582	26	1	1
34626	2023-04-23	Giggle	6.766272	G26	477	20	1	1

In [79]:

# Perform the DID analyses on larger dataset
did_model_L = smf.ols(did_formula, data=num_employees_L_df)
did_results_L = did_model_L.fit()
print(did_results_L.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.581
Model:                            OLS   Adj. R-squared:                  0.581
Method:                 Least Squares   F-statistic:                 1.944e+04
Date:                Thu, 03 Oct 2024   Prob (F-statistic):               0.00
Time:                        23:08:45   Log-Likelihood:                -59727.
No. Observations:               42000   AIC:                         1.195e+05
Df Residuals:                   41996   BIC:                         1.195e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                          5.1933      0.010    532.991      0.000       5.174       5.212
post_treatment                     0.0041      0.014      0.298      0.766      -0.023       0.031
treatment_group                    0.9036      0.014     65.576      0.000       0.877       0.931
post_treatment:treatment_group     1.9986      0.020    102.071      0.000       1.960       2.037
==============================================================================
Omnibus:                        0.301   Durbin-Watson:                   1.976
Prob(Omnibus):                  0.860   Jarque-Bera (JB):                0.287
Skew:                          -0.004   Prob(JB):                        0.866
Kurtosis:                       3.010   Cond. No.                         6.83
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [85]:

# let's try smaller dataset size

# Begin to manipulate the size of the datasets
# Decrease the size of the dataset (S here means Small)
num_employees_S_parameters=default_simulation_parameters.copy()
num_employees_S_parameters["num_employees_per_company"] = 30
num_employees_S_df= simulate_data(num_employees_S_parameters)
remote_work_onset_date= pd.to_datetime("2023-01-01")
num_employees_S_df['post_treatment'] = (num_employees_S_df['date'] > remote_work_onset_date).astype(int)
num_employees_S_df['treatment_group'] = (num_employees_S_df['company'] == 'Giggle').astype(int)
num_employees_S_df.sample(10)

Out[85]:

	date	company	satisfaction	employee_id	time_at_company	employee_age	post_treatment	treatment_group
4022	2022-07-24	Giggle	5.843885	G2	204	43	0	1
2915	2023-11-12	Rubicon	5.035775	R5	680	49	1	0
4923	2023-02-19	Giggle	8.033658	G3	414	27	1	1
5868	2023-09-24	Giggle	7.737370	G18	631	44	1	1
5124	2023-04-02	Giggle	9.773210	G24	456	20	1	1
3756	2022-05-22	Giggle	5.656828	G6	141	43	0	1
2794	2023-10-15	Rubicon	3.910319	R4	652	31	1	0
1524	2022-12-18	Rubicon	4.697491	R24	351	30	0	0
5769	2023-09-03	Giggle	6.691046	G9	610	52	1	1
4898	2023-02-12	Giggle	8.469248	G8	407	57	1	1

In [86]:

did_model_S = smf.ols(did_formula, data=num_employees_S_df)
did_results_S = did_model_S.fit()
print(did_results_S.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.582
Model:                            OLS   Adj. R-squared:                  0.582
Method:                 Least Squares   F-statistic:                     2926.
Date:                Thu, 03 Oct 2024   Prob (F-statistic):               0.00
Time:                        23:20:47   Log-Likelihood:                -8868.3
No. Observations:                6300   AIC:                         1.774e+04
Df Residuals:                    6296   BIC:                         1.777e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                          5.2103      0.025    210.049      0.000       5.162       5.259
post_treatment                     0.0012      0.035      0.033      0.974      -0.068       0.070
treatment_group                    0.8800      0.035     25.085      0.000       0.811       0.949
post_treatment:treatment_group     1.9875      0.050     39.871      0.000       1.890       2.085
==============================================================================
Omnibus:                        3.639   Durbin-Watson:                   1.976
Prob(Omnibus):                  0.162   Jarque-Bera (JB):                3.675
Skew:                           0.054   Prob(JB):                        0.159
Kurtosis:                       2.952   Cond. No.                         6.83
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [89]:

# let's try smaller dataset size

# Begin to manipulate the size of the datasets
# Decrease the size of the dataset (s here means smaller)
num_employees_s_parameters=default_simulation_parameters.copy()
num_employees_s_parameters["num_employees_per_company"] = 10
num_employees_s_df= simulate_data(num_employees_s_parameters)
remote_work_onset_date= pd.to_datetime("2023-01-01")
num_employees_s_df['post_treatment'] = (num_employees_s_df['date'] > remote_work_onset_date).astype(int)
num_employees_s_df['treatment_group'] = (num_employees_s_df['company'] == 'Giggle').astype(int)
num_employees_s_df.sample(10)

Out[89]:

	date	company	satisfaction	employee_id	time_at_company	employee_age	post_treatment	treatment_group
205	2022-05-22	Rubicon	6.237281	R5	141	53	0	0
1435	2022-09-25	Giggle	7.342051	G5	267	50	0	1
786	2023-07-02	Rubicon	5.997104	R6	547	19	1	0
1513	2022-11-20	Giggle	6.355691	G3	323	43	0	1
1263	2022-05-29	Giggle	5.765894	G3	148	53	0	1
1675	2023-03-12	Giggle	7.855803	G5	435	57	1	1
147	2022-04-10	Rubicon	7.331843	R7	99	52	0	0
1700	2023-04-02	Giggle	6.952891	G0	456	21	1	1
14	2022-01-09	Rubicon	6.464313	R4	8	42	0	0
431	2022-10-30	Rubicon	3.757645	R1	302	30	0	0

In [91]:

did_model_s = smf.ols(did_formula, data=num_employees_s_df)
did_results_s = did_model_s.fit()
print(did_results_s.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.578
Model:                            OLS   Adj. R-squared:                  0.577
Method:                 Least Squares   F-statistic:                     956.7
Date:                Thu, 03 Oct 2024   Prob (F-statistic):               0.00
Time:                        23:22:15   Log-Likelihood:                -2986.1
No. Observations:                2100   AIC:                             5980.
Df Residuals:                    2096   BIC:                             6003.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                          5.2405      0.044    120.167      0.000       5.155       5.326
post_treatment                    -0.0528      0.062     -0.852      0.394      -0.174       0.069
treatment_group                    0.9297      0.062     15.074      0.000       0.809       1.051
post_treatment:treatment_group     1.9809      0.088     22.603      0.000       1.809       2.153
==============================================================================
Omnibus:                        0.453   Durbin-Watson:                   2.014
Prob(Omnibus):                  0.797   Jarque-Bera (JB):                0.394
Skew:                          -0.028   Prob(JB):                        0.821
Kurtosis:                       3.036   Cond. No.                         6.83
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [82]:

# Additioanl Note(See below)

The DiD analysis is relatively robust to variations in the dataset, at least for changes in the size of the dataset(and at least based on the parameters we have tested). According to the summary tables, the interactive terms are always statistically signicificant with a value of around 2 (almost a equivalent of the treatment effect). But as the size of the dataset changes, the values would change slightly, but not that much. Besides, the values of R-squared are relatively stable at around 0.58. This conclusion at the primary stage might not be very accurate and we would go through several rounds of changes of the parameters below.

In [92]:

# In order to make the repeated process more easy to execute, I combine it as an integrated function

import statsmodels.api as sm
import statsmodels.formula.api as smf
def variation(changed_variable, value):
    new_parameters = default_simulation_parameters.copy()
    new_parameters[changed_variable] = value
    df_variation = simulate_data(new_parameters)
    remote_work_onset_date= pd.to_datetime("2023-01-01")
    df_variation['post_treatment'] = (df_variation['date'] > remote_work_onset_date).astype(int)
    df_variation['treatment_group'] = (df_variation['company'] == 'Giggle').astype(int)
    did_formula = 'satisfaction ~ post_treatment + treatment_group + post_treatment*treatment_group'
    did_model_variation = smf.ols(did_formula, data=df_variation)
    did_results_variation = did_model_variation.fit()
    return print(did_results_variation.summary())

In [93]:

# Other Attempts-Let me change the the strength of the treatment effect, 
# and the difference in baseline levels of satisfaction between the two companies

In [94]:

# Try positive treatment with different strengths
variation("remote_work_treatment_effect",3)
variation("remote_work_treatment_effect",5)
variation("remote_work_treatment_effect",10)

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.717
Model:                            OLS   Adj. R-squared:                  0.716
Method:                 Least Squares   F-statistic:                     8845.
Date:                Thu, 03 Oct 2024   Prob (F-statistic):               0.00
Time:                        23:26:31   Log-Likelihood:                -14938.
No. Observations:               10500   AIC:                         2.988e+04
Df Residuals:                   10496   BIC:                         2.991e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                          5.1810      0.020    265.658      0.000       5.143       5.219
post_treatment                     0.0238      0.028      0.859      0.390      -0.031       0.078
treatment_group                    0.9081      0.028     32.924      0.000       0.854       0.962
post_treatment:treatment_group     2.9789      0.039     76.007      0.000       2.902       3.056
==============================================================================
Omnibus:                       11.114   Durbin-Watson:                   2.013
Prob(Omnibus):                  0.004   Jarque-Bera (JB):               11.107
Skew:                           0.079   Prob(JB):                      0.00387
Kurtosis:                       3.018   Cond. No.                         6.83
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.856
Model:                            OLS   Adj. R-squared:                  0.856
Method:                 Least Squares   F-statistic:                 2.080e+04
Date:                Thu, 03 Oct 2024   Prob (F-statistic):               0.00
Time:                        23:26:32   Log-Likelihood:                -14885.
No. Observations:               10500   AIC:                         2.978e+04
Df Residuals:                   10496   BIC:                         2.981e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                          5.1824      0.019    267.089      0.000       5.144       5.220
post_treatment                     0.0184      0.028      0.667      0.505      -0.036       0.072
treatment_group                    0.9619      0.027     35.053      0.000       0.908       1.016
post_treatment:treatment_group     4.9165      0.039    126.087      0.000       4.840       4.993
==============================================================================
Omnibus:                        1.494   Durbin-Watson:                   2.018
Prob(Omnibus):                  0.474   Jarque-Bera (JB):                1.493
Skew:                          -0.007   Prob(JB):                        0.474
Kurtosis:                       2.943   Cond. No.                         6.83
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.954
Model:                            OLS   Adj. R-squared:                  0.954
Method:                 Least Squares   F-statistic:                 7.283e+04
Date:                Thu, 03 Oct 2024   Prob (F-statistic):               0.00
Time:                        23:26:32   Log-Likelihood:                -14924.
No. Observations:               10500   AIC:                         2.986e+04
Df Residuals:                   10496   BIC:                         2.988e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                          5.2105      0.019    267.540      0.000       5.172       5.249
post_treatment                    -0.0034      0.028     -0.123      0.902      -0.058       0.051
treatment_group                    0.9057      0.028     32.883      0.000       0.852       0.960
post_treatment:treatment_group     9.9607      0.039    254.499      0.000       9.884      10.037
==============================================================================
Omnibus:                        2.294   Durbin-Watson:                   2.028
Prob(Omnibus):                  0.318   Jarque-Bera (JB):                2.317
Skew:                          -0.014   Prob(JB):                        0.314
Kurtosis:                       3.067   Cond. No.                         6.83
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [95]:

# Additional note

Again, when I changed the strength of the positive treatment effect, the interactive term could still capture the effect of treatment with a value that is almost the same. However, it is also worth noting that the R-squared increase along with the magtitute of treatment effect（at least for what I have observed so far）.

In [96]:

# Try neagtive treatment with different strengths
variation("remote_work_treatment_effect",-3)
variation("remote_work_treatment_effect",-5)
variation("remote_work_treatment_effect",-10)

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.552
Model:                            OLS   Adj. R-squared:                  0.551
Method:                 Least Squares   F-statistic:                     4304.
Date:                Thu, 03 Oct 2024   Prob (F-statistic):               0.00
Time:                        23:29:35   Log-Likelihood:                -14924.
No. Observations:               10500   AIC:                         2.986e+04
Df Residuals:                   10496   BIC:                         2.988e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                          5.1595      0.019    264.924      0.000       5.121       5.198
post_treatment                     0.0821      0.028      2.966      0.003       0.028       0.136
treatment_group                    0.9561      0.028     34.713      0.000       0.902       1.010
post_treatment:treatment_group    -3.1143      0.039    -79.573      0.000      -3.191      -3.038
==============================================================================
Omnibus:                        1.638   Durbin-Watson:                   1.997
Prob(Omnibus):                  0.441   Jarque-Bera (JB):                1.607
Skew:                          -0.020   Prob(JB):                        0.448
Kurtosis:                       3.046   Cond. No.                         6.83
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.789
Model:                            OLS   Adj. R-squared:                  0.789
Method:                 Least Squares   F-statistic:                 1.309e+04
Date:                Thu, 03 Oct 2024   Prob (F-statistic):               0.00
Time:                        23:29:35   Log-Likelihood:                -14969.
No. Observations:               10500   AIC:                         2.995e+04
Df Residuals:                   10496   BIC:                         2.998e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                          5.2102      0.020    266.372      0.000       5.172       5.249
post_treatment                    -0.0088      0.028     -0.316      0.752      -0.063       0.046
treatment_group                    0.8888      0.028     32.131      0.000       0.835       0.943
post_treatment:treatment_group    -5.0155      0.039   -127.596      0.000      -5.093      -4.938
==============================================================================
Omnibus:                        2.947   Durbin-Watson:                   2.006
Prob(Omnibus):                  0.229   Jarque-Bera (JB):                2.995
Skew:                           0.019   Prob(JB):                        0.224
Kurtosis:                       3.073   Cond. No.                         6.83
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.943
Model:                            OLS   Adj. R-squared:                  0.943
Method:                 Least Squares   F-statistic:                 5.739e+04
Date:                Thu, 03 Oct 2024   Prob (F-statistic):               0.00
Time:                        23:29:35   Log-Likelihood:                -14941.
No. Observations:               10500   AIC:                         2.989e+04
Df Residuals:                   10496   BIC:                         2.992e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                          5.1695      0.020    264.991      0.000       5.131       5.208
post_treatment                     0.0354      0.028      1.278      0.201      -0.019       0.090
treatment_group                    0.9236      0.028     33.477      0.000       0.870       0.978
post_treatment:treatment_group   -10.0195      0.039   -255.573      0.000     -10.096      -9.943
==============================================================================
Omnibus:                        4.377   Durbin-Watson:                   1.990
Prob(Omnibus):                  0.112   Jarque-Bera (JB):                4.181
Skew:                          -0.021   Prob(JB):                        0.124
Kurtosis:                       2.912   Cond. No.                         6.83
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [97]:

# Alright, the above conclusion holds true (and again, based on the obseravtions)

In [98]:

# Lastly, let's try the difference in baseline levels of satisfaction between the two companies.
# Let me try the the difference in baseline levels of satisfaction
variation("satisfaction_levels",{'Rubicon': 5.2, 'Giggle': 6.1})
variation("satisfaction_levels",{'Rubicon': 2, 'Giggle': 6.1})
variation("satisfaction_levels",{'Rubicon': 1, 'Giggle': 6.1})
variation("satisfaction_levels",{'Rubicon': 0.2, 'Giggle': 6.1})

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.579
Model:                            OLS   Adj. R-squared:                  0.579
Method:                 Least Squares   F-statistic:                     4817.
Date:                Thu, 03 Oct 2024   Prob (F-statistic):               0.00
Time:                        23:32:30   Log-Likelihood:                -14962.
No. Observations:               10500   AIC:                         2.993e+04
Df Residuals:                   10496   BIC:                         2.996e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                          5.1787      0.020    264.933      0.000       5.140       5.217
post_treatment                     0.0159      0.028      0.571      0.568      -0.039       0.070
treatment_group                    0.9536      0.028     34.496      0.000       0.899       1.008
post_treatment:treatment_group     1.9393      0.039     49.369      0.000       1.862       2.016
==============================================================================
Omnibus:                        2.305   Durbin-Watson:                   2.018
Prob(Omnibus):                  0.316   Jarque-Bera (JB):                2.288
Skew:                          -0.019   Prob(JB):                        0.319
Kurtosis:                       2.939   Cond. No.                         6.83
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.875
Model:                            OLS   Adj. R-squared:                  0.875
Method:                 Least Squares   F-statistic:                 2.449e+04
Date:                Thu, 03 Oct 2024   Prob (F-statistic):               0.00
Time:                        23:32:30   Log-Likelihood:                -14851.
No. Observations:               10500   AIC:                         2.971e+04
Df Residuals:                   10496   BIC:                         2.974e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                          2.0045      0.019    103.642      0.000       1.967       2.042
post_treatment                    -0.0024      0.027     -0.086      0.932      -0.056       0.052
treatment_group                    4.0719      0.027    148.868      0.000       4.018       4.126
post_treatment:treatment_group     2.0179      0.039     51.918      0.000       1.942       2.094
==============================================================================
Omnibus:                        0.839   Durbin-Watson:                   2.005
Prob(Omnibus):                  0.658   Jarque-Bera (JB):                0.804
Skew:                          -0.016   Prob(JB):                        0.669
Kurtosis:                       3.028   Cond. No.                         6.83
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.907
Model:                            OLS   Adj. R-squared:                  0.907
Method:                 Least Squares   F-statistic:                 3.423e+04
Date:                Thu, 03 Oct 2024   Prob (F-statistic):               0.00
Time:                        23:32:31   Log-Likelihood:                -14951.
No. Observations:               10500   AIC:                         2.991e+04
Df Residuals:                   10496   BIC:                         2.994e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                          0.9902      0.020     50.715      0.000       0.952       1.029
post_treatment                    -0.0009      0.028     -0.033      0.974      -0.055       0.053
treatment_group                    5.1426      0.028    186.232      0.000       5.088       5.197
post_treatment:treatment_group     1.9895      0.039     50.703      0.000       1.913       2.066
==============================================================================
Omnibus:                        1.896   Durbin-Watson:                   2.043
Prob(Omnibus):                  0.387   Jarque-Bera (JB):                1.886
Skew:                           0.016   Prob(JB):                        0.390
Kurtosis:                       3.057   Cond. No.                         6.83
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.926
Model:                            OLS   Adj. R-squared:                  0.926
Method:                 Least Squares   F-statistic:                 4.401e+04
Date:                Thu, 03 Oct 2024   Prob (F-statistic):               0.00
Time:                        23:32:31   Log-Likelihood:                -14834.
No. Observations:               10500   AIC:                         2.968e+04
Df Residuals:                   10496   BIC:                         2.970e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                          0.2016      0.019     10.438      0.000       0.164       0.239
post_treatment                    -0.0193      0.027     -0.705      0.481      -0.073       0.034
treatment_group                    5.9032      0.027    216.174      0.000       5.850       5.957
post_treatment:treatment_group     2.0247      0.039     52.177      0.000       1.949       2.101
==============================================================================
Omnibus:                        1.619   Durbin-Watson:                   2.001
Prob(Omnibus):                  0.445   Jarque-Bera (JB):                1.624
Skew:                           0.013   Prob(JB):                        0.444
Kurtosis:                       2.945   Cond. No.                         6.83
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [101]:

# Additioanl note (See below)

Based on the observations, the interactive terms in the models could alwyas capture the treatment effect with a relatively accurate value. And the R-squared seems to enlarged as the gap expanded.

In [102]:

# Let me narrow the distance betwwen the two
variation("satisfaction_levels",{'Rubicon': 5.9, 'Giggle': 6.1})
variation("satisfaction_levels",{'Rubicon': 5.95, 'Giggle': 6.1})
variation("satisfaction_levels",{'Rubicon': 6, 'Giggle': 6.1})
variation("satisfaction_levels",{'Rubicon': 6.05, 'Giggle': 6.1})

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.462
Model:                            OLS   Adj. R-squared:                  0.462
Method:                 Least Squares   F-statistic:                     3010.
Date:                Thu, 03 Oct 2024   Prob (F-statistic):               0.00
Time:                        23:36:23   Log-Likelihood:                -14896.
No. Observations:               10500   AIC:                         2.980e+04
Df Residuals:                   10496   BIC:                         2.983e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                          5.9052      0.019    304.015      0.000       5.867       5.943
post_treatment                    -0.0196      0.028     -0.709      0.478      -0.074       0.035
treatment_group                    0.2114      0.027      7.696      0.000       0.158       0.265
post_treatment:treatment_group     2.0107      0.039     51.510      0.000       1.934       2.087
==============================================================================
Omnibus:                        0.440   Durbin-Watson:                   1.977
Prob(Omnibus):                  0.802   Jarque-Bera (JB):                0.467
Skew:                          -0.012   Prob(JB):                        0.792
Kurtosis:                       2.978   Cond. No.                         6.83
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.453
Model:                            OLS   Adj. R-squared:                  0.452
Method:                 Least Squares   F-statistic:                     2893.
Date:                Thu, 03 Oct 2024   Prob (F-statistic):               0.00
Time:                        23:36:24   Log-Likelihood:                -14828.
No. Observations:               10500   AIC:                         2.966e+04
Df Residuals:                   10496   BIC:                         2.969e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                          5.9603      0.019    308.849      0.000       5.922       5.998
post_treatment                    -0.0164      0.027     -0.599      0.549      -0.070       0.037
treatment_group                    0.1564      0.027      5.731      0.000       0.103       0.210
post_treatment:treatment_group     1.9938      0.039     51.410      0.000       1.918       2.070
==============================================================================
Omnibus:                        0.580   Durbin-Watson:                   1.996
Prob(Omnibus):                  0.748   Jarque-Bera (JB):                0.608
Skew:                          -0.014   Prob(JB):                        0.738
Kurtosis:                       2.975   Cond. No.                         6.83
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.441
Model:                            OLS   Adj. R-squared:                  0.441
Method:                 Least Squares   F-statistic:                     2764.
Date:                Thu, 03 Oct 2024   Prob (F-statistic):               0.00
Time:                        23:36:24   Log-Likelihood:                -14909.
No. Observations:               10500   AIC:                         2.983e+04
Df Residuals:                   10496   BIC:                         2.986e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                          5.9993      0.019    308.475      0.000       5.961       6.037
post_treatment                    -0.0312      0.028     -1.128      0.259      -0.085       0.023
treatment_group                    0.1337      0.028      4.863      0.000       0.080       0.188
post_treatment:treatment_group     1.9884      0.039     50.876      0.000       1.912       2.065
==============================================================================
Omnibus:                        0.266   Durbin-Watson:                   1.996
Prob(Omnibus):                  0.876   Jarque-Bera (JB):                0.294
Skew:                           0.005   Prob(JB):                        0.863
Kurtosis:                       2.976   Cond. No.                         6.83
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           satisfaction   R-squared:                       0.426
Model:                            OLS   Adj. R-squared:                  0.426
Method:                 Least Squares   F-statistic:                     2600.
Date:                Thu, 03 Oct 2024   Prob (F-statistic):               0.00
Time:                        23:36:24   Log-Likelihood:                -14988.
No. Observations:               10500   AIC:                         2.998e+04
Df Residuals:                   10496   BIC:                         3.001e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                          6.0456      0.020    308.525      0.000       6.007       6.084
post_treatment                     0.0418      0.028      1.499      0.134      -0.013       0.096
treatment_group                    0.0635      0.028      2.290      0.022       0.009       0.118
post_treatment:treatment_group     1.9434      0.039     49.351      0.000       1.866       2.021
==============================================================================
Omnibus:                        1.521   Durbin-Watson:                   1.975
Prob(Omnibus):                  0.467   Jarque-Bera (JB):                1.534
Skew:                          -0.015   Prob(JB):                        0.465
Kurtosis:                       2.948   Cond. No.                         6.83
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [103]:

# Additioanl note (See below)

Based on the observations, when the gap narrows down, the interactive terms in the models could still capture the treatment effects. And the R-squared seems to contracted as the gap decreased.

Conclusion to Q6¶

The conclusion to question 6 goes here

In the above section, we have tried the adjustments on several variations in the dataset, such as the size of the dataset, the strength of the treatment effect, and the difference in baseline levels of satisfaction between the two companies.The results show that the DiD analyses could always capture the treatment effects, as demostrated by the almost identical values with the treatment effcts in their interactive terms(and statistically significant). But it is worth mentioning that the values of effects detected could change slightly, but overall, they have no huge discrepancies. In addition, effects of the single "post_treatment" vary as the variables changed, namely, it could sometimes be statistically significant and could sometimes not. And the R-squared also changed along with the adjustments.Lastly, the model is sensitive towards changes in the different parameters(more details could be seen in the each analyse section).

Part 2: Twitter Data Analysis

Twitter Misinformation Analysis Notebook (click to expand)

analysis

Section 1: Twitter Dataset¶

The dataset that accompanies this paper has been compiled and included below as a Pandas dataframe (assigned to the variable mccabe_data). Please base your main analyses on this shared dataset.

In [1]:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.formula.api import ols
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [2]:

mccabe_data = pd.read_csv('/home/jovyan/compss-214a/mccabe-public-data.csv')

You are welcome to rename the dataset or work with different subsets of this data or with additional datasets if neccesary, but this shared dataset should be the primary source for your analyses, so that we are all working with the same underlying source of information.

Section 2 Exploring the structure of the dataset¶

Describe the key variables you are interested in. Feel free to include data summaries and/or vizualizations that illustrate how the dataset is structured, such as the different groups of users you are interested in and the different measures of whether posts are classified as misinformation, etc.

Section 2-1 Overall data structure¶

In [3]:

# Load the dataset, rename it as df and make a copy
df = mccabe_data.copy()
df.shape

Out[3]:

(32968, 29)

In [4]:

df.head()

Out[4]:

	date	fake_merged	fake_merged_initiation	fake_merged_rt	fake_grinberg_initiation	fake_grinberg_rt	fake_grinberg_rb_initiation	fake_grinberg_rb_rt	fake_newsguard_initiation	fake_newsguard_rt	...	not_fake_shopping	not_fake_shopping_initiation	not_fake_shopping_rt	not_fake_sports	not_fake_sports_initiation	not_fake_sports_rt	n	stat	nusers	group
0	2019-11-30	875.0	199.0	676.0	74.0	207.0	42.0	138.0	188.0	653.0	...	196.0	61.0	135.0	16.0	7.0	9.0	12387.0	total	4390	fns
1	2019-12-01	3382.0	825.0	2557.0	257.0	941.0	120.0	546.0	760.0	2293.0	...	608.0	207.0	401.0	99.0	33.0	66.0	54897.0	total	11629	fns
2	2019-12-02	3644.0	992.0	2652.0	280.0	780.0	141.0	479.0	926.0	2455.0	...	684.0	289.0	395.0	82.0	37.0	45.0	68505.0	total	13132	fns
3	2019-12-03	4184.0	1110.0	3074.0	339.0	921.0	185.0	562.0	1052.0	2890.0	...	782.0	236.0	546.0	92.0	41.0	51.0	74502.0	total	13997	fns
4	2019-12-04	4436.0	1100.0	3336.0	307.0	1171.0	135.0	540.0	1038.0	3146.0	...	540.0	261.0	279.0	124.0	53.0	71.0	71762.0	total	13967	fns

5 rows × 29 columns

In [5]:

# See the complete list of column names
df.columns

Out[5]:

Index(['date', 'fake_merged', 'fake_merged_initiation', 'fake_merged_rt',
       'fake_grinberg_initiation', 'fake_grinberg_rt',
       'fake_grinberg_rb_initiation', 'fake_grinberg_rb_rt',
       'fake_newsguard_initiation', 'fake_newsguard_rt', 'not_fake',
       'not_fake_initiation', 'not_fake_rt', 'not_fake_conservative',
       'not_fake_conservative_initiation', 'not_fake_conservative_rt',
       'not_fake_liberal', 'not_fake_liberal_initiation',
       'not_fake_liberal_rt', 'not_fake_shopping',
       'not_fake_shopping_initiation', 'not_fake_shopping_rt',
       'not_fake_sports', 'not_fake_sports_initiation', 'not_fake_sports_rt',
       'n', 'stat', 'nusers', 'group'],
      dtype='object')

In [6]:

# Some columns have been obmitted in presentation, let's see a complete one
df.iloc[0]

Out[6]:

date                                2019-11-30
fake_merged                              875.0
fake_merged_initiation                   199.0
fake_merged_rt                           676.0
fake_grinberg_initiation                  74.0
fake_grinberg_rt                         207.0
fake_grinberg_rb_initiation               42.0
fake_grinberg_rb_rt                      138.0
fake_newsguard_initiation                188.0
fake_newsguard_rt                        653.0
not_fake                               11512.0
not_fake_initiation                     4357.0
not_fake_rt                             7155.0
not_fake_conservative                    529.0
not_fake_conservative_initiation         156.0
not_fake_conservative_rt                 373.0
not_fake_liberal                        1030.0
not_fake_liberal_initiation              247.0
not_fake_liberal_rt                      783.0
not_fake_shopping                        196.0
not_fake_shopping_initiation              61.0
not_fake_shopping_rt                     135.0
not_fake_sports                           16.0
not_fake_sports_initiation                 7.0
not_fake_sports_rt                         9.0
n                                      12387.0
stat                                     total
nusers                                    4390
group                                      fns
Name: 0, dtype: object

In [7]:

# See what're contained in the non-numerical value
df["stat"].unique(), df["group"].unique()

Out[7]:

(array(['total', 'avg'], dtype=object),
 array(['fns', 'suspended', 'ha', 'ma', 'la', 'qanon', 'av', 'ss1', 'ss5',
        'A', 'B', 'D', 'F', 'all', 'nfns', 'nfns_ha', 'nfns_ma', 'nfns_la',
        'A_ha', 'B_ha', 'D_ha', 'F_ha', 'A_ma', 'B_ma', 'D_ma', 'F_ma',
        'A_la', 'B_la', 'D_la', 'F_la'], dtype=object))

In [8]:

# Some simple observations(see the text below)

Based on simple calculation and reference to the codebook, the fake_merged is simply the combination of fake_merged_initiation and fake_merged_rt. The same principle applies to not_fake, not_fake_conservative, not_fake_liberal, not_fake_shopping, and not_fake_sports. Besides, the nusers is a combination of fake_merged and not_fake. And it is worth noting that the stat has two distintive values including total and average, so we should deal with this carefully in the later analyses.

In [9]:

# To see what the avg and total means by observing a random selected date
df_selected = df[df['date'] == '2019-11-30']
# df_selected
# For the cleaniess of the pdf file, here I did not run the cell; but during the analysis, I did look at the whole picture

In [10]:

# The data structure becomes clear now (see the notes below)

It seems like each data are supposed to have 60 rows (the 30 classifications in the "group" mutiplied by the two kinds of stat: total and avg). But as the total number could not be divided evenly by 60, I would write a function below to see if there's any inconsistency and figure out the reasons for those (potential) inconsistencies. But before that, let's check the ditribution of time first.

In [11]:

df['date'] = pd.to_datetime(df['date'])

In [12]:

# Data summary and descriptive statistics
df_total=df[df["stat"]=="total"]
summary_stats = df_total.describe(include='all').style.format(precision=2)
summary_stats

Out[12]:

	date	fake_merged	fake_merged_initiation	fake_merged_rt	fake_grinberg_initiation	fake_grinberg_rt	fake_grinberg_rb_initiation	fake_grinberg_rb_rt	fake_newsguard_initiation	fake_newsguard_rt	not_fake	not_fake_initiation	not_fake_rt	not_fake_conservative	not_fake_conservative_initiation	not_fake_conservative_rt	not_fake_liberal	not_fake_liberal_initiation	not_fake_liberal_rt	not_fake_shopping	not_fake_shopping_initiation	not_fake_shopping_rt	not_fake_sports	not_fake_sports_initiation	not_fake_sports_rt	n	stat	nusers	group
count	16484	16484.00	16484.00	16484.00	16484.00	16484.00	16484.00	16484.00	16484.00	16484.00	16484.00	16484.00	16484.00	16484.00	16484.00	16484.00	16484.00	16484.00	16484.00	16484.00	16484.00	16484.00	16484.00	16484.00	16484.00	16484.00	16484	16484.00	16484
unique	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	1	nan	30
top	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	total	nan	fns
freq	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	16484	nan	550
mean	2020-08-29 19:27:21.446251008	1816.38	379.00	1437.38	116.53	484.67	67.94	302.52	360.77	1351.67	34438.99	14423.19	20015.80	1552.06	554.76	997.29	2109.63	507.11	1602.52	710.08	333.27	376.80	53.93	19.02	34.91	36255.37	nan	9600.57	nan
min	2019-11-30 00:00:00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.00	nan	1.00	nan
25%	2020-04-15 00:00:00	116.00	33.00	77.00	7.00	18.00	2.00	6.00	31.00	71.00	4235.00	1246.00	2432.00	230.00	71.00	136.00	155.00	34.00	119.00	18.00	6.00	9.00	4.00	1.00	2.00	4686.25	nan	519.75	nan
50%	2020-08-29 00:00:00	636.00	146.50	458.00	50.00	134.00	32.00	68.00	142.00	429.00	13385.50	5334.00	8437.50	766.00	215.00	530.00	738.00	204.50	517.00	127.00	59.00	63.00	16.00	4.00	11.00	14769.50	nan	1882.50	nan
75%	2021-01-14 00:00:00	2800.50	598.00	2266.25	176.00	761.00	97.00	473.25	569.00	2123.00	44738.00	12778.75	28380.25	2312.25	726.25	1416.00	3089.00	671.00	2345.25	1107.00	288.00	640.25	56.00	20.00	36.00	48138.00	nan	6164.25	nan
max	2021-05-31 00:00:00	19143.00	3124.00	16145.00	1034.00	5142.00	706.00	4186.00	3033.00	15829.00	355049.00	131886.00	223163.00	15075.00	5649.00	11149.00	27622.00	5691.00	23579.00	6368.00	3544.00	3134.00	1026.00	372.00	718.00	363619.00	nan	97893.00	nan
std	nan	2407.34	478.02	1952.45	151.58	687.88	92.58	461.43	455.32	1847.60	48069.00	22993.78	26827.75	1924.80	766.98	1238.35	2968.71	725.00	2279.25	1107.51	599.00	561.65	92.07	33.48	60.80	49543.95	nan	17992.41	nan

The time range for the dataset becomes clear now, starting from 2019-11-30, and ending on 2021-05-31.

In [13]:

# Define the start date and end date
start_date='2019-11-30'
end_date= '2021-05-31'

In [14]:

# Here is a function to check the number of rows for each day within the timeframe
def check_row_count_per_date(df, start_date, end_date):
    # Create a date range from start_date to end_date
    date_range = pd.date_range(start=start_date, end=end_date, freq='D')
    inconsistent_dates = {}
    for date in date_range:
        df_filtered = df[df['date'] == date.strftime('%Y-%m-%d')]
        if len(df_filtered) != 60:
            inconsistent_dates[date.strftime('%Y-%m-%d')] = len(df_filtered)
    return inconsistent_dates

In [15]:

inconsistent_dates = check_row_count_per_date(df, start_date, end_date)
inconsistent_dates

Out[15]:

{'2020-06-30': 120,
 '2020-07-08': 58,
 '2020-07-10': 58,
 '2020-08-17': 58,
 '2020-10-24': 58,
 '2020-10-26': 58,
 '2020-10-29': 58,
 '2020-10-31': 58,
 '2020-11-01': 58,
 '2021-01-12': 58,
 '2021-01-13': 58,
 '2021-01-14': 58,
 '2021-01-15': 58,
 '2021-01-16': 58,
 '2021-01-17': 58,
 '2021-01-19': 58,
 '2021-01-21': 58}

In [16]:

# Simple description has attached below

There were even 120 rows in the date "2020-06-30", exactly two times the value it is supposed to be. Therefore, it is very likely that there are duplicated values in this particular date. It looks like there're fewer than 60 rows in some dates, but they are consistently 58 rows. The "missing" values were not always the same. But one thing in common is that it is always missed for a same subgroup category. After some careful examination, I figured out that the missing values are always those with subgroup ineligible for certain grouping labels.

In [17]:

# I have tried as many dates as I can manually, but I would not display them here for the cleaniess of the final pdf file. 
# Example code
# df_filtered_1 = df[df['date'] == '2020-06-30']
# df_filtered_1

In [18]:

# Alternatively, we could write a function here to visulize the distribution of different groups.
def plot_group_distribution(df, group_column='group'):
    group_counts = df[group_column].value_counts()
    # Plotting
    plt.figure(figsize=(10, 6))
    group_counts.plot(kind='bar')
    plt.title(f'Distribution of {group_column}')
    plt.xlabel('Groups')
    plt.ylabel('Frequency')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

In [19]:

plot_group_distribution(df,group_column='group')

The graph indicates that each group has almost a same frequency, as the bars appear to be almost equal in height, suggesting a balanced dataset across groups. This implies that the number of data points or observations is relatively uniform among the groups, which is beneficial for comparative analysis. However, we could notice a slightly difference in the A_ha, A_la and suspended groups.

Section 2-2 Interesting Variables and More on Visualization¶

In [20]:

df['date'] = pd.to_datetime(df['date'])

In [21]:

# Write a function to visualize the number of users within a specific group

def plot_group_totals(df, group_name):
    group_totals = df[(df['group'] == group_name) & (df['stat'] == "total")].copy()
    suspension_start = pd.to_datetime('2021-01-06')
    suspension_end = pd.to_datetime('2021-01-12')

    # Remove duplicates and handle missing values in 'date' or 'nusers'
    group_totals = group_totals.drop_duplicates(subset=['date'])
    group_totals = group_totals.dropna(subset=['date', 'nusers'])
    group_totals = group_totals.sort_values(by='date')
    
    # Create a new figure and plot the data as a line plot
    plt.figure(figsize=(16, 2))  # Specify a wide but shallow figure size
    plt.plot(group_totals['date'], group_totals['nusers'], color="#00008B")
    plt.xlabel('Date')
    plt.ylabel('Number of Users')
    plt.axvline(suspension_start, color='r', linestyle='--', label='Suspension Starts (January 6th 2021)')
    plt.axvline(suspension_end, color='g', linestyle='--', label='Suspension Ends (January 12th 2021)')
    plt.legend()
    plt.title(f'{group_name.capitalize()} Group: Total Users Over Time')
    plt.show()

In [22]:

plot_group_totals(df, "suspended")
plot_group_totals(df, "qanon")
plot_group_totals(df, "fns")
plot_group_totals(df, "nfns")

The graph displays the total number of users over time for four different groups: Suspended, QAnon, Fns, and Nfns, from early 2020 to mid-2021. In particular, the number of Suspended group users has seen a sharp decline and maintained zero afterward. This shows the mechanism of deplatforming. The QAnon group also sees a similar trend. For the more general groups, Fns and Nfns, the numbers fluctuated during the period and demonstrated a slight decline afterward. This pattern suggests that the suspension significantly impacted user participation across all four groups, but with different levels of magnitude.

In [23]:

### Apart from the above, I am also very interested in the Anti-Vaccine(av) group, and would like to explore more of this subgroup

In [24]:

## subset to just the Anti-Vaccine group
av = df[(df['group'] == "av")].copy()
av_totals = av[
    (av["stat"] == "total")
].copy()

In [25]:

# Subset to just tweets during 2021
av_totals_2021 = av_totals[av_totals.date >= "2021-01-01"]

# Make a wide figure for the timeseries
plt.figure(figsize=(16, 2))

# Plot date on the x axis, and n (total number of tweets) on the y axis
plt.bar(av_totals_2021.date, av_totals_2021.n, color='lightblue', label="All Tweets")

# Overlay a count of the misinformation tweets (the fake_merged variable) in a different color
plt.bar(
    av_totals_2021.date,
    av_totals_2021.fake_merged,
    color='magenta',
    alpha=0.75, # alpha controls the opacity (alpha = 1 is solid, alpha = 0 is completely transparent)
    label="Misinformation" # whatever string you put here will go into the legend
)
plt.legend()

Out[25]:

<matplotlib.legend.Legend at 0x7cdcd770fc50>

The bar chart shows the volume of tweets over time from January to June 2021, with a comparison between all tweets (in light blue) and tweets identified as misinformation (in magenta). Initially, there is a high volume of tweets, with a noticeable proportion consisting of misinformation. Over time, both total tweets and misinformation tweets decrease. However, the proportion of misinformation remains relatively low afterwards, indicating the deplatforming event also have some effects on these users.

In [26]:

# I also would like to see if the standard for classification also affect the visulization in an evident way.

In [27]:

av_totals_2021["fake_grinberg"]=av_totals_2021["fake_grinberg_initiation"]+av_totals_2021["fake_grinberg_rt"]

/tmp/ipykernel_5539/2617567456.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  av_totals_2021["fake_grinberg"]=av_totals_2021["fake_grinberg_initiation"]+av_totals_2021["fake_grinberg_rt"]

In [28]:

# Make a wide figure for the timeseries
plt.figure(figsize=(16, 2))

# Plot date on the x axis, and n (total number of tweets) on the y axis
plt.bar(av_totals_2021.date, av_totals_2021.n, color='lightblue', label="All Tweets")

# Overlay a count of the misinformation tweets (the fake_merged variable) in a different color
plt.bar(
    av_totals_2021.date,
    av_totals_2021.fake_grinberg,
    color='magenta',
    alpha=0.75, # alpha controls the opacity (alpha = 1 is solid, alpha = 0 is completely transparent)
    label="Misinformation" # whatever string you put here will go into the legend
)
plt.legend()

Out[28]:

<matplotlib.legend.Legend at 0x7cdcd785a210>

The chart shows the total number of tweets and misinformation tweets from January to June 2021, with a clear decline in both over time. Similar to the first chart, there is a sharp drop in early January, followed by a gradual decrease. The proportion of misinformation is highest in early January but steadily decreases, mirroring the trend in total tweets. Both charts consistently indicate a sustained presence of misinformation. We could conclude that the classification of misinformation in the paper could have the same general trend.

Section 3 Replication of Main DiD Results¶

In this section, you will perform at least one Difference in Differences analysis with the goal of conceptually replicating the key DiD analysis that McCabe et al performed to support their primary conclusion.

Section 3-1 Main DiD Results: Impacts on Deplatformed-User Followers and Non-followers¶

Let's first try to analyze deplatforming on two groups‘ sharing behaviors, including the followers of deplatfromed users(Group_B) and those not following them(Group_F). And the outcome variable chosen for comparision is the fake_merged_rt, which demostrates one of an important aspect of communicating misinformation.

In [29]:

# Import the dataset again, and pay special attention to the stat type, here we choose followers and non-followers, and total stat values
df = mccabe_data.copy()
df['date'] = pd.to_datetime(df['date'])
df = df[df["stat"] == "total"]
# Here I create a new dataframe to incoporate the followers and non-followers
df_mdid = df[(df['group'] == 'B') | (df['group'] == 'F')].copy()
# The date that the deplatforming event occurred
suspension_start = pd.to_datetime('2021-01-06')
# We would like to perform the DID analysis, therefore we should set up the post-treatment date and treatment group
df_mdid['post_treatment'] = (df_mdid['date'] > suspension_start).astype(int)
df_mdid['treatment_group'] = (df['group'] == 'B').astype(int)
# Here is the very commonly used format for DID formula, and we choose the fake_merged_rt as outcome variables
formula = 'fake_merged_rt ~ post_treatment + treatment_group + post_treatment*treatment_group'
model = smf.ols(formula, data=df_mdid)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:         fake_merged_rt   R-squared:                       0.802
Model:                            OLS   Adj. R-squared:                  0.801
Method:                 Least Squares   F-statistic:                     1476.
Date:                Sat, 19 Oct 2024   Prob (F-statistic):               0.00
Time:                        21:24:23   Log-Likelihood:                -9067.7
No. Observations:                1100   AIC:                         1.814e+04
Df Residuals:                    1096   BIC:                         1.816e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                        489.9012     45.800     10.696      0.000     400.035     579.768
post_treatment                  -207.8599     89.200     -2.330      0.020    -382.882     -32.837
treatment_group                 3958.7333     64.771     61.118      0.000    3831.643    4085.823
post_treatment:treatment_group -2049.0644    126.148    -16.243      0.000   -2296.584   -1801.545
==============================================================================
Omnibus:                      418.613   Durbin-Watson:                   0.309
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3516.415
Skew:                           1.519   Prob(JB):                         0.00
Kurtosis:                      11.215   Cond. No.                         6.44
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [30]:

# Interpretation of the above summary table has been attached (see below)

The model explains a significant amount of variance (R-squared = 0.802, indicating that 80.2% of the variance could be explained). All predictors, including the interaction term, are statistically significant(p < 0.05). The intercept suggests that without any treatment or intervention (baseline condition), the average level of misinformation retweeted is about 490. And in the post-treatment period (after suspension), the average level of misinformation retweeted decreases by about 208 units compared to the pre-treatment period. In the pre-treatment period, the deplatformed user followers has a much higher average number of misinformation retweeted (by about 3959 units) compared to the not-followers. Lastly, in the post-treatment period, the average number of misinformation retweeted for the treatment group decreases by about 2049 units, beyond what is captured by the main effects of post-treatment and treatment group alone. This implies that deplatforming leads to a significant reduction in misinformation retweets for followers compared to non-followers.

Section 3-2 Given Time Window and Exact Duplication for Visulization¶

What I am going to do in this subsection is to duplicate the Fig. 4 | Time series of misinformation retweeting for followers and not-followers.

In [31]:

# Here I have attached a screenshot of the DID performed by the authors

In [32]:

# Before doing the exact visulization, I would like to align the time window and statistical preprocessing(standardization)
# Copy the dataset and filter for 'total' stat
df = mccabe_data.copy()
df = df[df["stat"] == "total"]
df['date'] = pd.to_datetime(df['date'])

# Filter data for groups 'B_ha' and 'F_ha' within the date range
df = df[(df['group'].isin(['B_ha', 'F_ha'])) & 
        (df['date'] >= '2020-12-01') & 
        (df['date'] <= '2021-01-20')]

# Standardize 'fake_merged_rt' column separately for each group
df['fake_merged_rt_std'] = df.groupby('group')['fake_merged_rt'].transform(
    lambda x: (x - x.mean()) / x.std()
)

# Create pre- and post-treatment indicators
df['pre_treatment'] = df['date'] <= '2021-01-06'
df['post_treatment'] = df['date'] > '2021-01-12'

# Fit OLS regressions for group B_ha
ols_before_B_ha = ols('fake_merged_rt_std ~ date', data=df[(df['group'] == 'B_ha') & (df['pre_treatment'])]).fit()
ols_after_B_ha = ols('fake_merged_rt_std ~ date', data=df[(df['group'] == 'B_ha') & (df['post_treatment'])]).fit()

# Fit OLS regressions for group F_ha
ols_before_F_ha = ols('fake_merged_rt_std ~ date', data=df[(df['group'] == 'F_ha') & (df['pre_treatment'])]).fit()
ols_after_F_ha = ols('fake_merged_rt_std ~ date', data=df[(df['group'] == 'F_ha') & (df['post_treatment'])]).fit()

# Prepare pre-treatment and post-treatment dates for B_ha and F_ha
dates_pre_B_ha = df[(df['group'] == 'B_ha') & df['pre_treatment']]['date']
dates_post_B_ha = df[(df['group'] == 'B_ha') & df['post_treatment']]['date']

dates_pre_F_ha = df[(df['group'] == 'F_ha') & df['pre_treatment']]['date']
dates_post_F_ha = df[(df['group'] == 'F_ha') & df['post_treatment']]['date']

# Align fitted values with corresponding dates for B_ha and F_ha
fitted_pre_B_ha = pd.Series(ols_before_B_ha.fittedvalues, index=dates_pre_B_ha.index)
fitted_post_B_ha = pd.Series(ols_after_B_ha.fittedvalues, index=dates_post_B_ha.index)

fitted_pre_F_ha = pd.Series(ols_before_F_ha.fittedvalues, index=dates_pre_F_ha.index)
fitted_post_F_ha = pd.Series(ols_after_F_ha.fittedvalues, index=dates_post_F_ha.index)

# Plot setup
plt.figure(figsize=(12, 7))

# Plot actual data points for groups B_ha and F_ha
sns.lineplot(data=df[df['group'] == 'B_ha'], x='date', y='fake_merged_rt_std', 
             color='black', label='Group B_ha', marker='o', linestyle='-')
sns.lineplot(data=df[df['group'] == 'F_ha'], x='date', y='fake_merged_rt_std', 
             color='gray', label='Group F_ha', marker='o', linestyle='-')

# Plot fitted lines for groups B_ha and F_ha
plt.plot(dates_pre_B_ha, fitted_pre_B_ha, color='black', linestyle='-')
plt.plot(dates_post_B_ha, fitted_post_B_ha, color='black', linestyle='-')

plt.plot(dates_pre_F_ha, fitted_pre_F_ha, color='gray', linestyle='-')
plt.plot(dates_post_F_ha, fitted_post_F_ha, color='gray', linestyle='-')

# Add the counterfactual dashed lines
plt.axvline(x=pd.to_datetime('2021-01-06'), color='black', linestyle='--', label='Counterfactual')
plt.axvline(x=pd.to_datetime('2021-01-12'), color='black', linestyle='--')

# Add labels, title, and legend
plt.title('Time series of misinformation retweeting for followers and not-followers')
plt.xlabel('Date')
plt.ylabel('Total misinformation retweeted (std)')
plt.legend(title='Group', loc='upper right')
plt.grid(True)

# Display the plot
plt.show()

Here I followed the instruction of the footnotes in the paper, and as indicated, sample size includes 51 observations (days) from 1 December 2020 to 20 January 2021. The counterfactual identified under the parallel path assumption is shown as a dashed line after 12 January 2021. In the paper, fitted straight lines are ordinary least squares regressions of standardized daily total retweeted misinformation, fitted separately before 6 January 2021 and after 12 January 2021 and by group. My results for visualization looks exactly the same as the one in the paper, which implies that I have successfully duplicate the results after more complicated standardization. And the visulization after std also implies a rougly parallel trend, which was an essential part for DiD analysis. During the suspension period, there was a clear divergence.

Section 4 Extensions and follow up analyses¶

In this section, you will perform follow-up analyses, summaries, or visualizations that you feel help shed light on the robustness of the conclusion reached by McCabe et al. You are welcome to draw on insights you gained through data simulation, and to draw on the questions we discussed in class surrounding the key assumptions and study decisions in Notebook 1: Data Acquisition.

Section 4-1 Impacts of Changes of Key Variables on Results¶

In [33]:

# Given that in the following analysis, DID could be repeatedly applied, I integrated it into a function
def run_did_analysis(df, stat, treatment_group, control_group, outcome_var):
    # Start over and import the dataset
    df = mccabe_data.copy()
    df['date'] = pd.to_datetime(df['date'])
    df = df[df['stat'] == stat]
    
    # Create a new dataframe for the treatment and control group
    df_mdid = df[(df['group'] == treatment_group) | (df['group'] == control_group)].copy()

    # Define the date of the deplatforming event
    suspension_start = pd.to_datetime('2021-01-06')

    # Set up post-treatment and treatment group indicators
    df_mdid['post_treatment'] = (df_mdid['date'] > suspension_start).astype(int)
    df_mdid['treatment_group'] = (df_mdid['group'] == treatment_group).astype(int)

    # Define the DID formula
    formula = f'{outcome_var} ~ post_treatment + treatment_group + post_treatment*treatment_group'

    # Run the DID model
    model = smf.ols(formula, data=df_mdid)
    results = model.fit()

    return results.summary()

In [34]:

# I was wondering if the outcome variables and the stat used for the datasets would affect the general outcomes, let's try them below

Another DiD¶

In the previous section, we examined the main DID in the paper to see the "indirect" effects of the delpatforming on deplaformed user-followers. Here we would like to see the impact of effects from another aspect. Let's see how it affects the misinformation sharers and non-misinformation sharers.

In [35]:

run_did_analysis(df=df, stat='total',treatment_group='fns', control_group='nfns',outcome_var='fake_merged_initiation')

Out[35]:

OLS Regression Results
Dep. Variable:	fake_merged_initiation	R-squared:	0.865
Model:	OLS	Adj. R-squared:	0.864
Method:	Least Squares	F-statistic:	2333.
Date:	Sat, 19 Oct 2024	Prob (F-statistic):	0.00
Time:	21:24:26	Log-Likelihood:	-7736.7
No. Observations:	1100	AIC:	1.548e+04
Df Residuals:	1096	BIC:	1.550e+04
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	115.2568	13.658	8.439	0.000	88.459	142.055
post_treatment	-48.3258	26.599	-1.817	0.070	-100.517	3.866
treatment_group	1480.1975	19.315	76.636	0.000	1442.299	1518.096
post_treatment:treatment_group	-548.3699	37.617	-14.578	0.000	-622.180	-474.560

Omnibus:	158.913	Durbin-Watson:	0.501
Prob(Omnibus):	0.000	Jarque-Bera (JB):	1099.916
Skew:	0.449	Prob(JB):	1.43e-239
Kurtosis:	7.816	Cond. No.	6.44

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [36]:

# Conclusion attached below

Again, the model explains a significant amount of variance (R-squared = 0.865, indicating that 86.5% of the variance could be explained). All predictors, including the interaction terms, are statistically significant(p < 0.05). The intercept suggests that without any treatment or intervention (baseline condition), the average level of misinformation initiated is about 115. In the pre-treatment period, the misinformation sharers has a much higher average number of misinformation initiated (by about 1480 units) compared to the non-misinformation sharers. Lastly, in the post-treatment period, the average number of misinformation initiated for the treatment group decreases by about 548 units, beyond what is captured by the main effects of post-treatment and treatment group alone. This implies that deplatforming also leads to a significant reduction in misinformation sharing.

Changing Outcome Variables and(or) Stat¶

For the ease of comparision, in the following content, unless specified otherwise, I would use the one in Section 3-1 as a benchmark model. And in the newly defined functions, the parameters is # df=df, stat='total',treatment_group='B',control_group='F', outcome_var='fake_merged_rt'.

In [37]:

# Here I changed the stat to avg, taking into account the user size of different groups after deplatforming
run_did_analysis(df=df,stat='avg',treatment_group='B',control_group='F',outcome_var='fake_merged_rt')

Out[37]:

OLS Regression Results
Dep. Variable:	fake_merged_rt	R-squared:	0.824
Model:	OLS	Adj. R-squared:	0.824
Method:	Least Squares	F-statistic:	1713.
Date:	Sat, 19 Oct 2024	Prob (F-statistic):	0.00
Time:	21:24:27	Log-Likelihood:	1190.3
No. Observations:	1100	AIC:	-2373.
Df Residuals:	1096	BIC:	-2353.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	0.0833	0.004	20.411	0.000	0.075	0.091
post_treatment	-0.0287	0.008	-3.609	0.000	-0.044	-0.013
treatment_group	0.3763	0.006	65.177	0.000	0.365	0.388
post_treatment:treatment_group	-0.1504	0.011	-13.377	0.000	-0.172	-0.128

Omnibus:	589.761	Durbin-Watson:	0.305
Prob(Omnibus):	0.000	Jarque-Bera (JB):	7582.192
Skew:	2.175	Prob(JB):	0.00
Kurtosis:	15.104	Cond. No.	6.44

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [38]:

# Conclusion attached below

Interesting. The general findings still apply even if I've changed the stat from "total" to "avg". The model explains a significant amount of variance (R-squared = 0.824, indicating that 82.4% of the variance could be explained). All predictors, including the interaction term, are statistically significant(p < 0.05). The intercept suggests that without any treatment or intervention (baseline condition), the average level of misinformation retweeted(per user) is about 0.0833. And in the post-treatment period (after suspension), the average level of misinformation retweeted (per user) in the control group decreases by about 0.0287 units compared to the pre-treatment period. In the pre-treatment period, the deplatformed user followers has a much higher average number of misinformation retweeted per user (by about 0.3763 units) compared to the not-followers. Lastly, in the post-treatment period, the average number of misinformation retweeted (per user) for the treatment group decreases by about 0.1504 units, beyond what is captured by the main effects of post-treatment and treatment group alone. This implies that deplatforming leads to a significant reduction in misinformation retweets for followers compared to non-followers, even we take into account the fluctuating user size.

In [39]:

# How about the impacts of misinformation initiation sharing 
run_did_analysis(df=df,stat='total',treatment_group='B',control_group='F',outcome_var='fake_merged_initiation')

Out[39]:

OLS Regression Results
Dep. Variable:	fake_merged_initiation	R-squared:	0.814
Model:	OLS	Adj. R-squared:	0.814
Method:	Least Squares	F-statistic:	1602.
Date:	Sat, 19 Oct 2024	Prob (F-statistic):	0.00
Time:	21:24:27	Log-Likelihood:	-7450.4
No. Observations:	1100	AIC:	1.491e+04
Df Residuals:	1096	BIC:	1.493e+04
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	255.2840	10.528	24.249	0.000	234.627	275.941
post_treatment	-104.6702	20.504	-5.105	0.000	-144.901	-64.440
treatment_group	918.0346	14.888	61.661	0.000	888.822	947.247
post_treatment:treatment_group	-221.5725	28.996	-7.641	0.000	-278.467	-164.678

Omnibus:	147.558	Durbin-Watson:	0.560
Prob(Omnibus):	0.000	Jarque-Bera (JB):	849.482
Skew:	0.460	Prob(JB):	3.45e-185
Kurtosis:	7.206	Cond. No.	6.44

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [40]:

# Conclusion attached below

Once again, the general findings still apply when I changed the outcome variable from retweets to initiate tweets. The model explains a significant amount of variance (R-squared = 0.814, indicating that 81.4% of the variance could be explained). All predictors, including the interaction term, are statistically significant(p < 0.05). The intercept suggests that without any treatment or intervention (baseline condition), the average level of creating misinformation tweets is about 255.28. And in the post-treatment period (after suspension), the average level of creating misinformation tweets in the control group decreases by about 221.6 units compared to the pre-treatment period. Lastly, in the post-treatment period, the average number of misinformation tweets created by the treatment group decreases by about 221 units. This implies that deplatforming leads to a significant reduction in misinformation tweets for followers compared to non-followers.

In [41]:

# How about changing the stat from the above modified version?
run_did_analysis(df=df,stat='avg',treatment_group='B',control_group='F',outcome_var='fake_merged_initiation')

Out[41]:

OLS Regression Results
Dep. Variable:	fake_merged_initiation	R-squared:	0.810
Model:	OLS	Adj. R-squared:	0.810
Method:	Least Squares	F-statistic:	1558.
Date:	Sat, 19 Oct 2024	Prob (F-statistic):	0.00
Time:	21:24:28	Log-Likelihood:	2788.1
No. Observations:	1100	AIC:	-5568.
Df Residuals:	1096	BIC:	-5548.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	0.0438	0.001	45.839	0.000	0.042	0.046
post_treatment	-0.0147	0.002	-7.886	0.000	-0.018	-0.011
treatment_group	0.0774	0.001	57.325	0.000	0.075	0.080
post_treatment:treatment_group	0.0038	0.003	1.449	0.148	-0.001	0.009

Omnibus:	556.340	Durbin-Watson:	0.625
Prob(Omnibus):	0.000	Jarque-Bera (JB):	8434.906
Skew:	1.952	Prob(JB):	0.00
Kurtosis:	15.992	Cond. No.	6.44

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [42]:

# Conclusion attached below

Although the model still has a relatively high R2 score, the most important feature that we might be interested in has become not statistically significant. While the benchmark model did show the relationship, the impact of deplatforming for other users(regardless of their relationships with the deplatformed users) seems less connnected now. Therefore, we might not get the results as anticipated. But this implies that the effectiveness of the model could sometimes suffer from changes in variables. Although so far the other tests still show robust results.

In [43]:

# Changing the definition of misinformation by using different lists (summary in the end)
# How about changing the list for misinformation classification-Grinberg et al. (2019) list
run_did_analysis(df=df, stat='total',treatment_group='B',control_group='F',outcome_var='fake_grinberg_rt')

Out[43]:

OLS Regression Results
Dep. Variable:	fake_grinberg_rt	R-squared:	0.795
Model:	OLS	Adj. R-squared:	0.794
Method:	Least Squares	F-statistic:	1416.
Date:	Sat, 19 Oct 2024	Prob (F-statistic):	0.00
Time:	21:24:29	Log-Likelihood:	-7910.8
No. Observations:	1100	AIC:	1.583e+04
Df Residuals:	1096	BIC:	1.585e+04
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	146.5457	16.000	9.159	0.000	115.152	177.939
post_treatment	-65.9043	31.161	-2.115	0.035	-127.046	-4.762
treatment_group	1351.9654	22.627	59.750	0.000	1307.568	1396.363
post_treatment:treatment_group	-867.4689	44.068	-19.685	0.000	-953.936	-781.001

Omnibus:	335.154	Durbin-Watson:	0.455
Prob(Omnibus):	0.000	Jarque-Bera (JB):	2177.721
Skew:	1.239	Prob(JB):	0.00
Kurtosis:	9.432	Cond. No.	6.44

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [44]:

# How about changing the list for misinformation classification-Newsguard list
run_did_analysis(df=df,stat='total',treatment_group='B',control_group='F',outcome_var='fake_grinberg_rb_rt')

Out[44]:

OLS Regression Results
Dep. Variable:	fake_grinberg_rb_rt	R-squared:	0.716
Model:	OLS	Adj. R-squared:	0.715
Method:	Least Squares	F-statistic:	919.0
Date:	Sat, 19 Oct 2024	Prob (F-statistic):	1.42e-298
Time:	21:24:29	Log-Likelihood:	-7636.1
No. Observations:	1100	AIC:	1.528e+04
Df Residuals:	1096	BIC:	1.530e+04
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	70.6099	12.464	5.665	0.000	46.153	95.067
post_treatment	-44.3478	24.276	-1.827	0.068	-91.980	3.284
treatment_group	840.7556	17.627	47.696	0.000	806.168	875.343
post_treatment:treatment_group	-619.5556	34.331	-18.047	0.000	-686.917	-552.194

Omnibus:	556.026	Durbin-Watson:	0.375
Prob(Omnibus):	0.000	Jarque-Bera (JB):	5543.562
Skew:	2.102	Prob(JB):	0.00
Kurtosis:	13.162	Cond. No.	6.44

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [45]:

# How about changing the list for misinformation classification-Newsguard list
run_did_analysis(df=df,stat='total',treatment_group='B',control_group='F',outcome_var='fake_newsguard_rt')

Out[45]:

OLS Regression Results
Dep. Variable:	fake_newsguard_rt	R-squared:	0.789
Model:	OLS	Adj. R-squared:	0.788
Method:	Least Squares	F-statistic:	1365.
Date:	Sat, 19 Oct 2024	Prob (F-statistic):	0.00
Time:	21:24:29	Log-Likelihood:	-9043.7
No. Observations:	1100	AIC:	1.810e+04
Df Residuals:	1096	BIC:	1.812e+04
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	455.5333	44.814	10.165	0.000	367.603	543.464
post_treatment	-202.7471	87.279	-2.323	0.020	-373.999	-31.495
treatment_group	3722.5951	63.376	58.738	0.000	3598.243	3846.947
post_treatment:treatment_group	-1938.4640	123.430	-15.705	0.000	-2180.651	-1696.277

Omnibus:	474.398	Durbin-Watson:	0.286
Prob(Omnibus):	0.000	Jarque-Bera (JB):	4450.884
Skew:	1.734	Prob(JB):	0.00
Kurtosis:	12.224	Cond. No.	6.44

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [46]:

# Conclusion goes below

Despite changes for identifying misinformation, the benchmark model would still generate a good result, such as a high R2 and all the four statistically significant variables. Besides, the signs of the coefficient always remain the same, including a positive intercept, negative post_treatment, positive treatment and a negative interaction term, which enables our consistent interpretation. However, despite the same general results, the magtitude of the impacts differ as changes imposed on classifying misinformation using different list.

Section 4-2 Impacts of Changes of Time Window¶

In [47]:

# Let me re-define a function to incoporate the effects of time; this is simply a modification of the previous function

def run_did_analysis_with_time_windows(df, stat, treatment_group, control_group, outcome_var, 
                                       pre_treatment_window=None, post_treatment_window=None):
    """
    - pre_treatment_window: Number of days before the intervention (optional)
    - post_treatment_window: Number of days after the intervention (optional)
    """
    # Start over and import the dataset
    df = mccabe_data.copy()
    df['date'] = pd.to_datetime(df['date'])
    df = df[df['stat'] == stat]

    # Create a new dataframe for the treatment and control groups
    df_mdid = df[(df['group'] == treatment_group) | (df['group'] == control_group)].copy()

    # Define the date of the deplatforming event
    suspension_start = pd.to_datetime('2021-01-06')

    # Apply time window filtering if specified
    if pre_treatment_window:
        pre_treatment_start = suspension_start - pd.Timedelta(days=pre_treatment_window)
        df_mdid = df_mdid[df_mdid['date'] >= pre_treatment_start]

    if post_treatment_window:
        post_treatment_end = suspension_start + pd.Timedelta(days=post_treatment_window)
        df_mdid = df_mdid[df_mdid['date'] <= post_treatment_end]

    # Set up post-treatment and treatment group indicators
    df_mdid['post_treatment'] = (df_mdid['date'] > suspension_start).astype(int)
    df_mdid['treatment_group'] = (df_mdid['group'] == treatment_group).astype(int)

    # Define the DID formula
    formula = f'{outcome_var} ~ post_treatment + treatment_group + post_treatment*treatment_group'

    # Run the DID model
    model = smf.ols(formula, data=df_mdid)
    results = model.fit()

    return results

In [48]:

# I'm also very interested in the impact of time windows on the ultimate outcomes

coefficients = []

# Loop through post_treatment_window values from 30 to 365
for window in range(30, 365):
    # Run the DID analysis for each window value
    model = run_did_analysis_with_time_windows(
        df=df,
        stat='total',
        treatment_group='B',
        control_group='F',
        outcome_var='fake_merged_rt',
        pre_treatment_window=window,
        post_treatment_window=30,
    )
    
    # Extract the coefficient for 'post_treatment:treatment_group'
    coeff = model.params['post_treatment:treatment_group']
    # Store the coefficient in the list
    coefficients.append(coeff)

# Visualize the coefficients
plt.figure(figsize=(10, 6))
plt.plot(range(30, 365), coefficients, label='DiD Coefficients')
plt.xlabel('Pre-Treatment Window')
plt.ylabel('Coefficient Value')
plt.title('DiD Coefficient over Different Pre-Treatment Windows')
plt.grid(True)
plt.legend()
plt.show()

In [49]:

# Conclusion goes below

While in the previous cells, we've tried different pre-treatment window by running values through 30 to 365, the visulization did demostrates that the windows of the pre-trement really matters in the magtitude of treament effect, which was interpreted as the deplatforming event. The plot starts with a coefficient value of around -1400 and initially drops sharply, reaching a minimum of around -2200. After this sharp decline, the coefficients increase steadily, eventually reaching around -1000 by the end of the window. The line shows fluctuations, particularly in the earlier stages, with the coefficients experiencing some variability before stabilizing in the upward trend. This reflects the dynamic changes in the estimated treatment effect as the pre-treatment window extends.

Section 5 Conclusions and Reflections¶

Here is where you draw together insights you have gained by analyzing this dataset and reflections on the methods we have applied. You should provide a clear answer to the question:

What are your conclusions about the question posed in this assignment: Did deplatforming reduce misinformation on Twitter?

You are welcome to use the bullet points below to guide your reflections if they are helpful, and also to include any additional insights.

Is the current dataset sufficient to offer insight into this question? What are some key limitations of the dataset, and key merits?
Is the DiD method sufficient to support strong conclusions related to this question?
Overall, do you think the conclusions of McCabe et al. (2024) are justified?
More generally, do you feel that misinformation on social media is a substantial threat to discourse and society that data science can address, and how has this project influenced your view?

Based on previous analyses, I think deplatfroming did reduce misinfromation on Twitter.

1. I think the current dataset is sufficient to offer insight into the question posted; but there are still areas for improvement.

The current dataset contains detailed information on misinformation sharing, retweets, and user categorization. The dataset includes a lot of variations that we could explore, including classification using different lists, different activity levels across subgroups (i.e., high-, medium-, and low-active), and the number of deplatformed users followed, among others. This offers a rich dataset to investigate misinformation spread on Twitter using different variations. Additionally, the dataset covers periods before and after key political events, like the January 6th insurrection, allowing for comparative analysis using quasi-experimental designs such as DiD.

However, there are some limitations that need improvement. The authors were not very careful in preprocessing the dataset, as there is one duplicated value in the date: 2020-06-30. It is strange that on this date, the values for different subgroups even vary, making it difficult to know which one is correct. This might introduce noise to the dataset, despite the high quality it could have maintained. Moreover, there are some inconsistencies in the data composition. There are supposed to be 60 values for each day; nevertheless, there were several dates that only have 58 values. After careful observation and logical reasoning, it appears that the missing values are always those not eligible to become certain group. The dataset creators could remind readers or potential users of this issue beforehand.

Besides, the dataset might suffer from confounding variables due to simultaneous events (like the insurrection) that can impact behavior, making it difficult to isolate the effects of deplatforming alone (McCabe et al., 2024). And while the panel is filtered using voter data, there is potential bias due to the exclusion of non-human actors (bots), which could play a significant role in misinformation dissemination. Moreover, the dataset primarily focuses on retweet and tweet counts without fully capturing the nuances of engagement (e.g., likes, replies) that could be influential. Last but not the least, while deplatformed users have been removed, there is still a chance that they could create new accounts or shift to other social media platforms, which could affect the spectrum of misinformation, though such a pattern is difficult to capture.

2. The DiD method is sufficient to support strong conclusions related to this question, given the strong assumptions made hold true.

Previous analyses indicate that Twitter's deplatforming effectively reduces misinformation, particularly among followers of deplatformed users. After several attempts to check the robustness of the results, the model works effectively most of the time, adding confidence to the conclusions. The DiD approach captures the differences between the control and treatment groups in their baseline values; for instance, followers of deplatformed users are more enthusiastic about misinformation sharing. It is also interesting to observe from the results that deplatforming plays a role, as the pre-treatment and post-treatment periods show a statistically significant decrease. The deplatforming action regulates both the sources and the exposure to misinformation. Deplatforming immediately reduces the ability of deplatformed users to share content. With the decreased amount of misinformation available, retweet behavior for those posts also decreases. Additionally, users may become more cautious about sharing misinformation due to the fear of facing similar suspensions.

However, there are some limitations to the DiD approach. DiD methods assume that, in the absence of treatment, the average outcomes for the treated and control groups would have followed parallel trends over time. However, this assumption is often difficult to verify, and when it is violated, the causal estimates from DiD may be biased. In my attempt to replicate the exact visualization from the authors’ paper, the standardization makes sense to demonstrate similar trends compared to unadjusted values. However, the visualization itself lacks more accurate explanation of the parallel trend. In the case of heterogeneous treatment effects, the DiD approach typically estimates an average treatment effect, potentially hiding crucial subgroup variations.

3. Overall, I think the conclusions of McCabe et al. (2024) are largely justified.

In the paper, McCabe et al. (2024) use more than one approach. Apart from DiD, they also use regression discontinuity (SRD). The SRD analysis indicates a significant decline in misinformation sharing by deplatformed users, as expected. Meanwhile, the DiD analysis shows a notable spillover effect, with a reduction in retweets by users who followed deplatformed accounts. This suggests that deplatforming impacts misinformation both directly and indirectly. The authors made efforts to verify some of the assumptions, although not all of them may hold true. They used a placebo test by investigating the patterns of shopping and sports tweets; if the expectations are correct, the behavior of those users should not change due to the intervention. This would indicate that the intervention did not affect behavior it was not intended to, increasing confidence in the intervention's effect on misinformation sharing.

However, the study faces limitations that temper its causal claims. The SRD design is confounded by concurrent political events, such as the insurrection itself and media coverage of election certification, which complicates efforts to isolate the deplatforming's specific impact. The authors acknowledge that interpreting these results as causal depends on strong assumptions, such as continuity and parallel trends, which may not fully hold given the extraordinary context. Furthermore, although Twitter’s intervention appears effective in reducing misinformation, the findings may not generalize to other deplatforming events due to amplification by media coverage and user awareness. Overall, while McCabe et al.(2024) provide compelling evidence of Twitter’s regulatory capacity, the results might be better viewed as context-specific and contingent upon unverified assumptions.

4. From my perspective, I do feel that misinformation on social media is a substantial threat to discourse and society; however, I am pessimistic about the viewpoint that data science can address this threat fully.

It is always tricky to define misinformation. Is it something that contradicts the truth that we can verify? Or is it simply something taken out of context, making its authenticity difficult to validate? Even the authors did not do very well in this regard, as the study classified tweets as misinformation if they contained URLs from a predefined list of domains. These lists focused on domains that lack editorial norms or have low credibility scores. URLs in tweets were cross-referenced with this list, but the analysis did not evaluate the content’s truthfulness. As such, it might oversimplify the classification of misinformation. On the one hand, the domains might not necessarily represent misinformation but are classified as such (False Positive); on the other hand, the scope of misinformation could be much larger than fake news, with some content not identified as misinformation (False Negative).

Secondly, there are potential harms of misinformation in shaping false beliefs (Ecker, U. K. et al., 2012). People often rely on intuition rather than careful reasoning when determining what is true, making them prone to biases. Repetition of a claim makes it seem more believable, a phenomenon known as the illusory truth effect. This effect can persist over time, regardless of cognitive ability and prior knowledge. Misinformation can continue to influence people’s thinking even after they receive a correction and accept it as true, known as the continued influence effect.

In conclusion, people should practice their critical thinking skills and make sound judgments. We are currently living in the era of Artificial Intelligence, and the issue of DeepFake makes it even more challenging to differentiate truth from misinformation. People should be cautious about the information consumed and exercise the same caution when creating and spreading information. Meanwhile, social media platforms should also play a role in combating misinformation. It shouldn’t necessarily involve coercion or suppression. However, even a kind reminder or downranking (with public voting) could work effectively. The main reason I am not confident that the issue of misinformation can be addressed solely by data science is the awareness of its complexities and human creativity. We should never underestimate human creativity in communication and the ability to create and understand coded language. Social media users often employ countermeasures to circumvent detection by social media algorithms. There could be an infinite number of variations, metaphors with historical roots, and other complexities. Moreover, there are complex ethical considerations about the right to freedom of speech, which adds on another layer of complication. They makes it almost impossible for data science to fully address these issues.

5. Other thoughts

Social media holds significant power in regulating discourse through its terms of use. McCabe et al. (2024) also noted two instruments, including content moderation and the enforcement of users' terms of use. In the paper, the main discussion centered around the enforcement instrument, such as deplatforming; nevertheless, an important part of the communication landscape remains unaddressed. Prior to this project, I anticipated that the direct effects of deplatforming on social media would be straightforward: targeted users are removed, and their "products" inevitably diminish. But it is also interesting to discover spillover effects in this sphere. And while deplatforming may have some indirect effects at first glance, what happens if users change their ways of expression? This could result in misinformation that continues to exist and becomes harder to detect, posing a new challenge. Consequently, our conclusions may be threatened. Lastly, it is always true that data scientists alone cannot adequately address these issues. Tackling them requires broader and deeper collaboration among stakeholders, including the public and policymakers.