Jump to content

Draft:Tiffstatskcl

From Wikipedia, the free encyclopedia

SUMMARY OF STATS

Types of Data

-       Categorical: Nominal or ordinal

o   Represented with frequency and percentages, bar chart or pie chart (only for nominal)

o   For ordinal with more than 5 points, use median/mix-max or mean/SD

-       Numerical: Discrete or continuous

o   Represented with mean and SD (normal) OR median and min-max (skewed) and IQR

o   Represented with histogram or box plot

o   Create boxplot by ‘Graphs’  ‘Legacy dialogues’  ‘Boxplot’  ‘Simple’

Description of Data

-       Variance (S) measures average of the values’ distance from the mean

-       SD is how spread out a group of numbers is from the mean = sq root of variance

o   E.g., SD = 9CM, heights on average 9cm away from mean height

Confidence and Significance Testing

Statistic and Parameter

-       Statistic is estimator of model parameter

Sampling and Error

-       Random error (noise, unpredictable)  can go in either direction; due to unknown factors

-       Systematic error (bias)  always same direction of error, due to known factors

Sampling/Normal Distribution

-       Sampling distribution: Distribution of these estimated means, from a number of samples

-       Central limit theorem  given a sufficiently large amount of repetitions, the sampling distribution will approximate the normal distribution

-       68% around +/- 1 SD, 95% +/- 1.96 SD, 99% +/- 2.58 SD

-       Variance = 𝝈𝟐/n

-       SD of sampling distribution is called the standard error = 𝝈𝟐/Ön

o   Smaller variability in population, smaller SE = greater precision

o   Larger sample size, smaller the SE and less wide CI

To Create Own Sample

-       ‘Transform’  ‘Random number generators’  input random 5 digits in ‘values’

-       ‘Data’  ‘Select cases’  ‘Random sample of cases’  ‘Sample’  Exact number of samples required

Confidence Intervals

^ only can be calculated on normally distributed curves

-       Confidence about whether interval contains true population mean

-       Calculate on SPSS ‘Analyse’  ‘Descriptive stats’  ‘Explore’  Input variable in dependent list  ‘statistics’  CI to 95%

Hypothesis Testing

-       Type 1 error (α): Null hypothesis rejected but was true  usually set to 0.05

-       Type 2 error (1-β): Null hypothesis accepted but was false  usually set to 0.80/80%

-       Power to correctly reject null is larger if is treasure is big or bigger sample

o   Power analysis to determine sufficient sample size for research

-       The type 1 error = Probability of rejecting a true null hypothesis (p-value)

-       The type 2 error = Probability of not rejecting a false null

-       The Power = Probability of rejecting a false null      

-       Steps to HT

o   Create null and alt hypothesis for population parameter

o   Sample from population and compute correct stat to estimate the parameter (SD/Ön)

o   Create sampling distribution for stat under the null

o   Find rejection area and check if sampled value falls in rejection area  always use population mean as 𝐱-

COMPARING GROUPS (PARAMETRIC METHODS)

Types of T-Test – Equality of Means (Continuous)

-       One sample t-test: Test sample mean against test value e.g., height

-       Independent sample t-test: Difference in means between two groups

-       Paired samples t-test: Before/after, matched cases, twin studies

-       NOTE: Variable of interest needs to be normally distributed  

One Sample T-Test

-       Continuous variable: E.g., Male/Female income: ‘Analyse’  ‘Compare means’  ‘One sample t-test’  Key in test value

Independent Sample T-Test

-       Split data file to check suitability of data/variable

o   ‘Data’  ‘Split file’  ‘Gender’ (e.g.) and perform frequency test

o   Recall to unsplit data  ‘Data’  ‘Split file’  ‘Analyse all groups’

o   ‘Analyse’  ‘Compare means’  ‘Independent samples t-test’

o   If Levene’s test is sig, equal variables are not assumed  go with line 2

Two Paired Samples T-Test

-       Calculate difference of ‘before’ and ‘after’ data through ‘Compute Variable’

o   Check normal distribution of new variable

-       ‘Analyse’  ‘Compare means’  ‘Paired samples t-test’

Types of χ2-Test – Equality of Proportions (Categorical)

-       One sample χ2-test: Test sample mean against test value e.g., %

o   # of cells with expected frequencies less than 5 is less than 20%

o   Minimum expected frequency is at least 1

-       Independent sample χ2-test: Difference in proportions between two groups

o   Observations must not be paired + above two assumptions

-       Paired samples χ2-test: Before/after, matched cases, twin studies

o   At least 25 observations in discordant cells; data are paired

One Sample χ2-test

-       ‘Analyse’  ‘Non parametric test’  ‘Legacy Dialogs’  ‘Chi Square’ (Tick ‘all categories equal’ or enter value if there’s a specific value)

o   When entering specific test value, ensure it totals 100% but input the % that you’re not interested in first (e.g., if test value is 20%, input 80 BEFORE 20)

Independent Sample χ2-test

-       ‘Analyse’  ‘Descriptive Statistics’  ‘Crosstabs’  ‘Statistics’  Chi-square’  Cells  Tick ‘observed’ and ‘column’/’row’  ‘Exact’

-       Remember to interpret based on columns/rows that do not add up to 100%

Paired Samples χ2-test

-       ‘McNemar chi square test’ under statistics and ‘Total’ under percentages

COMPARING GROUPS (NON-PARAMETRIC METHODS)

Equality of Means

Wilcoxon Signed Rank Test

-       For skewed continuous data, ordinal (interval) or discrete data

-       Assumption  at least interval data

-       ‘Analyse’  ‘Nonparametric tests’  ‘One sample’  ‘Fields’ Add variable of interest in field’  ‘Settings’  ‘Compare median to hypothesized (input value)’

-       Report on Z score (i.e., standardized test statistic)

Mann-Whitney U Test

-       Analyse’  ‘Nonparametric tests’  ‘Independent samples’  ‘Fields’ Add variable of interest in field and grouping variable ‘Settings’  ‘Customised tests (Mann Whitney U)’

-       Report on Mann-Whitney U score

Wilcoxon Matched-Pair Signed Rank Test

-       Analyse’  ‘Nonparametric tests’  ‘Related samples’  ‘Fields’ Add variable of interest in field  ‘Settings’  ‘Customised tests (Wilcoxon)’

-       Report on Z score (i.e., standardized test statistic)

Equality of Proportions

Binomial Exact Test

-       ‘Analyse’  ‘Non parametric test’  ‘Legacy Dialogs’  ‘Chi Square’ (Tick ‘all categories equal’ or enter value if there’s a specific value)  ‘Exact’

Fisher’s Exact Test

-       Can see from the churned out table from normal chi-square test ‘Exact’

-       For categories with more than 1 level (e.g., ethnicity), go to ‘Exact’ tab again

McNemar (Binomial Test)

-       SPSS will automatically print out correct exact test

CORRELATION AND LINEAR REGRESSION

Scatterplots

-       To display relationship between two variables observed over a number of instances

-       To investigate empirical relationship between x (IV) and y (DV)  attempt to predict Y from X

-       ‘Graph’  ‘Legacy dialogs’  ‘Simple scatter’

o   Double click graph and choose ‘linear fit line’ to draw line of best fit

Correlation

-       Pearson’s (parametric) and spearman’s (non-parametric) correlation coefficient

Pearson’s  

-       Assumptions

o   Variables should be continuous, each observation should have a pair of values

o   No significant outliers in either variable

o   Linearity ‘straight line’ relationship between variables should be formed

-       ‘Correlate’  ‘Bivariate’  ‘Pearson’

Spearman’s

-       When one or both variables are not normally distributed

-       Measures strength and direction of monotonic relationship (i.e., variables increase/decrease but not at a constant rate)

-       Click ‘Spearman’ instead under ‘Bivariate’

Simple Linear Regression

-       Estimate relationship between variables for one continuous outcome and one predictor  measure to what extent there is a linear r/s between two variables

-       Equation: 𝒚 = 𝜷𝟎 + 𝜷𝟏𝒙 + 𝜺

o   X = IV, predictor, explanatory or covariate (continuous or categorical)

o   Y = DV, outcome, or response (always continuous)  y ‘depends on’ x

o   Intercept 𝜷𝟎 is value that y takes when x is 0

o   Slope 𝜷𝟏 determines the change in y when x changes by one unit

o   𝜺 is the residual (distance between points and the line)

o   𝜷𝟎 & 𝜷𝟏 Are known as regression coefficients

-       Best linear regression line is closest to all data lines i.e., residual as small as possible

-       Ordinary Least Squares can be used to estimate regression that minimize the squared residuals to give estimates for 𝜷𝟎 & 𝜷𝟏

SLR Model

-       Hypothesis: There is no linear association e.g., the slope 𝛽1 in the population equals to 0

-       Assumptions

o   Linear relationship between DV and IV

o   Residuals are independent of one another and follow a normal distribution

o   Homogeneity of variation (homoscedascity): size of error in prediction doesn’t change significantly across values of the IV

-       ‘Analyse’  ‘Regression’  ‘Linear’  ‘Statistics’  Check ‘estimates’ and ‘confidence intervals’

-       R value represents degree of correlation, R2 value indicates how much of total variation in DV can be explained by IV in % (e.g., 0.270 = 27%)

-       Report on 𝜷𝟏 and t values

-       SLR can be used to predict new cases

o   Key in IV in new row

o   ‘Analyse’  ‘Regression’  ‘Linear’  ‘Save  ‘Prediction values’ (check ‘unstandardized’ and ‘mean’)

-       For categorical predictors

-       For multiple categorical predictors

o   Need to recode into dummy variables that are binary

o   ‘Transform’  ‘Recode into diff variables’  Input one of the variables and create value  repeat for another variable

o   Compute linear regression using dummy variables created

§  ‘Analyse’  ‘Regression’  ‘Linear’  add both dummy variables into IV  ‘Statistics’  Check ‘estimates’ and ‘confidence intervals’

§  Results will be against the non-dummy variable (e.g., if dummy variables created for low and medium urbanicity, numbers will be compared to high urbancity)

MULTIPLE LINEAR REGRESSION MODEL

-       When studying r/s between 1 DV and two or more IVs simultaneously

o   Answer whether and how several facts are related with one other fact

o   Whether and how a set of IVs are related with one DV

-       𝒚 = 𝜷𝟎 + 𝜷𝟏𝒙𝟏 + ⋯ + 𝜷𝒏𝒙𝒏 + 𝜺

o   𝜷𝒊’s are the partial regression coefficients.

o   𝜷𝒊 represents the change in average y for one unit change in 𝐱𝑖 (holding (adjusting for) all other x’s /IVs fixed)

o   E.g. 𝜷𝟏 Is the amount that the dependent variable 𝐲 will increase (or decrease) for each unit increase in the independent variable x1 while holding all other variables 𝒙2, … , 𝒙; constant.

-       ‘Analyse’  ‘Regression’  ‘Linear’  ‘Statistics’  Tick CI and Estimates

-       Poverty not statistically significantly associated with crime when the poverty-crime relationship is adjusted for education (or education is held constant). Cannot generalise that poverty is associated with crime in the population (𝛽1= 23.927, t=1.621, p=0.112, 95%CI (-5.774, 53.627))

-       Can predict by Analyse’  ‘Regression’  ‘Linear’  ‘Save  ‘Prediction values’ (check ‘unstandardized’ and ‘individual)

Confounding Variables

-       When association between explanatory variable (e.g., exercise) and outcome (e.g., weight) is distorted by presence of another variable (e.g., hours of free time)

o   Can introduce bias in estimation of 𝛽1

-       Should have more than 10 observations per IV if dealing with multiple IVs

Coefficient of Determination

-       Coefficient of determination R2 is a statistical measure of how well the regression line/hyperplane approximates the real data points (a.k.a. goodness of fit)  equivalent to r2 for Pearson correlation

o   Interpreted as the proportion of the variance in the dependent variable that is “explained” by the independent variables in the model

o   Ranges from 0 to 1, lower = poorer fit

-       𝑅2adj takes account of the phenomenon whereby 𝑅" increases every time an extra independent variable is added  better indicator for model selection, higher R2adj is the one that better fits the data and should be selected

o   E.g., if R2adj is 0.100, IVs explained 10% of variance in DV

Assumptions for Multiple Regression

-       Relationship between DV and each continuous IV is linear

-       Residuals should be approximately normally distributed

o   Represents variations in Y that is not explained by IVs

o   Can be seen from histogram and P-P plot  should be close to reference line

-       Homoscedascity  no pattern in standardised residuals and standardized predicted values  error terms have same variance irrespective of values of DV

-       Independent observations

-       Assess assumptions by ‘Analyse’  ‘Regression’  ‘Linear’  ‘Plots’  ‘Histogram’, ‘Normal probability plot’, ‘Produce all partial plots’, ‘ZRESID in Y’ and ‘ZPRED in X’

-       Use ‘save’ to input predicted values  ‘unstandardised’ , ‘Individual’ ‘CI’

MEDIATION

-       Explains portion of the association between Y and X1

o   Hypothesised causal mechanism by which one variable affects another

-       Total causal affect can be split into indirect (or mediated) part with paths 𝑎 and 𝑏 and a direct (non-mediated) path 𝑐′

-       Direct effect = c’

-       Indirect/mediated effect = a*b

-       𝑐= Total effect = direct + indirect effect =𝑐′ + 𝑎 * 𝑏

Mediation Analysis

-       4 steps

o   Establish that X1 associated with Y

§  Estimate path c: Y = β0 + βX1+ε

o   Estimate path a: M = β0 + β1X1 +ε

§  Run SLM on between M and X1

o   Estimate path b (M  Y, controlling for X1): Y = β0 + β2M +β3X1+ε

§  Run MLM that adjusts for X

o   Estimate path c’  2 ways

§  c = c’ + a* b, β = c’ + β1 * β2

§  From step 3, Y = β0 + β2M +β3X1+ε, c = β3 (from spss)

·      If β3 is >0.05, there is complete mediation i.e., no association between X1 and Y when controlled for M

·      If β3 <0.05, partial mediation i.e., c’ is smaller than c (in absolute value), there is association between X1 and Y

§  C’ must be significantly smaller than c to establish mediation effect

-       Steps in SPSS

o   ‘Analyse’  ‘Regression’  ‘Linear’

§  SLM for IV and DV (path c)

§  SLM for IV and M (path a)

§  MLM for IV, M and DV (paths b and c’)

-       Newer methods say that only steps 2 and 3 are essential for establishing mediation (i.e., indirect effect, H0: ab = 0)

o   Sobel test (normal theory approach) and nonparametric sobel test (bootstrapping)

§  Z = ab/SE(ab)

§  SE(ab) is the SE of the estimated indirect effect, Sa and Sb are coefficients for a and b taken from MLM  

§  If Z is greater than 1.96, reject the H0 that ab = 0

o   Sobel test only work sin large samples because skewness is reduced

o   Bootstrapping better alternative, provides estimated CI for ab instead  sig if CI interval does not contain 0

-       Step 1 passed: Path c (effect of treatment on stable housing) is equal to 6.558 (p value = 0.009), with a 95% confidence interval of [1.65 to 11.46]

-       Step 2 passed: Path a (effect of treatment on housing contact) is equal to 1.83 (p = 0.013), with a 95% confidence interval of [0.39 to 3.27]

-       Step 3 passed: Path b (effect of housing contacts on stable housing controlling for treatment) is equal to 1.398 (p <0.001), with a 95% confidence interval of [0.801 to 1.995]

-       Step 4 passed: Path c’ (effect of treatment on stable housing controlling for the mediator) is equal to 4.00 (p =0.09), with a 95% confidence interval of -0.63 to 8.62.

o   Controlling for the mediator substantially reduces the effect of treatment (c’= 4.00 < c=6.56).

o   Complete mediation as direct effect is not sig different from 0.





References

[edit]