Draft:Tiffstatskcl
![]() | Draft article not currently submitted for review.
This is a draft Articles for creation (AfC) submission. It is not currently pending review. While there are no deadlines, abandoned drafts may be deleted after six months. To edit the draft click on the "Edit" tab at the top of the window. To be accepted, a draft should:
It is strongly discouraged to write about yourself, your business or employer. If you do so, you must declare it. Where to get help
How to improve a draft
You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article. Improving your odds of a speedy review To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags. Editor resources
Last edited by Tiffanynandrea123 (talk | contribs) 4 days ago. (Update) |
Comment: In accordance with Wikipedia's Conflict of interest policy, I disclose that I have a conflict of interest regarding the subject of this article. Tiffanynandrea123 (talk) 13:37, 13 May 2025 (UTC)
SUMMARY OF STATS
Types of Data
- Categorical: Nominal or ordinal
o Represented with frequency and percentages, bar chart or pie chart (only for nominal)
o For ordinal with more than 5 points, use median/mix-max or mean/SD
- Numerical: Discrete or continuous
o Represented with mean and SD (normal) OR median and min-max (skewed) and IQR
o Represented with histogram or box plot
o Create boxplot by ‘Graphs’ ‘Legacy dialogues’ ‘Boxplot’ ‘Simple’
Description of Data
- Variance (S) measures average of the values’ distance from the mean
- SD is how spread out a group of numbers is from the mean = sq root of variance
o E.g., SD = 9CM, heights on average 9cm away from mean height
Confidence and Significance Testing
Statistic and Parameter
- Statistic is estimator of model parameter
Sampling and Error
- Random error (noise, unpredictable) can go in either direction; due to unknown factors
- Systematic error (bias) always same direction of error, due to known factors
Sampling/Normal Distribution
- Sampling distribution: Distribution of these estimated means, from a number of samples
- Central limit theorem given a sufficiently large amount of repetitions, the sampling distribution will approximate the normal distribution
- 68% around +/- 1 SD, 95% +/- 1.96 SD, 99% +/- 2.58 SD
- Variance = 𝝈𝟐/n
- SD of sampling distribution is called the standard error = 𝝈𝟐/Ön
o Smaller variability in population, smaller SE = greater precision
o Larger sample size, smaller the SE and less wide CI
To Create Own Sample
- ‘Transform’ ‘Random number generators’ input random 5 digits in ‘values’
- ‘Data’ ‘Select cases’ ‘Random sample of cases’ ‘Sample’ Exact number of samples required
Confidence Intervals
^ only can be calculated on normally distributed curves
- Confidence about whether interval contains true population mean
- Calculate on SPSS ‘Analyse’ ‘Descriptive stats’ ‘Explore’ Input variable in dependent list ‘statistics’ CI to 95%
Hypothesis Testing
- Type 1 error (α): Null hypothesis rejected but was true usually set to 0.05
- Type 2 error (1-β): Null hypothesis accepted but was false usually set to 0.80/80%
- Power to correctly reject null is larger if is treasure is big or bigger sample
o Power analysis to determine sufficient sample size for research
- The type 1 error = Probability of rejecting a true null hypothesis (p-value)
- The type 2 error = Probability of not rejecting a false null
- The Power = Probability of rejecting a false null
- Steps to HT
o Create null and alt hypothesis for population parameter
o Sample from population and compute correct stat to estimate the parameter (SD/Ön)
o Create sampling distribution for stat under the null
o Find rejection area and check if sampled value falls in rejection area always use population mean as 𝐱-
COMPARING GROUPS (PARAMETRIC METHODS)
Types of T-Test – Equality of Means (Continuous)
- One sample t-test: Test sample mean against test value e.g., height
- Independent sample t-test: Difference in means between two groups
- Paired samples t-test: Before/after, matched cases, twin studies
- NOTE: Variable of interest needs to be normally distributed
One Sample T-Test
- Continuous variable: E.g., Male/Female income: ‘Analyse’ ‘Compare means’ ‘One sample t-test’ Key in test value
Independent Sample T-Test
- Split data file to check suitability of data/variable
o ‘Data’ ‘Split file’ ‘Gender’ (e.g.) and perform frequency test
o Recall to unsplit data ‘Data’ ‘Split file’ ‘Analyse all groups’
o ‘Analyse’ ‘Compare means’ ‘Independent samples t-test’
o If Levene’s test is sig, equal variables are not assumed go with line 2
Two Paired Samples T-Test
- Calculate difference of ‘before’ and ‘after’ data through ‘Compute Variable’
o Check normal distribution of new variable
- ‘Analyse’ ‘Compare means’ ‘Paired samples t-test’
Types of χ2-Test – Equality of Proportions (Categorical)
- One sample χ2-test: Test sample mean against test value e.g., %
o # of cells with expected frequencies less than 5 is less than 20%
o Minimum expected frequency is at least 1
- Independent sample χ2-test: Difference in proportions between two groups
o Observations must not be paired + above two assumptions
- Paired samples χ2-test: Before/after, matched cases, twin studies
o At least 25 observations in discordant cells; data are paired
One Sample χ2-test
- ‘Analyse’ ‘Non parametric test’ ‘Legacy Dialogs’ ‘Chi Square’ (Tick ‘all categories equal’ or enter value if there’s a specific value)
o When entering specific test value, ensure it totals 100% but input the % that you’re not interested in first (e.g., if test value is 20%, input 80 BEFORE 20)
Independent Sample χ2-test
- ‘Analyse’ ‘Descriptive Statistics’ ‘Crosstabs’ ‘Statistics’ Chi-square’ Cells Tick ‘observed’ and ‘column’/’row’ ‘Exact’
- Remember to interpret based on columns/rows that do not add up to 100%
Paired Samples χ2-test
- ‘McNemar chi square test’ under statistics and ‘Total’ under percentages
COMPARING GROUPS (NON-PARAMETRIC METHODS)
Equality of Means
Wilcoxon Signed Rank Test
- For skewed continuous data, ordinal (interval) or discrete data
- Assumption at least interval data
- ‘Analyse’ ‘Nonparametric tests’ ‘One sample’ ‘Fields’ Add variable of interest in field’ ‘Settings’ ‘Compare median to hypothesized (input value)’
- Report on Z score (i.e., standardized test statistic)
Mann-Whitney U Test
- Analyse’ ‘Nonparametric tests’ ‘Independent samples’ ‘Fields’ Add variable of interest in field and grouping variable ‘Settings’ ‘Customised tests (Mann Whitney U)’
- Report on Mann-Whitney U score
Wilcoxon Matched-Pair Signed Rank Test
- Analyse’ ‘Nonparametric tests’ ‘Related samples’ ‘Fields’ Add variable of interest in field ‘Settings’ ‘Customised tests (Wilcoxon)’
- Report on Z score (i.e., standardized test statistic)
Equality of Proportions
Binomial Exact Test
- ‘Analyse’ ‘Non parametric test’ ‘Legacy Dialogs’ ‘Chi Square’ (Tick ‘all categories equal’ or enter value if there’s a specific value) ‘Exact’
Fisher’s Exact Test
- Can see from the churned out table from normal chi-square test ‘Exact’
- For categories with more than 1 level (e.g., ethnicity), go to ‘Exact’ tab again
McNemar (Binomial Test)
- SPSS will automatically print out correct exact test
CORRELATION AND LINEAR REGRESSION
Scatterplots
- To display relationship between two variables observed over a number of instances
- To investigate empirical relationship between x (IV) and y (DV) attempt to predict Y from X
- ‘Graph’ ‘Legacy dialogs’ ‘Simple scatter’
o Double click graph and choose ‘linear fit line’ to draw line of best fit
Correlation
- Pearson’s (parametric) and spearman’s (non-parametric) correlation coefficient
Pearson’s
- Assumptions
o Variables should be continuous, each observation should have a pair of values
o No significant outliers in either variable
o Linearity ‘straight line’ relationship between variables should be formed
- ‘Correlate’ ‘Bivariate’ ‘Pearson’
Spearman’s
- When one or both variables are not normally distributed
- Measures strength and direction of monotonic relationship (i.e., variables increase/decrease but not at a constant rate)
- Click ‘Spearman’ instead under ‘Bivariate’
Simple Linear Regression
- Estimate relationship between variables for one continuous outcome and one predictor measure to what extent there is a linear r/s between two variables
- Equation: 𝒚 = 𝜷𝟎 + 𝜷𝟏𝒙 + 𝜺
o X = IV, predictor, explanatory or covariate (continuous or categorical)
o Y = DV, outcome, or response (always continuous) y ‘depends on’ x
o Intercept 𝜷𝟎 is value that y takes when x is 0
o Slope 𝜷𝟏 determines the change in y when x changes by one unit
o 𝜺 is the residual (distance between points and the line)
o 𝜷𝟎 & 𝜷𝟏 Are known as regression coefficients
- Best linear regression line is closest to all data lines i.e., residual as small as possible
- Ordinary Least Squares can be used to estimate regression that minimize the squared residuals to give estimates for 𝜷𝟎 & 𝜷𝟏
SLR Model
- Hypothesis: There is no linear association e.g., the slope 𝛽1 in the population equals to 0
- Assumptions
o Linear relationship between DV and IV
o Residuals are independent of one another and follow a normal distribution
o Homogeneity of variation (homoscedascity): size of error in prediction doesn’t change significantly across values of the IV
- ‘Analyse’ ‘Regression’ ‘Linear’ ‘Statistics’ Check ‘estimates’ and ‘confidence intervals’
- R value represents degree of correlation, R2 value indicates how much of total variation in DV can be explained by IV in % (e.g., 0.270 = 27%)
- Report on 𝜷𝟏 and t values
- SLR can be used to predict new cases
o Key in IV in new row
o ‘Analyse’ ‘Regression’ ‘Linear’ ‘Save ‘Prediction values’ (check ‘unstandardized’ and ‘mean’)
- For categorical predictors
- For multiple categorical predictors
o Need to recode into dummy variables that are binary
o ‘Transform’ ‘Recode into diff variables’ Input one of the variables and create value repeat for another variable
o Compute linear regression using dummy variables created
§ ‘Analyse’ ‘Regression’ ‘Linear’ add both dummy variables into IV ‘Statistics’ Check ‘estimates’ and ‘confidence intervals’
§ Results will be against the non-dummy variable (e.g., if dummy variables created for low and medium urbanicity, numbers will be compared to high urbancity)
MULTIPLE LINEAR REGRESSION MODEL
- When studying r/s between 1 DV and two or more IVs simultaneously
o Answer whether and how several facts are related with one other fact
o Whether and how a set of IVs are related with one DV
- 𝒚 = 𝜷𝟎 + 𝜷𝟏𝒙𝟏 + ⋯ + 𝜷𝒏𝒙𝒏 + 𝜺
o 𝜷𝒊’s are the partial regression coefficients.
o 𝜷𝒊 represents the change in average y for one unit change in 𝐱𝑖 (holding (adjusting for) all other x’s /IVs fixed)
o E.g. 𝜷𝟏 Is the amount that the dependent variable 𝐲 will increase (or decrease) for each unit increase in the independent variable x1 while holding all other variables 𝒙2, … , 𝒙; constant.
- ‘Analyse’ ‘Regression’ ‘Linear’ ‘Statistics’ Tick CI and Estimates
- Poverty not statistically significantly associated with crime when the poverty-crime relationship is adjusted for education (or education is held constant). Cannot generalise that poverty is associated with crime in the population (𝛽1= 23.927, t=1.621, p=0.112, 95%CI (-5.774, 53.627))
- Can predict by Analyse’ ‘Regression’ ‘Linear’ ‘Save ‘Prediction values’ (check ‘unstandardized’ and ‘individual)
Confounding Variables
- When association between explanatory variable (e.g., exercise) and outcome (e.g., weight) is distorted by presence of another variable (e.g., hours of free time)
o Can introduce bias in estimation of 𝛽1
- Should have more than 10 observations per IV if dealing with multiple IVs
Coefficient of Determination
- Coefficient of determination R2 is a statistical measure of how well the regression line/hyperplane approximates the real data points (a.k.a. goodness of fit) equivalent to r2 for Pearson correlation
o Interpreted as the proportion of the variance in the dependent variable that is “explained” by the independent variables in the model
o Ranges from 0 to 1, lower = poorer fit
- 𝑅2adj takes account of the phenomenon whereby 𝑅" increases every time an extra independent variable is added better indicator for model selection, higher R2adj is the one that better fits the data and should be selected
o E.g., if R2adj is 0.100, IVs explained 10% of variance in DV
Assumptions for Multiple Regression
- Relationship between DV and each continuous IV is linear
- Residuals should be approximately normally distributed
o Represents variations in Y that is not explained by IVs
o Can be seen from histogram and P-P plot should be close to reference line
- Homoscedascity no pattern in standardised residuals and standardized predicted values error terms have same variance irrespective of values of DV
- Independent observations
- Assess assumptions by ‘Analyse’ ‘Regression’ ‘Linear’ ‘Plots’ ‘Histogram’, ‘Normal probability plot’, ‘Produce all partial plots’, ‘ZRESID in Y’ and ‘ZPRED in X’
- Use ‘save’ to input predicted values ‘unstandardised’ , ‘Individual’ ‘CI’
MEDIATION
- Explains portion of the association between Y and X1
o Hypothesised causal mechanism by which one variable affects another
- Total causal affect can be split into indirect (or mediated) part with paths 𝑎 and 𝑏 and a direct (non-mediated) path 𝑐′
- Direct effect = c’
- Indirect/mediated effect = a*b
- 𝑐= Total effect = direct + indirect effect =𝑐′ + 𝑎 * 𝑏
Mediation Analysis
- 4 steps
o Establish that X1 associated with Y
§ Estimate path c: Y = β0 + βX1+ε
o Estimate path a: M = β0 + β1X1 +ε
§ Run SLM on between M and X1
o Estimate path b (M Y, controlling for X1): Y = β0 + β2M +β3X1+ε
§ Run MLM that adjusts for X
o Estimate path c’ 2 ways
§ c = c’ + a* b, β = c’ + β1 * β2
§ From step 3, Y = β0 + β2M +β3X1+ε, c = β3 (from spss)
· If β3 is >0.05, there is complete mediation i.e., no association between X1 and Y when controlled for M
· If β3 <0.05, partial mediation i.e., c’ is smaller than c (in absolute value), there is association between X1 and Y
§ C’ must be significantly smaller than c to establish mediation effect
- Steps in SPSS
o ‘Analyse’ ‘Regression’ ‘Linear’
§ SLM for IV and DV (path c)
§ SLM for IV and M (path a)
§ MLM for IV, M and DV (paths b and c’)
- Newer methods say that only steps 2 and 3 are essential for establishing mediation (i.e., indirect effect, H0: ab = 0)
o Sobel test (normal theory approach) and nonparametric sobel test (bootstrapping)
§ Z = ab/SE(ab)
§ SE(ab) is the SE of the estimated indirect effect, Sa and Sb are coefficients for a and b taken from MLM
§ If Z is greater than 1.96, reject the H0 that ab = 0
o Sobel test only work sin large samples because skewness is reduced
o Bootstrapping better alternative, provides estimated CI for ab instead sig if CI interval does not contain 0
- Step 1 passed: Path c (effect of treatment on stable housing) is equal to 6.558 (p value = 0.009), with a 95% confidence interval of [1.65 to 11.46]
- Step 2 passed: Path a (effect of treatment on housing contact) is equal to 1.83 (p = 0.013), with a 95% confidence interval of [0.39 to 3.27]
- Step 3 passed: Path b (effect of housing contacts on stable housing controlling for treatment) is equal to 1.398 (p <0.001), with a 95% confidence interval of [0.801 to 1.995]
- Step 4 passed: Path c’ (effect of treatment on stable housing controlling for the mediator) is equal to 4.00 (p =0.09), with a 95% confidence interval of -0.63 to 8.62.
o Controlling for the mediator substantially reduces the effect of treatment (c’= 4.00 < c=6.56).
o Complete mediation as direct effect is not sig different from 0.