Data ScienceStatistics Fundamentals

Statistics Fundamentals

Master key statistical concepts, scales, probability, distributions, and inferential stats vital for AI and Machine Learning applications in business. Learn data types, nominal & ordinal data.

Statistics

This document provides a comprehensive overview of key statistical concepts, scales of measurement, probability theorems, distributions, and inferential statistics, particularly relevant to business applications.

1. Types of Data

1.1 Qualitative Data / Categorical Data

Nominal Data: Categories without any inherent order.
- Example: Gender (Male, Female, Other), Colors (Red, Blue, Green).
Ordinal Data: Categories with a meaningful order, but the difference between categories is not quantifiable.
- Example: Education Level (High School, Bachelor's, Master's, PhD), Customer Satisfaction (Poor, Fair, Good, Excellent).
Binomial Data: A special case of nominal data with only two categories.
- Example: Yes/No, True/False, Pass/Fail.

1.2 Quantitative Data

Interval Data: Ordered data where the difference between values is meaningful and constant, but there is no true zero point.
- Example: Temperature in Celsius or Fahrenheit.
Ratio Data: Ordered data with a meaningful and constant difference between values, and a true zero point.
- Example: Height, Weight, Income, Age.

1.3 Data by Number of Variables

Univariate Data: Data involving a single variable.
Bivariate Data: Data involving two variables.
Multivariate Data: Data involving more than two variables.

1.4 Data by Time Dimension

Time Series Data: Data collected over a period of time, where the order of observations matters.
- Example: Monthly sales figures, Daily stock prices.
Cross-Sectional Data: Data collected at a specific point in time from multiple entities.
- Example: Survey data collected from different individuals in a single day.

2. Scales of Measurement in Business Statistics

2.1 Nominal Scale: Categorical data without order.
2.2 Ordinal Scale: Categorical data with order.
2.3 Interval Scale: Numerical data with order and equal intervals, but no true zero.
2.4 Ratio Scale: Numerical data with order, equal intervals, and a true zero.

3. Measures of Central Tendency

These measures describe the typical or central value of a dataset.

Mean: The average of all values. $$ \text{Mean} = \frac{\sum_{i=1}^{n} x_i}{n} $$
Median: The middle value in an ordered dataset.
Mode: The most frequently occurring value in a dataset.

4. Measures of Dispersion

These measures describe the spread or variability of data.

Box Plot: A graphical representation of the distribution of data through quartiles. It shows the median, quartiles, and potential outliers.

5. Relationship between AM, GM, and HM

3.1 Arithmetic Mean (AM): The sum of values divided by the number of values.
3.2 Geometric Mean (GM): The nth root of the product of n values, typically used for average growth rates. $$ \text{GM} = \sqrt[n]{x_1 \times x_2 \times \dots \times x_n} $$
3.3 Harmonic Mean (HM): The reciprocal of the arithmetic mean of the reciprocals of the values, used for average rates. $$ \text{HM} = \frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}} $$
3.4 Relationship between AM, GM, and HM: For a set of positive numbers, $AM \ge GM \ge HM$. Equality holds only when all numbers are equal.

6. Skewness – Measures and Interpretation

Skewness measures the asymmetry of the probability distribution of a real-valued random variable about its mean.

4.1 What is Skewness? A measure of the asymmetry of the probability distribution of a random variable.
4.2 Tests of Skewness: Various statistical tests can determine if a distribution is significantly skewed.
4.3 Skewness of Karl Pearson’s Measure: A common method to quantify skewness.
4.4 Positive and Negative Skewness:
- 4.5 Positive Skewness (Right Skew): The tail on the right side of the distribution is longer or fatter. The mean is typically greater than the median.
- 4.6 Negative Skewness (Left Skew): The tail on the left side of the distribution is longer or fatter. The mean is typically less than the median.
- 4.7 Zero Skewness (Symmetrical Distribution): The distribution is perfectly symmetrical. The mean, median, and mode are equal.
4.8 Measurement of Skewness: Methods include Pearson's coefficients, Bowley's coefficient, and moment-based measures.
4.9 Karl Pearson’s Measure: $$ \text{Pearson's Coefficient of Skewness} = \frac{\text{Mean} - \text{Mode}}{\text{Standard Deviation}} $$ or $$ \text{Pearson's Coefficient of Skewness} = 3 \times \frac{\text{Mean} - \text{Median}}{\text{Standard Deviation}} $$
4.10 Bowley’s Measure: Based on quartiles. $$ \text{Bowley's Coefficient of Skewness} = \frac{Q_3 + Q_1 - 2 \times Q_2}{Q_3 - Q_1} $$ where $Q_1$, $Q_2$ (Median), and $Q_3$ are the first, second, and third quartiles.
4.11 Kelly’s Measure: Another measure of skewness.
4.12 Interpretation of Skewness: Indicates the direction and degree of asymmetry.
4.13 Difference between Dispersion and Skewness: Dispersion measures spread, while skewness measures asymmetry.

7. What is a Regression Line?

A regression line is a line that best fits the data points on a scatter plot, typically used to model the relationship between two variables.

5.1 What is a Regression Line? A line that represents the best linear approximation of the relationship between dependent and independent variables.
5.2 Equation of Regression Line: $$ y = a + bx $$ where:
- $y$ is the dependent variable.
- $x$ is the independent variable.
- $a$ is the y-intercept.
- $b$ is the slope of the line.
5.3 Graphical Representation of Regression Line: Plotted on a scatter plot showing the observed data points and the fitted line.
5.4 Examples of Regression Line: Predicting sales based on advertising spend.

8. Types of Regression Lines

6.1 Linear Regression Line: Assumes a linear relationship between variables.
6.2 Logistic Regression Line: Used for binary classification problems, modeling the probability of an event.
6.3 Polynomial Regression Line: Models a curved relationship between variables using polynomial functions.
6.4 Ridge and Lasso Regression: Regularization techniques used to prevent overfitting in linear regression models.
6.5 Non-Linear Regression Line: Models relationships that are not linear.
6.6 Multiple Regression Line: Extends linear regression to model relationships with more than one independent variable.
6.7 Exponential Regression Line: Models exponential growth or decay.
6.8 Pricewise Regression Line: Fits different linear segments to different parts of the data.
6.9 Time Series Regression Line: Regression models specifically applied to time series data.
6.10 Power Regression Line: Models relationships where one variable is a power function of another.
6.11 Applications of Regression Line: Forecasting, understanding relationships, prediction.
6.12 Importance of Regression Line: Quantifies relationships and allows for predictions.
6.13 Statistical Significance of Regression Line: Testing whether the relationship between variables is statistically significant.
6.14 Practice Questions on Regression Line: Exercises to reinforce understanding.

9. Probability Theorems | Theorems and Examples

7.1 What is Probability? The likelihood of an event occurring.
7.2 Probability Theorems: Fundamental rules governing probability calculations.
7.3 Theorem of Complementary Events: $P(A') = 1 - P(A)$, where $A'$ is the complement of event $A$.
7.4 Theorem of Addition: For mutually exclusive events, $P(A \cup B) = P(A) + P(B)$. For non-mutually exclusive events, $P(A \cup B) = P(A) + P(B) - P(A \cap B)$.
7.5 Theorem of Multiplication (Statistical Independence): For independent events, $P(A \cap B) = P(A) \times P(B)$.
7.6 Theorem of Total Probability: Used to calculate the probability of an event that can occur through various mutually exclusive intermediate events.

10. Tree Diagram: Meaning, Features, Conditional Probability and Examples

A tree diagram is a visual tool used to represent probabilities of sequential events.

8.1 What is a Tree Diagram? A graphical representation of outcomes of a sequence of events.
8.2 Features of Tree Diagram: Branches represent events, and nodes represent outcomes. Probabilities are assigned to branches.
8.3 How to Draw a Tree Diagram? Start with an initial node, branch out for each possible outcome of the first event, and continue branching for subsequent events.
8.4 Tree Diagram for Conditional Probability: Illustrates how the probability of an event changes based on the occurrence of a previous event.
8.5 Tree Diagram in Probability Theory: Useful for calculating probabilities of compound events and understanding conditional probabilities.
8.6 Examples of Tree Diagram: Calculating probabilities in games of chance, sequential decision-making.

11. Joint Probability | Concept, Formula, and Examples

Joint probability is the probability of two or more events occurring simultaneously.

9.1 What is Joint Probability in Business Statistics? The probability of two or more variables taking specific values or falling into specific categories at the same time.
9.2 Difference between Joint Probability and Conditional Probability: Joint probability is $P(A \cap B)$, the probability of both A and B occurring. Conditional probability is $P(A|B)$, the probability of A given that B has occurred.
9.3 Probability Density Function (PDF): A function describing the likelihood of a continuous random variable taking on a given value.
9.4 What is the Probability Density Function? A function that defines the probability distribution for a continuous random variable.
9.5 Probability Density Function Formula: Varies by distribution type.
9.6 Properties of Probability Density Function: Non-negative, the total area under the curve is 1.
9.7 Probability Distribution Function of Discrete Distribution: Typically represented by a probability mass function (PMF).
9.8 Probability Distribution Function of Continuous Distribution: Represented by a probability density function (PDF).

12. Bivariate Frequency Distribution | Calculation, Advantages, and Disadvantages

A bivariate frequency distribution shows the frequencies of occurrences for combinations of two variables.

10.1 Definition: A table that summarizes the relationship between two variables by showing the frequency of each combination of their values or categories.
10.2 Components:
- Joint Frequencies: Frequencies for specific combinations of values of the two variables.
- Marginal Frequencies: Frequencies for each value of a single variable, ignoring the other.
- Conditional Frequencies: Frequencies of one variable given a specific value of the other.
10.3 Construction: Create a grid with classes/values of one variable on rows and the other on columns. Tally the occurrences for each combination.
10.4 Graphical Representation: Scatter plots or heatmaps.
10.5 Calculation: Involves tallying observations into appropriate cells and then calculating marginal and conditional frequencies. Used for correlation or association analysis.
10.6 Advantages: Helps visualize and understand the relationship between two variables, identify patterns, and check for independence.
10.7 Disadvantages: Can become complex with many categories or variables; requires sufficient data to be meaningful.

13. Bernoulli Distribution in Business Statistics – Mean and Variance

The Bernoulli distribution describes the probability of success or failure in a single trial.

11.1 Terminologies:
- Bernoulli Trial: An experiment with only two possible outcomes (success or failure).
- Success: The desired outcome.
- Failure: The outcome other than success.
11.2 Formula of Bernoulli Distribution: $$ P(X=k) = p^k (1-p)^{1-k}, \quad \text{for } k \in {0, 1} $$ where $p$ is the probability of success.
11.3 Mean and Variance of Bernoulli Distribution:
- Mean (Expected Value): $E(X) = p$
- Variance: $Var(X) = p(1-p)$
11.4 Properties: Single trial, two outcomes, constant probability of success.
11.5 Bernoulli Distribution Graph: A simple bar chart showing probabilities for $X=0$ and $X=1$.
11.6 Bernoulli Trial: (See 11.1)
11.7 Examples: Flipping a coin (Heads/Tails), Pass/Fail on a test.
11.8 Applications: Modeling single events like a customer clicking an ad, a machine failing on a production line.
11.9 Bernoulli Distribution and Binomial Distribution: The Binomial distribution is a sum of independent Bernoulli trials.

14. Binomial Distribution in Business Statistics – Definition, Formula & Examples

The Binomial distribution describes the number of successes in a fixed number of independent Bernoulli trials.

12.1 Formula of Binomial Distribution: $$ P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad \text{for } k = 0, 1, 2, \dots, n $$ where:
- $n$ is the number of trials.
- $k$ is the number of successes.
- $p$ is the probability of success in a single trial.
- $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ is the binomial coefficient.
12.2 Properties: Fixed number of trials, independent trials, two outcomes per trial, constant probability of success.
12.3 Negative Binomial Distribution: Describes the number of trials needed to achieve a fixed number of successes.
12.4 Mean and Variance of Binomial Distribution:
- Mean: $E(X) = np$
- Variance: $Var(X) = np(1-p)$
12.5 Shape of Binomial Distribution: Bell-shaped and symmetric if $p=0.5$. Skewed if $p \neq 0.5$.
12.6 Solved Examples: Calculating the probability of getting exactly 3 heads in 5 coin flips.
12.7 Uses in Business Statistics: Quality control, marketing response rates, customer purchasing behavior.
12.8 Real-Life Scenarios: Number of defective items in a batch, number of customers who respond to a campaign.
12.9 Difference Between Binomial Distribution and Normal Distribution: Binomial is for discrete trials, Normal is for continuous data. Normal can approximate Binomial for large $n$.

15. Geometric Mean in Business Statistics | Concept, Properties, and Uses

The Geometric Mean is used for averaging rates of change or ratios.

13.1 Weighted Geometric Mean: Geometric mean where each observation is assigned a weight.
13.2 Properties: Sensitive to extreme values, always less than or equal to the arithmetic mean.
13.3 Uses: Calculating average investment returns, average growth rates, index numbers.

16. Negative Binomial Distribution: Properties, Applications, and Examples

The Negative Binomial distribution models the number of trials required to achieve a fixed number of successes.

14.1 Properties: Defined by parameters $r$ (number of successes) and $p$ (probability of success).
14.2 Probability Density Function (PDF): $$ P(X=k) = \binom{k-1}{r-1} p^r (1-p)^{k-r}, \quad \text{for } k = r, r+1, \dots $$ where $X$ is the number of trials.
14.3 Mean and Variance:
- Mean: $E(X) = \frac{r}{p}$
- Variance: $Var(X) = \frac{r(1-p)}{p^2}$
14.4 Applications in Business Statistics: Modeling customer purchasing frequency, duration of service calls.
14.5 Examples: The number of times a salesperson needs to make calls to achieve 5 sales, assuming a constant probability of sale per call.

17. Hypergeometric Distribution in Business Statistics: Meaning, Examples & Uses

The Hypergeometric distribution is used for sampling without replacement from a finite population.

15.1 Probability Density Function (PDF): $$ P(X=k) = \frac{\binom{K}{k} \binom{N-K}{n-k}}{\binom{N}{n}} $$ where:
- $N$ is the population size.
- $K$ is the number of success states in the population.
- $n$ is the number of draws (sample size).
- $k$ is the number of observed successes.
15.2 Mean and Variance:
- Mean: $E(X) = n \frac{K}{N}$
- Variance: $Var(X) = n \frac{K}{N} \left(1 - \frac{K}{N}\right) \left(\frac{N-n}{N-1}\right)$
15.3 Examples: The probability of drawing a certain number of defective items from a batch without putting them back.
15.4 When to Use: When sampling without replacement from a finite population, and the probability of success changes with each draw.
15.5 Difference Between Hypergeometric Distribution and Binomial Distribution: Binomial assumes replacement or infinite population (constant probability of success), while Hypergeometric does not.
15.6 Conclusion: Crucial for scenarios where sampling affects probabilities.

18. Poisson Distribution: Meaning, Characteristics, Shape, Mean, and Variance

The Poisson distribution models the number of events occurring in a fixed interval of time or space.

16.1 Probability Distribution Function (PDF) of Poisson Distribution: $$ P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad \text{for } k = 0, 1, 2, \dots $$ where $\lambda$ is the average rate of events.
16.2 Characteristics: Events occur independently, the average rate is constant.
16.3 Shape of Poisson Distribution: Skewed to the right, becomes more symmetric as $\lambda$ increases.
16.4 Mean and Variance of Poisson Distribution:
- Mean: $E(X) = \lambda$
- Variance: $Var(X) = \lambda$
16.5 Fitting a Poisson Distribution: Estimating $\lambda$ from observed data and using the formula to calculate probabilities.
16.6 Poisson Distribution as an Approximation to Binomial Distribution: When $n$ is large and $p$ is small, Poisson can approximate Binomial with $\lambda = np$.
16.7 Examples: Number of customer calls per hour, number of defects per square meter of fabric.

19. Gamma Distribution in Statistics

The Gamma distribution is a flexible, continuous probability distribution often used to model waiting times or the sum of exponential random variables.

17.1 What is Gamma Distribution: A continuous probability distribution characterized by a shape parameter and a rate or scale parameter.
17.2 Gamma Distribution Function: Refers to the Gamma function, $\Gamma(z) = \int_0^\infty t^{z-1}e^{-t}dt$.
17.3 Gamma Distribution Formula – Probability Density Function (PDF): $$ f(x; \alpha, \beta) = \frac{\beta^\alpha x^{\alpha-1} e^{-\beta x}}{\Gamma(\alpha)}, \quad \text{for } x > 0 $$ where $\alpha$ is the shape parameter and $\beta$ is the rate parameter.
17.4 Gamma Distribution Mean and Variance:
- Mean: $E(X) = \frac{\alpha}{\beta}$
- Variance: $Var(X) = \frac{\alpha}{\beta^2}$
17.5 Special Case 1: Exponential Distribution: When $\alpha = 1$, Gamma becomes Exponential, modeling time until the first event.
17.6 Examples of Exponential Distribution: Time until a machine breaks down, time between customer arrivals.
17.7 Special Case 2: Chi-Square Distribution: When $\alpha = \nu/2$ and $\beta = 1/2$ (or scale parameter $2$), Gamma becomes Chi-Square, used in hypothesis testing.
17.8 Examples of Chi-Square Distribution: Goodness-of-fit tests, tests of independence.

20. Normal Distribution in Business Statistics

The Normal distribution, or Gaussian distribution, is a fundamental continuous probability distribution characterized by its bell shape.

18.1 Probability Density Function (PDF) of Normal Distribution: $$ f(x; \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} $$ where $\mu$ is the mean and $\sigma$ is the standard deviation.
18.2 Standard Normal Distribution: A normal distribution with mean $\mu=0$ and standard deviation $\sigma=1$, often denoted by $Z$.
18.3 Properties: Symmetric, unimodal, mean=median=mode, tails extend infinitely.
18.4 The Empirical Rule (68-95-99.7 Rule):
- Approximately 68% of data falls within 1 standard deviation of the mean.
- Approximately 95% falls within 2 standard deviations.
- Approximately 99.7% falls within 3 standard deviations.
18.5 Parameters of Normal Distribution: Mean ($\mu$) and Standard Deviation ($\sigma$).
18.6 Curve of Normal Distribution: A bell-shaped curve, symmetric around the mean.
18.7 Examples: Heights of people, measurement errors, test scores.
18.8 Applications in Business Statistics: Statistical inference, modeling errors, financial modeling, quality control.

21. Lognormal Distribution in Business Statistics

The Lognormal distribution describes variables whose logarithms are normally distributed.

19.1 Probability Density Function (PDF) of Lognormal Distribution: $$ f(x; \mu, \sigma) = \frac{1}{x\sigma\sqrt{2\pi}} e^{-\frac{(\ln x - \mu)^2}{2\sigma^2}}, \quad \text{for } x > 0 $$ where $\mu$ and $\sigma$ are the mean and standard deviation of the logarithm of the variable.
19.2 Lognormal Distribution Curve: Skewed to the right.
19.3 Mean and Variance of Lognormal Distribution:
- Mean: $E(X) = e^{\mu + \sigma^2/2}$
- Variance: $Var(X) = (e^{\sigma^2} - 1)e^{2\mu + \sigma^2}$
19.4 Applications: Modeling incomes, stock prices, response times, lifetimes of equipment.
19.5 Examples: Income distribution in a population, size of companies.
19.6 Difference Between Normal Distribution and Lognormal Distribution: Normal can take negative values, Lognormal is strictly positive. Lognormal is right-skewed.

22. Inferential Statistics

Inferential statistics uses sample data to draw conclusions about a population.

20.1 Overview of Inferential Statistics: Making inferences and predictions about a population based on sample data.
20.2 Degrees of Freedom: The number of values in a calculation that are free to vary.
20.3 Central Limit Theorem: States that the distribution of sample means will approximate a normal distribution as the sample size becomes large, regardless of the population's distribution.
20.4 Parameters vs. Test Statistics: Parameters are population characteristics (e.g., population mean $\mu$), while test statistics are calculated from sample data (e.g., sample mean $\bar{x}$).
20.5 Test Statistics: Values calculated from sample data used to test hypotheses (e.g., t-statistic, z-statistic, F-statistic).
20.6 Estimation: The process of estimating population parameters from sample statistics (point estimation and interval estimation).
- 20.7 Standard Error: The standard deviation of the sampling distribution of a statistic.
- 20.8 Confidence Interval: A range of values, derived from sample statistics, that is likely to contain the value of an unknown population parameter.

23. Hypothesis Testing

Hypothesis testing is a statistical method to make decisions based on data, testing a claim about a population.

21.1 Hypothesis Testing Guide: A structured approach to testing hypotheses.
21.2 Null and Alternative Hypothesis:
- Null Hypothesis ($H_0$): A statement of no effect or no difference.
- Alternative Hypothesis ($H_a$ or $H_1$): A statement of an effect or difference that contradicts the null hypothesis.
21.3 Statistical Significance: The likelihood that the observed results are due to random chance.
21.4 P-Value: The probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true.
21.5 Type I and Type II Errors:
- Type I Error (False Positive): Rejecting a true null hypothesis.
- Type II Error (False Negative): Failing to reject a false null hypothesis.
21.6 Statistical Power: The probability of correctly rejecting a false null hypothesis.
Decisions: Based on comparing the p-value to a significance level ($\alpha$), or comparing a test statistic to a critical value.

24. Choosing the Right Statistical Test

Selecting the appropriate statistical test depends on the data type, research question, and assumptions.

22.1 Assumptions of Hypothesis Testing: Conditions that must be met for a test to be valid.
- 22.1.1 Skewness: Some tests assume symmetry or normality.
- 22.1.2 Kurtosis: Relates to the "tailedness" of the distribution, also relevant for normality assumptions.
22.2 Correlation: Measures the strength and direction of the linear relationship between two quantitative variables.
- Correlation Coefficient: A numerical measure of correlation (e.g., Pearson's $r$).
- Correlation vs. Causation: Correlation does not imply causation.
- Pearson Correlation: Measures linear association between two continuous variables.
- Covarance vs Correlation: Covariance indicates the direction of linear relationship, while correlation standardizes it.
22.3 Regression Analysis: Used to model the relationship between dependent and independent variables.
- 22.3.1 t-Test: Used to compare means of two groups or test the significance of regression coefficients.
- 22.3.2 ANOVAs (Analysis of Variance): Used to compare means of three or more groups.
  - 22.3.2.1 One Way ANOVA: Compares means of groups based on one factor.
  - 22.3.2.2 Two Way ANOVA: Compares means based on two factors and their interaction.
  - 22.3.2.3 Annova in R: Implementation of ANOVA using the R programming language.
22.4 Chi-Square Test: Used for categorical data.
- 22.4.1 Overview of Chi-Square Test: Tests for association between categorical variables or goodness-of-fit.
- 22.4.2 Chi-Square Goodness of Fit Test: Tests if sample data fits a hypothesized distribution.
- 22.4.3 Chi-Square Test of Independence: Tests if there is a significant association between two categorical variables.

25. Graphical Representation of Variables

Visualizing data is crucial for understanding patterns, trends, and relationships.

Graphs and Tables: Various graphical forms (histograms, scatter plots, bar charts, line graphs) and tabular formats are used to represent data.

This document is a compilation of common statistical topics. Specific applications and detailed formulas may vary based on context.