# Basic Statistical Terms Defined

Statistics as a discipline drives much of data

# Statistical Terms Related to Samples and Sampling

** Population** is a common statistical term that refers to the whole of the “event numbers” you’re studying. In other words, if you’re studying something like how a drug affects people, the number of people in your study could be considered the “population” because each person represents an event in which the drug worked in some way (or didn’t).

**Sample** and **sampling** are often used interchangeably, but they refer to slightly different things. Whatever portion of a population is used in order to do statistical analysis, that’s a sample. Sampling is a verb and it refers to the process you undergo in order to determine which numerical values will be selected from the population in order to create the sample. **Sample statistics**, so long as they remain free from bias, can be very efficient and economical ways to make determinations or draw inferences about the total population. You can sometimes get an unbiased picture of a population whose size would be far too large to warrant efficient study by creating a sample of that population by means of sampling.

**Probability sampling **is a specific kind of sampling process that considers how likely it would be to select a particular number, attribute, or other characteristic from the population as a whole. A **random sample** is what you arrive at when you conduct random sampling, a process in which every item or person in the population has the same chance of being chosen and put into the final sample. Random sampling is considered a best practice as it tends to drastically reduce the chance of bias.

# Statistical Terms Related to Numbers

Numerical terms aren’t just common statistical terms, they’re common in mathematics, data analysis, and other areas of healthcare data, too. Here are some of the most common numerical terms you’ll come across.

A **variable**, at least in the statistical sense, is a numerical attribute that can take on different values. You can have categorical variables, numerical variables, and so on. You tend to have numerical values within variables. So, for instance, someone’s religious affiliation would be a categorical variable within which their income would be a numerical value.

A **parameter **is a numerical measure that’s used to describe a specific characteristic of a given population. Parameters are often unobservable, since entire populations are usually not studied, and as such they must often be estimated. To eliminate bias, you should keep in mind that a truly unbiased parameter estimate is statistically equal to the true population parameter.

A **statistic** is a numerical measure, as well, and describes some property of a population. You generally obtain a statistic from a sample, and as long as the sample is unbiased, the statistic will be, as well. The goal is to ensure that the statistic obtained from the sample is equal to that same statistic were it drawn from the entire population.

# Statistical Terms Related to Data

We’ve talked about types of data when discussing big data and data analytics, but statistical data has many types, as well. There are many sources of statistical data, but several of these statistical terms are a bit more common than others, so we’ll go over those.

**Published sources of data** are collected either from a primary or a secondary source. Of course, the goal is always to collect from a primary data source since it is collected by the primary analyst. Secondary data can be equally valuable at times, though, so secondary published sources of data, which have been collected from various primary sources, are also used.

**Experimental data** is any data that’s collected about a particular variable where only one variable or group of variables is allowed to change. In other words, all the values in the experiment are held constant save for one. Experimental data is what most people think of when they think of “science” and is used most frequently in hard sciences. Non-experimental data is reserved mostly for social sciences where it’s categorically impossible to hold everything except for one variable constant.

Another type of statistical data is **survey data**, which is pretty straightforward. **Frame data** is also used, which is data that’s been gathered using a specific list of guidelines created to ensure that the sample data will be as close as possible to the population data. Think of it like “framework” data, because there’s a framework (the list) that you have to work within when you conduct your sampling.

Finally, **cross-sectional data** is a set of data that’s compared at one point in time. You can make intra-data comparisons or you can use a benchmark data point. It’s a way of asking what’s happening across the board at a given point in time regarding a specific sample. If you like, you can think of it as a sort of scientific version of “this day in history”.

# Statistical Terms Related to Visual Representation

**Bar charts** are charts of categorical data arranged into bars with varying heights. The heights represent the frequency, or relative frequency (such as percentages), of variable membership in each variable value. In other words, it shows you how many of each characteristic or category is present in a given variable. For instance, a bar graph of how much each of several demographics earned in a given year would be represented by several bars, one for each demographic, and their heights would correspond to an income level on the y-axis. The widths of the bars have no meaning.

A **histogram** is a graph made from quantitative data. The range of the data is usually divided into intervals, which are called bins, and bars are created above each bin. The heights of the bars represent the frequency of the data in the particular bin. It sounds similar to a bar chart, but the width of the bars is an important characteristic of the graph in a histogram whereas in a bar graph the widths have no meaning.

**Box and whisker plots** incorporate the median and upper and lower quartiles in order to display a data range in a graphical representation. This plot is very useful for displaying outlying values when they’re present in the data.

**Pie charts** are circular graphs where wedge-shaped slices represent proportions of the total graph. The entire chart tends to represent a “whole”, such as people who live in a certain area, and various “slices” of the chart represent percentages of that whole, such as mobile device usage among people in that area. Various wedges could represent those who primarily use mobile phones, those who use tablets, and so on. Pie charts are most useful when you want to visualize parts of a whole.

Finally, **scatter diagrams** are graphs that plot coordinates from two series of data points. Typically, the x-axis represents units of one variable and the y-axes represents units of the second variable. These diagrams are useful when you’re looking for patterns among datasets and, as such, “scatter diagram” is a commonly used statistical term in all industries that use data analysis.

# Statistical Terms Related to Probability

**Probability **is the specific mathematical likelihood that an outcome will occur. As you might imagine, then, **probability distribution **is taking possible outcomes based on their probability and scaling them. This is typically described by a probability function. **Discrete probability distribution** is a probability distribution where you only include certain values of each class at any particular interval. For example, you might choose to only include whole numbers.

**Continuous probability distribution** is a probability distribution described by a possible value of the variable placed within a range of possible values. **Symmetrical probability distribution** is a probability function that’s separated into two symmetrical sections by a vertical line, where either side is a mirror image of the other. If you’re familiar with the Normal distribution, the bell-shaped distribution that is fully described by its mean and standard deviation, that’s an example of a symmetrical probability distribution.

There are two other common types of probability distribution, as well. **Left-skewed probability distribution** is a set of values in which the mean is less than the median. The left tail of the distribution is longer than the right, hence the name left-skewed. **Right-skewed probability distribution**, then, is the opposite. The values in this distribution work out so that the mean is greater than the median and the right tail of the distribution is longer.

# Mean, Median, Mode, and More

The **mean** is simply the average and is computed the way all averages are. If you have a group of five values, you add them up and divide by five and you’ve found your mean. The **median **is the centermost or middle value in a series of data points or values. The **mode **is the value that appears the most. These are three words that most people are familiar with thanks to high school mathematics. **Expected mean**, however, is a term that might not have made its way into our memories, but it’s easy enough to figure out when you know what the mean is. As the name implies, it’s what you expect the mean to be based on computations from a sample. As a **expected value.**

# Variations

Variations and variance have to be dealt with quite frequently if you’re working with data. There are many ways to describe variations and many terms within this category, but we’ll go over some of the more common terms.

**Population variance** is a statistical term that refers to the average of the squared differences of data values. You use the mean value of observations and divide by N observations in order to obtain the population variance. **Sample variance** refers to the sum of the squared differences of value from the mean value of observations where the sum is divided by N-1 where N is the number of observations.

**Sample standard deviation** is the square root of the sample variance. It’s used to determine and measure risk. The **sample coefficient of variation** is the ratio you get when you divide the sample standard deviation by the sample mean. If you have two independent data sets, you’ll usually pick the one with the lower coefficient of variation. In other words, you want the one with the least variation per unit of expected value.

# Conclusion

We could spend hours discussing the many statistical terms you’ll come across when working with data. However, if you master these basic statistical terms, you should be able to understand statistical literature. It’s important to be able to navigate statistical writing and studies because it contains the information you’ll need to conduct research, develop questions, gather evidence, and make progress in your projects and processes. Knowing these terms and understanding them, as well as taking the time to increase your knowledge base of statistics as a whole, can help you deepen your current skill set and will set the foundation for a new level of professional development.