Lecture notes Probability and Statistics pdf

lecture notes on probability and statistics theory for economists. lecture notes on statistics and engineering probability and statistics lecture notes pdf free download
Dr.LeonBurns Profile Pic
Dr.LeonBurns,New Zealand,Researcher
Published Date:21-07-2017
Your Website URL(Optional)
VŠB - Technical University of Ostrava Faculty of Electrical Engineering and Computer Science Department of Applied Mathematics PROBABILITY AND STATISTICS FOR ENGINEERS Radim Briš Ostrava 2011 1 EXPLORATORY DATA ANALYSIS Study Time: 70 minutes Learning Objectives · General Concepts of Exploratory (Preliminary) Statistics · Data Variable Types · Statistical Characteristics and Graphical Methods of Presenting Qualitative Variables · Statistical Characteristics and Graphical Methods of Presenting Quantitative Variables Explanation Original goal of statistics was to collect data about population based on population samples. By population we mean a group of all existing components available for observation during statistical research. For example: If a statistical research is performed about physical hight of 15-year old girls, the population will be all girls currently aged 15. Considering the fact that the number of population members is usually high, the research will be based on the so-called sample examination where only part of the population is used. The examined part of the population is called a sample. What's really important is to make a definite selection that is as representative of the whole group as possible. There are several ways to achieve it. To avoid of omitting some elements of the population the so-called random sample is used in which each element of population has the same chance of being selected. It goes without saying that sample examination can never be as accurate as examining the whole population. Why do we do prefer it then? 1. To save time and minimize costs (especially for large populations). 4 2. To avoid damaging samples in destructive testing (some tests like examining cholesterol in blood etc., lead to the permanent damage of examined elements). 3. Because the whole population is not available. Now that you know that statistics can describe the whole population based on information gathered from a population sample we will move on to Exploratory Data Analysis (EDA). Data we observe will be called the variables and their values variable variants. EDA is often the first step in revealing information hidden in a large amount of variables and their variants. Because the way of processing variables depends most on their type, we will now explore how variables are devided into different categories. The variables division is shown in the following diagram. Variable Qualitative Quantitative (categorial, lexical...) (numerical...) general dividing Discrete Continuous Nominal Ordinal Finite Denumerable dividing based on number of variant variant Alternative Plural · Qualitative variable – its variants are expressed verbally and they split into two general subgroups according to what relation is between their values: § Nominal variable – has equivalent variants: it is impossible to either compare them or sort them (for example: sex, nationality, etc.) 5§ Ordinal variable – forms a transition between qualitative and quantitative variables: individual variant can be sorted and it is possible to compare one another (for example: cloth sizes S, M, L, and XL) The second way of dividing them is based on number of variants: § Alternative variable – has only two possible options (e.g. sex – male or female, etc.) § Plural variable – has more than two possible options (e.g. education, name, eye color, etc.) · Quantitative variable – is expressed numerically and it's divided into: § Discrete variable – it has finite or denumerable number of variants - Discrete finite variable – it has finite number of variants (e.g. math grades - 1,2,3,4,5) - Discrete denumerable variable – it has denumerable number of variants (e.g. age (year), height (cm), weight (kg), etc.) § Continuous variable - it has any value from  or from some  subset (e.g. distance between cities, etc.) Additional clues Imagine that you have a large statistical group and you face a question of how to best describe it. Number representations of values are used to “replace” the group elements and they become the basic attributes of the group. This is what we call statistical characteristics. In the next chapters we are going to learn how to set up statistical characteristics for various types of variables and how to represent larger statistical groups. 1.1 Statistical Characteristics of Qualitative Variables We know that a qualitative variable has two basic types - nominal and ordinal. 1.1.1 Nominal Variables Nominal variable has different but equivalent variants in one group. The number of these variants is usually low and that's why the first statistical characteristics we use to describe it will be its frequency. · Frequency n (absolute frequency) i - is defined as the number of a variant occurrences of the qualitative variable 6 In case that a qualitative variable has k different variants (we describe their frequency as n , n … n ) - in a statistical group (of n values) it must be true that: 1 2 k k n + n + ... + n = n = n 1 2 k å i i=1 If you want to express the proportion of the variant frequency on the total number of occurrences, we use relative frequency to describe the variable. · Relative frequency p i - is defined as: n i p = i n alternatively: n i p = ×100 % i n (We use the second formula to express the relative frequency in percentage points). For relative frequency it must be true that: k p + p +K+ p = p = 1 å 1 2 k i i=1 When qualitative variables are processed, it is good to arrange frequency and relative frequency in the so-called frequency table: FREQUENCY TABLE Values x Absolute frequency Relative frequency i n p i i x n p 1 1 1 x n p 2 2 2 M M M x n p k k k k k n = n p = 1 å i å i Total i=1 i=1 The last characteristic of nominal variable is the mode. · Mode 7- is defined as a variant that oc occ cur urs m s most ost f fr re eque quent ntl ly y The mode represents a a t ty yp pi ic ca al l e el le em me ent nt of of t th he e g gr roup. oup. Mode Mode c ca annot nnot b be e de det te er rm mi ine ned d i if f there are more values w wi it th m h ma ax xi im mum um f fr re eque quenc ncy y i in n t the he st sta at ti ist sti ic ca al l g gr roup. oup. 1.1.2 Graphical Methods of Pr re ese sen nt ti in ng g Qualitative Variables The statistics often uses graphs f for or be bet tt te er r a an na al ly ysi sis s of variables. There are two t ty ype pes s of graphs for analyzing nominal variable: · Histogram (bar chart) · Pie chart Histogram is a standard g gr ra aph ph w wh he er re e variants of the variable are represented on on on one e a ax xi is s a and nd variable frequencies on the other e ex xi is s. Individual values of the frequency are the hen di n displ spla ay ye ed d a as s bars (boxes, ve vec ct tor ors, s s, squ qua ar re ed l d lo og gs, c s, con one es s, etc.) Examples: Classification Classification 20 20 18 18 16 16 14 14 12 12 10 10 8 8 6 6 4 4 2 2 0 0 1 2 3 4 1 2 3 4 Classification Classification 20 20 18 18 16 16 14 14 12 12 10 10 8 8 6 6 4 4 2 2 0 0 1 2 3 4 1 2 3 4 Classification 20 18 16 14 12 10 8 6 4 2 0 1 2 3 4 8 Pie chart r re epr pre ese sent nts s r re el la at ti ive ve f fr re eque quenc nci ies of individual variants of a variable. Fr re equ que enc nci ie es s a ar re e presented as proportions in a a se sec ct tor or of of a a c ci ir rc cl le e. When we change the a ang ngl le e of of t the he c ci ir rc cl le e, , w we e can get elliptical, three-di dim me ensi nsiona onal l effect. Classification Classification 3 6 3 6 1 1 2 9 2 19 3 3 9 4 4 19 Classification Classification 3 6 3 6 1 1 2 2 3 3 9 9 19 4 4 19 REMEMBER D De es sc cr ri ibi bing ng t th he e pi pie e c cha har rt t i is s ne nec ce es ssa sar ry y.. Ma Mar rki king ng i indi ndivi vidua dual l se sec ct tor ors s b by y r re el la at ti ive ve frequencies only w wi it thout hout a addi dding ng t the hei ir r a absol bsolut ute e values is not sufficient. Example: : A An n opi opini nion on p pol oll l ha has s be bee en n c ca ar rr ri ie ed d out out a about bout l la aunc unchi hing ng hi hig gh h sc schoo hool f fe ee es. s. I It ts s r re esul sult ts s are shown on the following chart: YES 50% 50% NO Aren’t the results interesting? N N No o o m m ma a at t tt t te e er r r how how how t t tr r rue ue ue t t the he hey y y m m ma a ay y y be be be, , , i i it t t i i is s s r r re e ec c co o om m mm m me e ende nde nded d d t t tha ha hat t t t t the he he chart be modified as follows: 9YES 1 1 NO What is the difference? From the second chart it is obvious that only two people were asked - the first one said YES and the second one said NO. What can be learned from that? Make charts in such a way that their interpretation is absolutely clear. If you are presented with a pie chart without absolute frequencies marked on it, you can ask yourselves whether it is because of the author’s ignorance or it is a deliberate bias. Example and Solution An observational study has been undertaken on the use of an intersection. The collected data are in the table below. The data is made up of colours of cars that pass through the intersection. Analyze the data and interpret the results in a graphical form. red blue red Green blue red red White green green blue Red Solution: From the table it is obvious that the collected colours are qualitative (lexical) variables, and because there is no order or comparison between them, we can say they are nominal variables. For better description we create a frequency table and we determine the mode. We are going to present the colours of the passing vehicles by a histogram and a pie chart. FREQUENCY TABLE Colors of Absolute frequency Relative frequency passing cars n p i i red 5 5/12 = 0.42 blue 3 3/12 = 0.25 white 1 1/12 = 0.08 green 3 3/12 = 0.25 Total 12 1.00 10 We observed 12 cars total. Mode = red (i.e. in our sample most cars were red) Colours of passing cars Colours of passing cars 6 5 3 4 red 5 blue 3 white 1 green 2 1 3 0 red blue white green 1.1.3 Ordinal Variable Now we are going to have a look at describing ordinal variables. The ordinal variable (just like the nominal variable) has various verbal variants in the group but these variants can be sorted i.e. we can tell which one is "smaller" and which one is "bigger" For describing ordinal variables we use the same statistical characteristics and graphs as for nominal variables (frequency, relative frequency, mode viewed by histogram or pie chart) plus two others characteristics (cumulative frequency and cumulative relative frequency) thus including information about how they are sorted. · Cumulative frequency of the i-th variant m i - is a number of values of a variable showing the frequency of variants less or equal the i-th variant E.g. we have a variable called "grade from Statistics" that has the following variants: "1", "2", "3" or "4" (where 1 is the best and 4 the worst grade). Then, for example, the cumulative frequency for variant "3", will be equal number of students who get grade "3" or better. If variants are sorted by their "size" (“ ”) then the following must be true: x x K x 1 2 k i m = n i å j j=1 So it is self-evident that cumulative frequency k-th ("the highest") variant is equal to the variable n. m = n k 11The second special characteristic for ordinal variable is cumulative relative frequency. · Cumulative relative frequency of i-th variant F i - a part of the group are the values with the i-th and lower variants. They are expressed by the following formula: i F = p å i j j=1 This is nothing else then relative expression of the cumulative frequency: m i F = i n Just as in the case of nominal variables we can present statistical characteristics using frequency table for ordinal variables. In comparison to the frequency table of nominal variables it also contains values of cumulative and cumulative relative frequencies. FREQUENCY TABLE Values Absolute Cumulative Relative Relative cumulative x frequency frequency frequency frequency i n m p F i i i i x n m = n p F = p 1 1 1 1 1 1 1 x n p F = p + p = F + p 2 2 2 2 1 2 1 2 M M M M M x n m = n + n = n p F = F + p = 1 k k k k -1 k k k k -1 k k k Total - - n = n å p = 1 i å i i = 1 i=1 1.1.4 Graphical Presentation of Ordinal Variables We briefly mentioned histogram and the pie chart as good ways of presenting the ordinal variable. But these graphs don't reflect variants’ sorting. To achieve that, we need to use polygon (also known as Ogive) and Pareto graph. Frequency Polygon - is a line chart. The frequency is placed along the vertical axis and the individual variants of the variable are placed along the horizontal axis (sorted in ascending order from the “lowest" to the “highest"). The values are attached to the lines. 12 Frequency polygon for the evaluation grades 20 18 16 14 12 10 8 6 4 2 0 1 2 3 4 variant Ogive (Cumulative Frequency Polygon) - is a frequency polygon of the cumulative frequency or the relative cumulative frequency. The vertical axis is the cumulative frequency or relative cumulative frequency. The horizontal axis represents variants. The graph always starts at zero, at the lowest variant, and ends up at the total frequency (for a cumulative frequency) or 1.00 (for a relative cumulative frequency). Ogive for the evaluation grades 40 35 30 25 20 15 10 5 0 1 2 3 4 variant Pareto Graph - is a bar chart for qualitative variable with the bars arranged by frequency - variants are on horizontal axis and are sorted from the “highest” importance to the “lowest” 13 frequency cumulative frequency Notice the decline of cumulative frequency. It drops as the frequency of variables decreases. Example and Solution Following data represent t-shirts sizes that a cloths retailer offers on sale: S, M, L, S, M, L, XL, XL, M, XL, XL, L, M, S, M, L, L, XL, XL, XL, L, M a) Analyze the data and interpret results in a graphical form. b) Determine what percentage of people bought t-shirts of L size maximum. Solution: a) The variable is qualitative (lexical) and t-shirt sizes can be sorted, therefore it is an ordinal variable. For its description you use frequency table for the ordinal variable and you determine the mode. FREQUENCY TABLE Colors of Absolute frequency Relative frequency passing cars n p i i red 5 blue 3 white 1 green 3 Total 12 1.00 Mode = XL (the most people bought t-shirts with XL value) For graphical representation use histogram, pie graph and cumulative frequency polygon (you don't create Pareto graph because it is mostly used for technical data). 14 Graphical output: Hi ist stog ogr ra am m Pie Chart Sold t-shirt Sold t-shirt 8 7 6 5 XL S 32% 14% 4 M 27% 3 L 27% 2 1 0 S M L XL variant Cumulative Frequency Pol oly ygon gon Sold t-shirt 25 20 15 10 5 0 S M L XL variant Total sales were 22 t-shirts. b) You get the answer from the va val lue ue of of t the he r re el la at ti ive ve c cum umul ula at ti ive ve f fr re equ que en nc cy y f for or va var ri ia ant nt L L. . Y You ou see that 68% of people bought t-shi shir rt ts of L size and smaller. 1.2 Sta Stat ti is st ti ic ca al l C Ch ha ar ra acte cteri ris sti tic cs s o of f Qu Qua anti ntit ta at ti iv ve e V Variables To describe quantitative variable, m m most ost ost of of of t t the he he st st sta a at t ti i ist st sti i ic c ca a al l l c c cha ha har r ra a ac c ct t te e er r ri i ist st sti i ic c cs s s f f for or or or or ordi di dina na nal l l va va var r ri i ia a abl bl ble e e description can be used ( ( (f f fr r re e eque que quen n nc c cy y y, , , r r re e el l la a at t ti i ive ve ve f f fr r re e equ qu que e enc nc ncy y y, , , c c cum um umul ul ula a at t ti i ive ve ve f f fr r re e eque que quenc nc ncy y y a a and nd nd c c cum um umul ul ula a at t ti i ive ve ve relative freque uen nc cy y) ). A . Ap pa ar rt t f fr rom om t those hose, t , the her re e a ar re e t tw wo a o addi ddit ti iona onal l one ones: s: · Measures of location – t those hose i indi ndic ca at te e a a t ty ypi pic ca al l di dist str ri ibut buti ion of on of t the he va var ri ia abl ble e va val lue ues s and · Measures of variability – t t those hose hose i i indi ndi ndic c ca a at t te e e a a a va va var r ri i ia a abi bi bil l li i it t ty y y ( ( (va va var r ri i ia a anc nc nce e e) ) ) of of of t t the he he v v va a al l lue ue ues s s around their typical position 15 absolute frequency cumulative frequency1.2.1 Measures of Location and Variability The most common measure of position is the variable mean. The mean represents average or typical value of the sample population. The most famous mean of quantitative variable is: · Arithmetical mean x It is defined by the following formula: n x å i i= x = n where: ... are values of the variable x i n ... size of the sample population (number of the values of the variable) Properties of the arithmetical mean: n 1. (x - x) = 0 å i i= - sum of all diversions of variable values from their arithmetical mean is equal to zero which means that arithmetical mean compensates mistakes caused by random errors. n n æ ö x (a + x ) å å ç i i ÷ i= i= 2. " (a Î Â): ç x = Þ = a + x ÷ n n ç ÷ ç ÷ è ø - if the same number is added to all the values of the variable, the arithmetical mean increases by the same number n n æ ö ç x (bx ) ÷ å i å i 3. ç i= i= ÷ "(b Î Â): x = Þ = bx ç ÷ n n ç ÷ è ø - if all the variable values are multiplied by the same number the arithmetical Mean increases accordingly Arithmetical mean is not always the best way to calculate the mean of the sample population. For example, if we work with a variable representing relative changes (cost indexes, etc.) we use the so-called geometrical mean. To calculate mean when the variable has a form of a unit, harmonical mean is often used. 16 Considering that the mean uses the whole variable values data set, it carries maximum information about the sample population. On the other hand, it's very sensitive to the so-called outlying observations (outliers). Outliers are values that are substantially different from the rest of the values in a group and they can distort the mean to such a degree that it no longer represents the sample population. We are going to have a closer look at the Outliers later. Measures of location that are less dependent on the outlying observations are: ˆ · Mode x In the case of mode we will differentiate between discrete and continuous quantitative variable. For discrete variable we define mode as the most frequent value of the variable (similarly as with the qualitative variable). But in the case of continuous variable we think of the mode as the value around which most variable values are concentrated. For assessment of this value we use shorth. Shorth is the shortest interval with at least 50% of variable values. In case of a sample as large as (with even n = 2k (k Î N) number of values) k values lie within shorth - which is n/2 (50%) variable values. In the case of a sample as large as (with odd number of values) n = 2k + 1 (k Î N) k +1 values lies within short - which is about 1/2 plus 50% variable values (n/2+1/2). ˆ Then, the mode x can be defined as the centre of the shorth. From what has been said so far it is clear that the shorth length (top boundary - bottom boundary) is unique but its location is not. If the mode can be determined unambiguously we talk about unimode variable. When a variable has two modes we call it bimode. When there are two or more modes in a sample, it usually indicates a heterogenity of variable values. This heterogenity can be removed by dividing the sample into more subsamples (for example bimode mark for person's height can be divided by sex into two unimode marks - women's height and men's height). Example and Solution The following data shows ages of musicians who performed at a concert. Age is a continuous variable. Calculate Mean, Shorth and Mode for the variable. 22 82 27 43 19 47 41 34 34 42 35 Solution: a) Mean: 17In this case we use arithmetical mean: n x å i 22 + 82 + 27 + 43 +19 + 47 + 41+ 34 + 34 + 42 + 35 i= x = = = 38.7years n 11 The musicians’ average age is 38.7 years. b) Shorth: Our sample population has 11 values. 11 is an odd number. 50% of 11 is 5.5 and the nearest higher natural number is 6 - otherwise: n/2+1/2 = 11/2+1/2 = 12/2 = 6. That means that 6 values will lie in the Shorth. And what are the next steps? · You need to sort the variable · You determine the size of all the intervals (having 6 elements) where x x K x i i +1 i +5 · The shortest of these intervals will be the shorth (size of the interval = ) x - x i +5 i Original data Sorting data Size of intervals (having 6 elements) 22 19 16 (= 35 – 19) 82 22 19 (= 41 – 22) 27 27 15 (= 42 – 27) 43 34 9 (= 43 – 34) 19 34 13 (= 47 – 34) 47 35 47 (= 82 – 35) 41 41 34 42 34 43 42 47 35 82 From the table you can see that the shortest interval has the value of 9. There is only one interval that corresponds to this size and that is: . 34;43 Shorth = and that means that half of the musicians are between 34 and 43 years of 34;43 age. c) Mode: Mode is defined as the center of shorth: 34 + 43 ˆ x = = 38.5 2 18 Mode = 38.5 years which means that the typical age of the musicians who performed at the concert was 38.5 years. Among other characteristics describing quantitative variables are quantiles. Those are used for more detailed illustration of the distribution of the variable values within the scope of the population. · Quantiles Quantiles describe location of individual values (within the variable scope) and are resistant to outlying observations similarly like the mode. Generally the quantile is defined as a value that divides the sample into two parts. The first one contains values that are smaller than given quantile and the second one with values larger or equal than the given quantile. The data must be sorted ascendingly from the lowest to the highest value. Quantile of variable x that separates 100% smaller values from the rest of the samples (i.e. from 100(1-p)% values) will be called 100p % quantile and marked x . p In real life you most often come across the following quantiles: · Quartiles In case of the four-part division the values of the variate corresponding to 25%, 50%, and 75% of the total distribution are called quartiles. Lower quartile x = 25% quantile - divides a sample of data in a way that 25% 0,25 of the values are smaller than the quartile, i.e. 75% are bigger (or equal) Median x = 50% quantile - divides a sample of data in a way that 50% of the 0,5 values are smaller than the median and 50% of values are bigger (or equal) Upper quartile x = 75% quantile - divides a sample of data in a way that 75% 0,75 of values are smaller than the quartile, i.e. 25% are bigger (or equal) Example: Data 6 47 49 15 43 41 7 39 43 41 36 Data in ascending order 6 7 15 36 39 41 41 43 43 47 49 Median 41 Upper quartile 43 Lower quartile 15 The difference between the 1st and 3rd quartile is called the Inter-Quartile Range (IQR). 19 IQR = x - x 0.75 0.25 Example: Data 2 3 4 5 6 6 6 7 7 8 9 Upper quartile 7 Lower quartile 4 IQR 7 - 4 = 3 · Deciles – x x ... ; x 0.1; 0.2; 0.9 The deciles divide the data into 10 equal regions. · Percentiles – x ; x ; …; x 0.01 0.02 0.99 The percentiles divide the data into 100 equal regions. th For example, the 80 percentile is the number that has 80% of values below it and 20% above it. Rather than counting 80% from the bottom, count 20% from the top. th Note: The 50 percentile is the median. · Minimum x and Maximum x min max , i.e. 0% of values are less than minimum x = x min 0 , i.e. 100% of values are less than maximum x = x max 1 There is the following process to determine quantiles: 1. The sample population needs to be ordered by size 2. The individual values are sequenced so that the smallest value is at the first place and the highest value is at n-th place (n is the total number of values) 3. 100p% quantile is equal to a variable value with the sequence z where: p z = n × p + 0.5 p z has to be rounded to integer p REMEMBER In case of a data set with an even number of values the median is not uniquely defined. Any number between two middle values (including these values) can be accepted as the median. Most often it is the middle value. We are now going to discuss the relation between quantiles and the cumulative relative frequency. The value p denotes cumulative relative frequency of quantile x p 20 i.e. relative frequency of those variable values that are smaller than quantile x . p Quantile and cumulative relative frequency are inverse concepts. Graphical or tabular representation of the ordered variable and appropriate cumulative frequencies is known as distribution function of the cumulative frequency or empirical distribution function. · Empirical Distribution Function F(x) for the Quantitative Variable We put the sample population in ascending order (x x … x ) and we denote 1 2 n p(x ) as relative frequency of the value x . For empirical distribution function F(x) i i it must then be true that: 0 for x £ x ì 1 j ï ï F(x) = p(x ) for x x £ x , 1 £ j £ n -1 íå i j j +1 i=1 ï ï 1 for x x n î The empirical distribution function is a monotonous, increasing function and it runs from the left. p(x ) = lim F(x)- F(x ) i i x®x i+ F(x) 1 p(x ) n p(x ) 2 0 x x x x x 1 2 3 ........ n-1 n x · MAD MAD is a short for Median Absolute Deviation from the median. MAD is determined as follows: 1. Order the sample population by size 2. Determine the median of the sample population 213. For each value determine absolute value of its deviation from the median 4. Put absolute deviations from the median in ascending order by size 5. Determine the median of the absolute deviations from the median i.e. MAD Example and Solution There is the following data set: 22, 82, 27, 43, 19, 47, 41, 34, 34, 42, 35 (the data from the previous example). Determine: a) All quartiles b) Inter-Quartile Range c) MAD d) Draw the Empirical Distribution Function Solution: a) You need to determine Lower Quartile x ; Median x and Upper Quartile x . 0,25 0,5 0,75 First, you order the data by size and assign a sequence number to each value. Original data Ordered data Sequence 22 19 1 82 22 2 27 27 3 43 34 4 19 34 5 47 35 6 41 41 7 34 42 8 34 43 9 42 47 10 35 82 11 Now you can divide the data set into quartiles and mark their variable values accordingly: Lower Quartile x : p = 0.25; n = 11Þ z = 11 x 0.25 + 0.5 = 3.25 ≅ 3 Þ x = 27 0,25 p 0.25 i.e. 25% of musicians are under 27 (75% of them are 27 years old or older). Median x : p = 0.5; n = 11Þ z = 11 x 0.5 + 0.5 = 6 Þ x = 35 0,5 p 0.5 i.e. a half of the musician are under 35 (50% of them are 35 years old or older). Upper Quartile x : p = 0.75; n = 11Þ z = 11 x 0.75 + 0.5 = 8.75 ≅ 9 Þ x = 43 0,75 p 0.75 i.e. 75% musicians are under 43 (25% of them are 43 years old or older). b) Inter-Quartile Range IQR: 22

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.