Lecture notes on Simple random Sampling

random sampling benefits and disadvantages, random sampling definition statistics, random sampling of states in dynamic programming pdf free download
Prof.KristianHardy Profile Pic
Published Date:26-07-2017
Your Website URL(Optional)
Chapter 3 Collecting, presenting and summarising data Data are the key to many important management decisions. Is a new product selling well? Do potential customers like the new advertising campaign? Should we launch a new product? These are all questions that can be answered with data. We begin this part of the course with some basic methods of collecting, representing and describing data. We will start off by we looking at the different kinds of data that exist, how we might obtain these data, and some basic methods for presenting them. But first – some important definitions... 3.1 Important definitions For the work we will look at over the next six months or so, you must be familiar with the following words and phrases, and you must understand what they mean The quantities measured in a study are called random variables and a particular outcome is called an observation. A collection of observations is the data. The collection of all possible outcomes is the population.If we were interested in the height of people doing Accounting & Finance courses at Newcastle, that would be our random variable; a particular person’s height would be the observation and if we measured everyone doing ACC1012, those would be our data, which would form a sample from the population of all students registered on Accounting & Finance degrees. In practice it is difficult to observe whole populations, unless we are interested in a very limited population, e.g. the students taking ACC1012. In reality we usually observe a subset of the population; we will come back to sampling later in Section 3.2. Once we have our data, it is important to understand what type it is – so we can figure out exactly what to do with it. You should refer to Section 2.1 of the summer revision booklet for full details – the diagram overleaf provides a useful summary – you may want to annotate this 6162 CHAPTER 3. COLLECTING, PRESENTING AND SUMMARISING DATA ✎ Lee Fawcett 3.2 Sampling We can rarely observe the whole population. Instead, we observe some sub–set of this, that is, the sample. The difficulty is in obtaining a representative sample. For example, if you were to ask the people leaving a gym if they exercised this would produce a biased sample and would not be representative of the population as a whole. The importance of obtaining a representative sample cannot be stressed too highly. As we will see in Semester 2, we use the data from our samples in order to make inferences about the population and these inferences influence the decision making process. There are three general forms of sampling techniques. 1. Random sampling – where the members of the sample are chosen by some random mechanism. 2. Quasi–random sampling – where the mechanism for choosing the sample is only partly random. 3. Non–random sampling – where the sample is specifically selected.3.2. SAMPLING 63 3.2.1 Simple Random Sampling If we had a population of 200 students we could put all their names into a hat and draw out 20 names as our sample. Each name has an equally likely chance of being drawn and so the sample is completely random. Furthermore, each possible sample of 20 has an equal chance of being selected. In reality, the drawing of the names would be done by a computer and the population and samples would be considerably larger. The disadvantages of this method are that we often do not have a complete list of the population. For example, if you were surveying the market for some new software, the population would be everybody with a compatible computer. It would be almost impossible to obtain this information. Not all elements of the population are equally accessible and hence you could waste time trying to obtain data from people who are unwilling to provide it. Thirdly, it is possible that, purely by chance, you could pick an unrepresentative sample, either over– or under–representing elements of the population. 3.2.2 Stratified Sampling This is a form of random sample where clearly defined groups, or strata, exist within the population, for example males and females, working or not working, age groups etc. If we know the overall proportion of the population that falls into each of these groups, we can randomly sample from each of the groups and then adjust the results according to the known proportions. For example, assume that the population is 55% female and 45% male and we wanted a sample of 1000. We could first decide to have 550 females and 450 males in our sample. We would then pick the members of our sample from their respective groups randomly. We do not have to make the numbers in the samples proportional to the numbers in the strata because we could adjust the results but sampling within each stratum ensures that that stratum is properly represented in our results and gives us more precise information about the population as a whole. Such sampling should generally reflect the major groupings within the population. The disadvantages are that we need clear information on the size and composition of each group or stratum, which can be difficult to obtain; and as with simple random sampling, We still need to know the entire population so as to sample from it. 3.2.3 Systematic Sampling This is a form of quasi–random sampling which can be used where the population is clearly structured. For example, if you were interested in obtaining a 10% sample from a batch of components being manufactured, you would select the first component at random; after that, you pick every tenth item to come off the production line. The simplicity of selection makes this a particularly easy sampling scheme to implement, especially in a production setting. The disadvantages of this method are that it is not random and if there is a pattern in the process it may be possible to obtain a biased sample. It is only really applicable to structured populations.64 CHAPTER 3. COLLECTING, PRESENTING AND SUMMARISING DATA 3.2.4 Multi–stage Sampling This is another form of quasi–random sampling. These types of sampling schemes are common where the population is spread over a wide geographic area which might be difficult or expensive to sample from. Multi–stage sampling works, for example, by dividing the area into geographically distinct smaller areas, randomly selecting one (or more) of these areas and then sampling, whether by random, stratified or systematic sampling schemes within these areas. For example, if we were interested in sampling school children, we might take a random (or stratified) sample of education authorities, then, within each selected authority, a random (or stratified) sample of schools, then, within each selected school, a random (or stratified) sample of pupils. This is likely to save time and cost less than sampling from the whole population. The sample can be biased if the stages are not carefully selected. Indeed, the whole scheme needs to be carefully thought through and designed to be truly representative. 3.2.5 Cluster Sampling This is a method of non–random sampling. For example, a geographic area is sub–divided into clusters and all the members of a particular cluster are then surveyed. This differs from multi–stage sampling covered in Section 3.2.4 where the members of the cluster were sampled randomly. Here, no random sampling occurs. The advantage of this method is that, because the sampling takes place in a concentrated area, it is relatively inexpensive to perform. The very fact that small clusters are picked to allow an entire cluster to be surveyed introduces the strong possibility of bias within the sample. If you were interested in the take up of organic foods and were sampling via the cluster method you could easily get biased results; if, for example, you picked an economically deprived area, the proportion of those surveyed that ate organically might be very low, while if you picked a middle class suburb the proportion is likely to be higher than the overall population. 3.2.6 Judgemental sampling Here, the person interested in obtaining the data decides whom they are going to ask. This can provide a coherent and focused sample by choosing people with experience and relevant knowledge to provide their opinions. For example, the head of a service department might suggest particular clients to survey based on his judgement. They might be people he believes will be honest or have strong opinions. This methodology is non–random and relies on the judgement of the person making the choice. Hence, it cannot be guaranteed to be representative. It is prone to bias.3.3. FREQUENCY TABLES 65 3.2.7 Accessibility sampling Here, only the most easily accessible individuals are sampled. This is clearly prone to bias and only has convenience and cheapness in its favour. For example, a sample of grain taken from the top of a silo might be quite unrepresentative of the silo as a whole in terms of moisture content. 3.2.8 Quota Sampling This method is similar to stratified sampling but uses judgemental (or some other) sampling rather than random sampling within groups. We would classify the population by any set of criteria we choose to sample individuals and stop when we have reached our quota. For example, if we were interested in the purchasing habits of 18–23 year old male students, we would stop likely candidates in the street; if they matched the requirements we would ask our questions until we had reached our quota of 50 such students. This type of sampling can lead to very accurate results as it is specifically targeted, which saves time and expense. The accurate identification of the appropriate quotas can be problematic. This method is highly reliant on the individual interviewer selecting people to fill the quota. If this is done poorly bias can be introduced into the sample. 3.2.9 Sample Size When considering data collection, it is important to ensure that the sample contains a sufficient number of members of the population for adequate analysis to take place. Larger samples will generally give more precise information about the population. Unfortunately, in reality, issues of expense and time tend to limit the size of the sample it is possible to take. For example, national opinion polls often rely on samples in the region of just 1000. 3.3 Frequency Tables Once we have collected our data, often the first stage of any analysis is to present them in a simple and easily understood way. Tables are perhaps the simplest means of presenting data. There are many types of tables. For example, we have all seen tables listing sales of cars by type, or exchange rates, or the financial performance of companies. These types of tables can be very informative. However, they can also be difficult to interpret, especially those which contain vast amounts of data. Frequency tables are amongst the most commonly–used tables and are perhaps the most easily understood. They can be used with continuous, discrete, categorical and ordinal data. Frequency tables have uses in some of the techniques we will later on in this chapter.66 CHAPTER 3. COLLECTING, PRESENTING AND SUMMARISING DATA 3.3.1 Frequency tables for categorical data The following table presents the modes of transport used daily by 30 students to get to and from University (survey date: 3rd August 2012). Student Mode Student Mode Student Mode 1 Car 11 Walk 21 Walk 2 Walk 12 Walk 22 Metro 3 Car 13 Metro 23 Car 4 Walk 14 Bus 24 Car 5 Bus 15 Train 25 Car 6 Metro 16 Bike 26 Bus 7 Car 17 Bus 27 Car 8 Bike 18 Bike 28 Walk 9 Walk 19 Bike 29 Car 10 Car 20 Metro 30 Car The table obviously contains much information. However, it is difficult to see which method of transport is the most widely used. One obvious next step would be to count the number of students using each mode of transport: Mode Frequency Car 10 Walk 7 Bike 4 Bus 4 Metro 4 Train 1 Total 30 This gives us a much clearer picture of the methods of transport used. Also of interest might be the relative frequency of each of the modes of transport. The relative frequency is simply the frequency expressed as a proportion of the total number of students surveyed. If this is given as a percentage, as here, this is known as the percentage relative frequency. Mode Frequency Relative Frequency (%) Car 10 33.3 Walk 7 23.4 Bike 4 13.3 Bus 4 13.3 Metro 4 13.3 Train 1 3.4 Total 30 1003.3. FREQUENCY TABLES 67 3.3.2 Frequency tables for count data ☛ ✟ Example 3.1 ✡ ✠ The following table shows the raw data for car sales at a new car showroom over a two week period in July 2012. Date Cars Sold Date Cars Sold 01/07/12 9 08/07/12 10 02/07/12 8 09/07/12 5 03/07/12 6 10/07/12 8 04/07/12 7 11/07/12 4 05/07/12 7 12/07/12 6 06/07/12 10 13/07/12 8 07/07/12 11 14/07/12 9 Present these data in a relative frequency table by number of days on which different numbers of cars were sold. ✎ Cars Sold Tally Frequency Relative Frequency % Totals68 CHAPTER 3. COLLECTING, PRESENTING AND SUMMARISING DATA 3.3.3 Frequency tables for continuous data With discrete data, and especially with small data sets, it is easy to count the quantities in the defined categories. With continuous data this is not possible. Strictly speaking, no two observations are precisely the same. With such observations we group the data together. For example, the following data set represents the service time in seconds for callers to a credit card call centre. 214.8412 220.6484 216.7294 195.1217 211.4795 195.8980 201.1724 185.8529 183.4600 178.8625 196.3321 199.7596 206.7053 203.8093 203.1321 200.8080 201.3215 205.6930 181.6718 201.7461 180.2062 193.3125 188.2127 199.9597 204.7813 198.3838 193.1742 204.0352 197.2206 193.5201 205.5048 217.5945 208.8684 197.7658 212.3491 209.9000 197.6215 204.9101 203.1654 192.9706 208.9901 202.0090 195.0241 192.7098 219.8277 208.8920 200.7965 191.9784 188.8587 206.8912 To produce a continuous data frequency table we first need to divide the range of the variable into smaller ranges called class intervals. The class intervals should, between them, cover every possible value. There should be no gaps between the intervals. One way to ensure this is to include the boundary value as the smallest value in the next class above. This can be written as, for example, 20≤ obs 30. This means we include all observations (represented by “obs”) within this class interval that have a value of at least 20 up to values just below 30. Some things to think about: • Often for simplicity we would write the class intervals up to the number of decimal places in the data and avoid using the inequalities; for example, 20 up to 29.999 if we were working to 3 decimal places. • We need to include the full range of data in our table and so we need to identify the minimum and maximum points (sometimes our last class might be “greater than such and such”). • The class interval width should be a convenient number – for example 5, 10, or 100, depending on the data. Obviously we do not want so many classes that each one has only one or two observations in it. • The appropriate number of classes will vary from data set to data set; however, with simple examples that you would work through by hand, it is unlikely that you would have more than ten to fifteen classes.3.3. FREQUENCY TABLES 69 ☛ ✟ Example 3.2 ✡ ✠ Create a frequency table for the call centre data. Also, find the relative frequencies in each class interval. ✎ Class Interval Tally Frequency Relative Frequency % Totals70 CHAPTER 3. COLLECTING, PRESENTING AND SUMMARISING DATA 3.4 Graphical methods for presenting data We have looked at ways of collecting data and then collating them into tables. Frequency tables are useful methods of presenting data; they do, however, have their limitations. With large amounts of data graphical presentation methods are often clearer to understand. Here, we look at methods for producing graphical representations of data of the types we have seen previously. 3.4.1 Stem and Leaf plots Stem and leaf plots are a quick and easy way of representing data graphically. They can be used with both discrete and continuous data. You should refer to Section 2.3.1 of the summer revision booklet for more details about these plots. Example: Percentage returns on a share The following numbers show the percentage returns on an ordinary share for 23 consecutive months: 0.2 –2.1 1.0 0.1 –0.5 2.4 –2.3 1.5 1.2 –0.6 2.4 –1.2 1.7 –1.3 –1.2 0.9 0.5 0.1 –0.1 0.3 –0.4 0.5 0.9 Here, the largest value is 2.4 and the smallest –2.3, and we have lots of decimal values in between. Thus, it seems sensible here to have a stem unit of 1 and a leaf unit of 0.1. A stem and leaf diagram for this set of returns then might look like: –2 3 1 –1 3 2 2 –0 6 5 4 1 0 1 1 2 3 5 5 9 9 1 0 2 5 7 2 4 4 Stem Leaf n = 23, stem unit =1, leaf unit = 0.1. Example: Production line data Consider the following data on lengths of items on a production line (in cm): 2.97 3.81 2.54 2.01 3.49 3.09 1.99 2.64 2.31 2.22 The stem and leaf plot for this is shown overleaf. Notice that all figures have been rounded down, or cut, to one decimal place.3.4. GRAPHICAL METHODS FOR PRESENTING DATA 71 1 9 2 0 2 3 5 6 9 3 0 4 8 n = 10, stem unit = 1 cm, leaf unit =0.1 cm. Why do you think we cut the extra digits? ✎ ☛ ✟ Example 3.3 ✡ ✠ The observations in the table below are the recorded time it takes to get through to an operator at a telephone call centre (in seconds). Construct a stem–and–leaf plot for these data, and comment. ✎ 54 56 50 67 55 38 49 45 39 50 45 51 47 53 29 42 44 61 51 50 30 39 65 54 44 54 72 65 58 62 Stem Leaf n = stem unit = leaf unit =72 CHAPTER 3. COLLECTING, PRESENTING AND SUMMARISING DATA ☛ ✟ Example 3.4 ✡ ✠ The stem and leaf plot below represents the marks on a test for 52 students. Comment on the distribution of these marks. 1 4 1 5 7 7 2 1 1 2 3 2 5 5 6 7 8 8 3 2 3 3 3 4 4 4 3 5 5 6 7 7 8 8 9 9 9 9 4 0 0 1 1 1 2 2 4 4 4 4 5 7 7 8 8 8 9 5 0 0 0 n =52, stem unit =10, leaf unit =1. ✎ 3.4.2 Bar Charts Bar charts are a commonly–used and clear way of presenting categorical data or any ungrouped discrete frequency observations. See Section 2.3.2 of the summer revision booklet for more details. 3.4.3 Pie charts Pie charts are simple diagrams for displaying categorical or grouped data. These charts are commonly used within industry to communicate simple ideas, for example market share. They are used to show the proportions of a whole. They are best used when there are only a handful of categories to display. See Section 2.3.3 of the summer revision booklet for more details.3.4. GRAPHICAL METHODS FOR PRESENTING DATA 73 3.4.4 Histograms Bar charts have their limitations; for example, they cannot be used to present continuous data. When dealing with continuous random variables a different kind of graph is required. This is called a histogram. At first sight these look similar to bar charts. There are, however, two critical differences: • the horizontal (x-axis) is a continuous scale. As a result of this there are no gaps between the bars (unless there are no observations within a class interval); • the height of the rectangle is only proportional to the frequency if the class intervals are all equal. With histograms it is the area of the rectangle that is proportional to their frequency. The frequency table for the data on service times for a telephone call centre (Section 3.3.3) was Service time Frequency 175≤ time 180 1 180≤ time 185 3 185≤ time 190 3 190≤ time 195 6 195≤ time 200 10 200≤ time 205 12 205≤ time 210 8 210≤ time 215 3 215≤ time 220 3 220≤ time 225 1 Total 50 Notice that all the class intervals are the same width, and so the histogram for these data is: 12 10 Frequency 8 6 4 2 175 180 185 190 195 200 205 210 215 220 225 Time (s)74 CHAPTER 3. COLLECTING, PRESENTING AND SUMMARISING DATA ☛ ✟ Example 3.5 ✡ ✠ The Holiday Hypermarket travel agency received 64 telephone calls yesterday morning. The table below gives information of the lengths, in minutes, of these telephone calls. Length (x) minutes Frequency Frequency density 0≤ x 5 4 0.8 5≤ x 15 10 1.0 15≤ x 30 24 30≤ x 40 20 40≤ x 45 6 Complete this table, and construct a histogram for these data. What is the modal class here? 5 10 15 20 25 30 35 40 Length (minutes) 3.4. GRAPHICAL METHODS FOR PRESENTING DATA 75 3.4.5 Percentage Relative Frequency Histograms When we produced frequency tables in Section 3.3, we included a column for percentage relative frequency. This contained values for the frequency of each group, relative to the overall sample size, expressed as a percentage. For example, a percentage relative frequency table for the data on service time (in seconds) for calls to a credit card service centre is: Service time Frequency Relative Frequency (%) 175≤ time 180 1 2 180≤ time 185 3 6 185≤ time 190 3 6 190≤ time 195 6 12 195≤ time 200 10 20 200≤ time 205 12 24 205≤ time 210 8 16 210≤ time 215 3 6 215≤ time 220 3 6 220≤ time 225 1 2 Totals 50 100 You can plot these data like an ordinary histogram, or, instead of using frequency/frequency density on the vertical axis (y-axis), you could use the percentage relative frequency/percentage relative frequency density. 24 20 Relative 16 frequency (%) 12 8 4 175 180 185 190 195 200 205 210 215 220 225 Time (s) Note that the y-axis now contains the relative percentages rather than the frequencies. You might well ask “why would we want to do this?”.76 CHAPTER 3. COLLECTING, PRESENTING AND SUMMARISING DATA These percentage relative frequency histograms are useful when comparing two samples that have different numbers of observations. If one sample were larger than the other then a frequency histogram would show a difference simply because of the larger number of observations. Looking at percentages removes this difference and enables us to look at relative differences. For example, in the following graph (produced in the computer package Minitab – see Semester 2) there are data from two groups and four times as many data points for one group as the other. The left–hand plot shows an ordinary histogram and it is clear that the comparison between groups is masked by the quite different sample sizes. The right–hand plot shows a histogram based on (percentage) relative frequencies and this enables a much more direct comparison of the distributions in the two groups. Overlaying histograms on the same graph can sometimes not produce such a clear picture, particularly if the values in both groups are close or overlap one another significantly.3.4. GRAPHICAL METHODS FOR PRESENTING DATA 77 3.4.6 Relative Frequency Polygons These are a natural extension of the relative frequency histogram. They differ in that, rather than drawing bars, each class is represented by one point and these are joined together by straight lines. The method is similar to that for producing a histogram: 1. Produce a percentage relative frequency table. 2. Draw the axes – The x-axis needs to contain the full range of the classes used. – The y-axis needs to range from 0 to the maximum percentage relative frequency. 3. Plot points: pick the mid point of the class interval on the x-axis and go up until you reach the appropriate percentage value on the y-axis and mark the point. Do this for each class. 4. Join adjacent points together with straight lines. The relative frequency polygon is exactly the same as the relative frequency histogram, but instead of having bars we join the mid–points of the top of each bar with a straight line. Consider the following simple example. Class Interval Mid Point % Relative Frequency 0≤ x 10 5 10 10≤ x 20 15 20 20≤ x 30 25 35 30≤ x 40 35 25 40≤ x 50 45 10 We can draw this easily by hand: Relative frequency polygon 40 Relative frequency (%) 30 20 10 0 10 20 30 40 50 0 x78 CHAPTER 3. COLLECTING, PRESENTING AND SUMMARISING DATA These percentage relative frequency polygons are very useful for comparing two or more samples – we can easily “overlay” many relative frequency polygons, but overlaying the corresponding histograms could get really messy Consider the following data on gross weekly income (in£) collected from two sites in Newcastle. Let us suppose that many more responses were collected in Jesmond so that a direct comparison of the frequencies using a standard histogram is not appropriate. Instead we use relative frequencies. Weekly Income (£) West Road (%) Jesmond Road (%) 0≤ income 100 9.3 0.0 100≤ income 200 26.2 0.0 200≤ income 300 21.3 4.5 300≤ income 400 17.3 16.0 400≤ income 500 11.3 29.7 500≤ income 600 6.0 22.9 600≤ income 700 4.0 17.7 700≤ income 800 3.3 4.6 800≤ income 900 1.3 2.3 900≤ income 1000 0.0 2.3 The computer package Minitab (see Semester 2) was used to produce the following plot of the percentage relative frequency polygons for the two groups. We can clearly see the differences between the two samples. The line connecting the boxes represents the data from West Road and the line connecting the circles represents those for Jesmond Road. The distribution of incomes on West Road is skewed towards lower values, whilst those on Jesmond Road are more symmetric. The graph clearly shows that income in the Jesmond Road area is higher than that in the West Road area.3.4. GRAPHICAL METHODS FOR PRESENTING DATA 79 3.4.7 Cumulative Frequency Polygons (Ogives) Cumulative percentage relative frequency is also a useful tool. The cumulative percentage relative frequency is simply the sum of the percentage relative frequencies at the end of each class interval (i.e. we add the frequencies up as we go along). Consider the example from the previous section: Class Interval % Relative Frequency Cumulative % Relative Frequency 0≤ x 10 10 10 10≤ x 20 20 30 20≤ x 30 35 65 30≤ x 40 25 90 40≤ x 50 10 100 At the upper limit of the first class the cumulative % relative frequency is simply the % relative frequency in the first class, i.e. 10. However, at the end of the second class, at 20, the cumulative % relative frequency is 10+20 = 30. The cumulative % relative frequency at the end of the last class must be 100. The corresponding graph, or ogive, is simple to produce by hand: 1. Draw the axes. 2. Label the x-axis with the full range of the data and the y-axis from 0 to 100%. 3. Plot the cumulative % relative frequency at the end point of each class. 4. Join adjacent points, starting at 0% at the lowest class boundary. Ogive 100 80 Cumulative relative 60 frequency (%) 40 20 0 10 20 30 40 50 0 x80 CHAPTER 3. COLLECTING, PRESENTING AND SUMMARISING DATA For example, Minitab was used to produce the ogive below for the income data from the West Road survey: This graph instantly tells you many things. To see what percentage of respondents earn less than£x per week: 1. Find x on the x-axis and draw a line up from this value until you reach the ogive; 2. From this point trace across to the y-axis; 3. Read the percentage from the y-axis. If we wanted to know what percentage of respondents in the survey in West Road earn less than£250 per week, we simply find£250 on the x-axis, trace up to the ogive and then trace across to the y-axis and we can read a figure of about 47%. The process obviously works in reverse. If we wanted to know what level of income 50% of respondents earned, we would trace across from 50% to the ogive and then down to the x-axis and read a value of about£300.

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.