How Histogram excel

how do histograms work and how create histogram in excel
HalfoedGibbs Profile Pic
HalfoedGibbs,United Kingdom,Professional
Published Date:02-08-2017
Your Website URL(Optional)
Comment
9 histograms The shape of numbers Most of the action in this city concentrates right here. That’s why I’m so tall. So what? The important work is done in this area. If you understood the landscape, you’d see why How much can a bar graph tell you? There are about a zillion ways of showing data with pictures, but one of them is special. Histograms, which are kind of similar to bar graphs, are a super-fast and easy way to summarize data. You’re about to use these powerful little charts to measure your data’s spread, variability, central tendency, and more. No matter how large your data set is, if you draw a histogram with it, you’ll be able to “see” what’s happening inside of it. And you’re about to do it with a new, free, crazy-powerful software tool. this is a new chapter 251Oh boy, a self evaluation. Bet you’d score higher now than you would have in chapter 1 quest for recognition Your annual review is coming up You’ve been doing some really excellent analytical work lately, and it’s high time you got what’s coming to you. The powers that be want to know what you think about your own performance. You deserve a pat on the back. Not a literal pat on the back, though… something more. Some sort of real recognition. But what kind of recognition? Your work is totally solid. And how do you go about actually getting it? 252 Chapter 9 Starbuzz Analyst Self-review Thank you for filling out our self-review This document is important for our files and will help determine your future at Starbuzz. Date ____________________ Analyst Name ____________________ Circle the number that represents how well-developed you consider your abilities to be. A low score means you think you need some help, and a high score means you think your work is excellent. The overall quality of your analytical work. 1 2 3 4 5 Your ability to interpret the meaning and importance of past events. 1 2 3 4 5 Your ability to make level-headed judgements about the future. 1 2 3 4 5 Quality of written and oral communication. 1 2 3 4 5 Your ability to keep your client well-informed and making good choices. 1 2 3 4 5histograms You’d better brainstorm about strategies to get recognized. Write down how you’d respond to each of these questions. Should you just say thanks to your boss and hope for the best? If your boss really believes you’ve been valuable, he’ll reward you, right? Should you give yourself super-positive feedback, and maybe even exaggerate your talents a little? Then demand a big raise? Can you envision a data-based way of deciding on how to deal with this situation? you are here 4 253Here are your choices. This would be nice is more money in your future? We so deserve a raise. But how do we get the boss to give it to us? However you answered the questions on the last page, we think you should go for more money. You’re not doing this hard work for your health, after all. Going for more cash could play out in a bunch of different ways Anything could happen. People can be skittish about trying to get more money from their bosses. And who can blame them? There are lots of possible outcomes, and not all of them are good. Ask for a little Raise You’re “put back Does your boss in your place” Ask for a lot think you’re being reasonable? Is she delighted or disappointed? Do nothing Incredible raise You’ve been canned Could research help you predict the outcomes? Even though your case is unique to you, it still might make sense to get an idea of your boss’s baseline expectations. 254 Chapter 9 Right now, you have no idea what your boss will think or do.This data could be of use to you as you figure out what types of raises are reasonable to expect. Each line of the database represents someone’s raise for the specified year. You’re going to need to do something to make the data useful. histograms Here’s some data on raises Because you’re so plugged in to Starbuzz’s Load this data, you have access to some sweet numbers: Human Resource’s records about raises for the past three years. www.headfirstlabs.com/books/hfda/ hfda_ch09_employees.csv Here’s the person’s raise amount, measured as a percent increase. You might be able to wring some powerful insights out of this data. If you assume that your boss will act in a similar way to how previous bosses acted, this data could tell you what to expect. Problem is, with approximately 3,000 employees, the data set is pretty big. How would you deal with this data? Could you manage it to make it more useful? you are here 4 255 Your company’s raises This column says whether the person is male or female… you know, there might be a correlation between gender and raise amount. This column tells you whether the person asked for more money or not. TRUE means they asked for more cash, FALSE means they didn’t.data-based negotiation Jim: We should forget about the data and just go for as much as we can get. Nothing in there will tell us what they think we’re worth. There’s a range of numbers in the boss’s head, and we need to figure out how to get the upper end of that range. Joe: I agree that most of the data is useless to tell us what they think we are worth, and I don’t see how we find out. The data will tell us the average raise, and we can’t go wrong shooting for the average. Jim: The average? You’ve got to be joking. Why go for the middle? Aim higher Frank: I think a more subtle analysis is in order. There’s some rich information here, and who knows what it’ll tell us? Joe: We need to stay risk-averse and follow the herd. The middle is where we find safety. Just average the Raise column and ask for that amount. Jim: That’s a complete cop-out Frank: Look, the data shows whether people negotiated, the year of the raise, and people’s genders. All this information can be useful to us if we just massage it into the right format. Jim: OK, smarty pants. Show me. Frank: Not a problem. First we have to figure out how to collapse all these numbers into figures that make more sense… Better summarize the data. There’s just too much of it to read and understand all at once, and until you’ve summarized the data you don’t really know what’s in it. Where will you Start by breaking the data down into its basic constituent pieces. Once you have those pieces, begin your summary then you can look at averages or whatever other summary statistic you consider useful. of this data? 256 Chapter 9Here are some examples. histograms As you know, much of analysis consists of taking information and breaking it down into smaller, more manageable pieces. Draw a picture to describe how you would break these data fields down into smaller elements. Raises of 6–8% Women What statistics could you use to summarize these elements? Sketch some tables that incorporate your data fields with summary statistics. you are here 4 257 Draw pictures here to represent how you’d split the data into smaller pieces.How many people had a raise between 6 and 7 percent? …and you can combine those pieces of data with pieces from other columns. You’ve got loads of options here. Let’s actually create this last visualization… chunks of data What sort of pieces would you break your data into? Women Raises of Men 4–5% Men in Female 2008 negotiators Raises of Raises of 1% in ’07 10% Here are a few ways you might integrate your data chunks with summary statistics. 0–1% 1–2% Frequencies 3–4% (or counts) of 4–5% Men Women raises 5–6% 6–7+% Negotiators Average raise. Average raise. 258 Chapter 9 Here are some examples… your answers might be a little different. This chart counts the number of raises in each interval of raise possibilities. You can break the data in your columns into pieces… This table shows the average raise for male and female negotiators.histograms It sure is fun to imagine summarizing these pieces of the data, but here’s a thought. How about we actually do it? Using the groupings of data you imagined, you’re ready to start summarizing. When you need to slice, dice, and summarize a complex data set, you want to use your best software tools to do the dirty work. So let’s dive in and make your software reveal just what’s going on with all these raises. you are here 4 259With your data open in Excel, click the Data Analysis button under the Data tab. If you don’t see the Data Analysis button, see Appendix iii for how to install it. excel for histograms Test Drive A visualization of the number of people who fall in each category of raises will enable you to see the whole data set at once. So let’s create that summary… or even better, let’s do it graphically. 1 Open the Da ta Analysis dialogue box. In OpenOffice and older versions of Excel, you can find Data Analysis under the Tools menu. 2 Select Histogram. In the pop-up window, tell Excel that you want to create a histogram. 260 Chapter 9 Here it isSelect all your raise data under the received column. …to the bottom. histograms Test Drive Select y our data. 3 4 Run it. What happens when you create the chart? you are here 4 261 Press OK and let ‘er rip Be sure to check this box so that Excel makes a chart. When your raise data is selected, there should be a big “marching ants” selection box all the way from the top…Looks like a whole lot of people have raises in this area. The histogram shows us dramatically. histograms for counting Histograms show frequencies of groups of numbers Histograms are a powerful visualization because, no matter how large your data set is, they show you the distribution of data points across their range of values. For example, the table you envisioned in the last exercise would have told you how many people received raises at about 5 percent. 0–1% 1–2% Frequencies 3–4% of raises 4–5% 5–6% 6–7+% This histogram shows graphically how many people fall into each raise category, and it concisely shows you what people are getting across the spectrum of raises. On the other hand, there are some problems with what Excel did for you. The default settings for the bins (or “class intervals”) end up producing messy, noisy x-axis values. The graph would be easier to read with plain integers (rather than long decimals) on the x-axis to represent the bins. Sure, you can tweak the settings to get those bins looking more like the data table you initially envisioned. But even this histogram has a serious problem. Can you spot it? 262 Chapter 9 Here’s the output from Excel. Hmm… this x-axis sure looks like a mess. What kind of concentration of people get raises around 5%? These bins are nice whole numbers.These are more typical-looking histograms. histograms Gaps between bars in a histogram mean gaps among the data points In histograms, gaps mean that there is data missing between certain ranges. If, say, no one got a raise between 5.75 percent and 6.25 percent, there might be a gap. If the histogram showed that, it might really be worth investigating. In fact, there will always be gaps if there are more bins than data points (unless your data set is the same number repeated over and over). Histograms Up The problem with Excel’s function is Close that it creates these messy, artificial breaks that are really deceptive. And there’s a technical workaround for the issues (with Excel, there’s almost always a workaround if you have the time to write code using Microsoft’s proprietary language). But it’s chapter 9, and you’ve been kicking serious butt. You’re ready for a software tool with more power than Excel to manage and manipulate statistics. That’s exactly what the gap should mean, at least if the histogram is written correctly. If you assumed this The software you need is called R. histogram was correct, and that there were gaps between It’s a free, open source program these values, you’d get the totally wrong idea. You need a that might be the future of software tool to create a better histogram. statistical computing, and you’re about to dive into it you are here 4 263 Does this gap mean that there are no people who got raises between 3.3% and 3.8%?crank up a mean tool Install and run R Head on over to www.r-project.org to download R. You should have no problem finding a mirror near you that serves R for Windows, Mac, and Linux. This little cursor here represents the command prompt and is where you’ll be entering your commands into R. The command prompt is your friend. Working from the command prompt is something you get the hang of quickly, even though it requires you to think a little harder at first. And you can always pull up a spreadsheet-style visualization of your data by typing edit(yourdata). 264 Chapter 9 Click this download link. Once you’ve fired up the program, you’ll see a window that looks like this.Type the name of the data frame to get R to display it. What does this mean? histograms Load data into R For your first R command, try loading Load this the Head First Data Analysis script using the source command: source(“http://www.headfirstlabs.com/books/hfda/hfda.R”) That command will load the raise data you need for R. You’ll need to be connected to the Internet for it to work. If you want to save your R session so that you can come back to the Head First data when you’re not connected to the Internet you can type save.image(). So what did you download? First, take a look at the data frame from your download called “Employees.” Just type this command and press Enter: employees The output you see on the right is what R gives you in response. Generate a histogram in R by typing this command: hist(employeesreceived, breaks=50) What do you think the various elements of the command mean? Annotate your response. you are here 4 265 The command returns a listing of all the rows in the data frame.When you run the command, a window pops up showing this. hist tells R to run the histogram function. the maestro of histograms What do you think this histogram command means? hist(employeesreceived, breaks=50) The first argument specifies what data to use. R creates beautiful histograms With histograms, the areas under the bars don’t just measure the count (or frequency) of the thing being measured; they also show the percentage of the entire data set being represented by individual segments. Look carefully at the contour of the curve. A few things are obvious. Not a lot of people got Here are the highest raises below 0 percent, and not a percentage raises. lot of people got raises above 22 percent. What do you make But what’s happening in the middle of the distribution? of this histogram? 266 Chapter 9 The second argument tells R how to construct the groupings. The frequency here is the count of people. A lot of people got a raise around 5%. What’s happening here?Type help(sd) and help(summary) to find out what the commands do. histograms These commands will tell you a little more about your data set and what people’s raises look like. What happens when you run the commands? sd(employeesreceived) summary(employeesreceived) What do the two commands do? Look closely at the histogram. How does what you see on the histogram compare with what R tells you from these two commands? you are here 4 267 Why do you think R responds to each of these the way it does?summary() gives you some basic summary stats for the raises people received. There are two “humps,” a big tall one and this little one on the right. The mean is here. The right side is a little bigger than the left and pulls the mean to the rgiht. summary commands You just ran some commands to illustrate the summary statistics for your data set about raises. What do you think these commands did? What do the two commands do? The sd command returns the standard deviation of the data range you specify, and the summary() command shows you summary statistics about the received column. Look closely at the histogram. How does what you see on the histogram compare with what R tells you from these two commands? The histogram does a good job of visually showing mean, median, and standard deviation. Looking at it, you can’t see the exact figures, but you can get a sense of those numbers by looking at the shape of the curve. 268 Chapter 9 On average, the raises are 2.43 percentage points from the mean. The median is here. Here’s one standard deviation.histograms Joe: If the histogram were symmetrical, the mean and median would be in the same place—in the dead center. Frank: Right. But in this case, the small hump on the right side is pulling the mean away from the center of the larger hump, where most of the observations are. Joe: I’m struggling with those two humps. What do they mean? Frank: Maybe we should take another look at those pieces of data we identified earlier and see if they have any relevance to the histogram. Joe: Good idea. Female negotiators Raises of 4–5% Raises of Men 1% in ’07 Women Men in Raises of 2008 10% Can you think of any ways that the groups you identified earlier might explain the two humps on the histogram? you are here 4 269 The data groupings you imagined earlier.sorting out your histogram How might the groupings of data you identified earlier account for the two humps on your histogram? There could be variation among years: for example, raises in 2007 could be on average much higher than raises from 2006. And there could be gender variation, too: men could, on average, get higher raises than women, or vice versa. Of course, all the data is observational, so any relationships you discover won’t necessarily be as strong as what experimental data would show. Q: So it seems like we have a lot A: That’s a great question. Usually, A: And that shape must have some sort of flexibility when it comes to how the when we think of bell curves, we’re talking of meaning. The question is, why is the histograms look. about the normal or Gaussian distribution. distribution shaped that way? How will you But there are other types of bell-shaped find out? distributions, and a lot of other types of A: It’s true. You should think of the distributions that aren’t shaped like a bell. Q: very act of creating a histogram as an Can you draw histograms to interpretation, not something you do before represent small portions of the data to Q: interpretation. Then what’s the big deal about the evaluate individually? If we do that, we normal distribution? might be able to figure out why there are two humps. Q: Are the defaults that R uses for creating a histogram generally good? A: A lot of powerful and simple statistics can come into play if your data is normally A: That’s the right intuition. Give it a shot distributed, and a lot of natural and business A: Generally, yes. R tries to figure out the data follows a natural distribution (or can be number of breaks and the scale that will best “transformed” in a way that makes it naturally represent the data, but R doesn’t understand distributed). the meaning of the data it’s plotting. Just as with the summary functions, there’s Q: nothing wrong with running a quick and dirty So is our data normally histogram to see what’s there, but before distributed? you draw any big conclusions about what you see, you need to use the histogram (and A: The histogram you’ve been evaluating redraw the histogram) in a way that remains is definitely not normally distributed. As long Can you break the raise mindful of what you’re looking at and what as there’s more than one hump, there’s no you hope to gain from your analysis. data down in a way that way you can call the distribution bell-shaped. isolates the two humps Q: Are either of those humps the “bell Q: But there are definitely two humps and explains why they curve?” in the data that look like bells exist? 270 Chapter 9