How Data analysis is done

how to do data analysis for qualitative research and what data analysis to use and what data analysis to use for questionnaires and what is data analysis procedures
HalfoedGibbs Profile Pic
HalfoedGibbs,United Kingdom,Professional
Published Date:02-08-2017
Your Website URL(Optional)
Comment
13 cleaning data Impose order I do my best work when everything’s where it’s supposed to be. Your data is useless… …if it has messy structure. And a lot of people who collect data do a crummy job of maintaining a neat structure. If your data’s not neat, you can’t slice it or dice it, run formulas on it, or even really see it. You might as well just ignore it completely, right? Actually, you can do better. With a clear vision of how you need it to look and a few text manipulation tools, you can take the funkiest, craziest mess of data and whip it into something useful. this is a new chapter 385Look at all this stuff salvage a client list Just got a client list from a defunct competitor Your newest client, Head First Head Hunters, just Load this received a list of job seekers from a defunct competitor. They had to spend big bucks to get it, but it’s hugely valuable. The people on this list are the best of the best, the most employable people www.headfirstlabs.com/books/hfda/ around. hfda_ch13_raw_data.csv This list could be a gold mine… …too bad the data is a mess In its current form, there’s not much they can do with this data. That’s why they called you. Can you help? 386 Chapter 13 What are you going to do with this data?The visionary at work cleaning data The dirty secret of data analysis The dirty secret of data analysis is that as analyst you might spend more time cleaning data than analyzing it. Data often doesn’t arrive perfectly organized, so you’ll have to do some heavy text manipulation to get it into a useful format for analysis. What will be your first step for dealing with this messy data? Take a look at each of these possibilities and write the pros and cons of each. 1 Start retyping it. 2 Ask the client w hat she wants to do with the data once it’s cleaned. 3 W rite a formula to whip the data into shape. you are here 4 387 But your work as a data analyst may actually involve a lot of this… This is the fun part of data analysis. head hunter goals Which of these options did you choose as your first step? Start r etyping it. 1 This sucks. It’ll take forever, and there’s a good chance I’ll transcribe it incorrectly, messing up the data. If this is the only way to fix the data, we’d better be sure before going this route. 2 Ask the client what she wants to do with the data once it’s cleaned. This is the way to go. With an idea of what the client wants to do with the data, I can make sure that whatever I do puts the data in exactly the form that they need. 3 Write a formula to whip the data into shape. A powerful formula or two would definitely help out, once we have an idea of what the data needs to look like from the client. But let’s talk to the client first. Head First Head Hunters wants the list for their sales team We need a call list for our sales team to use to contact prospects we don’t know. The list is of job seekers who have been placed by our old competitor, and we want to be the ones who put them in their next job. Even though the raw data is a mess, it looks like they just want to extract names and phone numbers. Shouldn’t be too much of a problem. Let’s get started… 388 Chapter 13Add a few lines of sample data to show what you’d like the records to look like. Hmmm... the Time and CallID fields don’t really seem relevant.. cleaning data The data looks like a list of names, which is what we’d expect from the client’s description of it. What you need is a clean layout of those names. Draw a picture that shows some columns and sample data for what you want the messy data to look like. you are here 4 389 Looks like these are field headings up top. Draw your ideal data layout here. Put the column headings up here.…what you need is for that information to be split into columns. You need the name and phone fields separated from each other. previsualize your columns How would you like your data to look once you’ve cleaned it up? PersonID FirstName LastName Phone 127 Alexia Rasmussen 718-534-2403 98 Brenden Rasmussen 646-812-7298 Etc… Etc… Etc… Etc… 390 Chapter 13 You can see the information you want that’s been all mashed together in Column A... When everything’s separate, you can sort the data by field, filter it, or pipe it to a mail merge or web page or whatever else. Gotta have the phone numbers... that’s the most important thing for the sales team This ID field is useful, since it will let you make sure that the records are unique.cleaning data News flash The data’s still a mess. How are we going to fix it? It’s true: thinking about what neat data looks like won’t actually make it neat. But we needed to previsualize a solution before getting down into the messy data. Let’s take a look at our general strategy for fixing messy data and then start coding it… you are here 4 391plan your attack Cleaning messy data is all about preparation It may go without saying, but cleaning data should begin like any other data 2 project: making sure you have copies of Previsualize what you want your the original data so that you can go back final data set to look like. and check your work. 1 Original Data My Data Save a copy of your original data. Original Data My Copy 3 Identify repetitive patterns in the messy data. Once you’ve figured out what you need your data to look like, you can then proceed to identify patterns in the Original Data messiness. The last thing you want to do is go back and change the data line by line—that would take forever—so if you can identify repetition in the messiness you can write formulas and functions that exploit the patterns to make the data neat. 392 Chapter 13 Already did thiscleaning data Once you’re organized, you can fix the data itself 4 Clean and restructure Clean Restructure 5 Use your finished data My Data Then, patterns in hand, you can get down to the work of actually fixing the data. As you’ll discover, this process is often iterative, meaning you will clean and restructure the data over and over again until you get the results you need. First, let’s split up the fields. Is there a pattern to how the fields are separated from each other? you are here 4 393 Here’s where the magic happens.Tell Excel you have a delimiter. split the columns What patterns did you find in the data? Definitely All the data is in Column A with the fields mashed together. Between each field there is a single character: the pound (). Use the sign as a delimiter Excel has a handy tool for splitting data into columns when the fields are separated by a delimiter (the technical term for a character that makes the space between fields). Select Column A in your data and press the Text to Columns button under the Data tab… … and now you’ve started the Wizard. In the first step, tell Excel that your data is split up by a delimiter. In the second step, tell Excel that your delimiter is the character. What happens when you click Finish? 394 Chapter 13 Select Column A and click this button. Specify the delimiter.The data is now neatly separated into columns. cleaning data Excel split your data into columns using the delimiter And it was no big deal. Using Excel’s Convert Text to Column Wizard is a great thing to do if you have simple delimiters separating your fields. But the data still has a few problems. The first and last names, for example, both appear to have junk characters inside the fields. You’ll have to come up with a way to get rid of them What’s the pattern you’d use to fix the FirstName column? you are here 4 395 Now that the pieces of data are separated, you can manipulate them individually if you want to. What are you going to do to fix the FirstName field? What about the LastName field?kill the carat Is there a pattern to the messiness in the FirstName field? At the beginning of every name there is a character. We need to get rid of all of them in order to have neat first names. FirstName Here’s the This character is useful stuff. just in our way. You need some software tool to pull out all the carat characters. Let’s see what Excel has for us… 396 Chapter 13 This character is everywherecleaning data Match each Excel text formula with its function. Which function do you think you’ll need to use the clean up the Name column? Tells you the length of a cell. FIND Returns a numerical value for a number LEFT stored as text. Grabs characters on the right side of a cell. RIGHT Replaces text you don’t want in a cell with TRIM new text that you specify. Tells you where to find a search string LEN within a cell. Takes two values and sticks them together. CONCATENATE Grabs characters on the left side of a cell. VALUE Removes excess blank spaces from a cell. SUBSTITUTE you are here 4 397text formula spectacular SOlUT ion Match each Excel text formula with its function. Which function do you think you’ll need to use the clean up the Name column? Tells you the length of a cell. FIND Returns a numerical value for a number LEFT stored as text. Grabs characters on the right side of a cell. RIGHT Replaces text you don’t want in a cell with TRIM new text that you specify. Tells you where to find a search string LEN within a cell. Takes two values and sticks them together. CONCATENATE Grabs characters on the left side of a cell. VALUE Removes excess blank spaces from a cell. SUBSTITUTE 398 Chapter 13 Here’s the formula we need to use to replace the characters in the Name column.cleaning data Use SUBSTITUTE to replace the carat character Do this 1 T o fix the FirstName field, type this formula into cell H2: =SUBSTITUTE(B2,“”,“”) 2 Cop y this formula and paste it all the way down to the end of the data in Column H. What happens? Q: Q: Q: Am I limited to just these formulas? So I can nest a whole bunch of text Then how do I get around the What if I want to take the characters formulas inside of each other? problem of formulas that are big and on the left and right of a cell and stick unreadable? them together? It doesn’t look there’s a A: You can, and it’s a powerful way to formula that does just that. manipulate text. There’s a problem, though: A: Instead of packing all your smaller if your data is really messy and you have formulas into one big formula, you can break A: There isn’t, but if you nest the text to nest a whole bunch of formulas inside apart the small formulas into different cells functions inside of each other you can of each other, your entire formula can be and have a “final” formula that puts them all achieve much more complicated text almost impossible to read. together. That way, if something is a little off, manipulations. For example, if you wanted it’ll be easier to find the formula that needs to take the first and last characters inside of to be tweaked. Q: Who cares? As long as it works, I’m cell A1 and stick them together, you’d use not going to be reading it anyway. this formula: Q: You know, I bet R has much more powerful ways of doing text manipulation. A: Well, the more complex your formula, CONCATENATE(LEFT(A1,1), the more likely you’ll need to do subtle RIGHT(A1,1)) tweaking of it. And the less readable your A: It does, but why bother learning them? formula is, the harder that tweaking will be. If Excel’s SUBSTITUTE formula handles your issue, you can save your self some time by skipping R. you are here 4 399 Put the formula here.Here are your corrected first names. fixed first names You cleaned up all the first names Using the SUBSTITUTE formula, you had Excel grab the symbol from each first name and replace it with nothing, which you specified by two quotation marks (“”). Lots of different software lets you get rid of crummy characters by replacing those characters with nothing. To make the original first name data go away forever copy the H column and then Paste Special Values to turn these values into actual text rather than formula outputs. After that you can delete the FirstName column so that you never have to see those pesky symbols again. 400 Chapter 13 Here’s your original first name data. All of these values are outputs from the SUBSTITUTE formula. You can delete away... as long as you saved a copy of the original file so you can refer back to it if you made a mistake. Your last names are still screwy, though.cleaning data Hmpf. That first name pattern was easy because it was just a single character at the beginning that had to be removed. But the last name is going to be harder, because it’s a tougher pattern. Let’s try using SUBSTITUTE again, this time to fix the last names. First, look for the pattern in this messiness. What would you tell SUBSTITUTE to replace? Here’s the syntax again: =SUBSTITUTE(your reference cell, the text you want to replace, what you want to replace it with) Can you write a formula that works? you are here 4 401…and here it starts on the third character The length of this text is seven characters. tough pattern Could you fix the LastName field using SUBSTITUTE? SUBSTITUTE won’t work here Every cell has different messy text. In order to make SUBSTITUTE work, you’d have to write a separate formula for each last name. =SUBSTITUTE(C2, “(ID 127)”, “”) =SUBSTITUTE(C3, “(ID 98)”, “”) =SUBSTITUTE(C4, “(ID 94)”, “”) And typing a bajillion formulas like this defeats the purpose of using formulas to begin with. Formulas are supposed to save you the trouble of typing and retyping The last name pattern is too These text strings complex for SUBSTITUTE are different.. The SUBSTITUTE function looks for a pattern Rasmuss(ID 98)en in the form of a single text string to replace. The problem with the last names are that each has Co(ID 156)ok a different text string to replace. And that’s not all: the pattern of messiness in the LastName field is more complex in that the messy strings show up in different positions within each cell and they have different lengths. Rasmuss(ID 98)en Co(ID 156)ok 402 Chapter 13 You can’t just type in the value you want replaced, because that value changes from cell to cell. The messiness here starts on the eighth character of the cell… This one is eight characters long.cleaning data Handle complex patterns with nested text formulas The FIND formula returns a number that represents Once you get familiar with Excel text formulas, you the position of the “(“. can nest them inside of each other to do complex operations on your messy data. Here’s an example: FIND("(",C3) Rasmuss(ID 98)en LEFT(C3,FIND("(",C3)-1) Rasmuss(ID 98)en RIGHT grabs the rightmost text. RIGHT(C3,LEN(C3)-FIND(")",C3)) Rasmuss(ID 98)en CONCATENATE(LEFT(C3,FIND("(",C3)-1), Rasmussen RIGHT(C3,LEN(C3)-FIND(")",C3))) CONCATENATE The formula works, but there’s a problem: it’s starting to puts it all together. get really hard to read. That’s not a problem if you write formulas perfectly the first time, but you’d be better off with a tool that has power and simplicity, unlike this nested CONCATENATE formula. Wouldn't it be dreamy if there were an easier way to fix complex messes than with long, unreadable formulas like that one. But I know it's just a fantasy… you are here 4 403 LEFT grabs the leftmost text.? ? ? h h h c c c t t t a a a m m m a a a regex in r Geek Bits R can use regular expressions to To learn more about the full regex crunch complex data patterns specification and syntax, type ?regex in R. Regular expressions are a programming tool that allows you to specify complex patterns to match and replace strings of text, and R has a powerful facility for using them. Here’s a simple regular expression pattern that looks for the letter “a”. When you give this pattern to R, it’ll say whether there’s a match. Pattern “a+” dandelion Compare Text text and dandelion pattern. Match Pattern “a+” turkey Compare Text text and turkey pattern. Nope. Pattern “a+” aardvark Compare Text text and aardvark pattern. Match Regular expressions are the ultimate tool for cleaning up messy data. Lots of platforms and languages implement regular expressions, even though Excel doesn’t. 404 Chapter 13 Here’s R’s regular expression reference in the help files. e e e r r r e e e h h h t t t s s s I I I