File Formats List (2019)

File Formats List

Data Encodings and File Formats List 2019

There are many Typical File Format Categories. This blog explains list 20+ Data Encoding and File Formats that used in different types of data and media.

 

It was a bit demoralizing at the time, so let me make something clear up front: people are always dreaming up new data types and formats, and you will forever be playing catch‐up on them.

 

However, there are several formats that are common enough you should know them

First, I will talk about specific file formats that you are likely to encounter as a data scientist. This will include sample code for parsing them, discussions about when they are useful, and some thoughts about the future of data formats.

 

Typical File Format Categories

File Format

There are many, many different specific file formats out there. However, they fall under several broad categories. This section will go over the most important ones for a data scientist. The list is not exhaustive, and neither are the categories mutually exclusive, but this should give you a broad lay of the land.

 

Text Files

Most raw data files seen by data scientists are, at least in my experience, text files. This is the most common format for CSV files, JSON, XML and web pages. Pulls from databases, data from the web, and log files generated by machines are all typically text.

 

The advantage of a text file is that it is readable by a human being, meaning that it is very easy to write scripts that generate it or parse it. Text files work best for data with a relatively simple format.

 

There are limitations though. In particular, the text is a notoriously inefficient way to store numbers. The string "938238234232425123" takes up 18 bytes, but the number it represents would be stored in memory as 8 bytes.

 

Not only is this a price to pay for storage, but the number must be converted from text to a new format before a machine can operate on it.

 

Dense Numerical Arrays

Dense Numerical Arrays

If you are storing large arrays of numbers, it is much more space and performance‐efficient to store them in something such as the native format that computers use for processing numbers.

 

Most image files or sound files consist mostly of dense arrays of numbers, packed adjacent to each other in memory. Many scientific datasets fall into this category too. In my experience, you don’t see these datasets as often in data science, but they do come up.

 

Program‐Specific Data Formats

Many computer programs have their own specialized file format. This category would include things such as Excel files, db files, and similar formats. Typically, you will need to look up a tool to open one of these files.

 

In my experience, opening them often takes a while computationally, since there are often a lot of bells and whistles built into the program that may or may not be present in this particular dataset.

 

This makes it a pain to reparse them every time you rerun your analysis scripts – often, it takes much longer than the actual analysis does. What I typically do is make CSV versions of them right up front and use those as the input to my analyses.

 

Compressed or Archived Data

Archived Data

Many data files, when stored in a particular format, take up a lot more space compared to the file in question logically needs;

 

for example, if most lines in a large text file are exactly the same or a dense numerical array consists mostly of 0s. In these cases, we want to compress the large file into a smaller one, so that it can be stored and transferred more easily.

 

A related problem is when we have a large collection of files that we want to condense into a single file for easier management, often called data archiving. There are a variety of ways that we can encode the raw data into these more manageable forms.

 

There is a lot more to data compression than just reducing the size. A “perfect” algorithm would have the following properties:

It generally reduces the size of the data, easing storage requirements.

If it can’t compress the data much (or at all), then at least it doesn’t balloon it to take up much MORE space.

 

You can decompress it quickly. If you do this really well, it might take you less time to load the compressed data compared to the raw data itself, even with the decompression step. This is because decompression in RAM can be fairly quick, but it takes a long time to pull extra data off the disk.

 

You can decompress it “one line at a time,” rather than loading the entire file. This helps you deal with corrupt data and typically makes decompression go faster since you’re operating on fewer data at a time.

 

You can recompress it quickly.

In the real world, there is a wide range of compression algorithms available, which balance these interests in a lot of different ways. Compression becomes especially important in Big Data settings, where datasets are typically large and reloaded from disk every time the code runs.

 

[Note: You can free download the complete Office 365 and Office 2019 com setup Guide.]

 

CSV Files

CSV Files

CSV files are the workhorse data format for data science. “CSV” usually stands for “comma‐separated value,” but it really should be “character‐separated value” since characters other than commas do get used.

 

Sometimes, you will see “.tsv” if tabs are used or “.psv” if pipes (the “|” character) are used. More often though, in my experience, everything gets called CSV regardless of the delimiter.

 

CSV files are pretty straightforward conceptually – just a table with rows and columns. There are a few complications you should be aware of though:

 

Headers. Sometimes, the first line gives names for all the columns, and sometimes, it gets right into the data.

 

Quotes. In many files, the data elements are surrounded by quotes or another character. This is done largely so that commas (or whatever the delimiting character is) can be included in the data fields.

 

Nondata rows. In many file formats, the data itself is CSV, but there are a certain number of nondata lines at the beginning of the file. Typically, these encode metadata about the file and need to be stripped out when the file is loaded into a table.

 

Comments. Many CSV files will contain human‐readable comments, as source code does. Typically, these are denoted by a single character, such as the # in Python.

 

The following Python code shows how to read a basic CSV file into a data frame using Pandas:

import pandas

df = pandas.read_csv("myfile.csv")

)

If your CSV file has weird complexities associated with it, then read_csv has a number of optional arguments that let you deal with them. Here is a more complicated call to read_csv:

import pandas
df = pandas.read_csv("myfile.csv",
sep = "|", # the delimiter. Default is the comma
header = False,
quotechar = ’"’,
compression = "gzip",
comment = ’#’
)

In my work, the optional arguments I use most are sep and header.

 

JSON Files

JSON is probably my single favorite data format, for its dirt simplicity and flexibility. It is a way to take hierarchical data structures and serialize them into a plain text format. Every JSON data structure is either of the following:

 

An atomic type, such as a number, a string, or a Boolean.

A JSONObject, which is just a map from strings to JSON data structures. This is similar to Python dictionaries, except that there are keys in the JSONObject. An array of JSON data structures. This is similar to a Python list. 

Here is an example of some valid JSON, which encodes a JSONObject map with a lot of substructures:

{
"firstName": "John",
"lastName": "Smith",
"isAlive": true,
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021‐3100"
},
"children":["alice","john",{"name":"alice","birth_order":2}],
"spouse": null
}

 

Note a few things about this example:

The fact that I’ve made it all pretty with the newlines and indentations is purely to make it easier to read. This could have all been on one long line and any JSON parser would parse it equally well. A lot of programs for viewing JSON will automatically format it in this more legible way.

 

[Note: You can free download the complete Office 365 and Office 2019 com setup Guide.]

 

The overall object is conceptually similar to a Python dictionary, where the keys are all strings and the values are JSON objects. The overall object could have been an array too though.

 

A difference between JSON objects and Python dictionaries is that all the field names have to be strings. In Python, the keys can be any hashable type.

 

The fields in the object can be ordered arrays, such as “children.” These arrays are analogous to Python lists. You can mix and match types in the object, just as in Python.

You can have Boolean types. Note though that they are declared in lower case.

  • There are also numerical types.
  • There is a null supported
  • You can nest the object arbitrarily deeply.

 

Parsing JSON is a cinch in Python. You can either “load” a JSON string into a Python object (a dictionary at the highest level, with JSON arrays mapping to Python lists, etc.) or “dump” a Python dictionary into a JSON string.

 

The JSON string can either be a Python string or be stored in a file, in which case you write from/to a file object. The code looks as follows:

import json

json_str = """{"name": "Field", "height":6.0}"""

my_obj = json.loads(json_str)

my_obj

{u'name': u'Field', u'height': 6.0} >>> str_again = json.dumps(my_obj)

 

Data Encodings and File Formats

Historically, JSON was invented as a way to serialize objects from the JavaScript language. Think of the keys in a JSONObject as the names of the members in an object. However, JSON does NOT support notions such as pointers, classes, and functions.

 

XML Files

XML is similar to JSON: a text‐based format that lets you store hierarchical data in a format that can be read by both humans and machines. However, it’s significantly more complicated than JSON – part of the reason that JSON has been eclipsing it as a data transfer standard on the web.

 

Let’s jump in with an example:

<GroupOfPeople>
<person gender="male">
<Name>Field Cady</Name>
<Profession>Data Scientist</Profession>
</person>
<person gender="female">
<Name>Ryna</Name>
<Profession>Engineer</Profession>
</person>
</GroupOfPeople>

 

Everything enclosed in angle brackets is called a “tag.” Every section of the document is blog ended by matching pairs of tags, which tell what type of section it is. The closing tag contains a slash “/” after the “<”.

 

The opening tag can contain other pieces of information about the section – in this case, “gender” is such an attribute. Because you can have whatever tag names or additional attributes you like, XML lends itself to making domain‐specific description languages.

 

XML sections must be fully nested into each other, so something such as the following is invalid:

<a><b></a></b>

because the “b” section begins in the middle of the “a” section but doesn’t end until the “a” is already over.

 

For this reason, it is conventional to think of an XML document as a tree structure. Every nonleaf node in the tree corresponds to a pair of opening/closing tags, of some type and possibly with some attributes, and the leaf nodes are the actual data.

 

Sometimes, we want the start and end tag of a section to be adjacent to each other. In this case, there is a little bit of syntactic sugar, where you put the closing “/” before the closing angle bracket. So,

<foo a="bar"></foo>
is equivalent to
<foo a="bar"/>

 

A big difference between JSON and XML is that the content in XML is ordered. Every node in the tree has its children in a particular order – the order in which they come in the document. They can be of any types and come in any order, but there is AN order.

 

Processing XML is a little more finicky than processing JSON, in my experience. This is for two reasons:

 

It’s easier to refer to a named field in a JSON object than to search through all the children of an XML node and find the one you’re looking for.

XML nodes often have additional attributes, which are handled separately from the node’s children.

This isn’t inherent to the data formats, but in practice, JSON tends to be used in small snippets, for smaller applications where the data has a regular structure.

 

So, you typically know exactly how to extract the data you’re looking for. In contrast, XML is liable to be a massive document with many parts, and you have to sift through the whole thing.

 

In Python, the XML library offers a variety of ways of processing XML data. The simplest is the ElementTree sublibrary, which gives us direct access to the parse tree of the XML.

 

It is shown in this code example, where we parse XML data into a string object, access and modify the data, and then reencode it back to an XML string:

import xml.etree.ElementTree as ET
xml_str = """
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/> </country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/> </country>
</data>
"""
>>> root = ET.fromstring(xml_str)
>>> root.tag 'data'
>>> root[0] # gives the zeroth child <Element 'country' at 0x1092d4410>
>>> root.attrib # dictionary of node’s attributes {}
>>> root.getchildren()
[<Element 'country' at 0x1092d4410>, <Element 'country' at 0x1092d47d0>, <Element 'country' at 0x1092d4910>]
>>> del root[0] # deletes the zeroth child from the tree
>>> modified_xml_str = ET.tostring(root)

 

The “right” way to manage XML data is called the “Document Object Model.” It is a little more standardized across programming languages and web browsers, but it is also more complicated to master. The ElementTree is fine for simple applications and capable of doing whatever you need it to do.

 

HTML Files

HTML Files

By far the most important variant of XML is HTML, the language for describing pages on the web. Practically speaking, the definition of “valid” HTML is that your web browser will parse it as intended.

 

There are differences between browsers, some intentional and some not, and that’s why the same page might look different in Chrome and Internet Explorer.

 

But browsers have largely converged on a standard version of HTML (the most recent official standard is HTML5), and to a first approximation, that standard is a variant of XML. Many web pages could be parsed with an XML parser library.

 

I mentioned in the last section that XML can be used to create domain‐specific languages, each of which is defined by its own set of valid tags and their associated attributes. This is the way HTML works. Some of the more notable tags are given in the following table:

 

Tag Meaning Example

<a> Hyperlink Click <a href="www.google.com">here</a> to go to Google<img> Image <img class='lazy' data-src="smiley.gif">
<h1>–<h6> Headings of text <h1>The Title</h1>
<div> Division. It doesn’t get rendered but <div class="main‐text">My helps to organize the document. body of text</div>
Often, the “class” attribute is used to associate the contents of the division with a desired style of text formatting
<ul> and <li> Unordered lists (usually rendered as bulleted lists) and list items Here is a list:
<ul>
<li>Item One</li>
<li>Item Two</li>
</ul>

 

The practical problem with processing HTML data is that, unlike JSON or even XML, HTML documents tend to be extremely messy. They are often individually made, edited by humans, and tweaked until they look “just right.”

 

This means that there is almost no regularity in structure from one HTML document to the next, so the tools for processing HTML lean toward combing through the entire document to find what it is you’re looking for.

 

The default HTML tool for Python is the HTMLParser class, which you use by creating a subclass that inherits from it. An HTMLParser works by walking through the document, performing some action each time it hits a start or an end tag or another piece of text.

 

These actions will be user‐defined methods on the class, and they work by modifying the parser’s internal state.

 

When the parser has walked through the entire document, its internal state can be queried for whatever it is you were looking for. One very important note is that it’s up to the user to keep track of things such as how deeply nested you are within the document’s sections.

 

To illustrate, the following code will pull down the HTML for a Wikipedia page, step through its content, and count all hyperlinks that are embedded in the body of the text (i.e., they are within paragraph tags):

 

from HTMLParser import HTMLParser import urllib

TOPIC = "Dangiwa_Umar"
url = "https://en.wikipedia.org/wiki/%s" % TOPIC class LinkCountingParser(HTMLParser):
in_paragraph = False
link_count = 0
def handle_starttag(self, tag, attrs): if tag=='p': self.in_paragraph = True elif tag=='a' and self.in_paragraph:
self.link_count += 1 def handle_endtag(self, tag):
if tag=='p': self.in_paragraph = False
html = urllib.urlopen(url).read()
parser = LinkCountingParser()
parser.feed(html)
print "there were", parser.link_count, \ "links in the article"

 

Tar Files

Tar is the most popular example of an “archive file” format. The idea is to take an entire directory full of data, possibly including nested subdirectories, and combine it all into a single file that you can send in an e‐mail, store somewhere, or whatever you want.

 

There are a number of other archive file formats, such as ISO, but in my experience, tar is the most common example. Besides their widespread use for archiving files, Tar files are also used by the Java programming language and its relatives.

 

Compiled Java classes are stored into JAR files, but JAR files are created just by tarring individual Java class files together. The JAR is the same format as Tar, except that you are only combining Java class files rather than arbitrary file types.

 

Tarring a directory doesn’t actually compress the data – it just combines the files into one file that takes up about as much space as the data did originally. So in practice, Tar files are almost always then zipped. GZipping, in particular, is popular. 

The “.tgz” file extension is used as a shorthand for “.tar.gz”, that is, the directory has been put into a Tar file, which was then compressed using the GZIP algorithm.

 

Tar files are typically opened from the command line, such as the following:

$ # This will expand the contents of
$ # my_directory.tar into the local directory $ tar -xvf my_directory.tar
$ # This command will untar and unzip
$ # a directory with has been tarred and g-zipped $ tar -zxf file.tar.gz
$ # This command will tar the Homework3 directory $ # into the file ILoveHomework.tar $ tar -cf ILoveHomework.tar Homework

 

GZip Files

Gzip is the most common compression format that you will see on Unix‐like systems such as Mac and Linux. Often, it’s used in conjunction with Tar to archive the contents of an entire directory. Encoding data with gzip is comparatively slow, but the format has the following advantages:

  • It compresses data super well.
  • Data can be decompressed quickly.
  • It can also be decompressed one line at a time, in case you only want to operate only on part of the data without decompressing the whole file.

 

Under the hood, gzip runs on a compression algorithm called DEFLATE. A compressed zip file is broken into blocks. The first part of each block contains some data about the block, including how the rest of the block is encoded (it will be some type of Huffman code, but you don’t need to worry about the details of those).

 

Once the gzip program has parsed this header, it can read the rest of the block 1 byte at a time. This means there is minimal RAM being used up, so all the decompression can go on near the top of the RAM cache and hence proceed at breakneck speed.

 

The typical commands for gzipping/unzipping from the shell are simple:

$ gunzip myfile.txt.gz # creates raw file myfile.txt $ gzip myfile.txt # compresses the file into my file. txt.gz

However, you can typically also just double‐click on a file – most operating systems can open gzip files natively.

 

Zip Files

Zip files are very similar to Gzip files. In fact, they even use the same DEFLATE algorithm under the hood! There are some differences though, such as the fact that ZIP can compress an entire directory rather than just individual files.

 

Zipping and unzipping files is as easy with ZIP as with GZIP:

$ # This puts several files into a single zip file $ zip filename.zip input1.txt input2.txt resume.doc pic1.jpg

$ # This will open the zip file and put

$ # all of its contents into the current directory $ unzip filename.zip

 

Image­ Files

Image files can be broken down into two broad categories: rasterized and vectorized. 

Rasterized files break an image down into an array of pixels and encode things such as the brightness or color of each individual pixel. Sometimes, the image file will store the pixel array directly, and other times, it will store some compressed version of the pixel array. Almost all machine‐generated data will be rasterized.

 

Vectorized files, on the other hand, are a mathematical description of what the image should look like, complete with perfect circles, straight lines, and so on. They can be scaled to any size without losing resolution. Vectorized files are more likely to be company logos, animations, and similar things.

 

The most common vectorized image format you’re likely to run into is SVG, which is actually just an XML file under the hood (as I mentioned before, XML is great for domain‐specific languages!). However, in daily work as a data scientist, you’re most likely to encounter rasterized files.

 

A rasterized image is an array of pixels that, depending on the format, can be combined with metadata and then possibly subjected to some form of compression (sometimes using the DEFLATE algorithm, such as GZIP). There are several considerations that differentiate between the different formats available:

 

Lossy versus lossless.

Many formats (such as BMP and PNG) encode the pixel array exactly – these are called lossless. But others (such as JPEG) allow you to reduce the size of the file by degrading the resolution of your image.

 

Grayscale versus RBG.

If images are black‐and‐white, then you only need one number per pixel. But if you have a colored image, then there needs to be some way to specify the color. Typically, this is done by using RGB encoding, where a pixel is specified by how much red, how much green, and how much blue it contains.

 

Transparency. Many images allow pixels to be partly transparent. The “alpha” of a pixel ranges from 0 to 1, with 0 being completely transparent and 1 being completely opaque.

 

Some of the most important image formats you should be aware of are as follows:

 

JPEG.

This is probably the single most important one in web traffic, prized for its ability to massively compress an image with almost invisible degradation. It is a lossy compression format, stores RGB colors, and does not allow for transparency.

 

PNG.

This is maybe the next most ubiquitous format. It is lossless and allows for transparency pixels. Personally, I find the transparent pixels make PNG files super useful when I’m putting together slide decks.

 

TIFF.

Tiff files are not common on the Internet, but they are a frequent format for storing high‐resolution pictures in the context of photography or science. They can be lossy or lossless.

 

The following Python code will read an image file. It takes care of any decompression or format‐specific stuff under the hood and returns the image as a NumPy array of integers. It will be a three‐dimensional array, with the first two dimensions corresponding to the normal width and height.

 

The image is read in as RBG by default, and the third dimension of the array indicates whether we are measuring the red, blue, or green content. The integers themselves will range from 0 to 255 since each is encoded with a single byte.

from scipy.ndimage import imread img = imread('mypic.jpg')

 

If you want to read the image as grayscale, you can pass mode = "F" and get a two‐dimensional array. If you instead want to include the alpha opacity as a fourth value for each pixel, pass in mode = "RGBA."

Recommend