Twitter api with Python

twitter api in python and twitter api how to authenticate and twitter api python tutorial
HartJohnson Profile Pic
HartJohnson,United States,Professional
Published Date:02-08-2017
Your Website URL(Optional)
Comment
Twitter Cookbook This cookbook is a collection of recipes for mining Twitter data. Each recipe is designed to be as simple and atomic as possible in solving a particular problem so that multiple recipes can be composed into more complex recipes with minimal effort. Think of each recipe as being a building block that, while useful in its own right, is even more useful in concert with other building blocks that collectively constitute more complex units of analysis. Unlike previous chapters, which contain a lot more prose than code, this cookbook provides a relatively light discussion and lets the code do more of the talking. The thought is that you’ll likely be manipulating and composing the code in various ways to achieve particular objectives. While most recipes involve little more than issuing a parameterized API call and post- processing the response into a convenient format, some recipes are even simpler (in‐ volving little more than a couple of lines of code), and others are considerably more complex. This cookbook is designed to help you by presenting some common problems and their solutions. In some cases, it may not be common knowledge that the data you desire is really just a couple of lines of code away. The value proposition is in giving you code that you can trivially adapt to suit your own purposes. One fundamental software dependency you’ll need for all of the recipes in this chapter is the twitter package, which you can install with pip per the rather predictable pip install twitter command from a terminal. Other software dependencies will be no‐ ted as they are introduced in individual recipes. If you’re taking advantage of the book’s virtual machine (which you are highly encouraged to do), thetwitter package and all other dependencies will be preinstalled for you. As you know from Chapter 1, Twitter’s v1.1 API requires all requests to be authenticated, so each recipe assumes that you take advantage of Section 9.1 on page 352 or Section 9.2 on page 353 to first gain an authenticated API connector to use in each of the other recipes. 351 a Always get the latest bug-fixed source code for this chapter (and every other chapter) online at http://bit.ly/MiningTheSocialWeb2E. Be sure to also take advantage of this book’s virtual machine experience, as described in Appendix A, to maximize your enjoyment of the sample code. 9.1. Accessing Twitter’s API for Development Purposes 9.1.1. Problem You want to mine your own account data or otherwise gain quick and easy API access for development purposes. 9.1.2. Solution Use the twitter package and the OAuth 1.0a credentials provided in the application’s settings to gain API access to your own account without any HTTP redirects. 9.1.3. Discussion Twitter implements OAuth 1.0a, an authorization mechanism that’s expressly designed so that users can grant third parties access to their data without having to do the un‐ thinkable—doling out their usernames and passwords. While you can certainly take advantage of Twitter’s OAuth implementation for production situations in which you’ll need users to authorize your application to access their accounts, you can also use the credentials in your application’s settings to gain instant access for development purposes or to mine the data in your own account. Register an application under your Twitter account at http://dev.twitter.com/apps and take note of the consumer key, consumer secret, access token, and access token secret, which constitute the four credentials that any OAuth 1.0a–enabled application needs to ultimately gain account access. Figure 9-1 provides a screen capture of a Twitter appli‐ cation’s settings. With these credentials in hand, you can use any OAuth 1.0a library to access Twitter’s RESTful API, but we’ll opt to use thetwitter package, which provides a minimalist and Pythonic API wrapper around Twitter’s RESTful API interface. When registering your application, you don’t need to specify the callback URL since we are effectively bypassing the entire OAuth flow and simply using the credentials to imme‐ diately access the API. Example 9-1 demonstrates how to use these credentials to in‐ stantiate a connector to the API. 352 Chapter 9: Twitter Cookbook aExample 9-1. Accessing Twitter’s API for development purposes import twitter def oauth_login(): XXX: Go to http://twitter.com/apps/new to create an app and get values for these credentials that you'll need to provide in place of these empty string values that are defined as placeholders. See https://dev.twitter.com/docs/auth/oauth for more information on Twitter's OAuth implementation. CONSUMER_KEY = '' CONSUMER_SECRET = '' OAUTH_TOKEN = '' OAUTH_TOKEN_SECRET = '' auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET) twitter_api = twitter.Twitter(auth=auth) return twitter_api Sample usage twitter_api = oauth_login() Nothing to see by displaying twitter_api except that it's now a defined variable print twitter_api Keep in mind that the credentials used to connect are effectively the same as the username and password combination, so guard them carefully and specify the minimal level of access required in your ap‐ plication’s settings. Read-only access is sufficient for mining your own account data. While convenient for accessing your own data from your own account, this shortcut provides no benefit if your goal is to write a client program for accessing someone else’s data. You’ll need to perform the full OAuth dance, as demonstrated in Example 9-2, for that situation. 9.2. Doing the OAuth Dance to Access Twitter’s API for Production Purposes 9.2.1. Problem You want to use OAuth so that your application can access another user’s account data. 9.2. Doing the OAuth Dance to Access Twitter’s API for Production Purposes 353 a9.2.2. Solution Implement the “OAuth dance” with thetwitter package. 9.2.3. Discussion Thetwitter package provides a built-in implementation of the so-called OAuth dance that works for a console application. It does so by implementing an out of band (oob) OAuth flow in which an application that does not run in a browser, such as a Python program, can securely gain these four credentials to access the API, and allows you to easily request access to a particular user’s account data as a standard “out of the box” capability. However, if you’d like to write a web application that accesses another user’s account data, you may need to lightly adapt its implementation. Although there may not be many practical reasons to actually implement an OAuth dance from within IPython Notebook (unless perhaps you were running a hosted IPython Notebook service that was used by other people), this recipe uses Flask as an embedded web server to demonstrate this recipe using the same toolchain as the rest of the book. It could be easily adapted to work with an arbitrary web application frame‐ work of your choice since the concepts are the same. Figure 9-1 provides a screen capture of a Twitter application’s settings. In an OAuth 1.0a flow, the consumer key and consumer secret values that were introduced as part of Section 9.1 on page 352 uniquely identify your application. You provide these values to Twitter when requesting access to a user’s data so that Twitter can then prompt the user with information about the nature of your request. Assuming the user approves your application, Twitter redirects back to the callback URL that you specify in your appli‐ cation settings and includes an OAuth verifier that is then exchanged for an access to‐ ken and access token secret, which are used in concert with the consumer key and consumer secret to ultimately enable your application to access the account data. (For oob OAuth flows, you don’t need to include a callback URL; Twitter provides the user with a PIN code as an OAuth verifier that must be copied/pasted back into the appli‐ cation as a manual intervention.) See Appendix B for additional details on an OAuth 1.0a flow. 354 Chapter 9: Twitter Cookbook aFigure 9-1. Sample OAuth settings for a Twitter application Example 9-2 illustrates how to use the consumer key and consumer secret to do the OAuth dance with thetwitter package and gain access to a user’s data. The access token and access token secret are written to disk, which streamlines future authorizations. According to Twitter’s Development FAQ, Twitter does not currently expire access to‐ kens, which means that you can reliably store them and use them on behalf of the user indefinitely, as long as you comply with the applicable terms of service. Example 9-2. Doing the OAuth dance to access Twitter’s API for production purposes import json from flask import Flask, request import multiprocessing from threading import Timer from IPython.display import IFrame from IPython.display import display from IPython.display import Javascript as JS import twitter from twitter.oauth_dance import parse_oauth_tokens from twitter.oauth import read_token_file, write_token_file 9.2. Doing the OAuth Dance to Access Twitter’s API for Production Purposes 355 a Note: This code is exactly the flow presented in the _AppendixB notebook OAUTH_FILE = "resources/ch09-twittercookbook/twitter_oauth" XXX: Go to http://twitter.com/apps/new to create an app and get values for these credentials that you'll need to provide in place of these empty string values that are defined as placeholders. See https://dev.twitter.com/docs/auth/oauth for more information on Twitter's OAuth implementation, and ensure that oauth_callback is defined in your application settings as shown next if you are using Flask in this IPython Notebook. Define a few variables that will bleed into the lexical scope of a couple of functions that follow CONSUMER_KEY = '' CONSUMER_SECRET = '' oauth_callback = 'http://127.0.0.1:5000/oauth_helper' Set up a callback handler for when Twitter redirects back to us after the user authorizes the app webserver = Flask("TwitterOAuth") webserver.route("/oauth_helper") def oauth_helper(): oauth_verifier = request.args.get('oauth_verifier') Pick back up credentials from ipynb_oauth_dance oauth_token, oauth_token_secret = read_token_file(OAUTH_FILE) _twitter = twitter.Twitter( auth=twitter.OAuth( oauth_token, oauth_token_secret, CONSUMER_KEY, CONSUMER_SECRET), format='', api_version=None) oauth_token, oauth_token_secret = parse_oauth_tokens( _twitter.oauth.access_token(oauth_verifier=oauth_verifier)) This web server only needs to service one request, so shut it down shutdown_after_request = request.environ.get('werkzeug.server.shutdown') shutdown_after_request() Write out the final credentials that can be picked up after the following blocking call to webserver.run(). write_token_file(OAUTH_FILE, oauth_token, oauth_token_secret) return "%s %s written to %s" % (oauth_token, oauth_token_secret, OAUTH_FILE) To handle Twitter's OAuth 1.0a implementation, we'll just need to implement a custom "oauth dance" and will closely follow the pattern defined in twitter.oauth_dance. 356 Chapter 9: Twitter Cookbook a def ipynb_oauth_dance(): _twitter = twitter.Twitter( auth=twitter.OAuth('', '', CONSUMER_KEY, CONSUMER_SECRET), format='', api_version=None) oauth_token, oauth_token_secret = parse_oauth_tokens( _twitter.oauth.request_token(oauth_callback=oauth_callback)) Need to write these interim values out to a file to pick up on the callback from Twitter that is handled by the web server in /oauth_helper write_token_file(OAUTH_FILE, oauth_token, oauth_token_secret) oauth_url = ('http://api.twitter.com/oauth/authorize?oauth_token=' + oauth_token) Tap the browser's native capabilities to access the web server through a new window to get user authorization display(JS("window.open('%s')" % oauth_url)) After the webserver.run() blocking call, start the OAuth Dance that will ultimately cause Twitter to redirect a request back to it. Once that request is serviced, the web server will shut down and program flow will resume with the OAUTH_FILE containing the necessary credentials. Timer(1, lambda: ipynb_oauth_dance()).start() webserver.run(host='0.0.0.0') The values that are read from this file are written out at the end of /oauth_helper oauth_token, oauth_token_secret = read_token_file(OAUTH_FILE) These four credentials are what is needed to authorize the application auth = twitter.oauth.OAuth(oauth_token, oauth_token_secret, CONSUMER_KEY, CONSUMER_SECRET) twitter_api = twitter.Twitter(auth=auth) print twitter_api You should be able to observe that the access token and access token secret that your application retrieves are the same values as the ones in your application’s settings, and this is no coincidence. Guard these values carefully, as they are effectively the same thing as a username and password combination. 9.2. Doing the OAuth Dance to Access Twitter’s API for Production Purposes 357 a9.3. Discovering the Trending Topics 9.3.1. Problem You want to know what is trending on Twitter for a particular geographic area such as the United States, another country or group of countries, or possibly even the entire world. 9.3.2. Solution Twitter’s Trends API enables you to get the trending topics for geographic areas that are designated by a Where On Earth (WOE) ID, as defined and maintained by Yahoo. 9.3.3. Discussion A place is an essential concept in Twitter’s development platform, and trending topics are accordingly constrained by geography to provide the best API possible for querying for trending topics (as shown in Example 9-3). Like all other APIs, it returns the trending topics as JSON data, which can be converted to standard Python objects and then ma‐ nipulated with list comprehensions or similar techniques. This means it’s fairly easy to explore the API responses. Try experimenting with a variety of WOE IDs to compare and contrast the trends from various geographic regions. For example, compare and contrast trends in two different countries, or compare a trend in a particular country to a trend in the world. You’ll need to complete a short registration with Yahoo in order to access and look up Where On Earth (WOE) IDs as part of one of their developer products. It’s painless and well worth the couple of mi‐ nutes that it takes to do. Example 9-3. Discovering the trending topics import json import twitter def twitter_trends(twitter_api, woe_id): Prefix ID with the underscore for query string parameterization. Without the underscore, the twitter package appends the ID value to the URL itself as a special-case keyword argument. return twitter_api.trends.place(_id=woe_id) Sample usage twitter_api = oauth_login() 358 Chapter 9: Twitter Cookbook a See https://dev.twitter.com/docs/api/1.1/get/trends/place and http://developer.yahoo.com/geo/geoplanet/ for details on Yahoo Where On Earth ID WORLD_WOE_ID = 1 world_trends = twitter_trends(twitter_api, WORLD_WOE_ID) print json.dumps(world_trends, indent=1) US_WOE_ID = 23424977 us_trends = twitter_trends(twitter_api, US_WOE_ID) print json.dumps(us_trends, indent=1) 9.4. Searching for Tweets 9.4.1. Problem You want to search Twitter for tweets using specific keywords and query constraints. 9.4.2. Solution Use the Search API to perform a custom query. 9.4.3. Discussion Example 9-4 illustrates how to use the Search API to perform a custom query against the entire Twitterverse. Similar to the way that search engines work, Twitter’s Search API returns results in batches, and you can configure the number of results per batch to a maximum value of 200 by using the count keyword parameter. It is possible that more than 200 results (or the maximum value that you specify forcount) may be avail‐ able for any given query, and in the parlance of Twitter’s API, you’ll need to use a cursor to navigate to the next batch of results. Cursors are a new enhancement to Twitter’s v1.1 API and provide a more robust scheme than the pagination paradigm offered by the v1.0 API, which involved specifying a page number and a results per page constraint. The essence of the cursor paradigm is that it is able to better accommodate the dynamic and real-time nature of the Twitter platform. For example, Twitter’s API cursors are designed to inherently take into account the possibility that updated information may become available in real time while you are navigating a batch of search results. In other words, it could be the case that while you are navigating a batch of query results, relevant information becomes available that you would want to have included in your current results while you are navigating them, rather than needing to dispatch a new query. Example 9-4 illustrates how to use the Search API and navigate the cursor that’s included in a response to fetch more than one batch of results. 9.4. Searching for Tweets 359 aExample 9-4. Searching for tweets def twitter_search(twitter_api, q, max_results=200, kw): See https://dev.twitter.com/docs/api/1.1/get/search/tweets and https://dev.twitter.com/docs/using-search for details on advanced search criteria that may be useful for keyword arguments See https://dev.twitter.com/docs/api/1.1/get/search/tweets search_results = twitter_api.search.tweets(q=q, count=100, kw) statuses = search_results'statuses' Iterate through batches of results by following the cursor until we reach the desired number of results, keeping in mind that OAuth users can "only" make 180 search queries per 15-minute interval. See https://dev.twitter.com/docs/rate-limiting/1.1/limits for details. A reasonable number of results is 1000, although that number of results may not exist for all queries. Enforce a reasonable limit max_results = min(1000, max_results) for _ in range(10): 10100 = 1000 try: next_results = search_results'search_metadata''next_results' except KeyError, e: No more results when next_results doesn't exist break Create a dictionary from next_results, which has the following form: ?max_id=313519052523986943&q=NCAA&include_entities=1 kwargs = dict( kv.split('=') for kv in next_results1:.split("&") ) search_results = twitter_api.search.tweets(kwargs) statuses += search_results'statuses' if len(statuses) max_results: break return statuses Sample usage twitter_api = oauth_login() q = "CrossFit" results = twitter_search(twitter_api, q, max_results=10) Show one sample search result by slicing the list... print json.dumps(results0, indent=1) 360 Chapter 9: Twitter Cookbook a9.5. Constructing Convenient Function Calls 9.5.1. Problem You want to bind certain parameters to function calls and pass around a reference to the bound function in order to simplify coding patterns. 9.5.2. Solution Use Python’s functools.partial to create fully or partially bound functions that can be elegantly passed around and invoked by other code without the need to pass addi‐ tional parameters. 9.5.3. Discussion Although not a technique that is exclusive to design patterns with the Twitter API, functools.partial is a pattern that you’ll find incredibly convenient to use in com‐ bination with the twitter package and many of the patterns in this cookbook and in your other Python programming experiences. For example, you may find it cumber‐ some to continually pass around a reference to an authenticated Twitter API connector (twitter_api, as illustrated in these recipes, is usually the first argument to most func‐ tions) and want to create a function that partially satisfies the function arguments so that you can freely pass around a function that can be invoked with its remaining parameters. Another example that illustrates the convenience of partially binding parameters is that you may want to bind a Twitter API connector and a WOE ID for a geographic area to the Trends API as a single function call that can be passed around and simply invoked as is. Yet another possibility is that you may find that routinely typing json .dumps(..., indent=1) is rather cumbersome, so you could go ahead and partially apply the keyword argument and rename the function to something shorter like pp (pretty-print) to save some repetitive typing. The possibilities are vast, and while you could opt to use Python’sdef keyword to define functions as a possibility that usually achieves the same end, you may find that it’s more concise and elegant to use functools.partial in some situations. Example 9-5 dem‐ onstrates a few possibilities that you may find useful. Example 9-5. Constructing convenient function calls from functools import partial pp = partial(json.dumps, indent=1) twitter_world_trends = partial(twitter_trends, twitter_api, WORLD_WOE_ID) 9.5. Constructing Convenient Function Calls 361 a print pp(twitter_world_trends()) authenticated_twitter_search = partial(twitter_search, twitter_api) results = authenticated_twitter_search("iPhone") print pp(results) authenticated_iphone_twitter_search = partial(authenticated_twitter_search, "iPhone") results = authenticated_iphone_twitter_search() print pp(results) 9.6. Saving and Restoring JSON Data with Text Files 9.6.1. Problem You want to store relatively small amounts of data that you’ve fetched from Twitter’s API for recurring analysis or archival purposes. 9.6.2. Solution Write the data out to a text file in a convenient and portable JSON representation. 9.6.3. Discussion Although text files won’t be appropriate for every occasion, they are a portable and convenient option to consider if you need to just dump some data out to disk to save it for experimentation or analysis. In fact, this would be considered a best practice so that you minimize the number of requests to Twitter’s API and avoid the inevitable rate- limiting issues that you’ll likely encounter. After all, it certainly would not be in your best interest or Twitter’s best interest to repetitively hit the API and request the same data over and over again. Example 9-6 demonstrates a fairly routine use of Python’sio package to ensure that any data that you write to and read from disk is properly encoded and decoded as UTF-8 so that you can avoid the (often dreaded and not often well understood)UnicodeDeco deError exceptions that commonly occur with serialization and deserialization of text data in Python 2.x applications. Example 9-6. Saving and restoring JSON data with text files import io, json def save_json(filename, data): with io.open('resources/ch09-twittercookbook/0.json'.format(filename), 'w', encoding='utf-8') as f: f.write(unicode(json.dumps(data, ensure_ascii=False))) def load_json(filename): with io.open('resources/ch09-twittercookbook/0.json'.format(filename), 362 Chapter 9: Twitter Cookbook a encoding='utf-8') as f: return f.read() Sample usage q = 'CrossFit' twitter_api = oauth_login() results = twitter_search(twitter_api, q, max_results=10) save_json(q, results) results = load_json(q) print json.dumps(results, indent=1) 9.7. Saving and Accessing JSON Data with MongoDB 9.7.1. Problem You want to store and access nontrivial amounts of JSON data from Twitter API responses. 9.7.2. Solution Use a document-oriented database such as MongoDB to store the data in a convenient JSON format. 9.7.3. Discussion While a directory containing a relatively small number of properly encoded JSON files may work well for trivial amounts of data, you may be surprised at how quickly you start to amass enough data that flat files become unwieldy. Fortunately, document- oriented databases such as MongoDB are ideal for storing Twitter API responses, since they are designed to efficiently store JSON data. MongoDB is a robust and well-documented database that works well for small or large amounts of data. It provides powerful query operators and indexing capabilities that significantly streamline the amount of analysis that you’ll need to do in custom Python code. In most cases, if you put some thought into how to index and query your data, MongoDB will be able to outperform your custom manipulations through its use of indexes and efficient BSON representation on disk. See Chapter 6 for a fairly extensive introduction to MongoDB in the context of storing (JSONified mailbox) data and using MongoDB’s aggregation framework to query it in nontrivial ways. Example 9-7 illustrates how to connect to a running MongoDB database to save and load data. 9.7. Saving and Accessing JSON Data with MongoDB 363 aMongoDB is easy to install and contains excellent online documenta‐ tion for both installation/configuration and query/indexing opera‐ tions. The virtual machine for this book takes care of installing and starting it for you if you’d like to jump right in. Example 9-7. Saving and accessing JSON data with MongoDB import json import pymongo pip install pymongo def save_to_mongo(data, mongo_db, mongo_db_coll, mongo_conn_kw): Connects to the MongoDB server running on localhost:27017 by default client = pymongo.MongoClient(mongo_conn_kw) Get a reference to a particular database db = clientmongo_db Reference a particular collection in the database coll = dbmongo_db_coll Perform a bulk insert and return the IDs return coll.insert(data) def load_from_mongo(mongo_db, mongo_db_coll, return_cursor=False, criteria=None, projection=None, mongo_conn_kw): Optionally, use criteria and projection to limit the data that is returned as documented in http://docs.mongodb.org/manual/reference/method/db.collection.find/ Consider leveraging MongoDB's aggregations framework for more sophisticated queries. client = pymongo.MongoClient(mongo_conn_kw) db = clientmongo_db coll = dbmongo_db_coll if criteria is None: criteria = if projection is None: cursor = coll.find(criteria) else: cursor = coll.find(criteria, projection) 364 Chapter 9: Twitter Cookbook a Returning a cursor is recommended for large amounts of data if return_cursor: return cursor else: return item for item in cursor Sample usage q = 'CrossFit' twitter_api = oauth_login() results = twitter_search(twitter_api, q, max_results=10) save_to_mongo(results, 'search_results', q) load_from_mongo('search_results', q) 9.8. Sampling the Twitter Firehose with the Streaming API 9.8.1. Problem You want to analyze what people are tweeting about right now from a real-time stream of tweets as opposed to querying the Search API for what might be slightly (or very) dated information. Or, you want to begin accumulating nontrivial amounts of data about a particular topic for later analysis. 9.8.2. Solution Use Twitter’s Streaming API to sample public data from the Twitter firehose. 9.8.3. Discussion Twitter makes up to 1% of all tweets available in real time through a random sampling technique that represents the larger population of tweets and exposes these tweets through the Streaming API. Unless you want to go to a third-party provider such as GNIP or DataSift (which may actually be well worth the cost in many situations), this is about as good as it gets. Although you might think that 1% seems paltry, take a moment to realize that during peak loads, tweet velocity can be tens of thousands of tweets per second. For a broad enough topic, actually storing all of the tweets you sample could quickly become more of a problem than you might think. Access to up to 1% of all public tweets is significant. 9.8. Sampling the Twitter Firehose with the Streaming API 365 aWhereas the Search API is a little bit easier to use and queries for “historical” informa‐ tion (which in the Twitterverse could mean data that is minutes or hours old, given how fast trends emerge and dissipate), the Streaming API provides a way to sample from worldwide information in as close to real time as you’ll ever be able to get. The twit ter package exposes the Streaming API in an easy-to-use manner in which you can filter the firehose based upon keyword constraints, which is an intuitive and convenient way to access this information. As opposed to constructing a twitter.Twitter con‐ nector, you construct a twitter.TwitterStream connector, which takes a keyword argument that’s the same twitter.oauth.OAuth type as previously introduced in Sec‐ tion 9.1 on page 352 and Section 9.2 on page 353. The sample code in Example 9-8 demonstrates how to get started with Twitter’s Streaming API. Example 9-8. Sampling the Twitter firehose with the Streaming API Finding topics of interest by using the filtering capablities it offers. import twitter Query terms q = 'CrossFit' Comma-separated list of terms print sys.stderr, 'Filtering the public timeline for track="%s"' % (q,) Returns an instance of twitter.Twitter twitter_api = oauth_login() Reference the self.auth parameter twitter_stream = twitter.TwitterStream(auth=twitter_api.auth) See https://dev.twitter.com/docs/streaming-apis stream = twitter_stream.statuses.filter(track=q) For illustrative purposes, when all else fails, search for Justin Bieber and something is sure to turn up (at least, on Twitter) for tweet in stream: print tweet'text' Save to a database in a particular collection 9.9. Collecting Time-Series Data 9.9.1. Problem You want to periodically query Twitter’s API for specific results or trending topics and store the data for time-series analysis. 366 Chapter 9: Twitter Cookbook a9.9.2. Solution Use Python’s built-in time.sleep function inside of an infinite loop to issue a query and store the results to a database such as MongoDB if the use of the Streaming API as illustrated in Section 9.8 on page 365 won’t work. 9.9.3. Discussion Although it’s easy to get caught up in pointwise queries on particular keywords at a particular instant in time, the ability to sample data that’s collected over time and detect trends and patterns gives us access to a radically powerful form of analysis that is com‐ monly overlooked. Every time you look back and say, “I wish I’d known…” could have been a potential opportunity if you’d had the foresight to preemptively collect data that might have been useful for extrapolation or making predictions about the future (where applicable). Time-series analysis of Twitter data can be truly fascinating given the ebbs and flows of topics and updates that can occur. Although it may be useful for many situations to sample from the firehose and store the results to a document-oriented database like MongoDB, it may be easier or more appropriate in some situations to periodically issue queries and record the results into discrete time intervals. For example, you might query the trending topics for a variety of geographic regions throughout a 24-hour period and measure the rate at which various trends change, compare rates of change across geog‐ raphies, find the longest- and shortest-lived trends, and more. Another compelling possibility that is being actively explored is correlations between sentiment as expressed on Twitter and stock markets. It’s easy enough to zoom in on particular keywords, hashtags, or trending topics and later correlate the data against actual stock market changes; this could be an early step in building a bot to make pre‐ dictions about markets and commodities. Example 9-9 is essentially a composition of Section 9.1 on page 352, Example 9-3, and Example 9-7, and it demonstrates how you can use recipes as primitive building blocks to create more complex scripts with a little bit of creativity and copy/pasting. Example 9-9. Collecting time-series data import sys import datetime import time import twitter def get_time_series_data(api_func, mongo_db_name, mongo_db_coll, secs_per_interval=60, max_intervals=15, mongo_conn_kw): Default settings of 15 intervals and 1 API call per interval ensure that you will not exceed the Twitter rate limit. 9.9. Collecting Time-Series Data 367 a interval = 0 while True: A timestamp of the form "2013-06-14 12:52:07" now = str(datetime.datetime.now()).split(".")0 ids = save_to_mongo(api_func(), mongo_db_name, mongo_db_coll + "-" + now) print sys.stderr, "Write 0 trends".format(len(ids)) print sys.stderr, "Zzz..." print sys.stderr.flush() time.sleep(secs_per_interval) seconds interval += 1 if interval = 15: break Sample usage get_time_series_data(twitter_world_trends, 'time-series', 'twitter_world_trends') 9.10. Extracting Tweet Entities 9.10.1. Problem You want to extract entities such as username mentions, hashtags, and URLs from tweets for analysis. 9.10.2. Solution Extract the tweet entities from theentities field of tweets. 9.10.3. Discussion Twitter’s API now provides tweet entities as a standard field for most of its API respon‐ ses, where applicable. The entities field, illustrated in Example 9-10, includes user mentions, hashtags, references to URLs, media objects (such as images and videos), and financial symbols such as stock tickers. At the current time, not all fields may apply for all situations. For example, themedia field will appear and be populated in a tweet only if a user embeds the media using a Twitter client that specifically uses a particular API for embedding the content; simply copying/pasting a link to a YouTube video won’t necessarily populate this field. 368 Chapter 9: Twitter Cookbook a See the Tweet Entities API documentation for more details, including information on some of the additional fields that are available for each type of entity. For example, in the case of a URL, Twitter offers several variations, including the shortened and ex‐ panded forms as well as a value that may be more appropriate for displaying in a user interface for certain situations. Example 9-10. Extracting tweet entities def extract_tweet_entities(statuses): See https://dev.twitter.com/docs/tweet-entities for more details on tweet entities if len(statuses) == 0: return , , , , screen_names = user_mention'screen_name' for status in statuses for user_mention in status'entities''user_mentions' hashtags = hashtag'text' for status in statuses for hashtag in status'entities''hashtags' urls = url'expanded_url' for status in statuses for url in status'entities''urls' symbols = symbol'text' for status in statuses for symbol in status'entities''symbols' In some circumstances (such as search results), the media entity may not appear if status'entities'.has_key('media'): media = media'url' for status in statuses for media in status'entities''media' else: media = return screen_names, hashtags, urls, media, symbols Sample usage q = 'CrossFit' statuses = twitter_search(twitter_api, q) screen_names, hashtags, urls, media, symbols = extract_tweet_entities(statuses) Explore the first five items for each... 9.10. Extracting Tweet Entities 369 a print json.dumps(screen_names0:5, indent=1) print json.dumps(hashtags0:5, indent=1) print json.dumps(urls0:5, indent=1) print json.dumps(media0:5, indent=1) print json.dumps(symbols0:5, indent=1) 9.11. Finding the Most Popular Tweets in a Collection of Tweets 9.11.1. Problem You want to determine which tweets are the most popular among a collection of search results or any other batch of tweets, such as a user timeline. 9.11.2. Solution Analyze the retweet_count field of a tweet to determine whether or not a tweet was retweeted and, if so, how many times. 9.11.3. Discussion Analyzing theretweet_count field of a tweet, as shown in Example 9-11, is perhaps the most straightforward measure of a tweet’s popularity because it stands to reason that popular tweets will be shared with others. Depending on your particular interpretation of “popular,” however, another possible value that you could incorporate into a formula for determining a tweet’s popularity is itsfavorite_count, which is the number of times a user has bookmarked a tweet. For example, you might weight the retweet_count at 1.0 and the favorite_count at 0.1 to add a marginal amount of weight to tweets that have been both retweeted and favorited if you wanted to usefavorite_count as a tiebreaker. The particular choice of values in a formula is entirely up to you and will depend on how important you think each of these fields is in the overall context of the problem that you are trying to solve. Other possibilities, such as incorporating an exponential decay that accounts for time and weights recent tweets more heavily than less recent tweets, may prove useful in certain analyses. See also Section 9.14 on page 374 and Section 9.15 on page 376 for some additional discussion that may be helpful in navigating the space of analyzing and applying attribution to retweets, which can be slightly more confusing than it initially seems. 370 Chapter 9: Twitter Cookbook a