diff --git a/LICENSE b/LICENSE old mode 100644 new mode 100755 diff --git a/README.rst b/README.rst index 2bd7022..64dd40d 100644 --- a/README.rst +++ b/README.rst @@ -1,195 +1,241 @@ -Python Twitter Search API -========================= +.. .. image:: https://img.shields.io/endpoint?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Ftwbadges.glitch.me%2Fbadges%2Fv2 +.. :target: https://developer.twitter.com/en/docs/twitter-api +.. :alt: Twitter API v2 -This project serves as a wrapper for the `Twitter premium and enterprise -search -APIs `__, -providing a command-line utility and a Python library. Pretty docs can -be seen `here `__. +Python client for the Twitter API v2 search endpoints +=========================================================== -Features -======== +Welcome to the ``v2`` branch of the Python search client. This branch was born from the main branch that supports +premium and enterprise tiers of Twitter search. This branch supports the `Twitter API v2 'recent' and 'all' search endpoints `__ only, and drops support for the premium and enterprise tiers. -- Supports 30-day Search and Full Archive Search (not the standard - Search API at this time). -- Command-line utility is pipeable to other tools (e.g., ``jq``). -- Automatically handles pagination of search results with specifiable - limits -- Delivers a stream of data to the user for low in-memory requirements -- Handles enterprise and premium authentication methods -- Flexible usage within a python program -- Compatible with our group's `Tweet - Parser `__ for rapid - extraction of relevant data fields from each tweet payload -- Supports the Search Counts endpoint, which can reduce API call usage - and provide rapid insights if you only need Tweet volumes and not - Tweet payloads +This project serves as a wrapper for the Twitter API v2 search endpoints (/search/recent and /search/all), providing a command-line utility and a Python library. -Installation -============ +The search endpoint you want to hit is specified in the library's YAML file: -The ``searchtweets`` library is on Pypi: +.. code:: yaml -.. code:: bash + search_tweets_v2: + endpoint: https://api.twitter.com/2/tweets/search/recent #Or https://api.twitter.com/2/tweets/search/all - pip install searchtweets -Or you can install the development version locally via +The 'recent' search endpoint provides Tweets from the past 7 days. The 'all' search endpoint, launched in January 2021 as part of the 'academic research' tier of Twitter API v2 access, provides access to all publicly avaialble Tweets posted since March 2006. -.. code:: bash +To learn more about the Twitter academic research program, see this [Twitter blog post](https://blog.twitter.com/developer/en_us/topics/tips/2021/enabling-the-future-of-academic-research-with-the-twitter-api.html). - git clone https://github.com/twitterdev/search-tweets-python - cd search-tweets-python - pip install -e . +To download and install this package, go to: https://pypi.org/project/searchtweets-v2/ --------------- +If you are looking for the original version that works with premium and enterprise versions of search, head on over to +the main or ``enterprise-premium`` branch. (Soon, the v2 version will be promoted to the main branch.) -Credential Handling -=================== -The premium and enterprise Search APIs use different authentication -methods and we attempt to provide a seamless way to handle -authentication for all customers. We know credentials can be tricky or -annoying - please read this in its entirety. +Features +======== -Premium clients will require the ``bearer_token`` and ``endpoint`` -fields; Enterprise clients require ``username``, ``password``, and -``endpoint``. If you do not specify the ``account_type``, we attempt to -discern the account type and declare a warning about this behavior. +- Supports Twitter API v2 'recent' and 'all' search. +- Supports the configuration of v2 `expansions `_ and `fields `_. +- Supports multiple output formats: + * Original API responses (new default) + * Stream of messages (previous default in versions <1.0.7) + * New 'atomic' format with expansions included in tweets. +- Supports a new "polling" mode using the ``since-id`` search request parameter. The ``since-id``, along with the new ``until-id`` provide a way to navigate the public Tweet archive by Tweet ID. +- Supports additional ways to specify ``start-time`` and ``end-time`` request parameters: -For premium search products, we are using app-only authentication and -the bearer tokens are not delivered with an expiration time. You can -provide either: - your application key and secret (the library will -handle bearer-token authentication) - a bearer token that you get -yourself + - #d - For example, '2d' sets ``start-time`` to (exactly) two days ago. + - #h - For example, '12h' sets ``start-time`` to (exactly) twelve hours ago. + - #m - For example, '15m' sets ``start-time`` to (exactly) fifteen minutes ago. -Many developers might find providing your application key and secret -more straightforward and letting this library manage your bearer token -generation for you. Please see -`here `__ -for an overview of the premium authentication method. + These are handy for kicking off searches with a backfill period, and also work with the ``end-time`` request parameter. -We support both YAML-file based methods and environment variables for -storing credentials, and provide flexible handling with sensible -defaults. +These features were inherited from the enterprise/premium version: -YAML method ------------ +- Command-line utility is pipeable to other tools (e.g., ``jq``). +- Automatically handles pagination of search results with specifiable limits. +- Delivers a stream of data to the user for low in-memory requirements. +- Handles OAuth 2 and Bearer Token authentication. +- Flexible usage within a python program. + + +Twitter API v2 search updates +==================================== + +Twitter API v2 represents an opportunity to apply previous learnings from building Twitter API v1.1. and the premium and enterprise tiers of endpoints, and redesign and rebuild from the ground up. While building this v2 version of the `search-tweets-python` library, +we took the opportunity to update fundamental things. This library provides example scripts, and one example is updating their command-line arguments to better match new v2 conventions. Instead of setting search periods with `start-datetime` and `end-datetime`, +they have been shortened to match current search request parameters: `start-time` and `end-time`. Throughout the code, we no longer use parlance that references `rules` and `PowerTrack`, and now reference `queries` and the v2 recent search endpoint. + +When migrating this Python search client to v2 from the enterprise and premium tiers, the following updates were made: + +- Added support for GET requests (and removed POST support for now). +- Added support for ``since_id`` and ``until_id`` request parameters. +- Updated pagination details. +- Updated app command-line parlance: + - --start-datetime → --start-time + - --end-datetime → --end-time + - --filter-rule → --query + - --max-results → --max-tweets + - Dropped --account-type. No longer required since support for Premium and Enterprise search tiers have been dropped. + - Dropped --count-bucket. Removed search 'counts' endpoint support. This endpoint is currently not available in v2. + +In this spirit of updating the parlance used, note that a core method provided by searchtweets/result_stream.py has been renamed. The method `gen_rule_payload` has been updated to `gen_request_parameters`. + +**One key update is handling the changes in how the search endpoint returns its data.** The v2 search endpoint returns matching Tweets in a `data` array, along with an `includes` array that provides supporting objects that result from specifying `expansions`. +These expanded objects include Users, referenced Tweets, and attached media. In addition to the `data` and `includes` arrays, the search endpoint also provides a `meta` object that provides the max and min Tweet IDs included in the response, +along with a `next_token` if there is another 'page' of data to request. + +Currently, the v2 client returns the original API responses. Optionally, it can output a stream of Tweet objects with all expansions included in each tweet. Alternatively, it can output a stream of messages, yielding the individual Tweet objects, arrays of User, Tweet, and media objects from the `includes` array, followed by the `meta` object. This matches the behavior of the original search client, and was the default output format in versions 1.0.7 and earlier. + +Finally, the original version of search-tweets-python used a `Tweet Parser `__ to help manage the differences between two different JSON formats ("original" and "Activity Stream"). With v2, there is just one version of Tweet JSON, so this Tweet Parser is not used. +In the original code, this Tweet parser was envoked with a `tweetify=True directive. With this v2 version, this use of the Tweet Parser is turned off by instead using `tweetify=False`. + + +Command-line options +==================== + +.. code:: + +usage: search_tweets.py + [-h] [--credential-file CREDENTIAL_FILE] [--credential-file-key CREDENTIAL_YAML_KEY] [--env-overwrite ENV_OVERWRITE] [--config-file CONFIG_FILENAME] [--query QUERY] + [--start-time START_TIME] [--end-time END_TIME] [--since-id SINCE_ID] [--until-id UNTIL_ID] [--results-per-call RESULTS_PER_CALL] [--expansions EXPANSIONS] + [--tweet-fields TWEET_FIELDS] [--user-fields USER_FIELDS] [--media-fields MEDIA_FIELDS] [--place-fields PLACE_FIELDS] [--poll-fields POLL_FIELDS] + [--output-format OUTPUT_FORMAT] [--max-tweets MAX_TWEETS] [--max-pages MAX_PAGES] [--results-per-file RESULTS_PER_FILE] [--filename-prefix FILENAME_PREFIX] + [--no-print-stream] [--print-stream] [--extra-headers EXTRA_HEADERS] [--debug] + +optional arguments: + -h, --help show this help message and exit + --credential-file CREDENTIAL_FILE + Location of the yaml file used to hold your credentials. + --credential-file-key CREDENTIAL_YAML_KEY + the key in the credential file used for this session's credentials. Defaults to search_tweets_v2 + --env-overwrite ENV_OVERWRITE + Overwrite YAML-parsed credentials with any set environment variables. See API docs or readme for details. + --config-file CONFIG_FILENAME + configuration file with all parameters. Far, easier to use than the command-line args version., If a valid file is found, all args will be populated, from there. Remaining + command-line args, will overrule args found in the config, file. + --query QUERY Search query. (See: https://developer.twitter.com/en/docs/labs/recent-search/guides/search-queries) + --start-time START_TIME + Start of datetime window, format 'YYYY-mm-DDTHH:MM' (default: -7 days for /recent, -30 days for /all) + --end-time END_TIME End of datetime window, format 'YYYY-mm-DDTHH:MM' (default: to 30 seconds before request time) + --since-id SINCE_ID Tweet ID, will start search from Tweets after this one. (See: https://developer.twitter.com/en/docs/labs/recent-search/guides/pagination) + --until-id UNTIL_ID Tweet ID, will end search from Tweets before this one. (See: https://developer.twitter.com/en/docs/labs/recent-search/guides/pagination) + --results-per-call RESULTS_PER_CALL + Number of results to return per call (default 10; max 100) - corresponds to 'max_results' in the API + --expansions EXPANSIONS + A comma-delimited list of expansions. Specified expansions results in full objects in the 'includes' response object. + --tweet-fields TWEET_FIELDS + A comma-delimited list of Tweet JSON attributes to include in endpoint responses. (API default:"id,text") + --user-fields USER_FIELDS + A comma-delimited list of User JSON attributes to include in endpoint responses. (API default:"id") + --media-fields MEDIA_FIELDS + A comma-delimited list of media JSON attributes to include in endpoint responses. (API default:"id") + --place-fields PLACE_FIELDS + A comma-delimited list of Twitter Place JSON attributes to include in endpoint responses. (API default:"id") + --poll-fields POLL_FIELDS + A comma-delimited list of Twitter Poll JSON attributes to include in endpoint responses. (API default:"id") + --output-format OUTPUT_FORMAT + Set output format: 'r' Unmodified API [R]esponses. (default). 'a' [A]tomic Tweets: Tweet objects with expansions inline. 'm' [M]essage stream: Tweets, expansions, and + pagination metadata as a stream of messages. + --max-tweets MAX_TWEETS + Maximum number of Tweets to return for this session of requests. + --max-pages MAX_PAGES + Maximum number of pages/API calls to use for this session. + --results-per-file RESULTS_PER_FILE + Maximum tweets to save per file. + --filename-prefix FILENAME_PREFIX + prefix for the filename where tweet json data will be stored. + --no-print-stream disable print streaming + --print-stream Print tweet stream to stdout + --extra-headers EXTRA_HEADERS + JSON-formatted str representing a dict of additional HTTP request headers + --debug print all info and warning messages -For premium customers, the simplest credential file should look like -this: -.. code:: yaml - search_tweets_api: - account_type: premium - endpoint: - consumer_key: - consumer_secret: +Installation +============= -For enterprise customers, the simplest credential file should look like -this: +The updated Pypi install package for the v2 version is at: -.. code:: yaml +https://pypi.org/project/searchtweets-v2/ - search_tweets_api: - account_type: enterprise - endpoint: - username: - password: +Another option to work directly with this code by cloning the repository, installing the required Python packages, setting up your credentials, and start making requests. +For those not using the Pypi package, and instead are cloning the repository, a ``requirements.txt`` is provided. Dependencies can be installed with the ``pip install -r requirements.txt`` command. -By default, this library expects this file at -``"~/.twitter_keys.yaml"``, but you can pass the relevant location as -needed, either with the ``--credential-file`` flag for the command-line -app or as demonstrated below in a Python program. +To confirm the your code is ready to go, run the ``$python3 scripts/search-tweets.py -h`` command. You should see the help details shown above. -Both above examples require no special command-line arguments or -in-program arguments. The credential parsing methods, unless otherwise -specified, will look for a YAML key called ``search_tweets_api``. -For developers who have multiple endpoints and/or search products, you -can keep all credentials in the same file and specify specific keys to -use. ``--credential-file-key`` specifies this behavior in the command -line app. An example: -.. code:: yaml +Credential Handling +=================== - search_tweets_30_day_dev: - account_type: premium - endpoint: - consumer_key: - consumer_secret: - (optional) bearer_token: +The Twitter API v2 search endpoints uses app-only authentication. You have the choice to configure your application consumer key and secret, or a Bearer Token you have generated. If you supply the application key and secret, the client will generate a Bearer Token for you. +Many developers might find providing your application key and secret more straightforward and letting this library manage your Bearer Token generation for you. Please see `HERE `_ for an overview of the app-only authentication method. - search_tweets_30_day_prod: - account_type: premium - endpoint: - bearer_token: +We support both YAML-file based methods and environment variables for storing credentials, and provide flexible handling with sensible defaults. - search_tweets_fullarchive_dev: - account_type: premium - endpoint: - bearer_token: +YAML method +=========== - search_tweets_fullarchive_prod: - account_type: premium - endpoint: - bearer_token: +The simplest credential file should look like this: -Environment Variables ---------------------- +.. code:: yaml -If you want or need to pass credentials via environment variables, you -can set the appropriate variables for your product of the following: + search_tweets_v2: + endpoint: https://api.twitter.com/2/tweets/search/recent + consumer_key: + consumer_secret: + bearer_token: -:: +By default, this library expects this file at "~/.twitter_keys.yaml", but you can pass the relevant location as needed, either with the --credential-file flag for the command-line app or as demonstrated below in a Python program. - export SEARCHTWEETS_ENDPOINT= - export SEARCHTWEETS_USERNAME= - export SEARCHTWEETS_PASSWORD= - export SEARCHTWEETS_BEARER_TOKEN= - export SEARCHTWEETS_ACCOUNT_TYPE= - export SEARCHTWEETS_CONSUMER_KEY= - export SEARCHTWEETS_CONSUMER_SECRET= +Both above examples require no special command-line arguments or in-program arguments. The credential parsing methods, unless otherwise specified, will look for a YAML key called search_tweets_v2. -The ``load_credentials`` function will attempt to find these variables -if it cannot load fields from the YAML file, and it will **overwrite any -credentials from the YAML file that are present as environment -variables** if they have been parsed. This behavior can be changed by -setting the ``load_credentials`` parameter ``env_overwrite`` to -``False``. +For developers who have multiple endpoints and/or search products, you can keep all credentials in the same file and specify specific keys to use. --credential-file-key specifies this behavior in the command line app. An example: -The following cells demonstrates credential handling in the Python -library. +.. code:: yaml -.. code:: python + search_tweets_v2: + endpoint: https://api.twitter.com/2/tweets/search/recent + consumer_key: + consumer_secret: + (optional) bearer_token: - from searchtweets import load_credentials + search_tweets_labsv2: + endpoint: https://api.twitter.com/labs/2/tweets/search + consumer_key: + consumer_secret: + (optional) bearer_token: -.. code:: python +Environment Variables +===================== - load_credentials(filename="./search_tweets_creds_example.yaml", - yaml_key="search_tweets_ent_example", - env_overwrite=False) +If you want or need to pass credentials via environment variables, you can set the appropriate variables: :: - {'username': '', - 'password': '', - 'endpoint': ''} + export SEARCHTWEETS_ENDPOINT= + export SEARCHTWEETS_BEARER_TOKEN= + export SEARCHTWEETS_CONSUMER_KEY= + export SEARCHTWEETS_CONSUMER_SECRET= + +The ``load_credentials`` function will attempt to find these variables if it cannot load fields from the YAML file, and it will **overwrite any credentials from the YAML file that are present as environment variables** if they have been parsed. This behavior can be changed by setting the ``load_credentials`` parameter ``env_overwrite`` to ``False``. + +The following cells demonstrates credential handling in the Python library. .. code:: python - load_credentials(filename="./search_tweets_creds_example.yaml", - yaml_key="search_tweets_premium_example", - env_overwrite=False) + from searchtweets import load_credentials + +.. code:: python + + load_credentials(filename="./search_tweets_creds_example.yaml", + yaml_key="search_tweets_v2_example", + env_overwrite=False) :: - {'bearer_token': '', - 'endpoint': 'https://api.twitter.com/1.1/tweets/search/30day/dev.json', - 'extra_headers_dict': None} + {'bearer_token': '', + 'endpoint': 'https://api.twitter.com/2/tweets/search/recent', + 'extra_headers_dict': None} Environment Variable Overrides ------------------------------ @@ -200,8 +246,7 @@ regardless of a YAML file's validity or existence. .. code:: python import os - os.environ["SEARCHTWEETS_USERNAME"] = "" - os.environ["SEARCHTWEETS_PASSWORD"] = "" + os.environ["SEARCHTWEETS_BEARER_TOKEN"] = "" os.environ["SEARCHTWEETS_ENDPOINT"] = "" load_credentials(filename="nothing_here.yaml", yaml_key="no_key_here") @@ -213,8 +258,7 @@ regardless of a YAML file's validity or existence. :: - {'username': '', - 'password': '', + {'bearer_token': '', 'endpoint': ''} Command-line app @@ -233,279 +277,159 @@ are used to control credential behavior from the command-line app. Using the Comand Line Application ================================= -The library includes an application, ``search_tweets.py``, that provides -rapid access to Tweets. When you use ``pip`` to install this package, -``search_tweets.py`` is installed globally. The file is located in the -``tools/`` directory for those who want to run it locally. +The library includes an application, ``search_tweets.py``, that provides rapid access to Tweets. When you use ``pip`` to install this package, ``search_tweets.py`` is installed globally. The file is located in the ``scripts/`` directory for those who want to run it locally. -Note that the ``--results-per-call`` flag specifies an argument to the -API ( ``maxResults``, results returned per CALL), not as a hard max to -number of results returned from this program. The argument -``--max-results`` defines the maximum number of results to return from a -given call. All examples assume that your credentials are set up -correctly in the default location - ``.twitter_keys.yaml`` or in -environment variables. +Note that the ``--results-per-call`` flag specifies an argument to the API, not as a hard max to number of results returned from this program. The argument ``--max-tweets`` defines the maximum number of results to return from a single run of the ``search-tweets.py``` script. All examples assume that your credentials are set up correctly in the default location - ``.twitter_keys.yaml`` or in environment variables. **Stream json results to stdout without saving** .. code:: bash - search_tweets.py \ - --max-results 1000 \ - --results-per-call 100 \ - --filter-rule "beyonce has:hashtags" \ - --print-stream + search_tweets.py \ + --max-tweets 10000 \ + --results-per-call 100 \ + --query "(snow OR rain) has:media -is:retweet" \ + --print-stream **Stream json results to stdout and save to a file** .. code:: bash - search_tweets.py \ - --max-results 1000 \ - --results-per-call 100 \ - --filter-rule "beyonce has:hashtags" \ - --filename-prefix beyonce_geo \ - --print-stream + search_tweets.py \ + --max-tweets 10000 \ + --results-per-call 100 \ + --query "(snow OR rain) has:media -is:retweet" \ + --filename-prefix weather_pics \ + --print-stream **Save to file without output** .. code:: bash - search_tweets.py \ - --max-results 100 \ - --results-per-call 100 \ - --filter-rule "beyonce has:hashtags" \ - --filename-prefix beyonce_geo \ - --no-print-stream + search_tweets.py \ + --max-tweets 10000 \ + --results-per-call 100 \ + --query "(snow OR rain) has:media -is:retweet" \ + --filename-prefix weather_pics \ + --no-print-stream -One or more custom headers can be specified from the command line, using -the ``--extra-headers`` argument and a JSON-formatted string -representing a dictionary of extra headers: +One or more custom headers can be specified from the command line, using the ``--extra-headers`` argument and a JSON-formatted string representing a dictionary of extra headers: .. code:: bash - search_tweets.py \ - --filter-rule "beyonce has:hashtags" \ - --extra-headers '{"":""}' + search_tweets.py \ + --query "(snow OR rain) has:media -is:retweet" \ + --extra-headers '{"":""}' -Options can be passed via a configuration file (either ini or YAML). -Example files can be found in the ``tools/api_config_example.config`` or -``./tools/api_yaml_example.yaml`` files, which might look like this: +Options can be passed via a configuration file (either ini or YAML). Example files can be found in the ``config/api_config_example.config`` or ``config/api_yaml_example.yaml`` files, which might look like this: .. code:: bash - [search_rules] - from_date = 2017-06-01 - to_date = 2017-09-01 - pt_rule = beyonce has:geo + [search_rules] + start_time = 2020-05-01 + end_time = 2020-05-01 + query = (snow OR rain) has:media -is:retweet - [search_params] - results_per_call = 500 - max_results = 500 + [search_params] + results_per_call = 100 + max_tweets = 10000 - [output_params] - save_file = True - filename_prefix = beyonce - results_per_file = 10000000 + [output_params] + save_file = True + filename_prefix = weather_pics + results_per_file = 10000000 Or this: -.. code:: yaml +.. code:: bash - search_rules: - from-date: 2017-06-01 - to-date: 2017-09-01 01:01 - pt-rule: kanye + search_rules: + start_time: 2020-05-01 + end_time: 2020-05-01 01:01 + query: (snow OR rain) has:media -is:retweet - search_params: - results-per-call: 500 - max-results: 500 + search_params: + results_per_call: 100 + max_results: 500 - output_params: - save_file: True - filename_prefix: kanye - results_per_file: 10000000 + output_params: + save_file: True + filename_prefix: (snow OR rain) has:media -is:retweet + results_per_file: 10000000 -Custom headers can be specified in a config file, under a specific -credentials key: +Custom headers can be specified in a config file, under a specific credentials key: .. code:: yaml - search_tweets_api: - account_type: premium - endpoint: - username: - password: - extra_headers: - : + search_tweets_v2: + endpoint: + bearer_token: + extra_headers: + : -When using a config file in conjunction with the command-line utility, -you need to specify your config file via the ``--config-file`` -parameter. Additional command-line arguments will either be *added* to -the config file args or **overwrite** the config file args if both are -specified and present. +When using a config file in conjunction with the command-line utility, you need to specify your config file via the ``--config-file`` parameter. Additional command-line arguments will either be added to the config file args or overwrite the config file args if both are specified and present. Example: :: - search_tweets.py \ - --config-file myapiconfig.config \ - --no-print-stream + search_tweets.py \ + --config-file myapiconfig.config \ + --no-print-stream --------------- - -Full options are listed below: - -:: - - $ search_tweets.py -h - usage: search_tweets.py [-h] [--credential-file CREDENTIAL_FILE] - [--credential-file-key CREDENTIAL_YAML_KEY] - [--env-overwrite ENV_OVERWRITE] - [--config-file CONFIG_FILENAME] - [--account-type {premium,enterprise}] - [--count-bucket COUNT_BUCKET] - [--start-datetime FROM_DATE] [--end-datetime TO_DATE] - [--filter-rule PT_RULE] - [--results-per-call RESULTS_PER_CALL] - [--max-results MAX_RESULTS] [--max-pages MAX_PAGES] - [--results-per-file RESULTS_PER_FILE] - [--filename-prefix FILENAME_PREFIX] - [--no-print-stream] [--print-stream] - [--extra-headers EXTRA_HEADERS] [--debug] - - optional arguments: - -h, --help show this help message and exit - --credential-file CREDENTIAL_FILE - Location of the yaml file used to hold your - credentials. - --credential-file-key CREDENTIAL_YAML_KEY - the key in the credential file used for this session's - credentials. Defaults to search_tweets_api - --env-overwrite ENV_OVERWRITE - Overwrite YAML-parsed credentials with any set - environment variables. See API docs or readme for - details. - --config-file CONFIG_FILENAME - configuration file with all parameters. Far, easier to - use than the command-line args version., If a valid - file is found, all args will be populated, from there. - Remaining command-line args, will overrule args found - in the config, file. - --account-type {premium,enterprise} - The account type you are using - --count-bucket COUNT_BUCKET - Bucket size for counts API. Options:, day, hour, - minute (default is 'day'). - --start-datetime FROM_DATE - Start of datetime window, format 'YYYY-mm-DDTHH:MM' - (default: -30 days) - --end-datetime TO_DATE - End of datetime window, format 'YYYY-mm-DDTHH:MM' - (default: most recent date) - --filter-rule PT_RULE - PowerTrack filter rule (See: http://support.gnip.com/c - ustomer/portal/articles/901152-powertrack-operators) - --results-per-call RESULTS_PER_CALL - Number of results to return per call (default 100; max - 500) - corresponds to 'maxResults' in the API - --max-results MAX_RESULTS - Maximum number of Tweets or Counts to return for this - session (defaults to 500) - --max-pages MAX_PAGES - Maximum number of pages/API calls to use for this - session. - --results-per-file RESULTS_PER_FILE - Maximum tweets to save per file. - --filename-prefix FILENAME_PREFIX - prefix for the filename where tweet json data will be - stored. - --no-print-stream disable print streaming - --print-stream Print tweet stream to stdout - --extra-headers EXTRA_HEADERS - JSON-formatted str representing a dict of additional - request headers - --debug print all info and warning messages - --------------- +------------------ Using the Twitter Search APIs' Python Wrapper ============================================= -Working with the API within a Python program is straightforward both for -Premium and Enterprise clients. +Working with the API within a Python program is straightforward. We'll assume that credentials are in the default location, ``~/.twitter_keys.yaml``. .. code:: python - from searchtweets import ResultStream, gen_rule_payload, load_credentials + from searchtweets import ResultStream, gen_request_parameters, load_credentials -Enterprise setup ----------------- - -.. code:: python - - enterprise_search_args = load_credentials("~/.twitter_keys.yaml", - yaml_key="search_tweets_enterprise", - env_overwrite=False) -Premium Setup -------------- +Twitter API v2 Setup +-------------------- .. code:: python - premium_search_args = load_credentials("~/.twitter_keys.yaml", - yaml_key="search_tweets_premium", + search_args = load_credentials("~/.twitter_keys.yaml", + yaml_key="search_tweets_v2", env_overwrite=False) + -There is a function that formats search API rules into valid json -queries called ``gen_rule_payload``. It has sensible defaults, such as -pulling more Tweets per call than the default 100 (but note that a -sandbox environment can only have a max of 100 here, so if you get -errors, please check this) not including dates, and defaulting to hourly -counts when using the counts api. Discussing the finer points of -generating search rules is out of scope for these examples; I encourage -you to see the docs to learn the nuances within, but for now let's see -what a rule looks like. +There is a function that formats search API rules into valid json queries called ``gen_request_parameters``. It has sensible defaults, such as pulling more Tweets per call than the default 10, and not including dates. Discussing the finer points of +generating search rules is out of scope for these examples; we encourage you to see the docs to learn the nuances within, but for now let's see what a query looks like. .. code:: python - rule = gen_rule_payload("beyonce", results_per_call=100) # testing with a sandbox account - print(rule) + query = gen_request_parameters("snow", results_per_call=100) + print(query) :: - {"query":"beyonce","maxResults":100} + {"query":"snow","max_results":100} -This rule will match tweets that have the text ``beyonce`` in them. +This rule will match tweets that have the text ``snow`` in them. -From this point, there are two ways to interact with the API. There is a -quick method to collect smaller amounts of Tweets to memory that -requires less thought and knowledge, and interaction with the -``ResultStream`` object which will be introduced later. +From this point, there are two ways to interact with the API. There is a quick method to collect smaller amounts of Tweets to memory that requires less thought and knowledge, and interaction with the ``ResultStream`` object which will be introduced later. Fast Way -------- -We'll use the ``search_args`` variable to power the configuration point -for the API. The object also takes a valid PowerTrack rule and has -options to cutoff search when hitting limits on both number of Tweets -and API calls. +We'll use the ``search_args`` variable to power the configuration point for the API. The object also takes a valid search query and has options to cutoff search when hitting limits on both number of Tweets and endpoint calls. -We'll be using the ``collect_results`` function, which has three -parameters. +We'll be using the ``collect_results`` function, which has three parameters. -- rule: a valid PowerTrack rule, referenced earlier +- query: a valid search query, referenced earlier - max_results: as the API handles pagination, it will stop collecting when we get to this number - result_stream_args: configuration args that we've already specified. -For the remaining examples, please change the args to either premium or -enterprise depending on your usage. - Let's see how it goes: .. code:: python @@ -514,324 +438,90 @@ Let's see how it goes: .. code:: python - tweets = collect_results(rule, - max_results=100, - result_stream_args=enterprise_search_args) # change this if you need to + tweets = collect_results(query, + max_tweets=100, + result_stream_args=search_args) # change this if you need to -By default, Tweet payloads are lazily parsed into a ``Tweet`` -`object `__. An overwhelming -number of Tweet attributes are made available directly, as such: +An overwhelming number of Tweet attributes are made available directly, as such: .. code:: python - [print(tweet.all_text, end='\n\n') for tweet in tweets[0:10]]; + [print(tweet.text, end='\n\n') for tweet in tweets[0:10]] :: - Jay-Z & Beyoncé sat across from us at dinner tonight and, at one point, I made eye contact with Beyoncé. My limbs turned to jello and I can no longer form a coherent sentence. I have seen the eyes of the lord. - - Beyoncé and it isn't close. https://t.co/UdOU9oUtuW - - As you could guess.. Signs by Beyoncé will always be my shit. + @CleoLoughlin Rain after the snow? Do you have ice now? - When Beyoncé adopts a dog 🙌🏾 https://t.co/U571HyLG4F + @koofltxr Rain, 134340, still with you, winter bear, Seoul, crystal snow, sea, outro:blueside - Hold up, you can't just do that to Beyoncé - https://t.co/3p14DocGqA + @TheWxMeister Sorry it ruined your camping. I was covering plants in case we got snow in the Mountain Shadows area. Thankfully we didn\u2019t. At least it didn\u2019t stick to the ground. The wind was crazy! Got just over an inch of rain. Looking forward to better weather. - Why y'all keep using Rihanna and Beyoncé gifs to promote the show when y'all let Bey lose the same award she deserved 3 times and let Rihanna leave with nothing but the clothes on her back? https://t.co/w38QpH0wma + @brettlorenzen And, the reliability of \u201cNeither snow nor rain nor heat nor gloom of night stays these couriers (the #USPS) from the swift completion of their appointed rounds.\u201d + + Because black people get killed in the rain, black lives matter in the rain. It matters all the time. Snow, rain, sleet, sunny days. We're not out here because it's sunny. We're not out here for fun. We're out here because black lives matter. + + Some of the master copies of the film \u201cGone With the Wind\u201d are archived at the @librarycongress near \u201cSnow White and the Seven Dwarfs\u201d and \u201cSingin\u2019 in the Rain.\u201d GWTW isn\u2019t going to vanish off the face of the earth. + + Snow Man\u306eD.D.\u3068\nSixTONES\u306eImitation Rain\n\u6d41\u308c\u305f\u301c + + @Nonvieta Yup I work in the sanitation industry. I'm in the office however. Life would not go on without our garbage men and women out there. All day everyday rain snow or shine they out there. + + This picture of a rainbow in WA proves nothing. How do we know if this rainbow was not on Mars or the ISS? Maybe it was drawn in on the picture. WA has mail-in voting so we do have to worry aboug rain, snow, poll workers not showing up or voting machines broke on election day !! https://t.co/5WdHx0acS0 https://t.co/BEKtTpBW9g + + Weather in Oslo at 06:00: Clear Temp: 10.6\u00b0C Min today: 9.1\u00b0C Rain today:0.0mm Snow now: 0.0cm Wind N Conditions: Clear Daylight:18:39 hours Sunset: 22:36 - 30) anybody tell you that you look like Beyoncé https://t.co/Vo4Z7bfSCi - - Mi Beyoncé favorita https://t.co/f9Jp600l2B - Beyoncé necesita ver esto. Que diosa @TiniStoessel 🔥🔥🔥 https://t.co/gadVJbehQZ - - Joanne Pearce Is now playing IF I WAS A BOY - BEYONCE.mp3 by ! - - I'm trynna see beyoncé's finsta before I die - -.. code:: python - - [print(tweet.created_at_datetime) for tweet in tweets[0:10]]; - -:: - - 2018-01-17 00:08:50 - 2018-01-17 00:08:49 - 2018-01-17 00:08:44 - 2018-01-17 00:08:42 - 2018-01-17 00:08:42 - 2018-01-17 00:08:42 - 2018-01-17 00:08:40 - 2018-01-17 00:08:38 - 2018-01-17 00:08:37 - 2018-01-17 00:08:37 - -.. code:: python - - [print(tweet.generator.get("name")) for tweet in tweets[0:10]]; - -:: - - Twitter for iPhone - Twitter for iPhone - Twitter for iPhone - Twitter for iPhone - Twitter for iPhone - Twitter for iPhone - Twitter for Android - Twitter for iPhone - Airtime Pro - Twitter for iPhone - -Voila, we have some Tweets. For interactive environments and other cases -where you don't care about collecting your data in a single load or -don't need to operate on the stream of Tweets or counts directly, I -recommend using this convenience function. +Voila, we have some Tweets. For interactive environments and other cases where you don't care about collecting your data in a single load or don't need to operate on the stream of Tweets directly, I recommend using this convenience function. Working with the ResultStream ----------------------------- -The ResultStream object will be powered by the ``search_args``, and -takes the rules and other configuration parameters, including a hard -stop on number of pages to limit your API call usage. +The ResultStream object will be powered by the ``search_args``, and takes the query and other configuration parameters, including a hard stop on number of pages to limit your API call usage. .. code:: python - rs = ResultStream(rule_payload=rule, + rs = ResultStream(request_parameters=query, max_results=500, max_pages=1, - **premium_search_args) + **search_args) print(rs) - -:: - - ResultStream: + + :: + + ResultStream: { - "username":null, - "endpoint":"https:\/\/fanyv88.com:443\/https\/api.twitter.com\/1.1\/tweets\/search\/30day\/dev.json", - "rule_payload":{ - "query":"beyonce", - "maxResults":100 + "endpoint":"https:\/\/fanyv88.com:443\/https\/api.twitter.com\/2\/tweets\/search\/recent", + "request_parameters":{ + "query":"snow", + "max_results":100 }, - "tweetify":true, - "max_results":500 + "tweetify":false, + "max_results":1000 } - -There is a function, ``.stream``, that seamlessly handles requests and -pagination for a given query. It returns a generator, and to grab our -500 Tweets that mention ``beyonce`` we can do this: + +There is a function, ``.stream``, that seamlessly handles requests and pagination for a given query. It returns a generator, and to grab our 1000 Tweets that mention ``snow`` we can do this: .. code:: python tweets = list(rs.stream()) -Tweets are lazily parsed using our `Tweet -Parser `__, so tweet data is -very easily extractable. - .. code:: python # using unidecode to prevent emoji/accents printing - [print(tweet.all_text) for tweet in tweets[0:10]]; - -:: - - gente socorro kkkkkkkkkk BEYONCE https://t.co/kJ9zubvKuf - Jay-Z & Beyoncé sat across from us at dinner tonight and, at one point, I made eye contact with Beyoncé. My limbs turned to jello and I can no longer form a coherent sentence. I have seen the eyes of the lord. - Beyoncé and it isn't close. https://t.co/UdOU9oUtuW - As you could guess.. Signs by Beyoncé will always be my shit. - When Beyoncé adopts a dog 🙌🏾 https://t.co/U571HyLG4F - Hold up, you can't just do that to Beyoncé - https://t.co/3p14DocGqA - Why y'all keep using Rihanna and Beyoncé gifs to promote the show when y'all let Bey lose the same award she deserved 3 times and let Rihanna leave with nothing but the clothes on her back? https://t.co/w38QpH0wma - 30) anybody tell you that you look like Beyoncé https://t.co/Vo4Z7bfSCi - Mi Beyoncé favorita https://t.co/f9Jp600l2B - Beyoncé necesita ver esto. Que diosa @TiniStoessel 🔥🔥🔥 https://t.co/gadVJbehQZ - Joanne Pearce Is now playing IF I WAS A BOY - BEYONCE.mp3 by ! - -Counts Endpoint ---------------- - -We can also use the Search API Counts endpoint to get counts of Tweets -that match our rule. Each request will return up to *30* results, and -each count request can be done on a minutely, hourly, or daily basis. -The underlying ``ResultStream`` object will handle converting your -endpoint to the count endpoint, and you have to specify the -``count_bucket`` argument when making a rule to use it. - -The process is very similar to grabbing Tweets, but has some minor -differences. - -*Caveat - premium sandbox environments do NOT have access to the Search -API counts endpoint.* - -.. code:: python - - count_rule = gen_rule_payload("beyonce", count_bucket="day") - - counts = collect_results(count_rule, result_stream_args=enterprise_search_args) - -Our results are pretty straightforward and can be rapidly used. - -.. code:: python - - counts - -:: - - [{'count': 366, 'timePeriod': '201801170000'}, - {'count': 44580, 'timePeriod': '201801160000'}, - {'count': 61932, 'timePeriod': '201801150000'}, - {'count': 59678, 'timePeriod': '201801140000'}, - {'count': 44014, 'timePeriod': '201801130000'}, - {'count': 46607, 'timePeriod': '201801120000'}, - {'count': 41523, 'timePeriod': '201801110000'}, - {'count': 47056, 'timePeriod': '201801100000'}, - {'count': 65506, 'timePeriod': '201801090000'}, - {'count': 95251, 'timePeriod': '201801080000'}, - {'count': 162883, 'timePeriod': '201801070000'}, - {'count': 106344, 'timePeriod': '201801060000'}, - {'count': 93542, 'timePeriod': '201801050000'}, - {'count': 110415, 'timePeriod': '201801040000'}, - {'count': 127523, 'timePeriod': '201801030000'}, - {'count': 131952, 'timePeriod': '201801020000'}, - {'count': 176157, 'timePeriod': '201801010000'}, - {'count': 57229, 'timePeriod': '201712310000'}, - {'count': 72277, 'timePeriod': '201712300000'}, - {'count': 72051, 'timePeriod': '201712290000'}, - {'count': 76371, 'timePeriod': '201712280000'}, - {'count': 61578, 'timePeriod': '201712270000'}, - {'count': 55118, 'timePeriod': '201712260000'}, - {'count': 59115, 'timePeriod': '201712250000'}, - {'count': 106219, 'timePeriod': '201712240000'}, - {'count': 114732, 'timePeriod': '201712230000'}, - {'count': 73327, 'timePeriod': '201712220000'}, - {'count': 89171, 'timePeriod': '201712210000'}, - {'count': 192381, 'timePeriod': '201712200000'}, - {'count': 85554, 'timePeriod': '201712190000'}, - {'count': 57829, 'timePeriod': '201712180000'}] - -Dated searches / Full Archive Search ------------------------------------- - -**Note that this will only work with the full archive search option**, -which is available to my account only via the enterprise options. Full -archive search will likely require a different endpoint or access -method; please see your developer console for details. - -Let's make a new rule and pass it dates this time. - -``gen_rule_payload`` takes timestamps of the following forms: - -- ``YYYYmmDDHHMM`` -- ``YYYY-mm-DD`` (which will convert to midnight UTC (00:00) -- ``YYYY-mm-DD HH:MM`` -- ``YYYY-mm-DDTHH:MM`` - -Note - all Tweets are stored in UTC time. - -.. code:: python - - rule = gen_rule_payload("from:jack", - from_date="2017-09-01", #UTC 2017-09-01 00:00 - to_date="2017-10-30",#UTC 2017-10-30 00:00 - results_per_call=500) - print(rule) - -:: - - {"query":"from:jack","maxResults":500,"toDate":"201710300000","fromDate":"201709010000"} - -.. code:: python - - tweets = collect_results(rule, max_results=500, result_stream_args=enterprise_search_args) - -.. code:: python - - [print(tweet.all_text) for tweet in tweets[0:10]]; - -:: - - More clarity on our private information policy and enforcement. Working to build as much direct context into the product too https://t.co/IrwBexPrBA - To provide more clarity on our private information policy, we’ve added specific examples of what is/is not a violation and insight into what we need to remove this type of content from the service. https://t.co/NGx5hh2tTQ - Launching violent groups and hateful images/symbols policy on November 22nd https://t.co/NaWuBPxyO5 - We will now launch our policies on violent groups and hateful imagery and hate symbols on Nov 22. During the development process, we received valuable feedback that we’re implementing before these are published and enforced. See more on our policy development process here 👇 https://t.co/wx3EeH39BI - @WillStick @lizkelley Happy birthday Liz! - Off-boarding advertising from all accounts owned by Russia Today (RT) and Sputnik. - - We’re donating all projected earnings ($1.9mm) to support external research into the use of Twitter in elections, including use of malicious automation and misinformation. https://t.co/zIxfqqXCZr - @TMFJMo @anthonynoto Thank you - @gasca @stratechery @Lefsetz letter - @gasca @stratechery Bridgewater’s Daily Observations - Yup!!!! ❤️❤️❤️❤️ #davechappelle https://t.co/ybSGNrQpYF - @ndimichino Sometimes - Setting up at @CampFlogGnaw https://t.co/nVq8QjkKsf - -.. code:: python - - rule = gen_rule_payload("from:jack", - from_date="2017-09-20", - to_date="2017-10-30", - count_bucket="day", - results_per_call=500) - print(rule) - -:: - - {"query":"from:jack","toDate":"201710300000","fromDate":"201709200000","bucket":"day"} - -.. code:: python - - counts = collect_results(rule, max_results=500, result_stream_args=enterprise_search_args) - -.. code:: python - - [print(c) for c in counts]; + [print(tweet) for tweet in tweets[0:10]] :: - {'timePeriod': '201710290000', 'count': 0} - {'timePeriod': '201710280000', 'count': 0} - {'timePeriod': '201710270000', 'count': 3} - {'timePeriod': '201710260000', 'count': 6} - {'timePeriod': '201710250000', 'count': 4} - {'timePeriod': '201710240000', 'count': 4} - {'timePeriod': '201710230000', 'count': 0} - {'timePeriod': '201710220000', 'count': 0} - {'timePeriod': '201710210000', 'count': 3} - {'timePeriod': '201710200000', 'count': 2} - {'timePeriod': '201710190000', 'count': 1} - {'timePeriod': '201710180000', 'count': 6} - {'timePeriod': '201710170000', 'count': 2} - {'timePeriod': '201710160000', 'count': 2} - {'timePeriod': '201710150000', 'count': 1} - {'timePeriod': '201710140000', 'count': 64} - {'timePeriod': '201710130000', 'count': 3} - {'timePeriod': '201710120000', 'count': 4} - {'timePeriod': '201710110000', 'count': 8} - {'timePeriod': '201710100000', 'count': 4} - {'timePeriod': '201710090000', 'count': 1} - {'timePeriod': '201710080000', 'count': 0} - {'timePeriod': '201710070000', 'count': 0} - {'timePeriod': '201710060000', 'count': 1} - {'timePeriod': '201710050000', 'count': 3} - {'timePeriod': '201710040000', 'count': 5} - {'timePeriod': '201710030000', 'count': 8} - {'timePeriod': '201710020000', 'count': 5} - {'timePeriod': '201710010000', 'count': 0} - {'timePeriod': '201709300000', 'count': 0} - {'timePeriod': '201709290000', 'count': 0} - {'timePeriod': '201709280000', 'count': 9} - {'timePeriod': '201709270000', 'count': 41} - {'timePeriod': '201709260000', 'count': 13} - {'timePeriod': '201709250000', 'count': 6} - {'timePeriod': '201709240000', 'count': 7} - {'timePeriod': '201709230000', 'count': 3} - {'timePeriod': '201709220000', 'count': 0} - {'timePeriod': '201709210000', 'count': 1} - {'timePeriod': '201709200000', 'count': 7} +{"id": "1270572563505254404", "text": "@CleoLoughlin Rain after the snow? Do you have ice now?"} +{"id": "1270570767038599168", "text": "@koofltxr Rain, 134340, still with you, winter bear, Seoul, crystal snow, sea, outro:blueside"} +{"id": "1270570621282340864", "text": "@TheWxMeister Sorry it ruined your camping. I was covering plants in case we got snow in the Mountain Shadows area. Thankfully we didn\u2019t. At least it didn\u2019t stick to the ground. The wind was crazy! Got just over an inch of rain. Looking forward to better weather."} +{"id": "1270569070287630337", "text": "@brettlorenzen And, the reliability of \u201cNeither snow nor rain nor heat nor gloom of night stays these couriers (the #USPS) from the swift completion of their appointed rounds.\u201d"} +{"id": "1270568690447257601", "text": "\"Because black people get killed in the rain, black lives matter in the rain. It matters all the time. Snow, rain, sleet, sunny days. We're not out here because it's sunny. We're not out here for fun. We're out here because black lives matter.\" @wisn12news https://t.co/3kZZ7q2MR9"} +{"id": "1270568607605575680", "text": "Some of the master copies of the film \u201cGone With the Wind\u201d are archived at the @librarycongress near \u201cSnow White and the Seven Dwarfs\u201d and \u201cSingin\u2019 in the Rain.\u201d GWTW isn\u2019t going to vanish off the face of the earth."} +{"id": "1270568437916426240", "text": "Snow Man\u306eD.D.\u3068\nSixTONES\u306eImitation Rain\n\u6d41\u308c\u305f\u301c"} +{"id": "1270568195519373313", "text": "@Nonvieta Yup I work in the sanitation industry. I'm in the office however. Life would not go on without our garbage men and women out there. All day everyday rain snow or shine they out there."} +{"id": "1270567737283117058", "text": "This picture of a rainbow in WA proves nothing. How do we know if this rainbow was not on Mars or the ISS? Maybe it was drawn in on the picture. WA has mail-in voting so we do have to worry aboug rain, snow, poll workers not showing up or voting machines broke on election day !! https://t.co/5WdHx0acS0 https://t.co/BEKtTpBW9g"} +{"id": "1270566386524356608", "text": "Weather in Oslo at 06:00: Clear Temp: 10.6\u00b0C Min today: 9.1\u00b0C Rain today:0.0mm Snow now: 0.0cm Wind N Conditions: Clear Daylight:18:39 hours Sunset: 22:36"} Contributing ============ @@ -860,6 +550,12 @@ commands, ran from the root directory in the repo: python setup.py sdist twine upload dist/* +If you receive an error during the ``twine upload`` step, it may due to the README.rst +having something invalid in its RST format. Using a RST linter will help fix that. + +Also, as Pypi updates are made, you may want to clear out previous versions from the package. +This can be done with this command: ``rm -rf build dist *.egg-info`` + How to build the documentation: Building the documentation requires a few Sphinx packages to build the diff --git a/config/api_yaml_example.yaml b/config/api_yaml_example.yaml new file mode 100644 index 0000000..ed02a61 --- /dev/null +++ b/config/api_yaml_example.yaml @@ -0,0 +1,16 @@ +#search_rules: +# start-time: 2020-01-06 +# end-time: 2020-01-10 +# query: snow colorado -is:retweet has:media + +search_params: + results-per-call: 100 + max-tweets: 10000 + tweet-fields: id,created_at,author_id,text,public_metrics,attachments,entities + user-fields: description,location,public_metrics + expansions: author_id,referenced_tweets.id,attachments.media_keys + +output_params: + save_file: False + filename_prefix: snow_tweets + results_per_file: 100000 diff --git a/examples/api_example.ipynb b/examples/api_example.ipynb index a75aa62..cd5b31c 100644 --- a/examples/api_example.ipynb +++ b/examples/api_example.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Working with the API within a Python program is straightforward both for Premium and Enterprise clients.\n", + "Working with the API within a Python program is straightforward for the v2 client.\n", "\n", "We'll assume that credentials are in the default location, `~/.twitter_keys.yaml`." ] @@ -17,14 +17,14 @@ }, "outputs": [], "source": [ - "from searchtweets import ResultStream, gen_rule_payload, load_credentials" + "from searchtweets import ResultStream, gen_request_parameters, load_credentials" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Enterprise setup" + "## v2 setup" ] }, { @@ -35,8 +35,8 @@ }, "outputs": [], "source": [ - "enterprise_search_args = load_credentials(\"~/.twitter_keys.yaml\",\n", - " yaml_key=\"search_tweets_enterprise\",\n", + "v2_search_args = load_credentials(\"~/.twitter_keys.yaml\",\n", + " yaml_key=\"search_tweets_v2\",\n", " env_overwrite=False)" ] }, @@ -44,27 +44,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Premium Setup\n" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "collapsed": true - }, - "outputs": [], - "source": [ - "premium_search_args = load_credentials(\"~/.twitter_keys.yaml\",\n", - " yaml_key=\"search_tweets_premium\",\n", - " env_overwrite=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "There is a function that formats search API rules into valid json queries called `gen_rule_payload`. It has sensible defaults, such as pulling more Tweets per call than the default 100 (but note that a sandbox environment can only have a max of 100 here, so if you get errors, please check this) not including dates, and defaulting to hourly counts when using the counts api. Discussing the finer points of generating search rules is out of scope for these examples; I encourage you to see the docs to learn the nuances within, but for now let's see what a rule looks like." + "There is a function that formats search API rules into valid json queries called `gen_request_parameters`. It has sensible defaults, such as pulling more Tweets per call than the default 10 and not including dates. Discussing the finer points of generating search rules is out of scope for these examples; I encourage you to see the docs to learn the nuances within, but for now let's see what a rule looks like." ] }, { @@ -81,7 +61,7 @@ } ], "source": [ - "rule = gen_rule_payload(\"beyonce\", results_per_call=100) # testing with a sandbox account\n", + "query = gen_request_parameters(\"beyonce\", results_per_call=100) # testing with a sandbox account\n", "print(rule)" ] }, @@ -89,7 +69,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This rule will match tweets that have the text `beyonce` in them." + "This query will match tweets that have the text `beyonce` in them." ] }, { @@ -101,11 +81,11 @@ "\n", "## Fast Way\n", "\n", - "We'll use the `search_args` variable to power the configuration point for the API. The object also takes a valid PowerTrack rule and has options to cutoff search when hitting limits on both number of Tweets and API calls.\n", + "We'll use the `search_args` variable to power the configuration point for the API. The object also takes a valid query and has options to cutoff search when hitting limits on both number of Tweets and API calls.\n", "\n", "We'll be using the `collect_results` function, which has three parameters.\n", "\n", - "- rule: a valid PowerTrack rule, referenced earlier\n", + "- query: a valid search query, referenced earlier\n", "- max_results: as the API handles pagination, it will stop collecting when we get to this number\n", "- result_stream_args: configuration args that we've already specified.\n", "\n", @@ -135,9 +115,9 @@ }, "outputs": [], "source": [ - "tweets = collect_results(rule,\n", + "tweets = collect_results(query,\n", " max_results=100,\n", - " result_stream_args=enterprise_search_args) # change this if you need to" + " result_stream_args=v2_search_args) # change this if you need to" ] }, { @@ -261,13 +241,13 @@ "ResultStream: \n", "\t{\n", " \"username\":null,\n", - " \"endpoint\":\"https:\\/\\/api.twitter.com\\/1.1\\/tweets\\/search\\/30day\\/dev.json\",\n", + " \"endpoint\":\"https:\\/\\/api.twitter.com\\/2\\/tweets\\/search\\/recent\",\n", " \"rule_payload\":{\n", " \"query\":\"beyonce\",\n", " \"maxResults\":100\n", " },\n", - " \"tweetify\":true,\n", - " \"max_results\":500\n", + " \"tweetify\":false,\n", + " \"max_results\":100\n", "}\n" ] } @@ -335,90 +315,6 @@ "[print(tweet.all_text) for tweet in tweets[0:10]];" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Counts Endpoint\n", - "\n", - "We can also use the Search API Counts endpoint to get counts of Tweets that match our rule. Each request will return up to *30* results, and each count request can be done on a minutely, hourly, or daily basis. The underlying `ResultStream` object will handle converting your endpoint to the count endpoint, and you have to specify the `count_bucket` argument when making a rule to use it.\n", - "\n", - "The process is very similar to grabbing Tweets, but has some minor differences.\n", - "\n", - "\n", - "_Caveat - premium sandbox environments do NOT have access to the Search API counts endpoint._" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": { - "collapsed": true - }, - "outputs": [], - "source": [ - "count_rule = gen_rule_payload(\"beyonce\", count_bucket=\"day\")\n", - "\n", - "counts = collect_results(count_rule, result_stream_args=enterprise_search_args)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Our results are pretty straightforward and can be rapidly used." - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[{'count': 366, 'timePeriod': '201801170000'},\n", - " {'count': 44580, 'timePeriod': '201801160000'},\n", - " {'count': 61932, 'timePeriod': '201801150000'},\n", - " {'count': 59678, 'timePeriod': '201801140000'},\n", - " {'count': 44014, 'timePeriod': '201801130000'},\n", - " {'count': 46607, 'timePeriod': '201801120000'},\n", - " {'count': 41523, 'timePeriod': '201801110000'},\n", - " {'count': 47056, 'timePeriod': '201801100000'},\n", - " {'count': 65506, 'timePeriod': '201801090000'},\n", - " {'count': 95251, 'timePeriod': '201801080000'},\n", - " {'count': 162883, 'timePeriod': '201801070000'},\n", - " {'count': 106344, 'timePeriod': '201801060000'},\n", - " {'count': 93542, 'timePeriod': '201801050000'},\n", - " {'count': 110415, 'timePeriod': '201801040000'},\n", - " {'count': 127523, 'timePeriod': '201801030000'},\n", - " {'count': 131952, 'timePeriod': '201801020000'},\n", - " {'count': 176157, 'timePeriod': '201801010000'},\n", - " {'count': 57229, 'timePeriod': '201712310000'},\n", - " {'count': 72277, 'timePeriod': '201712300000'},\n", - " {'count': 72051, 'timePeriod': '201712290000'},\n", - " {'count': 76371, 'timePeriod': '201712280000'},\n", - " {'count': 61578, 'timePeriod': '201712270000'},\n", - " {'count': 55118, 'timePeriod': '201712260000'},\n", - " {'count': 59115, 'timePeriod': '201712250000'},\n", - " {'count': 106219, 'timePeriod': '201712240000'},\n", - " {'count': 114732, 'timePeriod': '201712230000'},\n", - " {'count': 73327, 'timePeriod': '201712220000'},\n", - " {'count': 89171, 'timePeriod': '201712210000'},\n", - " {'count': 192381, 'timePeriod': '201712200000'},\n", - " {'count': 85554, 'timePeriod': '201712190000'},\n", - " {'count': 57829, 'timePeriod': '201712180000'}]" - ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "counts" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -516,84 +412,14 @@ } ], "source": [ - "rule = gen_rule_payload(\"from:jack\",\n", + "query = gen_request_parameters(\"from:jack\",\n", " from_date=\"2017-09-20\",\n", " to_date=\"2017-10-30\",\n", " count_bucket=\"day\",\n", " results_per_call=500)\n", - "print(rule)" + "print(query)" ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": { - "collapsed": true - }, - "outputs": [], - "source": [ - "counts = collect_results(rule, max_results=500, result_stream_args=enterprise_search_args)" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": { - "scrolled": false - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'timePeriod': '201710290000', 'count': 0}\n", - "{'timePeriod': '201710280000', 'count': 0}\n", - "{'timePeriod': '201710270000', 'count': 3}\n", - "{'timePeriod': '201710260000', 'count': 6}\n", - "{'timePeriod': '201710250000', 'count': 4}\n", - "{'timePeriod': '201710240000', 'count': 4}\n", - "{'timePeriod': '201710230000', 'count': 0}\n", - "{'timePeriod': '201710220000', 'count': 0}\n", - "{'timePeriod': '201710210000', 'count': 3}\n", - "{'timePeriod': '201710200000', 'count': 2}\n", - "{'timePeriod': '201710190000', 'count': 1}\n", - "{'timePeriod': '201710180000', 'count': 6}\n", - "{'timePeriod': '201710170000', 'count': 2}\n", - "{'timePeriod': '201710160000', 'count': 2}\n", - "{'timePeriod': '201710150000', 'count': 1}\n", - "{'timePeriod': '201710140000', 'count': 64}\n", - "{'timePeriod': '201710130000', 'count': 3}\n", - "{'timePeriod': '201710120000', 'count': 4}\n", - "{'timePeriod': '201710110000', 'count': 8}\n", - "{'timePeriod': '201710100000', 'count': 4}\n", - "{'timePeriod': '201710090000', 'count': 1}\n", - "{'timePeriod': '201710080000', 'count': 0}\n", - "{'timePeriod': '201710070000', 'count': 0}\n", - "{'timePeriod': '201710060000', 'count': 1}\n", - "{'timePeriod': '201710050000', 'count': 3}\n", - "{'timePeriod': '201710040000', 'count': 5}\n", - "{'timePeriod': '201710030000', 'count': 8}\n", - "{'timePeriod': '201710020000', 'count': 5}\n", - "{'timePeriod': '201710010000', 'count': 0}\n", - "{'timePeriod': '201709300000', 'count': 0}\n", - "{'timePeriod': '201709290000', 'count': 0}\n", - "{'timePeriod': '201709280000', 'count': 9}\n", - "{'timePeriod': '201709270000', 'count': 41}\n", - "{'timePeriod': '201709260000', 'count': 13}\n", - "{'timePeriod': '201709250000', 'count': 6}\n", - "{'timePeriod': '201709240000', 'count': 7}\n", - "{'timePeriod': '201709230000', 'count': 3}\n", - "{'timePeriod': '201709220000', 'count': 0}\n", - "{'timePeriod': '201709210000', 'count': 1}\n", - "{'timePeriod': '201709200000', 'count': 7}\n" - ] - } - ], - "source": [ - "[print(c) for c in counts];" - ] - } - ], + }], "metadata": { "kernelspec": { "display_name": "Python 3", diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..c4f9676 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,3 @@ +requests +PyYAML +python-dateutil diff --git a/scripts/poll_tweets.py b/scripts/poll_tweets.py new file mode 100644 index 0000000..6652673 --- /dev/null +++ b/scripts/poll_tweets.py @@ -0,0 +1,277 @@ +#!/usr/bin/env python +# Copyright 2021 Twitter, Inc. +# Licensed under the Apache License, Version 2.0 +# http://www.apache.org/licenses/LICENSE-2.0 +import os +import argparse +import json +import sys +import logging +import time +from searchtweets import (ResultStream, + load_credentials, + merge_dicts, + read_config, + write_result_stream, + gen_params_from_config) + +logger = logging.getLogger() +# we want to leave this here and have it command-line configurable via the +# --debug flag +logging.basicConfig(level=os.environ.get("LOGLEVEL", "ERROR")) + +REQUIRED_KEYS = {"query", "endpoint"} + +def parse_cmd_args(): + argparser = argparse.ArgumentParser() + help_msg = """configuration file with all parameters. Far, + easier to use than the command-line args version., + If a valid file is found, all args will be populated, + from there. Remaining command-line args, + will overrule args found in the config, + file.""" + + argparser.add_argument("--credential-file", + dest="credential_file", + default=None, + help=("Location of the yaml file used to hold " + "your credentials.")) + + argparser.add_argument("--credential-file-key", + dest="credential_yaml_key", + default="search_tweets_v2", + help=("the key in the credential file used " + "for this session's credentials. " + "Defaults to search_tweets_v2")) + + argparser.add_argument("--env-overwrite", + dest="env_overwrite", + default=True, + help=("""Overwrite YAML-parsed credentials with + any set environment variables. See API docs or + readme for details.""")) + + argparser.add_argument("--config-file", + dest="config_filename", + default=None, + help=help_msg) + + argparser.add_argument("--query", + dest="query", + default=None, + help="Search query. (See: https://developer.twitter.com/en/docs/labs/recent-search/guides/search-queries)") + + argparser.add_argument("--interval", + dest="interval", + default=5, + help="""Polling interval in minutes. (default: 5 minutes)""") + + argparser.add_argument("--start-time", + dest="start_time", + default=None, + help="""Start of datetime window, format 'YYYY-mm-DDTHH:MM' (default: -7 days for /recent, -30 days for /all)""") + + argparser.add_argument("--end-time", + dest="end_time", + default=None, + help="""End of datetime window, format + 'YYYY-mm-DDTHH:MM' (default: to 30 seconds before request time)""") + + argparser.add_argument("--since-id", + dest="since_id", + default=None, + help="Tweet ID, will start search from Tweets after this one. (See: https://developer.twitter.com/en/docs/labs/recent-search/guides/pagination)") + + argparser.add_argument("--until-id", + dest="until_id", + default=None, + help="Tweet ID, will end search from Tweets before this one. (See: https://developer.twitter.com/en/docs/labs/recent-search/guides/pagination)") + + argparser.add_argument("--results-per-call", + dest="results_per_call", + help="Number of results to return per call " + "(default 10; max 100) - corresponds to " + "'max_results' in the API") + + argparser.add_argument("--expansions", + dest="expansions", + default=None, + help="""A comma-delimited list of expansions. Specified expansions results in full objects in the 'includes' response object.""") + + argparser.add_argument("--tweet-fields", + dest="tweet_fields", + default=None, + help="""A comma-delimited list of Tweet JSON attributes to include in endpoint responses. (API default:"id,text")""") + + argparser.add_argument("--user-fields", + dest="user_fields", + default=None, + help="""A comma-delimited list of User JSON attributes to include in endpoint responses. (API default:"id")""") + + argparser.add_argument("--media-fields", + dest="media_fields", + default=None, + help="""A comma-delimited list of media JSON attributes to include in endpoint responses. (API default:"id")""") + + argparser.add_argument("--place-fields", + dest="place_fields", + default=None, + help="""A comma-delimited list of Twitter Place JSON attributes to include in endpoint responses. (API default:"id")""") + + argparser.add_argument("--poll-fields", + dest="poll_fields", + default=None, + help="""A comma-delimited list of Twitter Poll JSON attributes to include in endpoint responses. (API default:"id")""") + + argparser.add_argument("--output-format", + dest="output_format", + default="r", + help="""Set output format: + 'r' Unmodified API [R]esponses. (default). + 'a' [A]tomic Tweets: Tweet objects with expansions inline. + 'm' [M]essage stream: Tweets, expansions, and pagination metadata as a stream of messages.""") + + #client options. + argparser.add_argument("--max-tweets", dest="max_tweets", + type=int, + help="Maximum number of Tweets to return for this session of requests.") + + argparser.add_argument("--max-pages", + dest="max_pages", + type=int, + default=None, + help="Maximum number of pages/API calls to " + "use for this session.") + + argparser.add_argument("--results-per-file", dest="results_per_file", + default=None, + type=int, + help="Maximum tweets to save per file.") + + argparser.add_argument("--filename-prefix", + dest="filename_prefix", + default=None, + help="prefix for the filename where tweet " + " json data will be stored.") + + argparser.add_argument("--no-print-stream", + dest="print_stream", + action="store_false", + help="disable print streaming") + + argparser.add_argument("--print-stream", + dest="print_stream", + action="store_true", + default=True, + help="Print tweet stream to stdout") + + argparser.add_argument("--extra-headers", + dest="extra_headers", + type=str, + default=None, + help="JSON-formatted str representing a dict of additional HTTP request headers") + + argparser.add_argument("--debug", + dest="debug", + action="store_true", + default=False, + help="print all info and warning messages") + return argparser + + +def _filter_sensitive_args(dict_): + sens_args = ("consumer_key", "consumer_secret", "bearer_token") + return {k: v for k, v in dict_.items() if k not in sens_args} + +def main(): + args_dict = vars(parse_cmd_args().parse_args()) + if args_dict.get("debug") is True: + logger.setLevel(logging.DEBUG) + logger.debug("command line args dict:") + logger.debug(json.dumps(args_dict, indent=4)) + + if args_dict.get("config_filename") is not None: + configfile_dict = read_config(args_dict["config_filename"]) + else: + configfile_dict = {} + + extra_headers_str = args_dict.get("extra_headers") + if extra_headers_str is not None: + args_dict['extra_headers_dict'] = json.loads(extra_headers_str) + del args_dict['extra_headers'] + + logger.debug("config file ({}) arguments sans sensitive args:".format(args_dict["config_filename"])) + logger.debug(json.dumps(_filter_sensitive_args(configfile_dict), indent=4)) + + creds_dict = load_credentials(filename=args_dict["credential_file"], + yaml_key=args_dict["credential_yaml_key"], + env_overwrite=args_dict["env_overwrite"]) + + dict_filter = lambda x: {k: v for k, v in x.items() if v is not None} + + config_dict = merge_dicts(dict_filter(configfile_dict), + dict_filter(creds_dict), + dict_filter(args_dict)) + + logger.debug("combined dict (cli, config, creds):") + logger.debug(json.dumps(_filter_sensitive_args(config_dict), indent=4)) + + if len(dict_filter(config_dict).keys() & REQUIRED_KEYS) < len(REQUIRED_KEYS): + print(REQUIRED_KEYS - dict_filter(config_dict).keys()) + logger.error("ERROR: not enough arguments for the script to work") + sys.exit(1) + + stream_params = gen_params_from_config(config_dict) + logger.debug("full arguments passed to the ResultStream object sans credentials") + logger.debug(json.dumps(_filter_sensitive_args(stream_params), indent=4)) + + while True: + + start = time.time() + rs = ResultStream(tweetify=False, **stream_params) + + logger.debug(str(rs)) + + if config_dict.get("filename_prefix") is not None: + stream = write_result_stream(rs, + filename_prefix=config_dict.get("filename_prefix"), + results_per_file=config_dict.get("results_per_file")) + else: + stream = rs.stream() + + first_tweet = True + tweets_num = 0 + + #Iterate through Tweet array and handle output. + for tweet in stream: + tweets_num = tweets_num + 1 + #Get Tweet ID from first Tweet + if first_tweet: + newest_id = tweet['id'] + first_tweet = False + if config_dict["print_stream"] is True: + print(json.dumps(tweet)) + + #This polling script switches to a since_id requests and removes the start_time parameter if it is used for backfill. + #Prepare next query, by setting the since_id request parameter. + print(f"{tweets_num} new Tweets. Newest_id: {newest_id}") + + request_json = json.loads(stream_params['request_parameters']) + + if 'start_time' in request_json.keys(): + del request_json['start_time'] + + request_json.update(since_id = newest_id) + stream_params['request_parameters'] = json.dumps(request_json) + + duration = time.time() - start + + sleep_interval = (float(config_dict["interval"]) * 60) - duration + + if sleep_interval < 0: + sleep_interval = (float(config_dict["interval"]) * 60) + + time.sleep(sleep_interval) + +if __name__ == '__main__': + main() \ No newline at end of file diff --git a/tools/search_tweets.py b/scripts/search_tweets.py similarity index 59% rename from tools/search_tweets.py rename to scripts/search_tweets.py index c2b699e..4cba8eb 100644 --- a/tools/search_tweets.py +++ b/scripts/search_tweets.py @@ -1,5 +1,5 @@ #!/usr/bin/env python -# Copyright 2017 Twitter, Inc. +# Copyright 2021 Twitter, Inc. # Licensed under the Apache License, Version 2.0 # http://www.apache.org/licenses/LICENSE-2.0 import os @@ -20,7 +20,7 @@ logging.basicConfig(level=os.environ.get("LOGLEVEL", "ERROR")) -REQUIRED_KEYS = {"pt_rule", "endpoint"} +REQUIRED_KEYS = {"query", "endpoint"} def parse_cmd_args(): @@ -40,10 +40,10 @@ def parse_cmd_args(): argparser.add_argument("--credential-file-key", dest="credential_yaml_key", - default=None, + default="search_tweets_v2", help=("the key in the credential file used " "for this session's credentials. " - "Defaults to search_tweets_api")) + "Defaults to search_tweets_v2")) argparser.add_argument("--env-overwrite", dest="env_overwrite", @@ -57,52 +57,94 @@ def parse_cmd_args(): default=None, help=help_msg) - argparser.add_argument("--account-type", - dest="account_type", + argparser.add_argument("--query", + dest="query", default=None, - choices=["premium", "enterprise"], - help="The account type you are using") + help="Search query. (See: https://developer.twitter.com/en/docs/labs/recent-search/guides/search-queries)") - argparser.add_argument("--count-bucket", - dest="count_bucket", + #Use of this command triggers a search count request. + argparser.add_argument("--granularity", + dest="granularity", default=None, - help=("""Bucket size for counts API. Options:, - day, hour, minute (default is 'day').""")) + help=("""Set this to make a 'counts' request. 'Bucket' size for the search counts API. Options: + day, hour, minute. Aligned to midnight UTC.""")) - argparser.add_argument("--start-datetime", - dest="from_date", + argparser.add_argument("--start-time", + dest="start_time", default=None, help="""Start of datetime window, format - 'YYYY-mm-DDTHH:MM' (default: -30 days)""") + 'YYYY-mm-DDTHH:MM' (default: -7 days for /recent, -30 days for /all)""") - argparser.add_argument("--end-datetime", - dest="to_date", + argparser.add_argument("--end-time", + dest="end_time", default=None, help="""End of datetime window, format - 'YYYY-mm-DDTHH:MM' (default: most recent - date)""") + 'YYYY-mm-DDTHH:MM' (default: to 30 seconds before request time)""") + + argparser.add_argument("--since-id", + dest="since_id", + default=None, + help="Tweet ID, will start search from Tweets after this one. (See: https://developer.twitter.com/en/docs/labs/recent-search/guides/pagination)") - argparser.add_argument("--filter-rule", - dest="pt_rule", + argparser.add_argument("--until-id", + dest="until_id", default=None, - help="PowerTrack filter rule (See: http://support.gnip.com/customer/portal/articles/901152-powertrack-operators)") + help="Tweet ID, will end search from Tweets before this one. (See: https://developer.twitter.com/en/docs/labs/recent-search/guides/pagination)") argparser.add_argument("--results-per-call", dest="results_per_call", help="Number of results to return per call " - "(default 100; max 500) - corresponds to " - "'maxResults' in the API") + "(default 10; max 100) - corresponds to " + "'max_results' in the API") + + argparser.add_argument("--expansions", + dest="expansions", + default=None, + help="""A comma-delimited list of expansions. Specified expansions results in full objects in the 'includes' response object.""") + + argparser.add_argument("--tweet-fields", + dest="tweet_fields", + default=None, + help="""A comma-delimited list of Tweet JSON attributes to include in endpoint responses. (API default:"id,text")""") + + argparser.add_argument("--user-fields", + dest="user_fields", + default=None, + help="""A comma-delimited list of User JSON attributes to include in endpoint responses. (API default:"id")""") - argparser.add_argument("--max-results", dest="max_results", + argparser.add_argument("--media-fields", + dest="media_fields", + default=None, + help="""A comma-delimited list of media JSON attributes to include in endpoint responses. (API default:"id")""") + + argparser.add_argument("--place-fields", + dest="place_fields", + default=None, + help="""A comma-delimited list of Twitter Place JSON attributes to include in endpoint responses. (API default:"id")""") + + argparser.add_argument("--poll-fields", + dest="poll_fields", + default=None, + help="""A comma-delimited list of Twitter Poll JSON attributes to include in endpoint responses. (API default:"id")""") + + argparser.add_argument("--output-format", + dest="output_format", + default="r", + help="""Set output format: + 'r' Unmodified API [R]esponses. (default). + 'a' [A]tomic Tweets: Tweet objects with expansions inline. + 'm' [M]essage stream: Tweets, expansions, and pagination metadata as a stream of messages.""") + + argparser.add_argument("--max-tweets", dest="max_tweets", type=int, - help="Maximum number of Tweets or Counts to return for this session") + help="Maximum number of Tweets to return for this session of requests.") argparser.add_argument("--max-pages", dest="max_pages", type=int, default=None, help="Maximum number of pages/API calls to " - "use for this session.") + "use for this session.") argparser.add_argument("--results-per-file", dest="results_per_file", default=None, @@ -113,7 +155,7 @@ def parse_cmd_args(): dest="filename_prefix", default=None, help="prefix for the filename where tweet " - " json data will be stored.") + " json data will be stored.") argparser.add_argument("--no-print-stream", dest="print_stream", @@ -130,7 +172,7 @@ def parse_cmd_args(): dest="extra_headers", type=str, default=None, - help="JSON-formatted str representing a dict of additional request headers") + help="JSON-formatted str representing a dict of additional HTTP request headers") argparser.add_argument("--debug", dest="debug", @@ -141,7 +183,7 @@ def parse_cmd_args(): def _filter_sensitive_args(dict_): - sens_args = ("password", "consumer_key", "consumer_secret", "bearer_token") + sens_args = ("consumer_key", "consumer_secret", "bearer_token") return {k: v for k, v in dict_.items() if k not in sens_args} def main(): @@ -155,7 +197,7 @@ def main(): configfile_dict = read_config(args_dict["config_filename"]) else: configfile_dict = {} - + extra_headers_str = args_dict.get("extra_headers") if extra_headers_str is not None: args_dict['extra_headers_dict'] = json.loads(extra_headers_str) @@ -165,7 +207,6 @@ def main(): logger.debug(json.dumps(_filter_sensitive_args(configfile_dict), indent=4)) creds_dict = load_credentials(filename=args_dict["credential_file"], - account_type=args_dict["account_type"], yaml_key=args_dict["credential_yaml_key"], env_overwrite=args_dict["env_overwrite"]) @@ -175,16 +216,16 @@ def main(): dict_filter(creds_dict), dict_filter(args_dict)) - logger.debug("combined dict (cli, config, creds) sans password:") + logger.debug("combined dict (cli, config, creds):") logger.debug(json.dumps(_filter_sensitive_args(config_dict), indent=4)) if len(dict_filter(config_dict).keys() & REQUIRED_KEYS) < len(REQUIRED_KEYS): print(REQUIRED_KEYS - dict_filter(config_dict).keys()) - logger.error("ERROR: not enough arguments for the program to work") + logger.error("ERROR: not enough arguments for the script to work") sys.exit(1) stream_params = gen_params_from_config(config_dict) - logger.debug("full arguments passed to the ResultStream object sans password") + logger.debug("full arguments passed to the ResultStream object sans credentials") logger.debug(json.dumps(_filter_sensitive_args(stream_params), indent=4)) rs = ResultStream(tweetify=False, **stream_params) @@ -202,6 +243,5 @@ def main(): if config_dict["print_stream"] is True: print(json.dumps(tweet)) - if __name__ == '__main__': main() diff --git a/searchtweets/__init__.py b/searchtweets/__init__.py index d68af5a..db3cb47 100644 --- a/searchtweets/__init__.py +++ b/searchtweets/__init__.py @@ -1,4 +1,4 @@ -# Copyright 2018 Twitter, Inc. +# Copyright 2020 Twitter, Inc. # Licensed under the MIT License # https://opensource.org/licenses/MIT from .result_stream import ResultStream, collect_results diff --git a/searchtweets/_version.py b/searchtweets/_version.py index ebbfa45..7c5168b 100644 --- a/searchtweets/_version.py +++ b/searchtweets/_version.py @@ -1,5 +1,5 @@ # -*- coding: utf-8 -*- -# Copyright 2018 Twitter, Inc. +# Copyright 2020 Twitter, Inc. # Licensed under the MIT License # https://opensource.org/licenses/MIT -VERSION = "1.7.4" +VERSION = "1.0.7" diff --git a/searchtweets/api_utils.py b/searchtweets/api_utils.py index c61392a..5e0ba3a 100644 --- a/searchtweets/api_utils.py +++ b/searchtweets/api_utils.py @@ -1,41 +1,49 @@ # -*- coding: utf-8 -*- -# Copyright 2018 Twitter, Inc. +# Copyright 2021 Twitter, Inc. # Licensed under the MIT License # https://opensource.org/licenses/MIT """ Module containing the various functions that are used for API calls, -rule generation, and related. +request payload generation, and related. """ import re import datetime +from dateutil.relativedelta import * import logging try: import ujson as json except ImportError: import json -__all__ = ["gen_rule_payload", "gen_params_from_config", - "infer_endpoint", "convert_utc_time", - "validate_count_api", "change_to_count_endpoint"] +__all__ = ["gen_request_parameters", + "gen_params_from_config", + "infer_endpoint", + "change_to_count_endpoint", + "validate_count_api", + "convert_utc_time"] logger = logging.getLogger(__name__) def convert_utc_time(datetime_str): """ - Handles datetime argument conversion to the GNIP API format, which is - `YYYYMMDDHHSS`. Flexible passing of date formats in the following types:: + Handles datetime argument conversion to the Labs API format, which is + `YYYY-MM-DDTHH:mm:ssZ`. + Flexible passing of date formats in the following types:: - YYYYmmDDHHMM - YYYY-mm-DD - YYYY-mm-DD HH:MM - YYYY-mm-DDTHH:MM + - 3d (set start_time to three days ago) + - 12h (set start_time to twelve hours ago) + - 15m (set start_time to fifteen minutes ago) Args: datetime_str (str): valid formats are listed above. Returns: - string of GNIP API formatted date. + string of ISO formatted date. Example: >>> from searchtweets.utils import convert_utc_time @@ -48,108 +56,147 @@ def convert_utc_time(datetime_str): >>> convert_utc_time("2017-08-02T00:00") '201708020000' """ + if not datetime_str: return None - if not set(['-', ':']) & set(datetime_str): - _date = datetime.datetime.strptime(datetime_str, "%Y%m%d%H%M") - else: - try: - if "T" in datetime_str: - # command line with 'T' - datetime_str = datetime_str.replace('T', ' ') + try: + if len(datetime_str) <= 5: + _date = datetime.datetime.utcnow() + #parse out numeric character. + num = float(datetime_str[:-1]) + if 'd' in datetime_str: + _date = (_date + relativedelta(days=-num)) + elif 'h' in datetime_str: + _date = (_date + relativedelta(hours=-num)) + elif 'm' in datetime_str: + _date = (_date + relativedelta(minutes=-num)) + elif not set(['-', ':']) & set(datetime_str): + _date = datetime.datetime.strptime(datetime_str, "%Y%m%d%H%M") + elif 'T' in datetime_str: + # command line with 'T' + datetime_str = datetime_str.replace('T', ' ') + _date = datetime.datetime.strptime(datetime_str, "%Y-%m-%d %H:%M") + else: _date = datetime.datetime.strptime(datetime_str, "%Y-%m-%d %H:%M") - except ValueError: - _date = datetime.datetime.strptime(datetime_str, "%Y-%m-%d") - return _date.strftime("%Y%m%d%H%M") - - -def change_to_count_endpoint(endpoint): - """Utility function to change a normal endpoint to a ``count`` api - endpoint. Returns the same endpoint if it's already a valid count endpoint. - Args: - endpoint (str): your api endpoint - - Returns: - str: the modified endpoint for a count endpoint. - """ - tokens = filter(lambda x: x != '', re.split("[/:]", endpoint)) - filt_tokens = list(filter(lambda x: x != "https", tokens)) - last = filt_tokens[-1].split('.')[0] # removes .json on the endpoint - filt_tokens[-1] = last # changes from *.json -> '' for changing input - if last == 'counts': - return endpoint - else: - return "https://" + '/'.join(filt_tokens) + '/' + "counts.json" + except ValueError: + _date = datetime.datetime.strptime(datetime_str, "%Y-%m-%d") + return _date.strftime("%Y-%m-%dT%H:%M:%SZ") -def gen_rule_payload(pt_rule, results_per_call=None, - from_date=None, to_date=None, count_bucket=None, - tag=None, - stringify=True): +def gen_request_parameters(query, granularity, results_per_call=None, + start_time=None, end_time=None, since_id=None, until_id=None, + tweet_fields=None, user_fields=None, media_fields=None, + place_fields=None, poll_fields=None, + expansions=None, + stringify=True): """ - Generates the dict or json payload for a PowerTrack rule. + Generates the dict or json payload for a search query. Args: - pt_rule (str): The string version of a powertrack rule, - e.g., "beyonce has:geo". Accepts multi-line strings + query (str): The string version of a search query, + e.g., "snow has:media -is:retweet". Accepts multi-line strings for ease of entry. results_per_call (int): number of tweets or counts returned per API - call. This maps to the ``maxResults`` search API parameter. - Defaults to 500 to reduce API call usage. - from_date (str or None): Date format as specified by + call. This maps to the `max_results`` search API parameter. + Defaults to 100 (maximum supported in Labs). + start_time (str or None): Date format as specified by `convert_utc_time` for the starting time of your search. - to_date (str or None): date format as specified by `convert_utc_time` + end_time (str or None): date format as specified by `convert_utc_time` for the end time of your search. - count_bucket (str or None): If using the counts api endpoint, - will define the count bucket for which tweets are aggregated. + tweet_fields (string): comma-delimted list of Tweet JSON attributes wanted in endpoint responses. Default is "id,created_at,text"). + Also user_fields, media_fields, place_fields, poll_fields + expansions (string): comma-delimited list of object expansions. stringify (bool): specifies the return type, `dict` or json-formatted `str`. Example: - >>> from searchtweets.utils import gen_rule_payload - >>> gen_rule_payload("beyonce has:geo", - ... from_date="2017-08-21", - ... to_date="2017-08-22") - '{"query":"beyonce has:geo","maxResults":100,"toDate":"201708220000","fromDate":"201708210000"}' + >>> from searchtweets.utils import gen_request_parameters + >>> gen_request_parameters("snow has:media -is:retweet", + ... from_date="2020-02-18", + ... to_date="2020-02-21") + '{"query":"snow has:media -is:retweet","max_results":100,"start_time":"202002180000","end_time":"202002210000"}' """ - pt_rule = ' '.join(pt_rule.split()) # allows multi-line strings - payload = {"query": pt_rule} + #Set endpoint request parameter to command-line arguments. This is where 'translation' happens. + query = ' '.join(query.split()) # allows multi-line strings + payload = {"query": query} if results_per_call is not None and isinstance(results_per_call, int) is True: - payload["maxResults"] = results_per_call - if to_date: - payload["toDate"] = convert_utc_time(to_date) - if from_date: - payload["fromDate"] = convert_utc_time(from_date) - if count_bucket: - if set(["day", "hour", "minute"]) & set([count_bucket]): - payload["bucket"] = count_bucket - del payload["maxResults"] - else: - logger.error("invalid count bucket: provided {}" - .format(count_bucket)) - raise ValueError - if tag: - payload["tag"] = tag + payload["max_results"] = results_per_call + if start_time: + payload["start_time"] = convert_utc_time(start_time) + if end_time: + payload["end_time"] = convert_utc_time(end_time) + if since_id: + payload["since_id"] = since_id + if until_id: + payload["until_id"] = until_id + if tweet_fields: + payload["tweet.fields"] = tweet_fields + if user_fields: + payload["user.fields"] = user_fields + if media_fields: + payload["media.fields"] = media_fields + if place_fields: + payload["place.fields"] = place_fields + if poll_fields: + payload["poll.fields"] = poll_fields + if expansions: + payload["expansions"] = expansions + if granularity: + payload["granularity"] = granularity return json.dumps(payload) if stringify else payload +def infer_endpoint(request_parameters): + """ + Infer which endpoint should be used for a given rule payload. + """ + if 'granularity' in request_parameters.keys(): + return 'counts' + else: + return 'search' #TODO: else "Tweets" makes more sense? + +def change_to_count_endpoint(endpoint): + """Utility function to change a normal 'get Tweets' endpoint to a ``count`` api + endpoint. Returns the same endpoint if it's already a valid count endpoint. + Args: + endpoint (str): your api endpoint + Returns: + str: the modified endpoint for a count endpoint. + + Recent search Tweet endpoint: https://api.twitter.com/2/tweets/search/recent + Recent search Counts endpoint: https://api.twitter.com/2/tweets/counts/recent + + FAS Tweet endpoint: https://api.twitter.com/2/tweets/search/all + FAS Counts endpoint: https://api.twitter.com/2/tweets/counts/all + + """ + if 'counts' in endpoint: + return endpoint + else: #Add in counts to endpoint URL. #TODO: update to *build* URL by injecting 'counts' to handle FAS. + #Insert 'counts' token as the second to last token. + #tokens = filter(lambda x: x != '', re.split("[/:]", endpoint)) + tokens = endpoint.split('/') + search_type = tokens[-1] + base = endpoint.split('tweets') + endpoint = base[0] + 'tweets/counts/' + search_type + return endpoint def gen_params_from_config(config_dict): """ Generates parameters for a ResultStream from a dictionary. """ - if config_dict.get("count_bucket"): - logger.warning("change your endpoint to the count endpoint; this is " - "default behavior when the count bucket " - "field is defined") - endpoint = change_to_count_endpoint(config_dict.get("endpoint")) - else: - endpoint = config_dict.get("endpoint") + # if config_dict.get("count_bucket"): + # logger.warning("change your endpoint to the count endpoint; this is " + # "default behavior when the count bucket " + # "field is defined") + # endpoint = change_to_count_endpoint(config_dict.get("endpoint")) + # else: + endpoint = config_dict.get("endpoint") def intify(arg): @@ -158,47 +205,46 @@ def intify(arg): else: return arg - # this parameter comes in as a string when it's parsed + # This numeric parameter comes in as a string when it's parsed results_per_call = intify(config_dict.get("results_per_call", None)) - rule = gen_rule_payload(pt_rule=config_dict["pt_rule"], - from_date=config_dict.get("from_date", None), - to_date=config_dict.get("to_date", None), - results_per_call=results_per_call, - count_bucket=config_dict.get("count_bucket", None)) + query = gen_request_parameters(query=config_dict["query"], + granularity=config_dict.get("granularity", None), + start_time=config_dict.get("start_time", None), + end_time=config_dict.get("end_time", None), + since_id=config_dict.get("since_id", None), + until_id=config_dict.get("until_id", None), + tweet_fields=config_dict.get("tweet_fields", None), + user_fields=config_dict.get("user_fields", None), + media_fields=config_dict.get("media_fields", None), + place_fields=config_dict.get("place_fields", None), + poll_fields=config_dict.get("poll_fields", None), + expansions=config_dict.get("expansions", None), + results_per_call=results_per_call) _dict = {"endpoint": endpoint, - "username": config_dict.get("username"), - "password": config_dict.get("password"), "bearer_token": config_dict.get("bearer_token"), "extra_headers_dict": config_dict.get("extra_headers_dict",None), - "rule_payload": rule, + "request_parameters": query, "results_per_file": intify(config_dict.get("results_per_file")), - "max_results": intify(config_dict.get("max_results")), - "max_pages": intify(config_dict.get("max_pages", None))} - return _dict - - -def infer_endpoint(rule_payload): - """ - Infer which endpoint should be used for a given rule payload. - """ - bucket = (rule_payload if isinstance(rule_payload, dict) - else json.loads(rule_payload)).get("bucket") - return "counts" if bucket else "search" + "max_tweets": intify(config_dict.get("max_tweets")), + "max_pages": intify(config_dict.get("max_pages", None)), + "output_format": config_dict.get("output_format")} + return _dict -def validate_count_api(rule_payload, endpoint): +#TODO: Check if this is still needed, when code dynamically checks/updates endpoint based on use of 'granularity.' +def validate_count_api(request_parameters, endpoint): """ Ensures that the counts api is set correctly in a payload. """ - rule = (rule_payload if isinstance(rule_payload, dict) - else json.loads(rule_payload)) - bucket = rule.get('bucket') + rule = (request_parameters if isinstance(request_parameters, dict) + else json.loads(request_parameters)) + granularity = rule.get('granularity') counts = set(endpoint.split("/")) & {"counts.json"} - if len(counts) == 0: - if bucket is not None: - msg = ("""There is a count bucket present in your payload, + if 'counts' not in endpoint: + if granularity is not None: + msg = ("""There is a 'granularity' present in your request, but you are using not using the counts API. Please check your endpoints and try again""") logger.error(msg) diff --git a/searchtweets/credentials.py b/searchtweets/credentials.py index 081c5db..cddfc5a 100644 --- a/searchtweets/credentials.py +++ b/searchtweets/credentials.py @@ -1,5 +1,5 @@ # -*- coding: utf-8 -*- -# Copyright 2017 Twitter, Inc. +# Copyright 2020 Twitter, Inc. # Licensed under the Apache License, Version 2.0 # http://www.apache.org/licenses/LICENSE-2.0 """This module handles credential management and parsing for the API. As we @@ -45,11 +45,7 @@ def _load_yaml_credentials(filename=None, yaml_key=None): def _load_env_credentials(): vars_ = ["SEARCHTWEETS_ENDPOINT", - "SEARCHTWEETS_ACCOUNT", - "SEARCHTWEETS_USERNAME", - "SEARCHTWEETS_PASSWORD", "SEARCHTWEETS_BEARER_TOKEN", - "SEARCHTWEETS_ACCOUNT_TYPE", "SEARCHTWEETS_CONSUMER_KEY", "SEARCHTWEETS_CONSUMER_SECRET" ] @@ -60,44 +56,22 @@ def _load_env_credentials(): return parsed -def _parse_credentials(search_creds, account_type): - - if account_type is None: - account_type = search_creds.get("account_type", None) - # attempt to infer account type - if account_type is None: - if search_creds.get("bearer_token") is not None: - account_type = "premium" - elif search_creds.get("password") is not None: - account_type = "enterprise" - else: - pass - - if account_type not in {"premium", "enterprise"}: - msg = """Account type is not specified and cannot be inferred. - Please check your credential file, arguments, or environment variables - for issues. The account type must be 'premium' or 'enterprise'. - """ - logger.error(msg) - raise KeyError +def _parse_credentials(search_creds, api_version=None): try: - if account_type == "premium": - if "bearer_token" not in search_creds: - if "consumer_key" in search_creds \ - and "consumer_secret" in search_creds: - search_creds["bearer_token"] = _generate_bearer_token( - search_creds["consumer_key"], - search_creds["consumer_secret"]) - - search_args = { - "bearer_token": search_creds["bearer_token"], - "endpoint": search_creds["endpoint"], - "extra_headers_dict": search_creds.get("extra_headers",None)} - if account_type == "enterprise": - search_args = {"username": search_creds["username"], - "password": search_creds["password"], - "endpoint": search_creds["endpoint"]} + + if "bearer_token" not in search_creds: + if "consumer_key" in search_creds \ + and "consumer_secret" in search_creds: + search_creds["bearer_token"] = _generate_bearer_token( + search_creds["consumer_key"], + search_creds["consumer_secret"]) + + search_args = { + "bearer_token": search_creds["bearer_token"], + "endpoint": search_creds["endpoint"], + "extra_headers_dict": search_creds.get("extra_headers",None)} + except KeyError: logger.error("Your credentials are not configured correctly and " " you are missing a required field. Please see the " @@ -106,8 +80,7 @@ def _parse_credentials(search_creds, account_type): return search_args - -def load_credentials(filename=None, account_type=None, +def load_credentials(filename=None, yaml_key=None, env_overwrite=True): """ Handles credential management. Supports both YAML files and environment @@ -118,12 +91,9 @@ def load_credentials(filename=None, account_type=None, : endpoint: - username: - password: consumer_key: consumer_secret: bearer_token: - account_type: extra_headers: : @@ -136,10 +106,8 @@ def load_credentials(filename=None, account_type=None, .. code: yaml SEARCHTWEETS_ENDPOINT - SEARCHTWEETS_USERNAME - SEARCHTWEETS_PASSWORD SEARCHTWEETS_BEARER_TOKEN - SEARCHTWEETS_ACCOUNT_TYPE + SEARCHTWEETS_API_VERSION ... Again, set the variables that correspond to your account information and @@ -149,8 +117,8 @@ def load_credentials(filename=None, account_type=None, Args: filename (str): pass a filename here if you do not want to use the default ``~/.twitter_keys.yaml`` - account_type (str): your account type, "premium" or "enterprise". We - will attempt to infer the account info if left empty. + api_version (str): API version, "labs_v1" or "labs_v2". We + will attempt to infer the version info if left empty. yaml_key (str): the top-level key in the YAML file that has your information. Defaults to ``search_tweets_api``. env_overwrite: any found environment variables will overwrite values @@ -161,21 +129,16 @@ def load_credentials(filename=None, account_type=None, Example: >>> from searchtweets.api_utils import load_credentials - >>> search_args = load_credentials(account_type="premium", - env_overwrite=False) + >>> search_args = load_credentials(env_overwrite=False) >>> search_args.keys() dict_keys(['bearer_token', 'endpoint']) >>> import os >>> os.environ["SEARCHTWEETS_ENDPOINT"] = "https://endpoint" - >>> os.environ["SEARCHTWEETS_USERNAME"] = "areallybadpassword" - >>> os.environ["SEARCHTWEETS_PASSWORD"] = "" >>> load_credentials() - {'endpoint': 'https://endpoint', - 'password': '', - 'username': 'areallybadpassword'} + {'endpoint': 'https://endpoint'} """ - yaml_key = yaml_key if yaml_key is not None else "search_tweets_api" + yaml_key = yaml_key if yaml_key is not None else "search_tweets_v2" filename = "~/.twitter_keys.yaml" if filename is None else filename yaml_vars = _load_yaml_credentials(filename=filename, yaml_key=yaml_key) @@ -186,7 +149,7 @@ def load_credentials(filename=None, account_type=None, merged_vars = (merge_dicts(yaml_vars, env_vars) if env_overwrite else merge_dicts(env_vars, yaml_vars)) - parsed_vars = _parse_credentials(merged_vars, account_type=account_type) + parsed_vars = _parse_credentials(merged_vars) return parsed_vars @@ -204,3 +167,4 @@ def _generate_bearer_token(consumer_key, consumer_secret): resp.raise_for_status() return resp.json()['access_token'] + diff --git a/searchtweets/result_stream.py b/searchtweets/result_stream.py index dcc995c..dee943d 100644 --- a/searchtweets/result_stream.py +++ b/searchtweets/result_stream.py @@ -1,45 +1,39 @@ # -*- coding: utf-8 -*- -# Copyright 2018 Twitter, Inc. +# Copyright 2020 Twitter, Inc. # Licensed under the MIT License # https://opensource.org/licenses/MIT """ This module contains the request handing and actual API wrapping functionality. - Its core method is the ``ResultStream`` object, which takes the API call arguments and returns a stream of results to the user. """ import time -import re import logging import requests +from urllib.parse import urlencode try: import ujson as json except ImportError: import json -from tweet_parser.tweet import Tweet from .utils import merge_dicts - from .api_utils import infer_endpoint, change_to_count_endpoint +from collections import defaultdict from ._version import VERSION logger = logging.getLogger(__name__) -def make_session(username=None, password=None, bearer_token=None, extra_headers_dict=None): +def make_session(bearer_token=None, extra_headers_dict=None): """Creates a Requests Session for use. Accepts a bearer token - for premiums users and will override username and password information if - present. - + for v2. Args: - username (str): username for the session - password (str): password for the user - bearer_token (str): token for a premium API user. + bearer_token (str): token for a v2 user. """ - if password is None and bearer_token is None: + if bearer_token is None: logger.error("No authentication information provided; " "please check your object") raise KeyError @@ -47,35 +41,33 @@ def make_session(username=None, password=None, bearer_token=None, extra_headers_ session = requests.Session() session.trust_env = False headers = {'Accept-encoding': 'gzip', - 'User-Agent': 'twitterdev-search-tweets-python/' + VERSION} + 'User-Agent': 'twitterdev-search-tweets-python-labs/' + VERSION} + if bearer_token: logger.info("using bearer token for authentication") headers['Authorization'] = "Bearer {}".format(bearer_token) session.headers = headers - else: - logger.info("using username and password for authentication") - session.auth = username, password - session.headers = headers + if extra_headers_dict: - headers.update(extra_headers_dict) + headers.update(extra_headers_dict) return session - def retry(func): """ - Decorator to handle API retries and exceptions. Defaults to three retries. - + Decorator to handle API retries and exceptions. Defaults to five retries. + Rate-limit (429) and server-side errors (5XX) implement a retry design. + Other 4XX errors are a 'one and done' type error. + Retries implement an exponential backoff... Args: func (function): function for decoration - Returns: decorated function - """ def retried_func(*args, **kwargs): max_tries = 10 tries = 0 total_sleep_seconds = 0 + while True: try: resp = func(*args, **kwargs) @@ -92,26 +84,25 @@ def retried_func(*args, **kwargs): tries += 1 - logger.error(f"HTTP Error code: {resp.status_code}: {resp.text}") - logger.error(f"Request payload: {kwargs['rule_payload']}") + logger.error(f" HTTP Error code: {resp.status_code}: {resp.text} | {resp.reason}") + logger.error(f" Request payload: {kwargs['request_parameters']}") if resp.status_code == 429: - logger.warning("Rate limit hit... Will retry...") - #print("Rate limit hit... Will retry...") - sleep_seconds = min(((tries * 2) ** 2), 900 - total_sleep_seconds) + logger.error("Rate limit hit... Will retry...") + #Expontential backoff, but within a 15-minute (900 seconds) period. No sense in backing off for more than 15 minutes. + sleep_seconds = min(((tries * 2) ** 2), max(900 - total_sleep_seconds,30)) total_sleep_seconds = total_sleep_seconds + sleep_seconds elif resp.status_code >= 500: - logger.warning("Server-side error... Will retry...") - #print("Server-side error... Will retry...") + logger.error("Server-side error... Will retry...") sleep_seconds = 30 else: #Other errors are a "one and done", no use in retrying error... + logger.error('Quitting... ') raise requests.exceptions.HTTPError - # mini exponential backoff here. - logger.warning(f"Will retry in {sleep_seconds} seconds...") - #print(f"Will retry in {sleep_seconds} seconds...") + + logger.error(f"Will retry in {sleep_seconds} seconds...") time.sleep(sleep_seconds) continue @@ -123,20 +114,30 @@ def retried_func(*args, **kwargs): @retry -def request(session, url, rule_payload, **kwargs): +def request(session, url, request_parameters, **kwargs): """ Executes a request with the given payload and arguments. - Args: session (requests.Session): the valid session object url (str): Valid API endpoint - rule_payload (str or dict): rule package for the POST. If you pass a + request_parameters (str or dict): rule package for the POST. If you pass a dictionary, it will be converted into JSON. """ - if isinstance(rule_payload, dict): - rule_payload = json.dumps(rule_payload) + + if isinstance(request_parameters, dict): + request_parameters = json.dumps(request_parameters) logger.debug("sending request") - result = session.post(url, data=rule_payload, **kwargs) + + request_json = json.loads(request_parameters) + + #Using POST command, not yet supported in v2. + #result = session.post(url, data=request_parameters, **kwargs) + + #New v2-specific code in support of GET requests. + request_url = urlencode(request_json) + url = f"{url}?{request_url}" + + result = session.get(url, **kwargs) return result @@ -145,71 +146,202 @@ class ResultStream: Class to represent an API query that handles two major functionality pieces: wrapping metadata around a specific API call and automatic pagination of results. - Args: - username (str): username for enterprise customers - password (str): password for enterprise customers - bearer_token (str): bearer token for premium users - endpoint (str): API endpoint; see your console at developer.twitter.com - rule_payload (json or dict): payload for the post request - max_results (int): max number results that will be returned from this + bearer_token (str): bearer token for v2. + + endpoint (str): API endpoint. + + request_parameters (json or dict): payload for the post request + + max_tweets (int): max number results that will be returned from this instance. Note that this can be slightly lower than the total returned - from the API call - e.g., setting ``max_results = 10`` would return - ten results, but an API call will return at minimum 100 results. - tweetify (bool): If you are grabbing tweets and not counts, use the - tweet parser library to convert each raw tweet package to a Tweet - with lazy properties. - max_requests (int): A hard cutoff for the number of API calls this - instance will make. Good for testing in sandbox premium environments. - extra_headers_dict (dict): custom headers to add + from the API call - e.g., setting ``max_tweets = 10`` would return + ten results, but an API call will return at minimum 100 results by default. + max_requests (int): A hard cutoff for the number of API calls this + instance will make. Good for testing in v2 environment. + extra_headers_dict (dict): custom headers to add Example: - >>> rs = ResultStream(**search_args, rule_payload=rule, max_pages=1) + >>> rs = ResultStream(**search_args, request_parameters=rule, max_pages=1) >>> results = list(rs.stream()) - """ # leaving this here to have an API call counter for ALL objects in your # session, helping with usage of the convenience functions in the library. session_request_counter = 0 - def __init__(self, endpoint, rule_payload, username=None, password=None, - bearer_token=None, extra_headers_dict=None, max_results=500, - tweetify=True, max_requests=None, **kwargs): + def __init__(self, endpoint, request_parameters, bearer_token=None, extra_headers_dict=None, max_tweets=500, + max_requests=None, output_format="r", **kwargs): - self.username = username - self.password = password - self.bearer_token = bearer_token + self.bearer_token = bearer_token #TODO: Add support for user tokens. self.extra_headers_dict = extra_headers_dict - if isinstance(rule_payload, str): - rule_payload = json.loads(rule_payload) - self.rule_payload = rule_payload - self.tweetify = tweetify + if isinstance(request_parameters, str): + request_parameters = json.loads(request_parameters) + self.request_parameters = request_parameters # magic number of max tweets if you pass a non_int - self.max_results = (max_results if isinstance(max_results, int) - else 10 ** 15) + self.max_tweets = (max_tweets if isinstance(max_tweets, int) + else 10 ** 15) self.total_results = 0 self.n_requests = 0 self.session = None + self.current_response = None self.current_tweets = None self.next_token = None self.stream_started = False - self._tweet_func = Tweet if tweetify else lambda x: x + self._tweet_func = lambda x: x # magic number of requests! self.max_requests = (max_requests if max_requests is not None else 10 ** 9) + + + + #Branching to counts or Tweets endpoint. + #TODO: unit testing + self.search_type = 'tweets' + #infer_endpoint(request_parameters) + #change_to_count_endpoint(endpoint) self.endpoint = (change_to_count_endpoint(endpoint) - if infer_endpoint(rule_payload) == "counts" + if infer_endpoint(request_parameters) == "counts" else endpoint) - # validate_count_api(self.rule_payload, self.endpoint) + + if 'counts' in self.endpoint: + self.search_type = 'counts' + + self.output_format = output_format + + def formatted_output(self): + + def extract_includes(expansion, _id="id"): + """ + Return empty objects for things missing in includes. + """ + if self.includes is not None and expansion in self.includes: + return defaultdict( + lambda: {}, + {include[_id]: include for include in self.includes[expansion]}, + ) + else: + return defaultdict(lambda: {}) + + #TODO - counts does not have extractions.... So, skip if you caunt. + # Users extracted both by id and by username for expanding mentions + includes_users = merge_dicts(extract_includes("users"), extract_includes("users", "username")) + # Tweets in includes will themselves be expanded + includes_tweets = extract_includes("tweets") + # Media is by media_key, not id + includes_media = extract_includes("media", "media_key") + includes_polls = extract_includes("polls") + includes_places = extract_includes("places") + # Errors are returned but unused here + includes_errors = extract_includes("errors") + + def expand_payload(payload): + """ + Recursively step through an object and sub objects and append extra data. + """ + + # Don't try to expand on primitive values, return strings as is: + if isinstance(payload, (str, bool, int, float)): + return payload + # expand list items individually: + elif isinstance(payload, list): + payload = [expand_payload(item) for item in payload] + return payload + # Try to expand on dicts within dicts: + elif isinstance(payload, dict): + for key, value in payload.items(): + payload[key] = expand_payload(value) + + if "author_id" in payload: + payload["author"] = includes_users[payload["author_id"]] + + if "in_reply_to_user_id" in payload: + payload["in_reply_to_user"] = includes_users[payload["in_reply_to_user_id"]] + + if "media_keys" in payload: + payload["media"] = list(includes_media[media_key] for media_key in payload["media_keys"]) + + if "poll_ids" in payload: + poll_id = payload["poll_ids"][-1] # always 1, only 1 poll per tweet. + payload["poll"] = includes_polls[poll_id] + + if "geo" in payload: + place_id = payload["geo"]['place_id'] + payload["geo"] = merge_dicts(payload["geo"], includes_places[place_id]) + + if "mentions" in payload: + payload["mentions"] = list(merge_dicts(referenced_user, includes_users[referenced_user['username']]) for referenced_user in payload["mentions"]) + + if "referenced_tweets" in payload: + payload["referenced_tweets"] = list(merge_dicts(referenced_tweet, includes_tweets[referenced_tweet['id']]) for referenced_tweet in payload["referenced_tweets"]) + + if "pinned_tweet_id" in payload: + payload["pinned_tweet"] = includes_tweets[payload["pinned_tweet_id"]] + + return payload + + #TODO: Tweets or Counts? + # First, expand the included tweets, before processing actual result tweets: + if self.search_type == 'tweets': + for included_id, included_tweet in extract_includes("tweets").items(): + includes_tweets[included_id] = expand_payload(included_tweet) + + def output_response_format(): + """ + output the response as 1 "page" per line + """ + #TODO: counts details + if self.search_type == 'tweets': + if self.total_results >= self.max_tweets: + return + yield self.current_response + + #With counts, there is nothing to count here... we aren't counting Tweets (but should count requests) + if self.search_type == 'tweets': + self.total_results += self.meta['result_count'] + + def output_atomic_format(): + """ + Format the results with "atomic" objects: + """ + for tweet in self.current_tweets: + if self.total_results >= self.max_tweets: + break + yield self._tweet_func(expand_payload(tweet)) + self.total_results += 1 + + def output_message_stream_format(): + """ + output as a stream of messages, + the way it was implemented originally + """ + # Serve up data.tweets. + for tweet in self.current_tweets: + if self.total_results >= self.max_tweets: + break + yield self._tweet_func(tweet) + self.total_results += 1 + + # Serve up "includes" arrays, this includes errors + if self.includes != None: + yield self.includes + + # Serve up meta structure. + if self.meta != None: + yield self.meta + + response_format = {"r": output_response_format, + "a": output_atomic_format, + "m": output_message_stream_format} + + return response_format.get(self.output_format)() def stream(self): """ Main entry point for the data from the API. Will automatically paginate - through the results via the ``next`` token and return up to ``max_results`` + through the results via the ``next`` token and return up to ``max_tweets`` tweets or up to ``max_requests`` API calls, whichever is lower. - Usage: >>> result_stream = ResultStream(**kwargs) >>> stream = result_stream.stream() @@ -218,25 +350,35 @@ def stream(self): >>> results = list(ResultStream(**kwargs).stream()) """ self.init_session() - self.check_counts() + #self.check_counts() #TODO: not needed if no Tweet Parser being used. self.execute_request() self.stream_started = True + while True: - for tweet in self.current_tweets: - if self.total_results >= self.max_results: - break - yield self._tweet_func(tweet) - self.total_results += 1 - if self.next_token and self.total_results < self.max_results and self.n_requests <= self.max_requests: - self.rule_payload = merge_dicts(self.rule_payload, - {"next": self.next_token}) + if self.current_tweets == None: + break + yield from self.formatted_output() + + if self.next_token and self.total_results < self.max_tweets and self.n_requests <= self.max_requests: + self.request_parameters = merge_dicts(self.request_parameters, + {"next_token": self.next_token}) logger.info("paging; total requests read so far: {}" .format(self.n_requests)) + + #If hitting the "all" search endpoint, wait one second since that endpoint is currently + #limited to one request per sleep. + #Revisit and make configurable when the requests-per-second gets revisited. + if "tweets/search/all" in self.endpoint: + time.sleep(2) + self.execute_request() + else: break + logger.info("ending stream at {} tweets".format(self.total_results)) + self.current_response = None self.current_tweets = None self.session.close() @@ -246,11 +388,11 @@ def init_session(self): """ if self.session: self.session.close() - self.session = make_session(self.username, - self.password, - self.bearer_token, + self.session = make_session(self.bearer_token, self.extra_headers_dict) + + #TODO: not needed if no Tweet Parser being used. def check_counts(self): """ Disables tweet parsing if the count API is used. @@ -259,6 +401,7 @@ def check_counts(self): logger.info("disabling tweet parsing due to counts API usage") self._tweet_func = lambda x: x + def execute_request(self): """ Sends the request to the API and parses the json response. @@ -271,52 +414,54 @@ def execute_request(self): resp = request(session=self.session, url=self.endpoint, - rule_payload=self.rule_payload) + request_parameters=self.request_parameters) self.n_requests += 1 ResultStream.session_request_counter += 1 - resp = json.loads(resp.content.decode(resp.encoding)) - self.next_token = resp.get("next", None) - self.current_tweets = resp["results"] + try: + resp = json.loads(resp.content.decode(resp.encoding)) + + self.current_response = resp + self.current_tweets = resp.get("data", None) + self.includes = resp.get("includes", None) + self.meta = resp.get("meta", None) + self.next_token = self.meta.get("next_token", None) + + except: + print("Error parsing content as JSON.") def __repr__(self): - repr_keys = ["username", "endpoint", "rule_payload", - "tweetify", "max_results"] + repr_keys = ["endpoint", "request_parameters", "max_tweets"] str_ = json.dumps(dict([(k, self.__dict__.get(k)) for k in repr_keys]), indent=4) str_ = "ResultStream: \n\t" + str_ return str_ - -def collect_results(rule, max_results=500, result_stream_args=None): +def collect_results(query, max_tweets=1000, result_stream_args=None): """ Utility function to quickly get a list of tweets from a ``ResultStream`` without keeping the object around. Requires your args to be configured prior to using. - Args: - rule (str): valid powertrack rule for your account, preferably - generated by the `gen_rule_payload` function. - max_results (int): maximum number of tweets or counts to return from + query (str): valid powertrack rule for your account, preferably + generated by the `gen_request_parameters` function. + max_tweets (int): maximum number of tweets or counts to return from the API / underlying ``ResultStream`` object. result_stream_args (dict): configuration dict that has connection information for a ``ResultStream`` object. - Returns: list of results - Example: >>> from searchtweets import collect_results - >>> tweets = collect_results(rule, - max_results=500, + >>> tweets = collect_results(query, + max_tweets=500, result_stream_args=search_args) - """ if result_stream_args is None: logger.error("This function requires a configuration dict for the " "inner ResultStream object.") raise KeyError - rs = ResultStream(rule_payload=rule, - max_results=max_results, + rs = ResultStream(request_parameters=query, + max_tweets=max_tweets, **result_stream_args) return list(rs.stream()) diff --git a/searchtweets/utils.py b/searchtweets/utils.py index 2efd664..a6e7ce1 100644 --- a/searchtweets/utils.py +++ b/searchtweets/utils.py @@ -1,7 +1,7 @@ """ Utility functions that are used in various parts of the program. """ -# Copyright 2018 Twitter, Inc. +# Copyright 2020 Twitter, Inc. # Licensed under the MIT License # https://opensource.org/licenses/MIT @@ -71,10 +71,10 @@ def merge_dicts(*dicts): Example: >>> from searchtweets.utils import merge_dicts - >>> d1 = {"rule": "something has:geo"} - >>> d2 = {"maxResults": 1000} + >>> d1 = {"query": "snow has:media -is:retweet"} + >>> d2 = {"max_tweets": 1000} >>> merge_dicts(*[d1, d2]) - {"maxResults": 1000, "rule": "something has:geo"} + {"max_results": 1000, "rule": "something has:geo"} """ def _merge_dicts(dict1, dict2): merged = dict1.copy() @@ -148,32 +148,31 @@ def read_config(filename): search_rules: from-date: 2017-06-01 to-date: 2017-09-01 01:01 - pt-rule: kanye + query: snow search_params: - results-per-call: 500 - max-results: 500 + results-per-call: 100 + max-tweets: 500 output_params: save_file: True - filename_prefix: kanye + filename_prefix: snow results_per_file: 10000000 or:: - [search_rules] from_date = 2017-06-01 to_date = 2017-09-01 - pt_rule = beyonce has:geo + query = snow has:geo [search_params] - results_per_call = 500 - max_results = 500 + results_per_call = 100 + max_tweets = 500 [output_params] save_file = True - filename_prefix = beyonce + filename_prefix = snow_geo results_per_file = 10000000 Args: @@ -187,7 +186,7 @@ def read_config(filename): if file_type == "yaml": with open(os.path.expanduser(filename)) as f: - config_dict = yaml.load(f) + config_dict = yaml.safe_load(f) config_dict = merge_dicts(*[dict(config_dict[s]) for s in config_dict.keys()]) @@ -203,10 +202,10 @@ def read_config(filename): # ensure args are renamed correctly: config_dict = {k.replace('-', '_'): v for k, v in config_dict.items()} - # YAML will parse datestrings as datetimes; we'll convert them here if they - # exist - if config_dict.get("to_date") is not None: - config_dict["to_date"] = str(config_dict["to_date"]) - if config_dict.get("from_date") is not None: - config_dict["from_date"] = str(config_dict["from_date"]) + # YAML will parse datestrings as datetimes; we'll convert them here if they exist + + if config_dict.get("start_time") is not None: + config_dict["start_time"] = str(config_dict["start_time"]) + if config_dict.get("end_time") is not None: + config_dict["end_time"] = str(config_dict["end_time"]) return config_dict diff --git a/setup.py b/setup.py index 5831758..5c0887d 100644 --- a/setup.py +++ b/setup.py @@ -1,5 +1,5 @@ # -*- coding: utf-8 -*- -# Copyright 2018 Twitter, Inc. +# Copyright 2020 Twitter, Inc. # Licensed under the MIT License # https://opensource.org/licenses/MIT import re @@ -22,16 +22,16 @@ def parse_version(str_): if line.startswith("VERSION")][0].strip() VERSION = parse_version(_version_line) -setup(name='searchtweets', - description="Wrapper for Twitter's Premium and Enterprise search APIs", +setup(name='searchtweets-v2', + description="Wrapper for Twitter API v2 recent search endpoint.", url='https://github.com/twitterdev/search-tweets-python', - author='Fiona Pigott, Jeff Kolb, Josh Montague, Aaron Gonzales', + author='Fiona Pigott, Jeff Kolb, Josh Montague, Aaron Gonzales, Jim Moffitt', long_description=open('README.rst', 'r', encoding="utf-8").read(), - author_email='agonzales@twitter.com', + author_email='dev-support@twitter.com', license='MIT', version=VERSION, python_requires='>=3.3', - install_requires=["requests", "tweet_parser", "pyyaml"], + install_requires=["requests", "pyyaml", "python-dateutil"], packages=find_packages(), - scripts=["tools/search_tweets.py"], + scripts=["scripts/search_tweets.py", "scripts/poll_tweets.py"], ) diff --git a/tools/api_config_example.config b/tools/api_config_example.config deleted file mode 100644 index 230d731..0000000 --- a/tools/api_config_example.config +++ /dev/null @@ -1,13 +0,0 @@ -[search_rules] -from_date = 2017-06-01 -to_date = 2017-09-01 -pt_rule = beyonce has:geo - -[search_params] -results_per_call = 500 -max_results = 500 - -[output_params] -save_file = True -filename_prefix = beyonce -results_per_file = 10000000 diff --git a/tools/api_yaml_example.yaml b/tools/api_yaml_example.yaml deleted file mode 100644 index d1bf9e6..0000000 --- a/tools/api_yaml_example.yaml +++ /dev/null @@ -1,13 +0,0 @@ -search_rules: - from-date: 2017-06-01 - to-date: 2017-09-01 01:01 - pt-rule: kanye - -search_params: - results-per-call: 500 - max-results: 500 - -output_params: - save_file: True - filename_prefix: kanye - results_per_file: 10000000