At the end of Part 4 we finished parsing all the fields we want to extract from a Tweet. We’re now going to integrate our changes into the Scraper module and look at how to retrieve subsequent pages of Tweets.
I’d encourage you to follow along, but if you want to skip directly to the code, it’s available on GitHub at: https://github.com/riebeekn/elixir-twitter-scraper.
Getting started
If you followed along with part 4 just continue on with the code you created in part 4. If not and you’d rather jump right into part 5, you can use a clone of part 4 as a starting point.
Clone the Repo
If grabbing the code from GitHub instead of continuing along from part 4, the first step is to clone the repo.
Terminal
OK, you’ve either gotten the code from GitHub or are using the existing code you created in Part 4, let’s get to it!
Back to the Scraper
We’ve mostly been working with the Parser
module lately; we’re going to switch gears and look at how we can integrate the changes we’ve made into the Scraper
module.
Starting with some testing
The first thing we’ll do is update our Scraper
tests. If you recall, we currently have pretty sparse testing of the Scraper
module and we’re currently not even running one of them.
Terminal
So let’s concentrate on that skipped test for now. We’re going to replace the test altogether as currently it really doesn’t do much:
Instead we will create a test based on the test data we placed in test/data
back in part 3, let’s get to it!
/test/twitter_feed/scraper_test.exs
So based on the data in test/data/twitter.html
we can reliably set-up our test to expect certain values… and this is exactly what we are doing in the "scraping the first page of tweets"
test.
Let’s have a look at the call to Scraper.scrape
.
The second parameter with a value of 0
indicates we want to process the first page of tweets. The first parameter can be any value as we ignore the handle in our mock:
We now have 17 full tests with nothing being skipped.
Terminal
Scraping the 2nd page of Tweets
So this works great, how about if we want to retrieve the 2nd page of tweets however? Let’s set up a test for this and see what happens. We’ll create a test similar to what we did for the first page, once again we’ll need to set-up our mock, this time we’ll retrieve the JSON result.
Adding some test data
Based on the data in test/data/twitter.html
the last tweet we retrieved had an id of 948759471467118592
. So our JSON request for the second page of tweets is going to be: https://twitter.com/i/profiles/show/lolagil/timeline/tweets?include_available_features=1&include_entities=1&max_position=948759471467118592&reset_error_state=false
You can run this in your browser to get the JSON response and move it into the test/data
folder, or just download the response I’ve previously retrieved and saved on GitHub:
Terminal
Adding the JSON call
First off let’s update our API to include a new call to grab the JSON response.
/lib/twitter_feed/twitter_api/api.ex
We’ve added a get_tweets
method that will be used to get the next 20 tweets (recall Twitter sends us back 20 tweets at a time) based on the last tweet we’ve previously retrieved.
We’ll update the concrete implementation.
/lib/twitter_feed/twitter_api/http_client.ex
Pretty simple, we just call into the UrlBuilder
to build the JSON specific URL and then use HTTPoison
to make the request.
Our mock is also straight-forward. We just change the extension of the file we’re loading depending on whether we want the JSON or HTML file.
/test/mocks/twitter_api_mock.ex
We’ve also done a bit of refactoring; moving the File consumption code to a private function to reduce code duplication.
Updating the Scraper
Now we just need to update our Scraper
code to handle grabbing the second page of tweets. We’re going to start by changing our scrape
method to explicitly match our scenarios. So let’s update the current scrape
function.
/lib/twitter_feed/scraper.ex
We are now explicitly pattern matching on the scenario where we are requesting the first page of tweets (as 0
is being passed in as the start_after_tweet
value). In this case we know we are looking to retrieve the first page of tweets and thus call into get_home_page
.
Let’s add a 2nd scrape
method that we can use for grabbing subsequent pages. For now we won’t worry about parsing the response, we’ll just return the body of the response.
/lib/twitter_feed/scraper.ex
Pretty simple, when the 2nd parameter is non-zero we know we want to execute a JSON request so we call into the new API method, get_tweets
, to do so.
If we try this in iex
we’ll see a JSON response.
Terminal
Terminal
Looks good, we’re retrieving the JSON as expected.
Next we’re going to want to actually parse the JSON response, similar to what we do with the HTML response… so let’s start by adding a new test.
/test/twitter_feed/scrapper_test.exs
The values in the test are based on the test/data/twitter.json
data.
Terminal
If we run the test, it will of course currently fail.
In order to get this test passing we’re going to need to parse the JSON content, and we’ll use a third party library to do so, Poison.
Let’s add it to mix.exs
and run mix deps.get
to update the dependencies.
/mix.exs
Terminal
With that out of the way, let’s update our Scraper
method to call into the Parser
.
/lib/twitter_feed/scraper.ex
So both implementations of scrape
now call into the Parser
module, and similar to what we did with our mock, we pass in an indication of whether the data is in HTML or JSON format, i.e. Parser.parse_tweets(:html)
for our first scrape
function and Parser.parse_tweets(:json)
for the second.
Now let’s update the Parser
. We’ll add a second parse
function that will handle the JSON response. Also our first function will change slightly to take in an :html
atom indicating we are parsing an HTML response.
/lib/twitter_feed/parser.ex
Our parse methods take in :html
or :json
to differentiate them. The format of the JSON response is very similar to the HTML response, the big difference being that the HTML for the tweets is contained in the items_html
field of the reponse. So all we need to do is parse out that field and then pass it along to our existing parse_tweet
function.
We should now have a passing test.
Terminal
If we go into iex
we’ll see our second page of tweets being parsed.
Terminal
Terminal
Summary
So with that, we’ve figured out how to grab subsequent pages of tweets. Next time we’ll update the information we are returning to make it easy for applications that are consuming our code to know whether more tweets exist and how to get them.
Thanks for reading and I hope you enjoyed the post!