In part 5 we used JSON calls to retrieve tweets from pages other than the Twitter home page. In this post we’ll add a new structure to our application to make it easy for applications that are consuming our code to know whether more tweets exist and how to retrieve them.
I’d encourage you to follow along, but if you want to skip directly to the code, it’s available on GitHub at: https://github.com/riebeekn/elixir-twitter-scraper.
Getting started
If you followed along with part 5 just continue on with the code you created in part 5. If not and you’d rather jump right into part 6, you can use a clone of part 5 as a starting point.
Clone the Repo
If grabbing the code from GitHub instead of continuing along from part 5, the first step is to clone the repo.
Terminal
OK, you’ve either gotten the code from GitHub or are using the existing code you created in Part 5, let’s get to it!
A review of where we’re at
Let’s start by having a look at where we’re currently at. The current output of a call into our application yields a list of Tweets.
Terminal
Terminal
The main problem with this approach is a consumer of our application has no indication of whether more Tweets exist; nor is it super easy for them to determine the value they should be passing into the start_after_tweet
parameter, i.e.
Remember the start_after_tweet
value needs to be the id
of the last tweet retrieved. So a consuming application would need to keep track of this in order to grab subsequent pages of tweets. Not a big deal, but not very user friendly either.
So let’s think about how we can improve things.
Restructuing our return values
To remedy this, we’ll add a new struct
that will contain our list of Tweets, but also some general information about the feed, namely whether more tweets exist and what the id
of the last tweet retrieved was.
Adding the new struct
Let’s add our new struct above the definition for our existing Tweet struct.
/lib/twitter_feed/structs.ex
Pretty straight-forward, we’ve added a new struct to structs.ex
called Feed
. This will supply us with the last tweet retrieved and whether more tweets exists, along with our list of tweets.
Making use of the new struct
The next step is to update parser.ex
to make use of the new structure. We now need to determine and populate the last_tweet_retrieved
and a more_tweets_exist
fields of our new struct. Let’s have a look at how we can go about accomplishing this.
We’re going to need to deal with parsing both the HTML and JSON responses; let’s start with the HTML response.
Parsing our new fields from the HTML response
We’ll add 2 new private methods in parser.ex
to handle the new fields. We’ll add a parse_html_min_position
and parse_html_has_more_items
method, and of course we want to add tests for these, so let’s start off with the tests for figuring out the min_position
value.
/test/twitter_feed/parser_test.exs
For the HTML response, the data-min-position
attribute contains the id of the last tweet retrieved. We have also added a test for the situation where the scraper is hitting a Twitter page that has no tweets. In this case we expect a 0
to be returned for the id
.
Next we need some tests for determining if more tweets exist:
/test/twitter_feed/parser_test.exs
Looking at our tests we can see that when the div
with the class of timeline-end
contains the has-more-items
class… we have more tweets. When it is missing we don’t.
Let’s add the implementation of our new methods. Note: make sure you place these new methods under the use Publicist
directive so that they can be accessed in the tests.
/lib/twitter_feed/parser.ex
Pretty simple, with the parse_html_has_more_items
function we simply check for the existance of the has-more-items
class. In the case of parse_html_min_position
we parse and convert the data-min-position
value to an integer while taking into account the scenario where it is empty (i.e. no tweets were retrieved).
Our tests should now be passing.
Terminal
Parsing our new fields from the JSON response
In the case of the JSON response there is very little to do; the response itself contains the information we’re after in the min_postition
and has_more_items
fields:
The only thing we need to do is convert the min_position
value to an integer, and return 0 in cases where it is missing.
So let’s add a few new tests for that.
/test/twitter_feed/parser_test.exs
And now the implementation.
/lib/twitter_feed/parser.ex
And with that we should have another 2 passing tests.
Terminal
Updating the main parser methods
Now it is time to update the main parse_tweets
methods to make use of the new struct and parse methods.
Let’s start by updating the HTML version of parse_tweets
.
/lib/twitter_feed/parser.ex
First we’ve added a new alias to include the Feed
structure, i.e. alias TwitterFeed.{ Feed, Tweet }
.
In the function itself we’re now returning our new structure by making use of the newly created functions that parse out the min position
and more items values
; the tweets
list is populated in the same way as before.
Now calling into the method with iex
yields the following:
Terminal
Terminal
Let’s also update the JSON version of the method.
/lib/twitter_feed/parser.ex
Pretty similar to what we’ve done with the HTML version. The only real difference is that we parse the JSON using Poison
and then retrieve the tweet HTML
, min position
and has more items
data from the results returned from Poison
.
If we have a look at a JSON based response from iex
we’ll see our new struct.
Terminal
Terminal
What about our scraper tests?
The last thing we need to deal with is our scraper
tests; the return value change will be causing them to fail.
The fix is very easy however, we just need to take into account the new structure and while we’re at it add some additional checks for the last_tweet_retrieved
and more_tweets_exist
values.
In the updated test we’re changing the name of the local variable assigned to the Scraper.scrape
result from tweets
to feed
as the return value is now our new Feed
structure. We then perform assertions against the new fields in our structure, i.e. last_tweet_retrieved
and more_tweets_exist
. For the existing statements where we were assigning the first and last tweets we now do so via feed.tweets
instead of directly. The full updated listing for scraper_test.ex
is below
/test/twitter_feed/scraper_test.exs
And with that we are back to pass.
Terminal
Summary
We now have a much more user friendly implementation. At this point we are very close to being done with this series of posts. Next time we’ll add some error handling, at which point our scraper will be done and dusted!
Thanks for reading and I hope you enjoyed the post!