At the end of Part 4 we finished parsing all the fields we want to extract from a Tweet. We’re now going to integrate our changes into the Scraper module and look at how to retrieve subsequent pages of Tweets.
I’d encourage you to follow along, but if you want to skip directly to the code, it’s available on GitHub at: https://github.com/riebeekn/elixir-twitter-scraper.
If you followed along with part 4 just continue on with the code you created in part 4. If not and you’d rather jump right into part 5, you can use a clone of part 4 as a starting point.
Clone the Repo
If grabbing the code from GitHub instead of continuing along from part 4, the first step is to clone the repo.
OK, you’ve either gotten the code from GitHub or are using the existing code you created in Part 4, let’s get to it!
Back to the Scraper
We’ve mostly been working with the
Parser module lately; we’re going to switch gears and look at how we can integrate the changes we’ve made into the
Starting with some testing
The first thing we’ll do is update our
Scraper tests. If you recall, we currently have pretty sparse testing of the
Scraper module and we’re currently not even running one of them.
So let’s concentrate on that skipped test for now. We’re going to replace the test altogether as currently it really doesn’t do much:
Instead we will create a test based on the test data we placed in
test/data back in part 3, let’s get to it!
So based on the data in
test/data/twitter.html we can reliably set-up our test to expect certain values… and this is exactly what we are doing in the
"scraping the first page of tweets" test.
Let’s have a look at the call to
The second parameter with a value of
0 indicates we want to process the first page of tweets. The first parameter can be any value as we ignore the handle in our mock:
We now have 17 full tests with nothing being skipped.
Scraping the 2nd page of Tweets
So this works great, how about if we want to retrieve the 2nd page of tweets however? Let’s set up a test for this and see what happens. We’ll create a test similar to what we did for the first page, once again we’ll need to set-up our mock, this time we’ll retrieve the JSON result.
Adding some test data
Based on the data in
test/data/twitter.html the last tweet we retrieved had an id of
948759471467118592. So our JSON request for the second page of tweets is going to be: https://twitter.com/i/profiles/show/lolagil/timeline/tweets?include_available_features=1&include_entities=1&max_position=948759471467118592&reset_error_state=false
You can run this in your browser to get the JSON response and move it into the
test/data folder, or just download the response I’ve previously retrieved and saved on GitHub:
Adding the JSON call
First off let’s update our API to include a new call to grab the JSON response.
We’ve added a
get_tweets method that will be used to get the next 20 tweets (recall Twitter sends us back 20 tweets at a time) based on the last tweet we’ve previously retrieved.
We’ll update the concrete implementation.
Pretty simple, we just call into the
UrlBuilder to build the JSON specific URL and then use
HTTPoison to make the request.
Our mock is also straight-forward. We just change the extension of the file we’re loading depending on whether we want the JSON or HTML file.
We’ve also done a bit of refactoring; moving the File consumption code to a private function to reduce code duplication.
Updating the Scraper
Now we just need to update our
Scraper code to handle grabbing the second page of tweets. We’re going to start by changing our
scrape method to explicitly match our scenarios. So let’s update the current
We are now explicitly pattern matching on the scenario where we are requesting the first page of tweets (as
0 is being passed in as the
start_after_tweet value). In this case we know we are looking to retrieve the first page of tweets and thus call into
Let’s add a 2nd
scrape method that we can use for grabbing subsequent pages. For now we won’t worry about parsing the response, we’ll just return the body of the response.
Pretty simple, when the 2nd parameter is non-zero we know we want to execute a JSON request so we call into the new API method,
get_tweets, to do so.
If we try this in
iex we’ll see a JSON response.
Looks good, we’re retrieving the JSON as expected.
Next we’re going to want to actually parse the JSON response, similar to what we do with the HTML response… so let’s start by adding a new test.
The values in the test are based on the
If we run the test, it will of course currently fail.
In order to get this test passing we’re going to need to parse the JSON content, and we’ll use a third party library to do so, Poison.
Let’s add it to
mix.exs and run
mix deps.get to update the dependencies.
With that out of the way, let’s update our
Scraper method to call into the
So both implementations of
scrape now call into the
Parser module, and similar to what we did with our mock, we pass in an indication of whether the data is in HTML or JSON format, i.e.
Parser.parse_tweets(:html) for our first
scrape function and
Parser.parse_tweets(:json) for the second.
Now let’s update the
Parser. We’ll add a second
parse function that will handle the JSON response. Also our first function will change slightly to take in an
:html atom indicating we are parsing an HTML response.
Our parse methods take in
:json to differentiate them. The format of the JSON response is very similar to the HTML response, the big difference being that the HTML for the tweets is contained in the
items_html field of the reponse. So all we need to do is parse out that field and then pass it along to our existing
We should now have a passing test.
If we go into
iex we’ll see our second page of tweets being parsed.
So with that, we’ve figured out how to grab subsequent pages of tweets. Next time we’ll update the information we are returning to make it easy for applications that are consuming our code to know whether more tweets exist and how to get them.
Thanks for reading and I hope you enjoyed the post!