In Part 1 we created our basic project structure and added code to build the URLs to retrieve data from Twitter. In this post we’ll get started on actually doing the data retrieval.
I’d encourage you to follow along, but if you want to skip directly to the code, it’s available on GitHub at: https://github.com/riebeekn/elixir-twitter-scraper.
If you followed along with part 1 just continue on with the code you created in part 1. If not and you’d rather jump right into part 2, you can use a clone of part 1 as a starting point.
Clone the Repo
If grabbing the code from GitHub instead of continuing along from part 1, the first step is to clone the repo.
What we’re starting with
OK, you’ve either gotten the code from GitHub or are using the existing code you created in Part 1, let’s see where we’re starting from.
iex loads we can run our
get_tweets function that we created in part 1.
As of now it just returns an
Let’s figure out how to retrieve some actual data.
For now we’re going to concentrate on retrieving the first page of tweets for a particular user. If you recall, this involves going to the main Twitter page for the user; for now we won’t worry about making the JSON requests needed for grabbing subsequent pages of data. We’ll be using HTTPoison to retrieve our data, so let’s add the package to our project.
deps section of
mix.exs to include
And now we need to update our dependencies:
Sweet, we’re good to go!
Retrieving the first page of Tweets
We’ll create a new module in our
twitter_api directory for handling HTTP requests, we’ll call it
Nothing complicated going on here, we’re using the
UrlBuilder module we created earlier to build the appropriate URL, and then passing that off to
scraper.ex to make use of our new module.
Notice we’ve removed the underscore from the
handle parameter and we’re passing it into the
HttpClient call that we’ve added.
iex (we need to restart it since we added a new dependency) and let’s see if we are retrieving data.
If you look carefully at the response we get back, you’ll notice there are two pieces of information being returned to us:
A response structure that itself contains a number of elements.
- A body section
- A headers section
- A request_url and status_code
With this in mind, let’s update our
scraper code slightly.
We’re using a case statement and some pattern matching to return the
body on a
200 response, in the case of a
404 response we return an error.
Let’s recompile and run our function again in
We now have our body returning in the case where the handle is found:
And as expected we get an error back with an invalid handle:
What about testing?
Running things from
iex is all well and good; it isn’t the way we want to build out our code going forward however. We want to be able to validate the functionality of our code via tests, not by typing things into the console.
We’re pretty much ok not testing the
HttpClient module as it doesn’t really do anything other than call off into
Scraper module however already has some logic in it and will continue to grow. So let’s create some tests for
So we’re testing both valid and invalid handles, checking the
reason field when an invalid handle is passed in, and the
body when a valid handle is passed in.
If you run
mix test our tests pass…
We have a few issues with our
Scraper tests however.
- They rely on an external resource (Twitter).
- Our valid handle test is not really testing for much of anything.
Let’s see how we can remove our dependence on Twitter when it comes to our testing. We’ll deal with creating a better
valid handle test in a future post.
Mocking out our Twitter interaction
José Valim wrote a great article on using Mocks in Elixir, it even happens to use Twitter as an example! We’ll use the technique explained in José’s article in our code.
We’ll be creating a test specific implementation of
HttpClient that returns known values. Depending on whether we are running under a test or non-test environment the
Scraper module will either call into the test or concrete implementation of our code.
The process for getting this all set up is:
- Create an API that both our concrete and test implementations will implement, we’ll use an Elixir
- Update our existing
HttpClientmodule to make use of the
- Create a test implementation of the
- Add some configuration that will direct us to the test or concrete version of
Scraperto use the API instead of directly calling into
Yikes! I know that sounds a little involved, but it isn’t too bad and will be worth it in the long run, so let’s get going!
The first step is to define an API that
HttpClient and our mock will implement.
All this does is indicate the function(s) we want to define in both our concrete and mocked implementation. For now, the only function we need is
HttpClient to make use of the API.
Simple! We just need to add a
behaviour line to
HttpClient. This indicates it implements the behaviour specified in our API, and since we already have a
get_home_page function in
HttpClient, we’re all good to go.
Next we need to create our Mock. Let’s place it in a specific directory we’ll use for mocks.
So our mock also uses the
behaviour we defined in the API, but this time we are explicitly setting the return values of our
get_home_page function. A call to
get_home_page with a handle of
:non_existant_handle is going to return the
404 status that we get from Twitter when the real version of
HttpClient attempts to grab data for an invalid handle. For any other handle we’re going to return a status of
200 and some body text.
In order to use our mock we need to reference it in the
We now want to update
Scraper to make use of our API, instead of directly calling into
This will require a bit of configuration set-up. We’ll be updating the existing configuration file as well as adding dev, prod and test configurations.
Let’s start by creating the new files.
Now we need to update the main config file so that it loads our newly created files based on the current environment (test, dev, or prod). Note: there’s a lot of comments in the default
config.exs file, feel free to remove them as I’ve done or keep them hanging around if you wish.
Now we’ll specify whether to use the real or mock version of
HttpClient in our enviroment specific configuration files.
prod both use the concrete implementation,
test uses the mock.
The next step is to update
Scraper to take into account these configurations.
We’ve added a module attribute,
@twitter_api, that gets assigned to our concrete or mock implementation of
HttpClient based on the current environment we are running under. Notice we no longer need the
HttpClient alias (i.e.
alias TwitterFeed.TwitterApi.HttpClient), so have removed it.
The only other change is the
case statement where instead of calling
HttpClient.get_home_page we use the module attribute, i.e.
In production and development we will use the concrete implementation, under test, the mock.
Let’s update our
Scraper tests and see if they work.
invalid_handle test requires no changes, we’ve just updated the valid handle test. Since the mock ignores the handle value, we can pass in anything for the handle. Then we just need to check for the text explicitly being returned from the mock.
If we run our tests, all is good.
Yet running from
iex we still get the real data.
We now have a consistent way of running tests against our
Scraper module going forward and our unit test coverage is looking good.
The two modules we are running tests against,
UrlBuilder both have 100% coverage.
Thanks for reading and I hope you enjoyed the post!