In Part 2 we figured out how to retrieve data from Twitter, and also worked on setting up some of our test infrastructure. In this post we’ll look at how to consume and format the data we’re retrieving.
I’d encourage you to follow along, but if you want to skip directly to the code, it’s available on GitHub at: https://github.com/riebeekn/elixir-twitter-scraper.
If you followed along with part 2 just continue on with the code you created in part 2. If not and you’d rather jump right into part 3, you can use a clone of part 2 as a starting point.
Clone the Repo
If grabbing the code from GitHub instead of continuing along from part 2, the first step is to clone the repo.
OK, you’ve either gotten the code from GitHub or are using the existing code you created in Part 2, let’s get parsing!
Determining what we need to parse
One thing we haven’t really addressed yet is what our expected data that we pull from Twitter looks like. We’ll have a hard time figuring out how to parse the data if we don’t even know what it looks like!
Getting a sample of our data
The easiest way to determine the format of our data is to navigate to a Twitter page, and then view and save the HTML source. For grabbing some test data, I navigated to https://twitter.com/lolagil, viewed the source and saved the contents. Since this Twitter page has likely changed since I downloaded it, I would suggest you download the copy I’ve saved as per the instructions below. This way we’ll be working off the same sample.
First let’s create our new project directory.
Then download the file.
Perfect we now have some sample data to work with.
Updating the Twitter mock
If you recall from part 2 we’re not returning particularly realistic data from our mock, i.e.
Although we won’t be concentrating on the
Scraper tests for a bit, this is a good time to update our mock so it’ll be ready to go in the future.
To create a more realistic mock, we’ll return the contents of our sample data file.
get_home_page we’re now reading our sample file and returning the contents of it via the
body return value. This is exactly what happens with the real implemention; just that the data is retrieved from Twitter instead of a local file.
This change will cause our
Scraper test to fail. For now just to get it passing, we’ll just check that the response contains a
With that, our tests should all be passing.
Now that we have some sample data, let’s see if we can parse and format the data being returned. We’ll be looking to return a list of tweets when our public API is called.
Structuring our data
Let’s create a struct for the tweet data, we’ll place this in a general struct file as we’ll need some more structs down the road. Also note that we are including comments for our struct. Someone interacting with our application is likely going to appreciate knowing what our struct is returning, and this is something that we’d want ExDoc to generate documentation for.
There are a number of fields that we plan to return. Everything is pretty self-explanatory, maybe the only tricky aspect is the
handle_id refers to the
id of the current feed we are retrieving data from. For example, if we are looking at
https://twitter.com/bobsFeed and Bob has an id of
1, and Bob has retweeted a tweet from
https://twitter.com/sallysFeed and Sally has an id of
handle_id would be
1; whereas the
user_id would be
2, i.e. Sally’s id. This is because the
user_id always refers to the user who created the original tweet. Similarily the
user_name refers to the original author of a tweet, so in our example these fields would also refer to Sally.
Hopefully that is clear as mud! The next step is to create a
Parser module that will be used to take the data we’re retrieving and populate our struct.
Let’s create a simple function that takes in the HTML we retrieve and returns an empty list… in the future we’ll update this to return a list of
Scraper.scrape/3 to call into the
We’ve added an alias for
TwitterFeed.Parser, and updated the
:ok clause of our case statement to call into our new
iex we should now get an empty list back.
Preparing to parse
So we’ve added a
struct, created a
Parser module, and updated the
Scraper module to use the
Parser… we’re finally ready to get to the meat of our task, the actual parsing of tweets! We’ll be using a Hex package called Floki to help with our HTML parsing. Let’s add it to our
And then update our project with our new dependency:
Now we need to update the
Parser module to handle each Tweet we retrieve.
We can see
Floki in action here. If you look at our sample data file, you’ll notice that each
Tweet is contained in a
div with a class of
.tweet. So we’re using Floki to find each tweet div and then passing each div to
parse_tweet. For now the function just returns an empty
Let’s see it in action:
Perfect, for each tweet we’re returning an empty struct.
Note that our
ScraperTest is now going to be failing. We could update it so that it passes; however we’re going to be changing the return value of our
Parser.parse_tweets method a heck of a lot in the next while and it’s not going to be very efficient for us to keep updating the
Scraper test. So let’s ignore it for now. How can we do that?
Pretty simple, first update
We’ve added an
exclude parameter to
ExUnit.start. This is going to allow us to skip tests by adding a
@tag :skip directive as demonstrated below.
With that, we can run our tests and avoid having a bunch of error output getting in our way.
Parsing the individual fields of a tweet
The last thing we’ll tackle today is figuring out how to parse the individual fields of each tweet we retrieve. For now we’ll just grab the display name of the tweet. The display name is different from the handle, i.e. for the handle
lolagil, the display name is
Being able to test the parsing of each individual field will be helpful, so let’s set-up some testing for the
First we’ll create the test file.
Now let’s create a test for the display name.
If we look in our sample data, we’ll see that the display name is an attribute within the
.tweet div. So let’s set-up our test to reflect this.
So we’re feeding a snippet of HTML into our
parser and expecting an appropriate result.
Let’s update the
We’ve added a
parse_display_name function which surprise, surprise, parses the display name. We call into it from
parse_tweet in order to populate the
display_name value of our structure.
Let’s run our tests.
Hmm, we’ve got a bit of a problem here as we’re trying to test a private function. Luckily there is a
Hex package we can use that will help us out.
Now some people would argue that testing private functions is not appropriate, I think it is dependent on the situation, and in this case makes sense for the following reasons:
- It allows us to quickly determine if we have an issue with the parsing of a particular field.
- It avoids having to write a large and likely complicated test against
parse_tweets. To test
parse_tweetswe would need to create a comprehensive HTML snippet in the test that captured all the fields we need to parse and we’d then need to check all these fields, all within one test.
- It helps to document in our code what we are parsing within the HTML. For instance with our display name test in place it is very easy to see what we expect to be returned in the Twitter HTML when it comes to the display name.
- Another option would be to make the functions we want to test public instead of private… but this would add confusion as to how we’re expecting the
Parsermodule will be used within our code base. We don’t want modules external to the
Parserto call the individual
In any case, let’s update
mix.exs and get the test passing. We’ll be using Publicist. As per the
Publicist documentation, it maps private items to public when running under the Elixir test environment.
We need to make a small change to the
Anything under the
use Publicist line will be made public when running in the test environment.
Now we should have a passing test.
Finally, if you run the application with
iex, the display name will be showing up in our returned data.
Notice that at the point in time that I ran this, the third tweet is a re-tweet, so has the display name of the original tweeter.
To end things off let’s see how easy it is to create documentation with Elixir. All we need to do is update our dependencies in the
mix.exs file, adding an entry for
Now we’re good to go!
Check them out in your browser, we can see that our comments for the
Tweet struct are showing up as expected.
It’s seems like it’s been a long time coming, but we are finally starting to see our Twitter Scraper come together. Slowly but surely we’re building things up, and now that we’ve managed to parse out our first field, things should start to move along at a good pace.
In the next installment we’ll tackle the rest of our fields and eventually get back to our
Scraper tests and functionality.
Thanks for reading and I hope you enjoyed the post!