At the end of Part 3 we finally got around to parsing out some data from the HTML we retrieved from Twitter. We’ll continue building out our Tweet
struct and aim to have all our fields populated by the end of this post.
I’d encourage you to follow along, but if you want to skip directly to the code, it’s available on GitHub at: https://github.com/riebeekn/elixir-twitter-scraper.
Getting started
If you followed along with part 3 just continue on with the code you created in part 3. If not and you’d rather jump right into part 4, you can use a clone of part 3 as a starting point.
Clone the Repo
If grabbing the code from GitHub instead of continuing along from part 3, the first step is to clone the repo.
Terminal
OK, you’ve either gotten the code from GitHub or are using the existing code you created in Part 3, let’s get to parsing!
Parsing the rest of our fields
You’ll recall from part 3 we got to the point where we had parsed the display_name
field of our struct.
We have 8 more fields to populate, we’ll be mucking about pretty much exclusively in the Parser
module today, let’s get at it!
Parsing the user_id
The user_id
, like many of the fields we want to capture, is an attribute of the .tweet
div.
Let’s put together a test.
/test/twitter_feed/parser_test.ex
Very simple, we’ve added a new test, parsing of user_id
, that contains an HTML snippet with the data-user-id
attribute and checks for the value of the attribute upon parsing.
Now for the implementation.
/lib/twitter_feed/parser.ex
Very similar to parse_display_name
. We’re just using Floki
to grab the value of the attribute, and then in this case, we convert it to an integer before returning it.
With that all in place we should now have a new test that passes.
Terminal
Looking good! We’ll be following a similar pattern for the rest of the fields. The code should be pretty similar to the 2 fields we’ve completed, so explanations will be kept to a minimum from here on out. Also, I suggest you re-run the tests after each field is added; but I’ll be skipping out on that for the sake of brevity.
Parsing the user_name
The user_name
is contained within the .tweet
div in a data-screen-name
attribute.
Let’s add a test.
/test/twitter_feed/parser_test.ex
And now the implementation.
/lib/twitter_feed/parser.ex
Parsing the tweet_id
The tweet_id
is contained within the .tweet
div in a data-tweet-id
attribute.
The test.
/test/twitter_feed/parser_test.ex
The implementation.
/lib/twitter_feed/parser.ex
Parsing the timestamp
The timestamp
unlike most of our fields is not a direct attribute of the .tweet
div. We need to search out a span with a class of _timestamp
within the .tweet
div and then grab the data-time-ms
attribute.
The test.
/test/twitter_feed/parser_test.ex
The implementation.
/lib/twitter_feed/parser.ex
Not too tricky, we’ve got a few whacky manipulations to get the timestamp in the format we want, but nothing crazy. If you are curious about what some of these functions do, check out the documentation in iex
, for example.
Terminal
Terminal
Parsing the tweet text
We’ll derive the text_summary
from the Tweet text which we find within a p
tag with a class of tweet-text
.
The test.
/test/twitter_feed/parser_test.ex
The implementation.
/lib/twitter_feed/parser.ex
Nothing we haven’t seen before, we’re applying a String.trim()
at the end in order to clean up any extraneous whitespace.
Parsing the image_url
The image_url
is nested in a div within the .tweet
div. The nested div has a class of .AdaptiveMedia-photoContainer
and the URL is specified in the data-image-url
attribute.
The test.
/test/twitter_feed/parser_test.ex
The implementation.
/lib/twitter_feed/parser.ex
Again pretty simple.
Parsing the retweet field
In the case of a retweet, the .tweet
will contain a data-retweeter
attribute. So we just need to check if that attribute exists or not.
We need two tests for this field, one to check the correct functionality when the Tweet is not a retweet and another for when it is.
/test/twitter_feed/parser_test.ex
The implementation.
/lib/twitter_feed/parser.ex
If you haven’t been running the tests as we’ve been adding the fields, now would be a good time to check that everything is working!
Terminal
Populating the Struct
The next step is to populate the Tweet
structure with the fields we’ve parsed. This is simple, all we need to do is call into our parse functions.
/lib/twitter_feed/parser.ex
Notice we also have a truncate/1
function we are calling to create the text summary. We’ll need to create this function, so let’s add a few more tests and an implementation for this.
/test/twitter_feed/parser_test.ex
We can see by the tests that we are expecting the truncate/1
function to truncate any text that exceeds 30 characters. It will add 3 trailing periods to the text to indicate it has been truncated, meaning the truncated text will be 33 characters total.
The implementation.
/lib/twitter_feed/parser.ex
Not too tricky, we’re just using some existing String
functions to accomplish the truncation.
Let’s make sure everything is still passing:
Terminal
Great! Now let’s see how this looks in iex
.
Terminal
Terminal
Perfect, things are looking good. Notice the 4th tweet in the screen-shot above is correctly identified as a retweet. The display_name
, user_name
and user_id
fields are all correctly populated with the values for the original author.
Parsing the handle_id
You’ve probably noticed we have yet to parse the handle_id
. This is because this is a little tricky and is actually dependent on some of the existing fields that we’ve now parsed.
Regardless of whether a Tweet is a retweet or not, the handle_id
should always refer to the current feed we are retrieving data from. In the case when the Tweet is not a retweet, the handle_id
is the same as the user_id
. When it is a retweet, we can parse out the handle_id
from an anchor tag that refers to the user who retweeted the Tweet.
Let’s see some tests, hopefully a concrete example will help to clarify what is going on.
/test/twitter_feed/parser_test.ex
Looking at the call to the parse_handle_id/3
function in the tests we see it takes in the following parameters:
- A boolean to indicate whether the Tweet is a retweet or not.
- An integer representing the
user_id
of the Tweet. - The
.tweet
div for the current Tweet we are processing, just like all our other parse functions.
We can see in the first test where the Tweet is not a retweet, we expect the handle_id
to equal the value passed in to the user_id
parameter.
In the second test we’re handling a retweet and need to parse the handle_id
out of the anchor tag.
The implementation.
/lib/twitter_feed/parser.ex
We’re using pattern matching to determine which version of parse_handle_id/3
to run. The implementations themselves are straight forward.
We should now have a couple more passing tests.
Terminal
Now let’s see how we can hook this into the Tweet
struct.
/lib/twitter_feed/parser.ex
So we’ve pulled out the calls to parse_user_id
and is_retweet
in order to their return values available when we call into parse_handle_id
. We also replace the function calls that we were making within the struct for user_id
and retweet
with the user_id
and is_retweet
variables.
Let’s have another look in iex
.
Terminal
Terminal
And there we go, the handle_id
is now properly populated, both for original and retweeted Tweets.
Summary
Slowly things are starting to come together, all our fields are parsed and being returned via the Tweet
structure. In the next installment we’ll switch back to working on the Scraper
module.
Thanks for reading and I hope you enjoyed the post!