At the end of Part 3 we finally got around to parsing out some data from the HTML we retrieved from Twitter. We’ll continue building out our
Tweet struct and aim to have all our fields populated by the end of this post.
I’d encourage you to follow along, but if you want to skip directly to the code, it’s available on GitHub at: https://github.com/riebeekn/elixir-twitter-scraper.
If you followed along with part 3 just continue on with the code you created in part 3. If not and you’d rather jump right into part 4, you can use a clone of part 3 as a starting point.
Clone the Repo
If grabbing the code from GitHub instead of continuing along from part 3, the first step is to clone the repo.
OK, you’ve either gotten the code from GitHub or are using the existing code you created in Part 3, let’s get to parsing!
Parsing the rest of our fields
You’ll recall from part 3 we got to the point where we had parsed the
display_name field of our struct.
We have 8 more fields to populate, we’ll be mucking about pretty much exclusively in the
Parser module today, let’s get at it!
Parsing the user_id
user_id, like many of the fields we want to capture, is an attribute of the
Let’s put together a test.
Very simple, we’ve added a new test,
parsing of user_id, that contains an HTML snippet with the
data-user-id attribute and checks for the value of the attribute upon parsing.
Now for the implementation.
Very similar to
parse_display_name. We’re just using
Floki to grab the value of the attribute, and then in this case, we convert it to an integer before returning it.
With that all in place we should now have a new test that passes.
Looking good! We’ll be following a similar pattern for the rest of the fields. The code should be pretty similar to the 2 fields we’ve completed, so explanations will be kept to a minimum from here on out. Also, I suggest you re-run the tests after each field is added; but I’ll be skipping out on that for the sake of brevity.
Parsing the user_name
user_name is contained within the
.tweet div in a
Let’s add a test.
And now the implementation.
Parsing the tweet_id
tweet_id is contained within the
.tweet div in a
Parsing the timestamp
timestamp unlike most of our fields is not a direct attribute of the
.tweet div. We need to search out a span with a class of
_timestamp within the
.tweet div and then grab the
Not too tricky, we’ve got a few whacky manipulations to get the timestamp in the format we want, but nothing crazy. If you are curious about what some of these functions do, check out the documentation in
iex, for example.
Parsing the tweet text
We’ll derive the
text_summary from the Tweet text which we find within a
p tag with a class of
Nothing we haven’t seen before, we’re applying a
String.trim() at the end in order to clean up any extraneous whitespace.
Parsing the image_url
image_url is nested in a div within the
.tweet div. The nested div has a class of
.AdaptiveMedia-photoContainer and the URL is specified in the
Again pretty simple.
Parsing the retweet field
In the case of a retweet, the
.tweet will contain a
data-retweeter attribute. So we just need to check if that attribute exists or not.
We need two tests for this field, one to check the correct functionality when the Tweet is not a retweet and another for when it is.
If you haven’t been running the tests as we’ve been adding the fields, now would be a good time to check that everything is working!
Populating the Struct
The next step is to populate the
Tweet structure with the fields we’ve parsed. This is simple, all we need to do is call into our parse functions.
Notice we also have a
truncate/1 function we are calling to create the text summary. We’ll need to create this function, so let’s add a few more tests and an implementation for this.
We can see by the tests that we are expecting the
truncate/1 function to truncate any text that exceeds 30 characters. It will add 3 trailing periods to the text to indicate it has been truncated, meaning the truncated text will be 33 characters total.
Not too tricky, we’re just using some existing
String functions to accomplish the truncation.
Let’s make sure everything is still passing:
Great! Now let’s see how this looks in
Perfect, things are looking good. Notice the 4th tweet in the screen-shot above is correctly identified as a retweet. The
user_id fields are all correctly populated with the values for the original author.
Parsing the handle_id
You’ve probably noticed we have yet to parse the
handle_id. This is because this is a little tricky and is actually dependent on some of the existing fields that we’ve now parsed.
Regardless of whether a Tweet is a retweet or not, the
handle_id should always refer to the current feed we are retrieving data from. In the case when the Tweet is not a retweet, the
handle_id is the same as the
user_id. When it is a retweet, we can parse out the
handle_id from an anchor tag that refers to the user who retweeted the Tweet.
Let’s see some tests, hopefully a concrete example will help to clarify what is going on.
Looking at the call to the
parse_handle_id/3 function in the tests we see it takes in the following parameters:
- A boolean to indicate whether the Tweet is a retweet or not.
- An integer representing the
user_idof the Tweet.
.tweetdiv for the current Tweet we are processing, just like all our other parse functions.
We can see in the first test where the Tweet is not a retweet, we expect the
handle_id to equal the value passed in to the
In the second test we’re handling a retweet and need to parse the
handle_id out of the anchor tag.
We’re using pattern matching to determine which version of
parse_handle_id/3 to run. The implementations themselves are straight forward.
We should now have a couple more passing tests.
Now let’s see how we can hook this into the
So we’ve pulled out the calls to
is_retweet in order to their return values available when we call into
parse_handle_id. We also replace the function calls that we were making within the struct for
retweet with the
Let’s have another look in
And there we go, the
handle_id is now properly populated, both for original and retweeted Tweets.
Slowly things are starting to come together, all our fields are parsed and being returned via the
Tweet structure. In the next installment we’ll switch back to working on the
Thanks for reading and I hope you enjoyed the post!