What is Data Science?
“Data Science is about extraction, preparation, analysis, visualization, and maintenance of information. It is a cross-disciplinary field which uses scientific methods and processes to draw insights from data. ”
i. R
R is a scripting language that is specifically tailored for statistical computing. It is widely used for data analysis, statistical modeling, time-series forecasting, clustering etc. R is mostly used for statistical operations.
It also possesses the features of an object-oriented programming language. R is an interpreter based language and is widely popular across multiple industries
ii. Python
Like R, Python is an interpreter based high-level programming language. Python is a versatile language. It is mostly used for Data Science and Software Development. Python has gained popularity due to its ease of use and code readability.
As a result, Python is widely used for Data Analysis, Natural Language Processing, and Computer Vision. Python comes with various graphical and statistical packages like Matplotlib, Numpy, SciPy and more advanced packages for Deep Learning such as TensorFlow, PyTorch, Keras etc.
For the purpose of data mining, wrangling, visualizations and developing predictive models, we utilize Python. This makes Python a very flexible programming language.
iii. SQL
SQL stands for Structured Query Language. Data Scientists use SQL for managing and querying data stored in databases. Being able to extract information from databases is the first step towards analyzing the data. Relational Databases are a collection of data organized in tables.
We use SQL for extracting, managing and manipulating the data. For example A Data Scientist working in the banking industry uses SQL for extracting information of customers. While Relational Databases use SQL, ‘NoSQL’ is a popular choice for non-relational or distributed databases.
Recently NoSQL has been gaining popularity due to its flexible scalability, dynamic design, and open source nature. MongoDB, Redis, and Cassandra are some of the popular NoSQL languages.
iv. Hadoop
Big data is another trending term that deals with management and storage of huge amount of data. Data is either structured or unstructured. A Data Scientist must have a familiarity with complex data and must know tools that regulate the storage of massive datasets.
One such tool is Hadoop. While being open-source software, Hadoop utilizes a distributed storage system using a model called ‘MapReduce’. There are several packages in Hadoop such as Apache Pig, Hive, HBase etc.
Due to its ability to process colossal data quickly, its scalable architecture and low-cost deployment, Hadoop has grown to become the most popular software for Big Data.
v. Tableau
Tableau is a Data Visualization software specializing in graphical analysis of data. It allows its users to create interactive visualizations and dashboards.
This makes Tableau an ideal choice for showing various trends and insights of the data in the form of interactable charts such as Treemaps, Histograms, Box plots etc. An important feature of Tableau is its ability to connect with spreadsheets, relational databases, and cloud platforms.
This allows Tableau to process data directly, making it easier for the users.